Update Linux to v5.10.109
Sourced from [1]
[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.10.109.tar.xz
Change-Id: I19bca9fc6762d4e63bcf3e4cba88bbe560d9c76c
Signed-off-by: Olivier Deprez <olivier.deprez@arm.com>
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst
new file mode 100644
index 0000000..7b5964b
--- /dev/null
+++ b/Documentation/filesystems/9p.rst
@@ -0,0 +1,195 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+v9fs: Plan 9 Resource Sharing for Linux
+=======================================
+
+About
+=====
+
+v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
+
+This software was originally developed by Ron Minnich <rminnich@sandia.gov>
+and Maya Gokhale. Additional development by Greg Watson
+<gwatson@lanl.gov> and most recently Eric Van Hensbergen
+<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox
+<rsc@swtch.com>.
+
+The best detailed explanation of the Linux implementation and applications of
+the 9p client is available in the form of a USENIX paper:
+
+ https://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
+
+Other applications are described in the following papers:
+
+ * XCPU & Clustering
+ http://xcpu.org/papers/xcpu-talk.pdf
+ * KVMFS: control file system for KVM
+ http://xcpu.org/papers/kvmfs.pdf
+ * CellFS: A New Programming Model for the Cell BE
+ http://xcpu.org/papers/cellfs-talk.pdf
+ * PROSE I/O: Using 9p to enable Application Partitions
+ http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
+ * VirtFS: A Virtualization Aware File System pass-through
+ http://goo.gl/3WPDg
+
+Usage
+=====
+
+For remote file server::
+
+ mount -t 9p 10.10.1.2 /mnt/9
+
+For Plan 9 From User Space applications (http://swtch.com/plan9)::
+
+ mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
+
+For server running on QEMU host with virtio transport::
+
+ mount -t 9p -o trans=virtio <mount_tag> /mnt/9
+
+where mount_tag is the tag associated by the server to each of the exported
+mount points. Each 9P export is seen by the client as a virtio device with an
+associated "mount_tag" property. Available mount tags can be
+seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
+
+Options
+=======
+
+ ============= ===============================================================
+ trans=name select an alternative transport. Valid options are
+ currently:
+
+ ======== ============================================
+ unix specifying a named pipe mount point
+ tcp specifying a normal TCP/IP connection
+ fd used passed file descriptors for connection
+ (see rfdno and wfdno)
+ virtio connect to the next virtio channel available
+ (from QEMU with trans_virtio module)
+ rdma connect to a specified RDMA channel
+ ======== ============================================
+
+ uname=name user name to attempt mount as on the remote server. The
+ server may override or ignore this value. Certain user
+ names may require authentication.
+
+ aname=name aname specifies the file tree to access when the server is
+ offering several exported file systems.
+
+ cache=mode specifies a caching policy. By default, no caches are used.
+
+ none
+ default no cache policy, metadata and data
+ alike are synchronous.
+ loose
+ no attempts are made at consistency,
+ intended for exclusive, read-only mounts
+ fscache
+ use FS-Cache for a persistent, read-only
+ cache backend.
+ mmap
+ minimal cache that is only used for read-write
+ mmap. Northing else is cached, like cache=none
+
+ debug=n specifies debug level. The debug level is a bitmask.
+
+ ===== ================================
+ 0x01 display verbose error messages
+ 0x02 developer debug (DEBUG_CURRENT)
+ 0x04 display 9p trace
+ 0x08 display VFS trace
+ 0x10 display Marshalling debug
+ 0x20 display RPC debug
+ 0x40 display transport debug
+ 0x80 display allocation debug
+ 0x100 display protocol message debug
+ 0x200 display Fid debug
+ 0x400 display packet debug
+ 0x800 display fscache tracing debug
+ ===== ================================
+
+ rfdno=n the file descriptor for reading with trans=fd
+
+ wfdno=n the file descriptor for writing with trans=fd
+
+ msize=n the number of bytes to use for 9p packet payload
+
+ port=n port to connect to on the remote server
+
+ noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
+
+ version=name Select 9P protocol version. Valid options are:
+
+ ======== ==============================
+ 9p2000 Legacy mode (same as noextend)
+ 9p2000.u Use 9P2000.u protocol
+ 9p2000.L Use 9P2000.L protocol
+ ======== ==============================
+
+ dfltuid attempt to mount as a particular uid
+
+ dfltgid attempt to mount with a particular gid
+
+ afid security channel - used by Plan 9 authentication protocols
+
+ nodevmap do not map special files - represent them as normal files.
+ This can be used to share devices/named pipes/sockets between
+ hosts. This functionality will be expanded in later versions.
+
+ access there are four access modes.
+ user
+ if a user tries to access a file on v9fs
+ filesystem for the first time, v9fs sends an
+ attach command (Tattach) for that user.
+ This is the default mode.
+ <uid>
+ allows only user with uid=<uid> to access
+ the files on the mounted filesystem
+ any
+ v9fs does single attach and performs all
+ operations as one user
+ clien
+ ACL based access check on the 9p client
+ side for access validation
+
+ cachetag cache tag to use the specified persistent cache.
+ cache tags for existing cache sessions can be listed at
+ /sys/fs/9p/caches. (applies only to cache=fscache)
+ ============= ===============================================================
+
+Behavior
+========
+
+This section aims at describing 9p 'quirks' that can be different
+from a local filesystem behaviors.
+
+ - Setting O_NONBLOCK on a file will make client reads return as early
+ as the server returns some data instead of trying to fill the read
+ buffer with the requested amount of bytes or end of file is reached.
+
+Resources
+=========
+
+Protocol specifications are maintained on github:
+http://ericvh.github.com/9p-rfc/
+
+9p client and server implementations are listed on
+http://9p.cat-v.org/implementations
+
+A 9p2000.L server is being developed by LLNL and can be found
+at http://code.google.com/p/diod/
+
+There are user and developer mailing lists available through the v9fs project
+on sourceforge (http://sourceforge.net/projects/v9fs).
+
+News and other information is maintained on a Wiki.
+(http://sf.net/apps/mediawiki/v9fs/index.php).
+
+Bug reports are best issued via the mailing list.
+
+For more information on the Plan 9 Operating System check out
+http://plan9.bell-labs.com/plan9
+
+For information on Plan 9 from User Space (Plan 9 applications and libraries
+ported to Linux/BSD/OSX/etc) check out https://9fans.github.io/plan9port/
diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt
deleted file mode 100644
index fec7144..0000000
--- a/Documentation/filesystems/9p.txt
+++ /dev/null
@@ -1,161 +0,0 @@
- v9fs: Plan 9 Resource Sharing for Linux
- =======================================
-
-ABOUT
-=====
-
-v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
-
-This software was originally developed by Ron Minnich <rminnich@sandia.gov>
-and Maya Gokhale. Additional development by Greg Watson
-<gwatson@lanl.gov> and most recently Eric Van Hensbergen
-<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox
-<rsc@swtch.com>.
-
-The best detailed explanation of the Linux implementation and applications of
-the 9p client is available in the form of a USENIX paper:
- http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
-
-Other applications are described in the following papers:
- * XCPU & Clustering
- http://xcpu.org/papers/xcpu-talk.pdf
- * KVMFS: control file system for KVM
- http://xcpu.org/papers/kvmfs.pdf
- * CellFS: A New Programming Model for the Cell BE
- http://xcpu.org/papers/cellfs-talk.pdf
- * PROSE I/O: Using 9p to enable Application Partitions
- http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
- * VirtFS: A Virtualization Aware File System pass-through
- http://goo.gl/3WPDg
-
-USAGE
-=====
-
-For remote file server:
-
- mount -t 9p 10.10.1.2 /mnt/9
-
-For Plan 9 From User Space applications (http://swtch.com/plan9)
-
- mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
-
-For server running on QEMU host with virtio transport:
-
- mount -t 9p -o trans=virtio <mount_tag> /mnt/9
-
-where mount_tag is the tag associated by the server to each of the exported
-mount points. Each 9P export is seen by the client as a virtio device with an
-associated "mount_tag" property. Available mount tags can be
-seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
-
-OPTIONS
-=======
-
- trans=name select an alternative transport. Valid options are
- currently:
- unix - specifying a named pipe mount point
- tcp - specifying a normal TCP/IP connection
- fd - used passed file descriptors for connection
- (see rfdno and wfdno)
- virtio - connect to the next virtio channel available
- (from QEMU with trans_virtio module)
- rdma - connect to a specified RDMA channel
-
- uname=name user name to attempt mount as on the remote server. The
- server may override or ignore this value. Certain user
- names may require authentication.
-
- aname=name aname specifies the file tree to access when the server is
- offering several exported file systems.
-
- cache=mode specifies a caching policy. By default, no caches are used.
- none = default no cache policy, metadata and data
- alike are synchronous.
- loose = no attempts are made at consistency,
- intended for exclusive, read-only mounts
- fscache = use FS-Cache for a persistent, read-only
- cache backend.
- mmap = minimal cache that is only used for read-write
- mmap. Northing else is cached, like cache=none
-
- debug=n specifies debug level. The debug level is a bitmask.
- 0x01 = display verbose error messages
- 0x02 = developer debug (DEBUG_CURRENT)
- 0x04 = display 9p trace
- 0x08 = display VFS trace
- 0x10 = display Marshalling debug
- 0x20 = display RPC debug
- 0x40 = display transport debug
- 0x80 = display allocation debug
- 0x100 = display protocol message debug
- 0x200 = display Fid debug
- 0x400 = display packet debug
- 0x800 = display fscache tracing debug
-
- rfdno=n the file descriptor for reading with trans=fd
-
- wfdno=n the file descriptor for writing with trans=fd
-
- msize=n the number of bytes to use for 9p packet payload
-
- port=n port to connect to on the remote server
-
- noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
-
- version=name Select 9P protocol version. Valid options are:
- 9p2000 - Legacy mode (same as noextend)
- 9p2000.u - Use 9P2000.u protocol
- 9p2000.L - Use 9P2000.L protocol
-
- dfltuid attempt to mount as a particular uid
-
- dfltgid attempt to mount with a particular gid
-
- afid security channel - used by Plan 9 authentication protocols
-
- nodevmap do not map special files - represent them as normal files.
- This can be used to share devices/named pipes/sockets between
- hosts. This functionality will be expanded in later versions.
-
- access there are four access modes.
- user = if a user tries to access a file on v9fs
- filesystem for the first time, v9fs sends an
- attach command (Tattach) for that user.
- This is the default mode.
- <uid> = allows only user with uid=<uid> to access
- the files on the mounted filesystem
- any = v9fs does single attach and performs all
- operations as one user
- client = ACL based access check on the 9p client
- side for access validation
-
- cachetag cache tag to use the specified persistent cache.
- cache tags for existing cache sessions can be listed at
- /sys/fs/9p/caches. (applies only to cache=fscache)
-
-RESOURCES
-=========
-
-Protocol specifications are maintained on github:
-http://ericvh.github.com/9p-rfc/
-
-9p client and server implementations are listed on
-http://9p.cat-v.org/implementations
-
-A 9p2000.L server is being developed by LLNL and can be found
-at http://code.google.com/p/diod/
-
-There are user and developer mailing lists available through the v9fs project
-on sourceforge (http://sourceforge.net/projects/v9fs).
-
-News and other information is maintained on a Wiki.
-(http://sf.net/apps/mediawiki/v9fs/index.php).
-
-Bug reports are best issued via the mailing list.
-
-For more information on the Plan 9 Operating System check out
-http://plan9.bell-labs.com/plan9
-
-For information on Plan 9 from User Space (Plan 9 applications and libraries
-ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
-
diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.rst
similarity index 64%
rename from Documentation/filesystems/adfs.txt
rename to Documentation/filesystems/adfs.rst
index 5949766..5b22cae 100644
--- a/Documentation/filesystems/adfs.txt
+++ b/Documentation/filesystems/adfs.rst
@@ -1,6 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Acorn Disc Filing System - ADFS
+===============================
+
+Filesystems supported by ADFS
+-----------------------------
+
+The ADFS module supports the following Filecore formats which have:
+
+- new maps
+- new directories or big directories
+
+In terms of the named formats, this means we support:
+
+- E and E+, with or without boot block
+- F and F+
+
+We fully support reading files from these filesystems, and writing to
+existing files within their existing allocation. Essentially, we do
+not support changing any of the filesystem metadata.
+
+This is intended to support loopback mounted Linux native filesystems
+on a RISC OS Filecore filesystem, but will allow the data within files
+to be changed.
+
+If write support (ADFS_FS_RW) is configured, we allow rudimentary
+directory updates, specifically updating the access mode and timestamp.
+
Mount options for ADFS
----------------------
+ ============ ======================================================
uid=nnn All files in the partition will be owned by
user id nnn. Default 0 (root).
gid=nnn All files in the partition will be in group
@@ -12,22 +43,23 @@
ftsuffix=n When ftsuffix=0, no file type suffix will be applied.
When ftsuffix=1, a hexadecimal suffix corresponding to
the RISC OS file type will be added. Default 0.
+ ============ ======================================================
Mapping of ADFS permissions to Linux permissions
------------------------------------------------
ADFS permissions consist of the following:
- Owner read
- Owner write
- Other read
- Other write
+ - Owner read
+ - Owner write
+ - Other read
+ - Other write
(In older versions, an 'execute' permission did exist, but this
- does not hold the same meaning as the Linux 'execute' permission
- and is now obsolete).
+ does not hold the same meaning as the Linux 'execute' permission
+ and is now obsolete).
- The mapping is performed as follows:
+ The mapping is performed as follows::
Owner read -> -r--r--r--
Owner write -> --w--w---w
@@ -42,17 +74,18 @@
Possible other mode permissions -> ----rwxrwx
Hence, with the default masks, if a file is owner read/write, and
- not a UnixExec filetype, then the permissions will be:
+ not a UnixExec filetype, then the permissions will be::
-rw-------
However, if the masks were ownmask=0770,othmask=0007, then this would
- be modified to:
+ be modified to::
+
-rw-rw----
There is no restriction on what you can do with these masks. You may
wish that either read bits give read access to the file for all, but
- keep the default write protection (ownmask=0755,othmask=0577):
+ keep the default write protection (ownmask=0755,othmask=0577)::
-rw-r--r--
diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.rst
similarity index 85%
rename from Documentation/filesystems/affs.txt
rename to Documentation/filesystems/affs.rst
index a8f1a58..5776cbd 100644
--- a/Documentation/filesystems/affs.txt
+++ b/Documentation/filesystems/affs.rst
@@ -1,9 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
Overview of Amiga Filesystems
=============================
Not all varieties of the Amiga filesystems are supported for reading and
writing. The Amiga currently knows six different filesystems:
+============== ===============================================================
DOS\0 The old or original filesystem, not really suited for
hard disks and normally not used on them, either.
Supported read/write.
@@ -23,6 +27,7 @@
sense on hard disks. Supported read only.
DOS\5 The Fast File System with directory cache. Supported read only.
+============== ===============================================================
All of the above filesystems allow block sizes from 512 to 32K bytes.
Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks
@@ -36,14 +41,18 @@
Mount options for the AFFS
==========================
-protect If this option is set, the protection bits cannot be altered.
+protect
+ If this option is set, the protection bits cannot be altered.
-setuid[=uid] This sets the owner of all files and directories in the file
+setuid[=uid]
+ This sets the owner of all files and directories in the file
system to uid or the uid of the current user, respectively.
-setgid[=gid] Same as above, but for gid.
+setgid[=gid]
+ Same as above, but for gid.
-mode=mode Sets the mode flags to the given (octal) value, regardless
+mode=mode
+ Sets the mode flags to the given (octal) value, regardless
of the original permissions. Directories will get an x
permission if the corresponding r bit is set.
This is useful since most of the plain AmigaOS files
@@ -53,33 +62,41 @@
The file system will return an error when filename exceeds
standard maximum filename length (30 characters).
-reserved=num Sets the number of reserved blocks at the start of the
+reserved=num
+ Sets the number of reserved blocks at the start of the
partition to num. You should never need this option.
Default is 2.
-root=block Sets the block number of the root block. This should never
+root=block
+ Sets the block number of the root block. This should never
be necessary.
-bs=blksize Sets the blocksize to blksize. Valid block sizes are 512,
+bs=blksize
+ Sets the blocksize to blksize. Valid block sizes are 512,
1024, 2048 and 4096. Like the root option, this should
never be necessary, as the affs can figure it out itself.
-quiet The file system will not return an error for disallowed
+quiet
+ The file system will not return an error for disallowed
mode changes.
-verbose The volume name, file system type and block size will
+verbose
+ The volume name, file system type and block size will
be written to the syslog when the filesystem is mounted.
-mufs The filesystem is really a muFS, also it doesn't
+mufs
+ The filesystem is really a muFS, also it doesn't
identify itself as one. This option is necessary if
the filesystem wasn't formatted as muFS, but is used
as one.
-prefix=path Path will be prefixed to every absolute path name of
+prefix=path
+ Path will be prefixed to every absolute path name of
symbolic links on an AFFS partition. Default = "/".
(See below.)
-volume=name When symbolic links with an absolute path are created
+volume=name
+ When symbolic links with an absolute path are created
on an AFFS partition, name will be prepended as the
volume name. Default = "" (empty string).
(See below.)
@@ -123,7 +140,7 @@
- All other flags (suid, sgid, ...) are ignored and will
not be retained.
-
+
Newly created files and directories will get the user and group ID
of the current user and a mode according to the umask.
@@ -152,11 +169,13 @@
Examples
========
-Command line:
+Command line::
+
mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose
mount /dev/sda3 /Amiga -t affs
-/etc/fstab entry:
+/etc/fstab entry::
+
/dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0
IMPORTANT NOTE
@@ -174,7 +193,8 @@
If the damage is already done, the following should fix the RDB
(where <disk> is the device name).
-DO AT YOUR OWN RISK:
+
+DO AT YOUR OWN RISK::
dd if=/dev/<disk> of=rdb.tmp count=1
cp rdb.tmp rdb.fixed
@@ -193,10 +213,14 @@
'nofilenametruncate' mount option can change that behavior.
Case is ignored by the affs in filename matching, but Linux shells
-do care about the case. Example (with /wb being an affs mounted fs):
+do care about the case. Example (with /wb being an affs mounted fs)::
+
rm /wb/WRONGCASE
-will remove /mnt/wrongcase, but
+
+will remove /mnt/wrongcase, but::
+
rm /wb/WR*
+
will not since the names are matched by the shell.
The block allocation is designed for hard disk partitions. If more
@@ -223,4 +247,4 @@
If you are interested in an Amiga Emulator for Linux, look at
-http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/
+http://web.archive.org/web/%2E/http://www.freiburg.linux.de/~uae/
diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.rst
similarity index 88%
rename from Documentation/filesystems/afs.txt
rename to Documentation/filesystems/afs.rst
index 8c6ea7b..0abb155 100644
--- a/Documentation/filesystems/afs.txt
+++ b/Documentation/filesystems/afs.rst
@@ -1,8 +1,10 @@
- ====================
- kAFS: AFS FILESYSTEM
- ====================
+.. SPDX-License-Identifier: GPL-2.0
-Contents:
+====================
+kAFS: AFS FILESYSTEM
+====================
+
+.. Contents:
- Overview.
- Usage.
@@ -14,8 +16,7 @@
- The @sys substitution.
-========
-OVERVIEW
+Overview
========
This filesystem provides a fairly simple secure AFS filesystem driver. It is
@@ -35,35 +36,33 @@
(*) pioctl() system call.
-===========
-COMPILATION
+Compilation
===========
The filesystem should be enabled by turning on the kernel configuration
-options:
+options::
CONFIG_AF_RXRPC - The RxRPC protocol transport
CONFIG_RXKAD - The RxRPC Kerberos security handler
CONFIG_AFS - The AFS filesystem
-Additionally, the following can be turned on to aid debugging:
+Additionally, the following can be turned on to aid debugging::
CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled
CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled
They permit the debugging messages to be turned on dynamically by manipulating
-the masks in the following files:
+the masks in the following files::
/sys/module/af_rxrpc/parameters/debug
/sys/module/kafs/parameters/debug
-=====
-USAGE
+Usage
=====
When inserting the driver modules the root cell must be specified along with a
-list of volume location server IP addresses:
+list of volume location server IP addresses::
modprobe rxrpc
modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
@@ -71,20 +70,20 @@
The first module is the AF_RXRPC network protocol driver. This provides the
RxRPC remote operation protocol and may also be accessed from userspace. See:
- Documentation/networking/rxrpc.txt
+ Documentation/networking/rxrpc.rst
The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem.
Once the module has been loaded, more modules can be added by the following
-procedure:
+procedure::
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
Where the parameters to the "add" command are the name of a cell and a list of
volume location servers within that cell, with the latter separated by colons.
-Filesystems can be mounted anywhere by commands similar to the following:
+Filesystems can be mounted anywhere by commands similar to the following::
mount -t afs "%cambridge.redhat.com:root.afs." /afs
mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge
@@ -104,8 +103,7 @@
Additional cells can be added through /proc (see later section).
-===========
-MOUNTPOINTS
+Mountpoints
===========
AFS has a concept of mountpoints. In AFS terms, these are specially formatted
@@ -123,42 +121,40 @@
unmounted, otherwise error EBUSY will be returned.
This can be used by the administrator to attempt to unmount the whole AFS tree
-mounted on /afs in one go by doing:
+mounted on /afs in one go by doing::
umount /afs
-============
-DYNAMIC ROOT
+Dynamic Root
============
A mount option is available to create a serverless mount that is only usable
-for dynamic lookup. Creating such a mount can be done by, for example:
+for dynamic lookup. Creating such a mount can be done by, for example::
mount -t afs none /afs -o dyn
This creates a mount that just has an empty directory at the root. Attempting
to look up a name in this directory will cause a mountpoint to be created that
-looks up a cell of the same name, for example:
+looks up a cell of the same name, for example::
ls /afs/grand.central.org/
-===============
-PROC FILESYSTEM
+Proc Filesystem
===============
The AFS modules creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module and
- their usage counts:
+ their usage counts::
[root@andromeda ~]# cat /proc/fs/afs/cells
USE NAME
3 cambridge.redhat.com
(*) A directory per cell that contains files that list volume location
- servers, volumes, and active servers known within that cell.
+ servers, volumes, and active servers known within that cell::
[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers
USE ADDR STATE
@@ -171,8 +167,7 @@
1 Val 20000000 20000001 20000002 root.afs
-=================
-THE CELL DATABASE
+The Cell Database
=================
The filesystem maintains an internal database of all the cells it knows and the
@@ -181,7 +176,7 @@
"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on
the kernel command line.
-Further cells can be added by commands similar to the following:
+Further cells can be added by commands similar to the following::
echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells
echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells
@@ -189,26 +184,25 @@
No other cell database operations are available at this time.
-========
-SECURITY
+Security
========
Secure operations are initiated by acquiring a key using the klog program. A
very primitive klog program is available at:
- http://people.redhat.com/~dhowells/rxrpc/klog.c
+ https://people.redhat.com/~dhowells/rxrpc/klog.c
-This should be compiled by:
+This should be compiled by::
make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils"
-And then run as:
+And then run as::
./klog
Assuming it's successful, this adds a key of type RxRPC, named for the service
and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or
-by cat'ing /proc/keys:
+by cat'ing /proc/keys::
[root@andromeda ~]# keyctl show
Session Keyring
@@ -232,20 +226,19 @@
open the file.
-=====================
-THE @SYS SUBSTITUTION
+The @sys Substitution
=====================
The list of up to 16 @sys substitutions for the current network namespace can
-be configured by writing a list to /proc/fs/afs/sysname:
+be configured by writing a list to /proc/fs/afs/sysname::
[root@andromeda ~]# echo foo amd64_linux_26 >/proc/fs/afs/sysname
-or cleared entirely by writing an empty list:
+or cleared entirely by writing an empty list::
[root@andromeda ~]# echo >/proc/fs/afs/sysname
-The current list for current network namespace can be retrieved by:
+The current list for current network namespace can be retrieved by::
[root@andromeda ~]# cat /proc/fs/afs/sysname
foo
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
index bbb0c1c..a94f17d 100644
--- a/Documentation/filesystems/api-summary.rst
+++ b/Documentation/filesystems/api-summary.rst
@@ -86,9 +86,6 @@
.. kernel-doc:: fs/dax.c
:export:
-.. kernel-doc:: fs/direct-io.c
- :export:
-
.. kernel-doc:: fs/libfs.c
:export:
diff --git a/Documentation/filesystems/autofs-mount-control.txt b/Documentation/filesystems/autofs-mount-control.rst
similarity index 88%
rename from Documentation/filesystems/autofs-mount-control.txt
rename to Documentation/filesystems/autofs-mount-control.rst
index acc02fc..bf4b511 100644
--- a/Documentation/filesystems/autofs-mount-control.txt
+++ b/Documentation/filesystems/autofs-mount-control.rst
@@ -1,4 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+====================================================================
Miscellaneous Device control operations for the autofs kernel module
====================================================================
@@ -36,24 +38,24 @@
module source you will see a third type called an offset, which is just
a direct mount in disguise) and indirect.
-Here is a master map with direct and indirect map entries:
+Here is a master map with direct and indirect map entries::
-/- /etc/auto.direct
-/test /etc/auto.indirect
+ /- /etc/auto.direct
+ /test /etc/auto.indirect
-and the corresponding map files:
+and the corresponding map files::
-/etc/auto.direct:
+ /etc/auto.direct:
-/automount/dparse/g6 budgie:/autofs/export1
-/automount/dparse/g1 shark:/autofs/export1
-and so on.
+ /automount/dparse/g6 budgie:/autofs/export1
+ /automount/dparse/g1 shark:/autofs/export1
+ and so on.
-/etc/auto.indirect:
+/etc/auto.indirect::
-g1 shark:/autofs/export1
-g6 budgie:/autofs/export1
-and so on.
+ g1 shark:/autofs/export1
+ g6 budgie:/autofs/export1
+ and so on.
For the above indirect map an autofs file system is mounted on /test and
mounts are triggered for each sub-directory key by the inode lookup
@@ -69,23 +71,23 @@
But, each entry in direct and indirect maps can have offsets (making
them multi-mount map entries).
-For example, an indirect mount map entry could also be:
+For example, an indirect mount map entry could also be::
-g1 \
- / shark:/autofs/export5/testing/test \
- /s1 shark:/autofs/export/testing/test/s1 \
- /s2 shark:/autofs/export5/testing/test/s2 \
- /s1/ss1 shark:/autofs/export1 \
- /s2/ss2 shark:/autofs/export2
+ g1 \
+ / shark:/autofs/export5/testing/test \
+ /s1 shark:/autofs/export/testing/test/s1 \
+ /s2 shark:/autofs/export5/testing/test/s2 \
+ /s1/ss1 shark:/autofs/export1 \
+ /s2/ss2 shark:/autofs/export2
-and a similarly a direct mount map entry could also be:
+and a similarly a direct mount map entry could also be::
-/automount/dparse/g1 \
- / shark:/autofs/export5/testing/test \
- /s1 shark:/autofs/export/testing/test/s1 \
- /s2 shark:/autofs/export5/testing/test/s2 \
- /s1/ss1 shark:/autofs/export2 \
- /s2/ss2 shark:/autofs/export2
+ /automount/dparse/g1 \
+ / shark:/autofs/export5/testing/test \
+ /s1 shark:/autofs/export/testing/test/s1 \
+ /s2 shark:/autofs/export5/testing/test/s2 \
+ /s1/ss1 shark:/autofs/export2 \
+ /s2/ss2 shark:/autofs/export2
One of the issues with version 4 of autofs was that, when mounting an
entry with a large number of offsets, possibly with nesting, we needed
@@ -170,32 +172,32 @@
The control interface is opening a device node, typically /dev/autofs.
All the ioctls use a common structure to pass the needed parameter
-information and return operation results:
+information and return operation results::
-struct autofs_dev_ioctl {
- __u32 ver_major;
- __u32 ver_minor;
- __u32 size; /* total size of data passed in
- * including this struct */
- __s32 ioctlfd; /* automount command fd */
+ struct autofs_dev_ioctl {
+ __u32 ver_major;
+ __u32 ver_minor;
+ __u32 size; /* total size of data passed in
+ * including this struct */
+ __s32 ioctlfd; /* automount command fd */
- /* Command parameters */
- union {
- struct args_protover protover;
- struct args_protosubver protosubver;
- struct args_openmount openmount;
- struct args_ready ready;
- struct args_fail fail;
- struct args_setpipefd setpipefd;
- struct args_timeout timeout;
- struct args_requester requester;
- struct args_expire expire;
- struct args_askumount askumount;
- struct args_ismountpoint ismountpoint;
- };
+ /* Command parameters */
+ union {
+ struct args_protover protover;
+ struct args_protosubver protosubver;
+ struct args_openmount openmount;
+ struct args_ready ready;
+ struct args_fail fail;
+ struct args_setpipefd setpipefd;
+ struct args_timeout timeout;
+ struct args_requester requester;
+ struct args_expire expire;
+ struct args_askumount askumount;
+ struct args_ismountpoint ismountpoint;
+ };
- char path[0];
-};
+ char path[0];
+ };
The ioctlfd field is a mount point file descriptor of an autofs mount
point. It is returned by the open call and is used by all calls except
@@ -212,7 +214,7 @@
structure sent from user space.
This structure can be initialized before setting specific fields by using
-the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *).
+the void function call init_autofs_dev_ioctl(``struct autofs_dev_ioctl *``).
All of the ioctls perform a copy of this structure from user space to
kernel space and return -EINVAL if the size parameter is smaller than
@@ -389,7 +391,7 @@
set to an autofs mount type. The call returns 1 if this is a mount point
and sets out.devid field to the device number of the mount and out.magic
field to the relevant super block magic number (described below) or 0 if
-it isn't a mountpoint. In both cases the the device number (as returned
+it isn't a mountpoint. In both cases the device number (as returned
by new_encode_dev()) is returned in out.devid field.
If supplied with a file descriptor we're looking for a specific mount,
@@ -397,12 +399,12 @@
the descriptor corresponds to is considered a mountpoint if it is itself
a mountpoint or contains a mount, such as a multi-mount without a root
mount. In this case we return 1 if the descriptor corresponds to a mount
-point and and also returns the super magic of the covering mount if there
+point and also returns the super magic of the covering mount if there
is one or 0 if it isn't a mountpoint.
If a path is supplied (and the ioctlfd field is set to -1) then the path
is looked up and is checked to see if it is the root of a mount. If a
type is also given we are looking for a particular autofs mount and if
-a match isn't found a fail is returned. If the the located path is the
+a match isn't found a fail is returned. If the located path is the
root of a mount 1 is returned along with the super magic of the mount
or 0 otherwise.
diff --git a/Documentation/filesystems/autofs.txt b/Documentation/filesystems/autofs.rst
similarity index 77%
rename from Documentation/filesystems/autofs.txt
rename to Documentation/filesystems/autofs.rst
index 3af38c7..681c6a4 100644
--- a/Documentation/filesystems/autofs.txt
+++ b/Documentation/filesystems/autofs.rst
@@ -1,12 +1,9 @@
-<head>
-<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
-</head>
-
+=====================
autofs - how it works
=====================
Purpose
--------
+=======
The goal of autofs is to provide on-demand mounting and race free
automatic unmounting of various other filesystems. This provides two
@@ -28,7 +25,7 @@
first accessed a name.
Context
--------
+=======
The "autofs" filesystem module is only one part of an autofs system.
There also needs to be a user-space program which looks up names
@@ -43,7 +40,7 @@
can each be managed separately, or all managed by the same daemon.
Content
--------
+=======
An autofs filesystem can contain 3 sorts of objects: directories,
symbolic links and mount traps. Mount traps are directories with
@@ -52,9 +49,10 @@
Objects can only be created by the automount daemon: symlinks are
created with a regular `symlink` system call, while directories and
mount traps are created with `mkdir`. The determination of whether a
-directory should be a mount trap or not is quite _ad hoc_, largely for
-historical reasons, and is determined in part by the
-*direct*/*indirect*/*offset* mount options, and the *maxproto* mount option.
+directory should be a mount trap is based on a master map. This master
+map is consulted by autofs to determine which directories are mount
+points. Mount points can be *direct*/*indirect*/*offset*.
+On most systems, the default master map is located at */etc/auto.master*.
If neither the *direct* or *offset* mount options are given (so the
mount is considered to be *indirect*), then the root directory is
@@ -80,7 +78,7 @@
and whether the mount was *indirect* or not.
Mount Traps
----------------
+===========
A core element of the implementation of autofs is the Mount Traps
which are provided by the Linux VFS. Any directory provided by a
@@ -201,7 +199,7 @@
Mountpoint expiry
------------------
+=================
The VFS has a mechanism for automatically expiring unused mounts,
much as it can expire any unused dentry information from the dcache.
@@ -301,7 +299,7 @@
necessary), or has been aborted.
Communicating with autofs: detecting the daemon
------------------------------------------------
+===============================================
There are several forms of communication between the automount daemon
and the filesystem. As we have already seen, the daemon can create and
@@ -317,33 +315,39 @@
provided through an ioctl as will be described below.
Communicating with autofs: the event pipe
------------------------------------------
+=========================================
When an autofs filesystem is mounted, the 'write' end of a pipe must
be passed using the 'fd=' mount option. autofs will write
notification messages to this pipe for the daemon to respond to.
-For version 5, the format of the message is:
+For version 5, the format of the message is::
- struct autofs_v5_packet {
- int proto_version; /* Protocol version */
- int type; /* Type of packet */
- autofs_wqt_t wait_queue_token;
- __u32 dev;
- __u64 ino;
- __u32 uid;
- __u32 gid;
- __u32 pid;
- __u32 tgid;
- __u32 len;
- char name[NAME_MAX+1];
+ struct autofs_v5_packet {
+ struct autofs_packet_hdr hdr;
+ autofs_wqt_t wait_queue_token;
+ __u32 dev;
+ __u64 ino;
+ __u32 uid;
+ __u32 gid;
+ __u32 pid;
+ __u32 tgid;
+ __u32 len;
+ char name[NAME_MAX+1];
};
-where the type is one of
+And the format of the header is::
- autofs_ptype_missing_indirect
- autofs_ptype_expire_indirect
- autofs_ptype_missing_direct
- autofs_ptype_expire_direct
+ struct autofs_packet_hdr {
+ int proto_version; /* Protocol version */
+ int type; /* Type of packet */
+ };
+
+where the type is one of ::
+
+ autofs_ptype_missing_indirect
+ autofs_ptype_expire_indirect
+ autofs_ptype_missing_direct
+ autofs_ptype_expire_direct
so messages can indicate that a name is missing (something tried to
access it but it isn't there) or that it has been selected for expiry.
@@ -360,7 +364,7 @@
`wait_queue_token`.
Communicating with autofs: root directory ioctls
-------------------------------------------------
+================================================
The root directory of an autofs filesystem will respond to a number of
ioctls. The process issuing the ioctl must have the CAP_SYS_ADMIN
@@ -368,58 +372,66 @@
The available ioctl commands are:
-- **AUTOFS_IOC_READY**: a notification has been handled. The argument
- to the ioctl command is the "wait_queue_token" number
- corresponding to the notification being acknowledged.
-- **AUTOFS_IOC_FAIL**: similar to above, but indicates failure with
- the error code `ENOENT`.
-- **AUTOFS_IOC_CATATONIC**: Causes the autofs to enter "catatonic"
- mode meaning that it stops sending notifications to the daemon.
- This mode is also entered if a write to the pipe fails.
-- **AUTOFS_IOC_PROTOVER**: This returns the protocol version in use.
-- **AUTOFS_IOC_PROTOSUBVER**: Returns the protocol sub-version which
- is really a version number for the implementation.
-- **AUTOFS_IOC_SETTIMEOUT**: This passes a pointer to an unsigned
- long. The value is used to set the timeout for expiry, and
- the current timeout value is stored back through the pointer.
-- **AUTOFS_IOC_ASKUMOUNT**: Returns, in the pointed-to `int`, 1 if
- the filesystem could be unmounted. This is only a hint as
- the situation could change at any instant. This call can be
- used to avoid a more expensive full unmount attempt.
-- **AUTOFS_IOC_EXPIRE**: as described above, this asks if there is
- anything suitable to expire. A pointer to a packet:
+- **AUTOFS_IOC_READY**:
+ a notification has been handled. The argument
+ to the ioctl command is the "wait_queue_token" number
+ corresponding to the notification being acknowledged.
+- **AUTOFS_IOC_FAIL**:
+ similar to above, but indicates failure with
+ the error code `ENOENT`.
+- **AUTOFS_IOC_CATATONIC**:
+ Causes the autofs to enter "catatonic"
+ mode meaning that it stops sending notifications to the daemon.
+ This mode is also entered if a write to the pipe fails.
+- **AUTOFS_IOC_PROTOVER**:
+ This returns the protocol version in use.
+- **AUTOFS_IOC_PROTOSUBVER**:
+ Returns the protocol sub-version which
+ is really a version number for the implementation.
+- **AUTOFS_IOC_SETTIMEOUT**:
+ This passes a pointer to an unsigned
+ long. The value is used to set the timeout for expiry, and
+ the current timeout value is stored back through the pointer.
+- **AUTOFS_IOC_ASKUMOUNT**:
+ Returns, in the pointed-to `int`, 1 if
+ the filesystem could be unmounted. This is only a hint as
+ the situation could change at any instant. This call can be
+ used to avoid a more expensive full unmount attempt.
+- **AUTOFS_IOC_EXPIRE**:
+ as described above, this asks if there is
+ anything suitable to expire. A pointer to a packet::
- struct autofs_packet_expire_multi {
- int proto_version; /* Protocol version */
- int type; /* Type of packet */
- autofs_wqt_t wait_queue_token;
- int len;
- char name[NAME_MAX+1];
- };
+ struct autofs_packet_expire_multi {
+ struct autofs_packet_hdr hdr;
+ autofs_wqt_t wait_queue_token;
+ int len;
+ char name[NAME_MAX+1];
+ };
- is required. This is filled in with the name of something
- that can be unmounted or removed. If nothing can be expired,
- `errno` is set to `EAGAIN`. Even though a `wait_queue_token`
- is present in the structure, no "wait queue" is established
- and no acknowledgment is needed.
-- **AUTOFS_IOC_EXPIRE_MULTI**: This is similar to
- **AUTOFS_IOC_EXPIRE** except that it causes notification to be
- sent to the daemon, and it blocks until the daemon acknowledges.
- The argument is an integer which can contain two different flags.
+ is required. This is filled in with the name of something
+ that can be unmounted or removed. If nothing can be expired,
+ `errno` is set to `EAGAIN`. Even though a `wait_queue_token`
+ is present in the structure, no "wait queue" is established
+ and no acknowledgment is needed.
+- **AUTOFS_IOC_EXPIRE_MULTI**:
+ This is similar to
+ **AUTOFS_IOC_EXPIRE** except that it causes notification to be
+ sent to the daemon, and it blocks until the daemon acknowledges.
+ The argument is an integer which can contain two different flags.
- **AUTOFS_EXP_IMMEDIATE** causes `last_used` time to be ignored
- and objects are expired if the are not in use.
+ **AUTOFS_EXP_IMMEDIATE** causes `last_used` time to be ignored
+ and objects are expired if the are not in use.
- **AUTOFS_EXP_FORCED** causes the in use status to be ignored
- and objects are expired ieven if they are in use. This assumes
- that the daemon has requested this because it is capable of
- performing the umount.
+ **AUTOFS_EXP_FORCED** causes the in use status to be ignored
+ and objects are expired ieven if they are in use. This assumes
+ that the daemon has requested this because it is capable of
+ performing the umount.
- **AUTOFS_EXP_LEAVES** will select a leaf rather than a top-level
- name to expire. This is only safe when *maxproto* is 4.
+ **AUTOFS_EXP_LEAVES** will select a leaf rather than a top-level
+ name to expire. This is only safe when *maxproto* is 4.
Communicating with autofs: char-device ioctls
----------------------------------------------
+=============================================
It is not always possible to open the root of an autofs filesystem,
particularly a *direct* mounted filesystem. If the automount daemon
@@ -429,9 +441,9 @@
which can be used to communicate directly with the autofs filesystem.
It requires CAP_SYS_ADMIN for access.
-The `ioctl`s that can be used on this device are described in a separate
+The 'ioctl's that can be used on this device are described in a separate
document `autofs-mount-control.txt`, and are summarised briefly here.
-Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure:
+Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
struct autofs_dev_ioctl {
__u32 ver_major;
@@ -469,41 +481,50 @@
Commands are:
-- **AUTOFS_DEV_IOCTL_VERSION_CMD**: does nothing, except validate and
- set version numbers.
-- **AUTOFS_DEV_IOCTL_OPENMOUNT_CMD**: return an open file descriptor
- on the root of an autofs filesystem. The filesystem is identified
- by name and device number, which is stored in `openmount.devid`.
- Device numbers for existing filesystems can be found in
- `/proc/self/mountinfo`.
-- **AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD**: same as `close(ioctlfd)`.
-- **AUTOFS_DEV_IOCTL_SETPIPEFD_CMD**: if the filesystem is in
- catatonic mode, this can provide the write end of a new pipe
- in `setpipefd.pipefd` to re-establish communication with a daemon.
- The process group of the calling process is used to identify the
- daemon.
-- **AUTOFS_DEV_IOCTL_REQUESTER_CMD**: `path` should be a
- name within the filesystem that has been auto-mounted on.
- On successful return, `requester.uid` and `requester.gid` will be
- the UID and GID of the process which triggered that mount.
-- **AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD**: Check if path is a
- mountpoint of a particular type - see separate documentation for
- details.
-- **AUTOFS_DEV_IOCTL_PROTOVER_CMD**:
-- **AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD**:
-- **AUTOFS_DEV_IOCTL_READY_CMD**:
-- **AUTOFS_DEV_IOCTL_FAIL_CMD**:
-- **AUTOFS_DEV_IOCTL_CATATONIC_CMD**:
-- **AUTOFS_DEV_IOCTL_TIMEOUT_CMD**:
-- **AUTOFS_DEV_IOCTL_EXPIRE_CMD**:
-- **AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD**: These all have the same
- function as the similarly named **AUTOFS_IOC** ioctls, except
- that **FAIL** can be given an explicit error number in `fail.status`
- instead of assuming `ENOENT`, and this **EXPIRE** command
- corresponds to **AUTOFS_IOC_EXPIRE_MULTI**.
+- **AUTOFS_DEV_IOCTL_VERSION_CMD**:
+ does nothing, except validate and
+ set version numbers.
+- **AUTOFS_DEV_IOCTL_OPENMOUNT_CMD**:
+ return an open file descriptor
+ on the root of an autofs filesystem. The filesystem is identified
+ by name and device number, which is stored in `openmount.devid`.
+ Device numbers for existing filesystems can be found in
+ `/proc/self/mountinfo`.
+- **AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD**:
+ same as `close(ioctlfd)`.
+- **AUTOFS_DEV_IOCTL_SETPIPEFD_CMD**:
+ if the filesystem is in
+ catatonic mode, this can provide the write end of a new pipe
+ in `setpipefd.pipefd` to re-establish communication with a daemon.
+ The process group of the calling process is used to identify the
+ daemon.
+- **AUTOFS_DEV_IOCTL_REQUESTER_CMD**:
+ `path` should be a
+ name within the filesystem that has been auto-mounted on.
+ On successful return, `requester.uid` and `requester.gid` will be
+ the UID and GID of the process which triggered that mount.
+- **AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD**:
+ Check if path is a
+ mountpoint of a particular type - see separate documentation for
+ details.
+
+- **AUTOFS_DEV_IOCTL_PROTOVER_CMD**
+- **AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD**
+- **AUTOFS_DEV_IOCTL_READY_CMD**
+- **AUTOFS_DEV_IOCTL_FAIL_CMD**
+- **AUTOFS_DEV_IOCTL_CATATONIC_CMD**
+- **AUTOFS_DEV_IOCTL_TIMEOUT_CMD**
+- **AUTOFS_DEV_IOCTL_EXPIRE_CMD**
+- **AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD**
+
+These all have the same
+function as the similarly named **AUTOFS_IOC** ioctls, except
+that **FAIL** can be given an explicit error number in `fail.status`
+instead of assuming `ENOENT`, and this **EXPIRE** command
+corresponds to **AUTOFS_IOC_EXPIRE_MULTI**.
Catatonic mode
---------------
+==============
As mentioned, an autofs mount can enter "catatonic" mode. This
happens if a write to the notification pipe fails, or if it is
@@ -527,7 +548,7 @@
**AUTOFS_DEV_IOCTL_OPENMOUNT_CMD** ioctl on the `/dev/autofs`.
The "ignore" mount option
--------------------------
+=========================
The "ignore" mount option can be used to provide a generic indicator
to applications that the mount entry should be ignored when displaying
@@ -542,18 +563,18 @@
mounts from consideration when reading the mounts list.
autofs, name spaces, and shared mounts
---------------------------------------
+======================================
With bind mounts and name spaces it is possible for an autofs
filesystem to appear at multiple places in one or more filesystem
name spaces. For this to work sensibly, the autofs filesystem should
-always be mounted "shared". e.g.
+always be mounted "shared". e.g. ::
-> `mount --make-shared /autofs/mount/point`
+ mount --make-shared /autofs/mount/point
The automount daemon is only able to manage a single mount location for
an autofs filesystem and if mounts on that are not 'shared', other
locations will not behave as expected. In particular access to those
-other locations will likely result in the `ELOOP` error
+other locations will likely result in the `ELOOP` error ::
-> Too many levels of symbolic links
+ Too many levels of symbolic links
diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.rst
similarity index 89%
rename from Documentation/filesystems/automount-support.txt
rename to Documentation/filesystems/automount-support.rst
index b0afd3d..430f0b4 100644
--- a/Documentation/filesystems/automount-support.txt
+++ b/Documentation/filesystems/automount-support.rst
@@ -1,3 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Automount Support
+=================
+
+
Support is available for filesystems that wish to do automounting
support (such as kAFS which can be found in fs/afs/ and NFS in
fs/nfs/). This facility includes allowing in-kernel mounts to be
@@ -5,13 +12,12 @@
also be requested by userspace.
-======================
-IN-KERNEL AUTOMOUNTING
+In-Kernel Automounting
======================
-See section "Mount Traps" of Documentation/filesystems/autofs.txt
+See section "Mount Traps" of Documentation/filesystems/autofs.rst
-Then from userspace, you can just do something like:
+Then from userspace, you can just do something like::
[root@andromeda root]# mount -t afs \#root.afs. /afs
[root@andromeda root]# ls /afs
@@ -21,7 +27,7 @@
[root@andromeda root]# ls /afs/cambridge/afsdoc/
ChangeLog html LICENSE pdf RELNOTES-1.2.2
-And then if you look in the mountpoint catalogue, you'll see something like:
+And then if you look in the mountpoint catalogue, you'll see something like::
[root@andromeda root]# cat /proc/mounts
...
@@ -30,8 +36,7 @@
#afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0
-===========================
-AUTOMATIC MOUNTPOINT EXPIRY
+Automatic Mountpoint Expiry
===========================
Automatic expiration of mountpoints is easy, provided you've mounted the
@@ -43,7 +48,8 @@
hung.
(2) When a new mountpoint is created in the ->d_automount method, add
- the mnt to the list using mnt_set_expiry()
+ the mnt to the list using mnt_set_expiry()::
+
mnt_set_expiry(newmnt, &afs_vfsmounts);
(3) When you want mountpoints to be expired, call mark_mounts_for_expiry()
@@ -70,8 +76,7 @@
same expiration list.
-=======================
-USERSPACE DRIVEN EXPIRY
+Userspace Driven Expiry
=======================
As an alternative, it is possible for userspace to request expiry of any
diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.rst
similarity index 83%
rename from Documentation/filesystems/befs.txt
rename to Documentation/filesystems/befs.rst
index da45e6c..79f9740 100644
--- a/Documentation/filesystems/befs.txt
+++ b/Documentation/filesystems/befs.rst
@@ -1,48 +1,54 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
BeOS filesystem for Linux
+=========================
Document last updated: Dec 6, 2001
-WARNING
+Warning
=======
Make sure you understand that this is alpha software. This means that the
-implementation is neither complete nor well-tested.
+implementation is neither complete nor well-tested.
I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE!
-LICENSE
-=====
-This software is covered by the GNU General Public License.
+License
+=======
+This software is covered by the GNU General Public License.
See the file COPYING for the complete text of the license.
Or the GNU website: <http://www.gnu.org/licenses/licenses.html>
-AUTHOR
-=====
+Author
+======
The largest part of the code written by Will Dyson <will_dyson@pobox.com>
He has been working on the code since Aug 13, 2001. See the changelog for
details.
Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp>
+
His original code can still be found at:
<http://hp.vector.co.jp/authors/VA008030/bfs/>
+
Does anyone know of a more current email address for Makoto? He doesn't
respond to the address given above...
This filesystem doesn't have a maintainer.
-WHAT IS THIS DRIVER?
-==================
-This module implements the native filesystem of BeOS http://www.beincorporated.com/
+What is this Driver?
+====================
+This module implements the native filesystem of BeOS http://www.beincorporated.com/
for the linux 2.4.1 and later kernels. Currently it is a read-only
implementation.
Which is it, BFS or BEFS?
-================
-Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
+=========================
+Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS".
But Unixware Boot Filesystem is called bfs, too. And they are already in
the kernel. Because of this naming conflict, on Linux the BeOS
filesystem is called befs.
-HOW TO INSTALL
+How to Install
==============
step 1. Install the BeFS patch into the source code tree of linux.
@@ -54,16 +60,16 @@
patch -p1 < /path/to/patch-befs-xxx
if the patching step fails (i.e. there are rejected hunks), you can try to
-figure it out yourself (it shouldn't be hard), or mail the maintainer
+figure it out yourself (it shouldn't be hard), or mail the maintainer
(Will Dyson <will_dyson@pobox.com>) for help.
step 2. Configuration & make kernel
The linux kernel has many compile-time options. Most of them are beyond the
scope of this document. I suggest the Kernel-HOWTO document as a good general
-reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html
+reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html
-However, to use the BeFS module, you must enable it at configure time.
+However, to use the BeFS module, you must enable it at configure time::
cd /foo/bar/linux
make menuconfig (or xconfig)
@@ -82,35 +88,40 @@
See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for
instructions on this critical step.
-USING BFS
+Using BFS
=========
To use the BeOS filesystem, use filesystem type 'befs'.
-ex)
+ex::
+
mount -t befs /dev/fd0 /beos
-MOUNT OPTIONS
+Mount Options
=============
+
+============= ===========================================================
uid=nnn All files in the partition will be owned by user id nnn.
gid=nnn All files in the partition will be in group nnn.
iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
+============= ===========================================================
-HOW TO GET LASTEST VERSION
+How to Get Lastest Version
==========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
-ANY KNOWN BUGS?
-===========
+Any Known Bugs?
+===============
As of Jan 20, 2002:
-
+
None
-SPECIAL THANKS
+Special Thanks
==============
Dominic Giampalo ... Writing "Practical file system design with Be filesystem"
+
Hiroyuki Yamada ... Testing LinuxPPC.
diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.rst
similarity index 71%
rename from Documentation/filesystems/bfs.txt
rename to Documentation/filesystems/bfs.rst
index 843ce91..ce14b90 100644
--- a/Documentation/filesystems/bfs.txt
+++ b/Documentation/filesystems/bfs.rst
@@ -1,4 +1,7 @@
-BFS FILESYSTEM FOR LINUX
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+BFS Filesystem for Linux
========================
The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which
@@ -9,22 +12,22 @@
know the partition number and the kernel must support UnixWare disk slices
(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not
depend on having UnixWare disklabel support because one can also mount
-BFS filesystem via loopback:
+BFS filesystem via loopback::
-# losetup /dev/loop0 stand.img
-# mount -t bfs /dev/loop0 /mnt/stand
+ # losetup /dev/loop0 stand.img
+ # mount -t bfs /dev/loop0 /mnt/stand
-where stand.img is a file containing the image of BFS filesystem.
+where stand.img is a file containing the image of BFS filesystem.
When you have finished using it and umounted you need to also deallocate
-/dev/loop0 device by:
+/dev/loop0 device by::
-# losetup -d /dev/loop0
+ # losetup -d /dev/loop0
-You can simplify mounting by just typing:
+You can simplify mounting by just typing::
-# mount -t bfs -o loop stand.img /mnt/stand
+ # mount -t bfs -o loop stand.img /mnt/stand
-this will allocate the first available loopback device (and load loop.o
+this will allocate the first available loopback device (and load loop.o
kernel module if necessary) automatically. If the loopback driver is not
loaded automatically, make sure that you have compiled the module and
that modprobe is functioning. Beware that umount will not deallocate
@@ -33,21 +36,21 @@
losetup(8). Read losetup(8) manpage for more info.
To create the BFS image under UnixWare you need to find out first which
-slice contains it. The command prtvtoc(1M) is your friend:
+slice contains it. The command prtvtoc(1M) is your friend::
-# prtvtoc /dev/rdsk/c0b0t0d0s0
+ # prtvtoc /dev/rdsk/c0b0t0d0s0
(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you
look for the slice with tag "STAND", which is usually slice 10. With this
-information you can use dd(1) to create the BFS image:
+information you can use dd(1) to create the BFS image::
-# umount /stand
-# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
+ # umount /stand
+ # dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512
Just in case, you can verify that you have done the right thing by checking
-the magic number:
+the magic number::
-# od -Ad -tx4 stand.img | more
+ # od -Ad -tx4 stand.img | more
The first 4 bytes should be 0x1badface.
diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.rst
similarity index 95%
rename from Documentation/filesystems/btrfs.txt
rename to Documentation/filesystems/btrfs.rst
index f9dad22..d0904f6 100644
--- a/Documentation/filesystems/btrfs.txt
+++ b/Documentation/filesystems/btrfs.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
BTRFS
=====
diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.rst
similarity index 87%
rename from Documentation/filesystems/caching/backend-api.txt
rename to Documentation/filesystems/caching/backend-api.rst
index c418280..19fbf6b 100644
--- a/Documentation/filesystems/caching/backend-api.txt
+++ b/Documentation/filesystems/caching/backend-api.rst
@@ -1,6 +1,8 @@
- ==========================
- FS-CACHE CACHE BACKEND API
- ==========================
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+FS-Cache Cache backend API
+==========================
The FS-Cache system provides an API by which actual caches can be supplied to
FS-Cache for it to then serve out to network filesystems and other interested
@@ -9,15 +11,14 @@
This API is declared in <linux/fscache-cache.h>.
-====================================
-INITIALISING AND REGISTERING A CACHE
+Initialising and Registering a Cache
====================================
To start off, a cache definition must be initialised and registered for each
cache the backend wants to make available. For instance, CacheFS does this in
the fill_super() operation on mounting.
-The cache definition (struct fscache_cache) should be initialised by calling:
+The cache definition (struct fscache_cache) should be initialised by calling::
void fscache_init_cache(struct fscache_cache *cache,
struct fscache_cache_ops *ops,
@@ -26,17 +27,17 @@
Where:
- (*) "cache" is a pointer to the cache definition;
+ * "cache" is a pointer to the cache definition;
- (*) "ops" is a pointer to the table of operations that the backend supports on
+ * "ops" is a pointer to the table of operations that the backend supports on
this cache; and
- (*) "idfmt" is a format and printf-style arguments for constructing a label
+ * "idfmt" is a format and printf-style arguments for constructing a label
for the cache.
The cache should then be registered with FS-Cache by passing a pointer to the
-previously initialised cache definition to:
+previously initialised cache definition to::
int fscache_add_cache(struct fscache_cache *cache,
struct fscache_object *fsdef,
@@ -44,12 +45,12 @@
Two extra arguments should also be supplied:
- (*) "fsdef" which should point to the object representation for the FS-Cache
+ * "fsdef" which should point to the object representation for the FS-Cache
master index in this cache. Netfs primary index entries will be created
here. FS-Cache keeps the caller's reference to the index object if
successful and will release it upon withdrawal of the cache.
- (*) "tagname" which, if given, should be a text string naming this cache. If
+ * "tagname" which, if given, should be a text string naming this cache. If
this is NULL, the identifier will be used instead. For CacheFS, the
identifier is set to name the underlying block device and the tag can be
supplied by mount.
@@ -58,20 +59,18 @@
is already in use. 0 will be returned on success.
-=====================
-UNREGISTERING A CACHE
+Unregistering a Cache
=====================
A cache can be withdrawn from the system by calling this function with a
-pointer to the cache definition:
+pointer to the cache definition::
void fscache_withdraw_cache(struct fscache_cache *cache);
In CacheFS's case, this is called by put_super().
-========
-SECURITY
+Security
========
The cache methods are executed one of two contexts:
@@ -89,8 +88,7 @@
This is left to the cache to handle; FS-Cache makes no effort in this regard.
-===================================
-CONTROL AND STATISTICS PRESENTATION
+Control and Statistics Presentation
===================================
The cache may present data to the outside world through FS-Cache's interfaces
@@ -101,11 +99,10 @@
and is for use by the cache as it sees fit.
-========================
-RELEVANT DATA STRUCTURES
+Relevant Data Structures
========================
- (*) Index/Data file FS-Cache representation cookie:
+ * Index/Data file FS-Cache representation cookie::
struct fscache_cookie {
struct fscache_object_def *def;
@@ -121,7 +118,7 @@
cache operations.
- (*) In-cache object representation:
+ * In-cache object representation::
struct fscache_object {
int debug_id;
@@ -150,7 +147,7 @@
initialised by calling fscache_object_init(object).
- (*) FS-Cache operation record:
+ * FS-Cache operation record::
struct fscache_operation {
atomic_t usage;
@@ -173,7 +170,7 @@
an operation needs more processing time, it should be enqueued again.
- (*) FS-Cache retrieval operation record:
+ * FS-Cache retrieval operation record::
struct fscache_retrieval {
struct fscache_operation op;
@@ -198,7 +195,7 @@
it sees fit.
- (*) FS-Cache storage operation record:
+ * FS-Cache storage operation record::
struct fscache_storage {
struct fscache_operation op;
@@ -212,16 +209,17 @@
storage.
-================
-CACHE OPERATIONS
+Cache Operations
================
The cache backend provides FS-Cache with a table of operations that can be
performed on the denizens of the cache. These are held in a structure of type:
- struct fscache_cache_ops
+ ::
- (*) Name of cache provider [mandatory]:
+ struct fscache_cache_ops
+
+ * Name of cache provider [mandatory]::
const char *name
@@ -229,7 +227,7 @@
the backend.
- (*) Allocate a new object [mandatory]:
+ * Allocate a new object [mandatory]::
struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
struct fscache_cookie *cookie)
@@ -244,7 +242,7 @@
form once lookup is complete or aborted.
- (*) Look up and create object [mandatory]:
+ * Look up and create object [mandatory]::
void (*lookup_object)(struct fscache_object *object)
@@ -263,7 +261,7 @@
to abort the lookup of that object.
- (*) Release lookup data [mandatory]:
+ * Release lookup data [mandatory]::
void (*lookup_complete)(struct fscache_object *object)
@@ -271,7 +269,7 @@
using to perform a lookup.
- (*) Increment object refcount [mandatory]:
+ * Increment object refcount [mandatory]::
struct fscache_object *(*grab_object)(struct fscache_object *object)
@@ -280,7 +278,7 @@
It should return the object pointer if successful.
- (*) Lock/Unlock object [mandatory]:
+ * Lock/Unlock object [mandatory]::
void (*lock_object)(struct fscache_object *object)
void (*unlock_object)(struct fscache_object *object)
@@ -289,7 +287,7 @@
to schedule with the lock held, so a spinlock isn't sufficient.
- (*) Pin/Unpin object [optional]:
+ * Pin/Unpin object [optional]::
int (*pin_object)(struct fscache_object *object)
void (*unpin_object)(struct fscache_object *object)
@@ -299,7 +297,7 @@
enough space in the cache to permit this.
- (*) Check coherency state of an object [mandatory]:
+ * Check coherency state of an object [mandatory]::
int (*check_consistency)(struct fscache_object *object)
@@ -308,7 +306,7 @@
if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS
may also be returned.
- (*) Update object [mandatory]:
+ * Update object [mandatory]::
int (*update_object)(struct fscache_object *object)
@@ -317,7 +315,7 @@
obtained by calling object->cookie->def->get_aux()/get_attr().
- (*) Invalidate data object [mandatory]:
+ * Invalidate data object [mandatory]::
int (*invalidate_object)(struct fscache_operation *op)
@@ -329,7 +327,7 @@
fscache_op_complete() must be called on op before returning.
- (*) Discard object [mandatory]:
+ * Discard object [mandatory]::
void (*drop_object)(struct fscache_object *object)
@@ -341,7 +339,7 @@
caller. The caller will invoke the put_object() method as appropriate.
- (*) Release object reference [mandatory]:
+ * Release object reference [mandatory]::
void (*put_object)(struct fscache_object *object)
@@ -349,7 +347,7 @@
be freed when all the references to it are released.
- (*) Synchronise a cache [mandatory]:
+ * Synchronise a cache [mandatory]::
void (*sync)(struct fscache_cache *cache)
@@ -357,7 +355,7 @@
device.
- (*) Dissociate a cache [mandatory]:
+ * Dissociate a cache [mandatory]::
void (*dissociate_pages)(struct fscache_cache *cache)
@@ -365,7 +363,7 @@
cache withdrawal.
- (*) Notification that the attributes on a netfs file changed [mandatory]:
+ * Notification that the attributes on a netfs file changed [mandatory]::
int (*attr_changed)(struct fscache_object *object);
@@ -386,7 +384,7 @@
execution of this operation.
- (*) Reserve cache space for an object's data [optional]:
+ * Reserve cache space for an object's data [optional]::
int (*reserve_space)(struct fscache_object *object, loff_t size);
@@ -404,7 +402,7 @@
size if larger than that already.
- (*) Request page be read from cache [mandatory]:
+ * Request page be read from cache [mandatory]::
int (*read_or_alloc_page)(struct fscache_retrieval *op,
struct page *page,
@@ -446,7 +444,7 @@
with. This will complete the operation when all pages are dealt with.
- (*) Request pages be read from cache [mandatory]:
+ * Request pages be read from cache [mandatory]::
int (*read_or_alloc_pages)(struct fscache_retrieval *op,
struct list_head *pages,
@@ -457,7 +455,7 @@
of pages instead of one page. Any pages on which a read operation is
started must be added to the page cache for the specified mapping and also
to the LRU. Such pages must also be removed from the pages list and
- *nr_pages decremented per page.
+ ``*nr_pages`` decremented per page.
If there was an error such as -ENOMEM, then that should be returned; else
if one or more pages couldn't be read or allocated, then -ENOBUFS should
@@ -466,7 +464,7 @@
returned.
- (*) Request page be allocated in the cache [mandatory]:
+ * Request page be allocated in the cache [mandatory]::
int (*allocate_page)(struct fscache_retrieval *op,
struct page *page,
@@ -482,7 +480,7 @@
allocated, then the netfs page should be marked and 0 returned.
- (*) Request pages be allocated in the cache [mandatory]:
+ * Request pages be allocated in the cache [mandatory]::
int (*allocate_pages)(struct fscache_retrieval *op,
struct list_head *pages,
@@ -493,7 +491,7 @@
nr_pages should be treated as for the read_or_alloc_pages() method.
- (*) Request page be written to cache [mandatory]:
+ * Request page be written to cache [mandatory]::
int (*write_page)(struct fscache_storage *op,
struct page *page);
@@ -514,7 +512,7 @@
appropriately.
- (*) Discard retained per-page metadata [mandatory]:
+ * Discard retained per-page metadata [mandatory]::
void (*uncache_page)(struct fscache_object *object, struct page *page)
@@ -523,13 +521,12 @@
maintains for this page.
-==================
-FS-CACHE UTILITIES
+FS-Cache Utilities
==================
FS-Cache provides some utilities that a cache backend may make use of:
- (*) Note occurrence of an I/O error in a cache:
+ * Note occurrence of an I/O error in a cache::
void fscache_io_error(struct fscache_cache *cache)
@@ -541,7 +538,7 @@
This does not actually withdraw the cache. That must be done separately.
- (*) Invoke the retrieval I/O completion function:
+ * Invoke the retrieval I/O completion function::
void fscache_end_io(struct fscache_retrieval *op, struct page *page,
int error);
@@ -550,8 +547,8 @@
error value should be 0 if successful and an error otherwise.
- (*) Record that one or more pages being retrieved or allocated have been dealt
- with:
+ * Record that one or more pages being retrieved or allocated have been dealt
+ with::
void fscache_retrieval_complete(struct fscache_retrieval *op,
int n_pages);
@@ -562,7 +559,7 @@
completed.
- (*) Record operation completion:
+ * Record operation completion::
void fscache_op_complete(struct fscache_operation *op);
@@ -571,7 +568,7 @@
one or more pending operations to start running.
- (*) Set highest store limit:
+ * Set highest store limit::
void fscache_set_store_limit(struct fscache_object *object,
loff_t i_size);
@@ -581,7 +578,7 @@
rejected by fscache_read_alloc_page() and co with -ENOBUFS.
- (*) Mark pages as being cached:
+ * Mark pages as being cached::
void fscache_mark_pages_cached(struct fscache_retrieval *op,
struct pagevec *pagevec);
@@ -590,7 +587,7 @@
the netfs must call fscache_uncache_page() to unmark the pages.
- (*) Perform coherency check on an object:
+ * Perform coherency check on an object::
enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
const void *data,
@@ -603,29 +600,26 @@
One of three values will be returned:
- (*) FSCACHE_CHECKAUX_OKAY
-
+ FSCACHE_CHECKAUX_OKAY
The coherency data indicates the object is valid as is.
- (*) FSCACHE_CHECKAUX_NEEDS_UPDATE
-
+ FSCACHE_CHECKAUX_NEEDS_UPDATE
The coherency data needs updating, but otherwise the object is
valid.
- (*) FSCACHE_CHECKAUX_OBSOLETE
-
+ FSCACHE_CHECKAUX_OBSOLETE
The coherency data indicates that the object is obsolete and should
be discarded.
- (*) Initialise a freshly allocated object:
+ * Initialise a freshly allocated object::
void fscache_object_init(struct fscache_object *object);
This initialises all the fields in an object representation.
- (*) Indicate the destruction of an object:
+ * Indicate the destruction of an object::
void fscache_object_destroyed(struct fscache_cache *cache);
@@ -635,7 +629,7 @@
all the objects.
- (*) Indicate negative lookup on an object:
+ * Indicate negative lookup on an object::
void fscache_object_lookup_negative(struct fscache_object *object);
@@ -650,7 +644,7 @@
significant - all subsequent calls are ignored.
- (*) Indicate an object has been obtained:
+ * Indicate an object has been obtained::
void fscache_obtained_object(struct fscache_object *object);
@@ -667,7 +661,7 @@
(2) that writes may now proceed against this object.
- (*) Indicate that object lookup failed:
+ * Indicate that object lookup failed::
void fscache_object_lookup_error(struct fscache_object *object);
@@ -676,7 +670,7 @@
as possible.
- (*) Indicate that a stale object was found and discarded:
+ * Indicate that a stale object was found and discarded::
void fscache_object_retrying_stale(struct fscache_object *object);
@@ -685,7 +679,7 @@
discarded from the cache and the lookup will be performed again.
- (*) Indicate that the caching backend killed an object:
+ * Indicate that the caching backend killed an object::
void fscache_object_mark_killed(struct fscache_object *object,
enum fscache_why_object_killed why);
@@ -693,13 +687,20 @@
This is called to indicate that the cache backend preemptively killed an
object. The why parameter should be set to indicate the reason:
- FSCACHE_OBJECT_IS_STALE - the object was stale and needs discarding.
- FSCACHE_OBJECT_NO_SPACE - there was insufficient cache space
- FSCACHE_OBJECT_WAS_RETIRED - the object was retired when relinquished.
- FSCACHE_OBJECT_WAS_CULLED - the object was culled to make space.
+ FSCACHE_OBJECT_IS_STALE
+ - the object was stale and needs discarding.
+
+ FSCACHE_OBJECT_NO_SPACE
+ - there was insufficient cache space
+
+ FSCACHE_OBJECT_WAS_RETIRED
+ - the object was retired when relinquished.
+
+ FSCACHE_OBJECT_WAS_CULLED
+ - the object was culled to make space.
- (*) Get and release references on a retrieval record:
+ * Get and release references on a retrieval record::
void fscache_get_retrieval(struct fscache_retrieval *op);
void fscache_put_retrieval(struct fscache_retrieval *op);
@@ -708,7 +709,7 @@
asynchronous data retrieval and block allocation.
- (*) Enqueue a retrieval record for processing.
+ * Enqueue a retrieval record for processing::
void fscache_enqueue_retrieval(struct fscache_retrieval *op);
@@ -718,7 +719,7 @@
within the callback function.
- (*) List of object state names:
+ * List of object state names::
const char *fscache_object_states[];
diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.rst
similarity index 89%
rename from Documentation/filesystems/caching/cachefiles.txt
rename to Documentation/filesystems/caching/cachefiles.rst
index 28aefcb..e58bc1f 100644
--- a/Documentation/filesystems/caching/cachefiles.txt
+++ b/Documentation/filesystems/caching/cachefiles.rst
@@ -1,8 +1,10 @@
- ===============================================
- CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
- ===============================================
+.. SPDX-License-Identifier: GPL-2.0
-Contents:
+===============================================
+CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
+===============================================
+
+.. Contents:
(*) Overview.
@@ -27,8 +29,8 @@
(*) Debugging.
-========
-OVERVIEW
+
+Overview
========
CacheFiles is a caching backend that's meant to use as a cache a directory on
@@ -58,8 +60,8 @@
space.
-============
-REQUIREMENTS
+
+Requirements
============
The use of CacheFiles and its daemon requires the following features to be
@@ -79,84 +81,70 @@
filesystems being used as a cache.
-=============
-CONFIGURATION
+Configuration
=============
The cache is configured by a script in /etc/cachefilesd.conf. These commands
set up cache ready for use. The following script commands are available:
- (*) brun <N>%
- (*) bcull <N>%
- (*) bstop <N>%
- (*) frun <N>%
- (*) fcull <N>%
- (*) fstop <N>%
-
+ brun <N>%, bcull <N>%, bstop <N>%, frun <N>%, fcull <N>%, fstop <N>%
Configure the culling limits. Optional. See the section on culling
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
The commands beginning with a 'b' are file space (block) limits, those
beginning with an 'f' are file count limits.
- (*) dir <path>
-
+ dir <path>
Specify the directory containing the root of the cache. Mandatory.
- (*) tag <name>
-
+ tag <name>
Specify a tag to FS-Cache to use in distinguishing multiple caches.
Optional. The default is "CacheFiles".
- (*) debug <mask>
-
+ debug <mask>
Specify a numeric bitmask to control debugging in the kernel module.
Optional. The default is zero (all off). The following values can be
OR'd into the mask to collect various information:
+ == =================================================
1 Turn on trace of function entry (_enter() macros)
2 Turn on trace of function exit (_leave() macros)
4 Turn on trace of internal debug points (_debug())
+ == =================================================
- This mask can also be set through sysfs, eg:
+ This mask can also be set through sysfs, eg::
echo 5 >/sys/modules/cachefiles/parameters/debug
-==================
-STARTING THE CACHE
+Starting the Cache
==================
The cache is started by running the daemon. The daemon opens the cache device,
configures the cache and tells it to begin caching. At that point the cache
binds to fscache and the cache becomes live.
-The daemon is run as follows:
+The daemon is run as follows::
/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
The flags are:
- (*) -d
-
+ ``-d``
Increase the debugging level. This can be specified multiple times and
is cumulative with itself.
- (*) -s
-
+ ``-s``
Send messages to stderr instead of syslog.
- (*) -n
-
+ ``-n``
Don't daemonise and go into background.
- (*) -f <configfile>
-
+ ``-f <configfile>``
Use an alternative configuration file rather than the default one.
-===============
-THINGS TO AVOID
+Things to Avoid
===============
Do not mount other things within the cache as this will cause problems. The
@@ -179,8 +167,7 @@
permissions to prevent random users being able to access them directly.
-=============
-CACHE CULLING
+Cache Culling
=============
The cache may need culling occasionally to make space. This involves
@@ -192,27 +179,21 @@
percentage of files available in the underlying filesystem. There are six
"limits":
- (*) brun
- (*) frun
-
+ brun, frun
If the amount of free space and the number of available files in the cache
rises above both these limits, then culling is turned off.
- (*) bcull
- (*) fcull
-
+ bcull, fcull
If the amount of available space or the number of available files in the
cache falls below either of these limits, then culling is started.
- (*) bstop
- (*) fstop
-
+ bstop, fstop
If the amount of available space or the number of available files in the
cache falls below either of these limits, then no further allocation of
disk space or files is permitted until culling has raised things above
these limits again.
-These must be configured thusly:
+These must be configured thusly::
0 <= bstop < bcull < brun < 100
0 <= fstop < fcull < frun < 100
@@ -226,16 +207,14 @@
their atimes have changed or if the kernel module says it is still using them.
-===============
-CACHE STRUCTURE
+Cache Structure
===============
The CacheFiles module will create two directories in the directory it was
given:
- (*) cache/
-
- (*) graveyard/
+ * cache/
+ * graveyard/
The active cache objects all reside in the first directory. The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
@@ -261,10 +240,10 @@
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended. Into
this directory, if possible, will be placed the representations of the child
-objects:
+objects::
- INDEX INDEX INDEX DATA FILES
- ========= ========== ================================= ================
+ /INDEX /INDEX /INDEX /DATA FILES
+ /=========/==========/=================================/================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
@@ -275,7 +254,7 @@
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory. The names of the intermediate directories will have
-'+' prepended:
+'+' prepended::
J1223/@23/+xy...z/+kl...m/Epqr
@@ -288,11 +267,13 @@
"base-64" encode ones that aren't directly suitable. The two versions of
object filenames indicate the encoding:
+ =============== =============== ===============
OBJECT TYPE PRINTABLE ENCODED
=============== =============== ===============
Index "I..." "J..."
Data "D..." "E..."
Special "S..." "T..."
+ =============== =============== ===============
Intermediate directories are always "@" or "+" as appropriate.
@@ -307,8 +288,7 @@
any file of an incorrect type (such as a FIFO file or a device file).
-==========================
-SECURITY MODEL AND SELINUX
+Security Model and SELinux
==========================
CacheFiles is implemented to deal properly with the LSM security features of
@@ -331,26 +311,26 @@
(1) Finds the security label attached to the root cache directory and uses
that as the security label with which it will create files. By default,
- this is:
+ this is::
cachefiles_var_t
(2) Finds the security label of the process which issued the bind request
- (presumed to be the cachefilesd daemon), which by default will be:
+ (presumed to be the cachefilesd daemon), which by default will be::
cachefilesd_t
and asks LSM to supply a security ID as which it should act given the
- daemon's label. By default, this will be:
+ daemon's label. By default, this will be::
cachefiles_kernel_t
SELinux transitions the daemon's security ID to the module's security ID
- based on a rule of this form in the policy.
+ based on a rule of this form in the policy::
type_transition <daemon's-ID> kernel_t : process <module's-ID>;
- For instance:
+ For instance::
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
@@ -368,9 +348,9 @@
There are policy source files available in:
- http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
+ https://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
-and later versions. In that tarball, see the files:
+and later versions. In that tarball, see the files::
cachefilesd.te
cachefilesd.fc
@@ -379,7 +359,7 @@
They are built and installed directly by the RPM.
If a non-RPM based system is being used, then copy the above files to their own
-directory and run:
+directory and run::
make -f /usr/share/selinux/devel/Makefile
semodule -i cachefilesd.pp
@@ -394,7 +374,7 @@
cache.
For instructions on how to add an auxiliary policy to enable the cache to be
-located elsewhere when SELinux is in enforcing mode, please see:
+located elsewhere when SELinux is in enforcing mode, please see::
/usr/share/doc/cachefilesd-*/move-cache.txt
@@ -402,8 +382,7 @@
in the sources.
-==================
-A NOTE ON SECURITY
+A Note on Security
==================
CacheFiles makes use of the split security in the task_struct. It allocates
@@ -445,17 +424,18 @@
files and directories with another security label.
-=======================
-STATISTICAL INFORMATION
+Statistical Information
=======================
-If FS-Cache is compiled with the following option enabled:
+If FS-Cache is compiled with the following option enabled::
CONFIG_CACHEFILES_HISTOGRAM=y
then it will gather certain statistics and display them through a proc file.
- (*) /proc/fs/cachefiles/histogram
+ /proc/fs/cachefiles/histogram
+
+ ::
cat /proc/fs/cachefiles/histogram
JIFS SECS LOOKUPS MKDIRS CREATES
@@ -465,36 +445,39 @@
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
columns are as follows:
+ ======= =======================================================
COLUMN TIME MEASUREMENT
======= =======================================================
LOOKUPS Length of time to perform a lookup on the backing fs
MKDIRS Length of time to perform a mkdir on the backing fs
CREATES Length of time to perform a create on the backing fs
+ ======= =======================================================
Each row shows the number of events that took a particular range of times.
Each step is 1 jiffy in size. The JIFS column indicates the particular
jiffy range covered, and the SECS field the equivalent number of seconds.
-=========
-DEBUGGING
+Debugging
=========
If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
-debugging enabled by adjusting the value in:
+debugging enabled by adjusting the value in::
/sys/module/cachefiles/parameters/debug
This is a bitmask of debugging streams to enable:
+ ======= ======= =============================== =======================
BIT VALUE STREAM POINT
======= ======= =============================== =======================
0 1 General Function entry trace
1 2 Function exit trace
2 4 General
+ ======= ======= =============================== =======================
The appropriate set of values should be OR'd together and the result written to
-the control file. For example:
+the control file. For example::
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst
new file mode 100644
index 0000000..70de869
--- /dev/null
+++ b/Documentation/filesystems/caching/fscache.rst
@@ -0,0 +1,565 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+General Filesystem Caching
+==========================
+
+Overview
+========
+
+This facility is a general purpose cache for network filesystems, though it
+could be used for caching other things such as ISO9660 filesystems too.
+
+FS-Cache mediates between cache backends (such as CacheFS) and network
+filesystems::
+
+ +---------+
+ | | +--------------+
+ | NFS |--+ | |
+ | | | +-->| CacheFS |
+ +---------+ | +----------+ | | /dev/hda5 |
+ | | | | +--------------+
+ +---------+ +-->| | |
+ | | | |--+
+ | AFS |----->| FS-Cache |
+ | | | |--+
+ +---------+ +-->| | |
+ | | | | +--------------+
+ +---------+ | +----------+ | | |
+ | | | +-->| CacheFiles |
+ | ISOFS |--+ | /var/cache |
+ | | +--------------+
+ +---------+
+
+Or to look at it another way, FS-Cache is a module that provides a caching
+facility to a network filesystem such that the cache is transparent to the
+user::
+
+ +---------+
+ | |
+ | Server |
+ | |
+ +---------+
+ | NETWORK
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ |
+ | +----------+
+ V | |
+ +---------+ | |
+ | | | |
+ | NFS |----->| FS-Cache |
+ | | | |--+
+ +---------+ | | | +--------------+ +--------------+
+ | | | | | | | |
+ V +----------+ +-->| CacheFiles |-->| Ext3 |
+ +---------+ | /var/cache | | /dev/sda6 |
+ | | +--------------+ +--------------+
+ | VFS | ^ ^
+ | | | |
+ +---------+ +--------------+ |
+ | KERNEL SPACE | |
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
+ | USER SPACE | |
+ V | |
+ +---------+ +--------------+
+ | | | |
+ | Process | | cachefilesd |
+ | | | |
+ +---------+ +--------------+
+
+
+FS-Cache does not follow the idea of completely loading every netfs file
+opened in its entirety into a cache before permitting it to be accessed and
+then serving the pages out of that cache rather than the netfs inode because:
+
+ (1) It must be practical to operate without a cache.
+
+ (2) The size of any accessible file must not be limited to the size of the
+ cache.
+
+ (3) The combined size of all opened files (this includes mapped libraries)
+ must not be limited to the size of the cache.
+
+ (4) The user should not be forced to download an entire file just to do a
+ one-off access of a small portion of it (such as might be done with the
+ "file" program).
+
+It instead serves the cache out in PAGE_SIZE chunks as and when requested by
+the netfs('s) using it.
+
+
+FS-Cache provides the following facilities:
+
+ (1) More than one cache can be used at once. Caches can be selected
+ explicitly by use of tags.
+
+ (2) Caches can be added / removed at any time.
+
+ (3) The netfs is provided with an interface that allows either party to
+ withdraw caching facilities from a file (required for (2)).
+
+ (4) The interface to the netfs returns as few errors as possible, preferring
+ rather to let the netfs remain oblivious.
+
+ (5) Cookies are used to represent indices, files and other objects to the
+ netfs. The simplest cookie is just a NULL pointer - indicating nothing
+ cached there.
+
+ (6) The netfs is allowed to propose - dynamically - any index hierarchy it
+ desires, though it must be aware that the index search function is
+ recursive, stack space is limited, and indices can only be children of
+ indices.
+
+ (7) Data I/O is done direct to and from the netfs's pages. The netfs
+ indicates that page A is at index B of the data-file represented by cookie
+ C, and that it should be read or written. The cache backend may or may
+ not start I/O on that page, but if it does, a netfs callback will be
+ invoked to indicate completion. The I/O may be either synchronous or
+ asynchronous.
+
+ (8) Cookies can be "retired" upon release. At this point FS-Cache will mark
+ them as obsolete and the index hierarchy rooted at that point will get
+ recycled.
+
+ (9) The netfs provides a "match" function for index searches. In addition to
+ saying whether a match was made or not, this can also specify that an
+ entry should be updated or deleted.
+
+(10) As much as possible is done asynchronously.
+
+
+FS-Cache maintains a virtual indexing tree in which all indices, files, objects
+and pages are kept. Bits of this tree may actually reside in one or more
+caches::
+
+ FSDEF
+ |
+ +------------------------------------+
+ | |
+ NFS AFS
+ | |
+ +--------------------------+ +-----------+
+ | | | |
+ homedir mirror afs.org redhat.com
+ | | |
+ +------------+ +---------------+ +----------+
+ | | | | | |
+ 00001 00002 00007 00125 vol00001 vol00002
+ | | | | |
+ +---+---+ +-----+ +---+ +------+------+ +-----+----+
+ | | | | | | | | | | | | |
+ PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
+ | |
+ PG0 +-------+
+ | |
+ 00001 00003
+ |
+ +---+---+
+ | | |
+ PG0 PG1 PG2
+
+In the example above, you can see two netfs's being backed: NFS and AFS. These
+have different index hierarchies:
+
+ * The NFS primary index contains per-server indices. Each server index is
+ indexed by NFS file handles to get data file objects. Each data file
+ objects can have an array of pages, but may also have further child
+ objects, such as extended attributes and directory entries. Extended
+ attribute objects themselves have page-array contents.
+
+ * The AFS primary index contains per-cell indices. Each cell index contains
+ per-logical-volume indices. Each of volume index contains up to three
+ indices for the read-write, read-only and backup mirrors of those volumes.
+ Each of these contains vnode data file objects, each of which contains an
+ array of pages.
+
+The very top index is the FS-Cache master index in which individual netfs's
+have entries.
+
+Any index object may reside in more than one cache, provided it only has index
+children. Any index with non-index object children will be assumed to only
+reside in one cache.
+
+
+The netfs API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/netfs-api.rst
+
+The cache backend API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/backend-api.rst
+
+A description of the internal representations and object state machine can be
+found in:
+
+ Documentation/filesystems/caching/object.rst
+
+
+Statistical Information
+=======================
+
+If FS-Cache is compiled with the following options enabled::
+
+ CONFIG_FSCACHE_STATS=y
+ CONFIG_FSCACHE_HISTOGRAM=y
+
+then it will gather certain statistics and display them through a number of
+proc files.
+
+/proc/fs/fscache/stats
+----------------------
+
+ This shows counts of a number of events that can happen in FS-Cache:
+
++--------------+-------+-------------------------------------------------------+
+|CLASS |EVENT |MEANING |
++==============+=======+=======================================================+
+|Cookies |idx=N |Number of index cookies allocated |
++ +-------+-------------------------------------------------------+
+| |dat=N |Number of data storage cookies allocated |
++ +-------+-------------------------------------------------------+
+| |spc=N |Number of special cookies allocated |
++--------------+-------+-------------------------------------------------------+
+|Objects |alc=N |Number of objects allocated |
++ +-------+-------------------------------------------------------+
+| |nal=N |Number of object allocation failures |
++ +-------+-------------------------------------------------------+
+| |avl=N |Number of objects that reached the available state |
++ +-------+-------------------------------------------------------+
+| |ded=N |Number of objects that reached the dead state |
++--------------+-------+-------------------------------------------------------+
+|ChkAux |non=N |Number of objects that didn't have a coherency check |
++ +-------+-------------------------------------------------------+
+| |ok=N |Number of objects that passed a coherency check |
++ +-------+-------------------------------------------------------+
+| |upd=N |Number of objects that needed a coherency data update |
++ +-------+-------------------------------------------------------+
+| |obs=N |Number of objects that were declared obsolete |
++--------------+-------+-------------------------------------------------------+
+|Pages |mrk=N |Number of pages marked as being cached |
+| |unc=N |Number of uncache page requests seen |
++--------------+-------+-------------------------------------------------------+
+|Acquire |n=N |Number of acquire cookie requests seen |
++ +-------+-------------------------------------------------------+
+| |nul=N |Number of acq reqs given a NULL parent |
++ +-------+-------------------------------------------------------+
+| |noc=N |Number of acq reqs rejected due to no cache available |
++ +-------+-------------------------------------------------------+
+| |ok=N |Number of acq reqs succeeded |
++ +-------+-------------------------------------------------------+
+| |nbf=N |Number of acq reqs rejected due to error |
++ +-------+-------------------------------------------------------+
+| |oom=N |Number of acq reqs failed on ENOMEM |
++--------------+-------+-------------------------------------------------------+
+|Lookups |n=N |Number of lookup calls made on cache backends |
++ +-------+-------------------------------------------------------+
+| |neg=N |Number of negative lookups made |
++ +-------+-------------------------------------------------------+
+| |pos=N |Number of positive lookups made |
++ +-------+-------------------------------------------------------+
+| |crt=N |Number of objects created by lookup |
++ +-------+-------------------------------------------------------+
+| |tmo=N |Number of lookups timed out and requeued |
++--------------+-------+-------------------------------------------------------+
+|Updates |n=N |Number of update cookie requests seen |
++ +-------+-------------------------------------------------------+
+| |nul=N |Number of upd reqs given a NULL parent |
++ +-------+-------------------------------------------------------+
+| |run=N |Number of upd reqs granted CPU time |
++--------------+-------+-------------------------------------------------------+
+|Relinqs |n=N |Number of relinquish cookie requests seen |
++ +-------+-------------------------------------------------------+
+| |nul=N |Number of rlq reqs given a NULL parent |
++ +-------+-------------------------------------------------------+
+| |wcr=N |Number of rlq reqs waited on completion of creation |
++--------------+-------+-------------------------------------------------------+
+|AttrChg |n=N |Number of attribute changed requests seen |
++ +-------+-------------------------------------------------------+
+| |ok=N |Number of attr changed requests queued |
++ +-------+-------------------------------------------------------+
+| |nbf=N |Number of attr changed rejected -ENOBUFS |
++ +-------+-------------------------------------------------------+
+| |oom=N |Number of attr changed failed -ENOMEM |
++ +-------+-------------------------------------------------------+
+| |run=N |Number of attr changed ops given CPU time |
++--------------+-------+-------------------------------------------------------+
+|Allocs |n=N |Number of allocation requests seen |
++ +-------+-------------------------------------------------------+
+| |ok=N |Number of successful alloc reqs |
++ +-------+-------------------------------------------------------+
+| |wt=N |Number of alloc reqs that waited on lookup completion |
++ +-------+-------------------------------------------------------+
+| |nbf=N |Number of alloc reqs rejected -ENOBUFS |
++ +-------+-------------------------------------------------------+
+| |int=N |Number of alloc reqs aborted -ERESTARTSYS |
++ +-------+-------------------------------------------------------+
+| |ops=N |Number of alloc reqs submitted |
++ +-------+-------------------------------------------------------+
+| |owt=N |Number of alloc reqs waited for CPU time |
++ +-------+-------------------------------------------------------+
+| |abt=N |Number of alloc reqs aborted due to object death |
++--------------+-------+-------------------------------------------------------+
+|Retrvls |n=N |Number of retrieval (read) requests seen |
++ +-------+-------------------------------------------------------+
+| |ok=N |Number of successful retr reqs |
++ +-------+-------------------------------------------------------+
+| |wt=N |Number of retr reqs that waited on lookup completion |
++ +-------+-------------------------------------------------------+
+| |nod=N |Number of retr reqs returned -ENODATA |
++ +-------+-------------------------------------------------------+
+| |nbf=N |Number of retr reqs rejected -ENOBUFS |
++ +-------+-------------------------------------------------------+
+| |int=N |Number of retr reqs aborted -ERESTARTSYS |
++ +-------+-------------------------------------------------------+
+| |oom=N |Number of retr reqs failed -ENOMEM |
++ +-------+-------------------------------------------------------+
+| |ops=N |Number of retr reqs submitted |
++ +-------+-------------------------------------------------------+
+| |owt=N |Number of retr reqs waited for CPU time |
++ +-------+-------------------------------------------------------+
+| |abt=N |Number of retr reqs aborted due to object death |
++--------------+-------+-------------------------------------------------------+
+|Stores |n=N |Number of storage (write) requests seen |
++ +-------+-------------------------------------------------------+
+| |ok=N |Number of successful store reqs |
++ +-------+-------------------------------------------------------+
+| |agn=N |Number of store reqs on a page already pending storage |
++ +-------+-------------------------------------------------------+
+| |nbf=N |Number of store reqs rejected -ENOBUFS |
++ +-------+-------------------------------------------------------+
+| |oom=N |Number of store reqs failed -ENOMEM |
++ +-------+-------------------------------------------------------+
+| |ops=N |Number of store reqs submitted |
++ +-------+-------------------------------------------------------+
+| |run=N |Number of store reqs granted CPU time |
++ +-------+-------------------------------------------------------+
+| |pgs=N |Number of pages given store req processing time |
++ +-------+-------------------------------------------------------+
+| |rxd=N |Number of store reqs deleted from tracking tree |
++ +-------+-------------------------------------------------------+
+| |olm=N |Number of store reqs over store limit |
++--------------+-------+-------------------------------------------------------+
+|VmScan |nos=N |Number of release reqs against pages with no |
+| | |pending store |
++ +-------+-------------------------------------------------------+
+| |gon=N |Number of release reqs against pages stored by |
+| | |time lock granted |
++ +-------+-------------------------------------------------------+
+| |bsy=N |Number of release reqs ignored due to in-progress store|
++ +-------+-------------------------------------------------------+
+| |can=N |Number of page stores cancelled due to release req |
++--------------+-------+-------------------------------------------------------+
+|Ops |pend=N |Number of times async ops added to pending queues |
++ +-------+-------------------------------------------------------+
+| |run=N |Number of times async ops given CPU time |
++ +-------+-------------------------------------------------------+
+| |enq=N |Number of times async ops queued for processing |
++ +-------+-------------------------------------------------------+
+| |can=N |Number of async ops cancelled |
++ +-------+-------------------------------------------------------+
+| |rej=N |Number of async ops rejected due to object |
+| | |lookup/create failure |
++ +-------+-------------------------------------------------------+
+| |ini=N |Number of async ops initialised |
++ +-------+-------------------------------------------------------+
+| |dfr=N |Number of async ops queued for deferred release |
++ +-------+-------------------------------------------------------+
+| |rel=N |Number of async ops released |
+| | |(should equal ini=N when idle) |
++ +-------+-------------------------------------------------------+
+| |gc=N |Number of deferred-release async ops garbage collected |
++--------------+-------+-------------------------------------------------------+
+|CacheOp |alo=N |Number of in-progress alloc_object() cache ops |
++ +-------+-------------------------------------------------------+
+| |luo=N |Number of in-progress lookup_object() cache ops |
++ +-------+-------------------------------------------------------+
+| |luc=N |Number of in-progress lookup_complete() cache ops |
++ +-------+-------------------------------------------------------+
+| |gro=N |Number of in-progress grab_object() cache ops |
++ +-------+-------------------------------------------------------+
+| |upo=N |Number of in-progress update_object() cache ops |
++ +-------+-------------------------------------------------------+
+| |dro=N |Number of in-progress drop_object() cache ops |
++ +-------+-------------------------------------------------------+
+| |pto=N |Number of in-progress put_object() cache ops |
++ +-------+-------------------------------------------------------+
+| |syn=N |Number of in-progress sync_cache() cache ops |
++ +-------+-------------------------------------------------------+
+| |atc=N |Number of in-progress attr_changed() cache ops |
++ +-------+-------------------------------------------------------+
+| |rap=N |Number of in-progress read_or_alloc_page() cache ops |
++ +-------+-------------------------------------------------------+
+| |ras=N |Number of in-progress read_or_alloc_pages() cache ops |
++ +-------+-------------------------------------------------------+
+| |alp=N |Number of in-progress allocate_page() cache ops |
++ +-------+-------------------------------------------------------+
+| |als=N |Number of in-progress allocate_pages() cache ops |
++ +-------+-------------------------------------------------------+
+| |wrp=N |Number of in-progress write_page() cache ops |
++ +-------+-------------------------------------------------------+
+| |ucp=N |Number of in-progress uncache_page() cache ops |
++ +-------+-------------------------------------------------------+
+| |dsp=N |Number of in-progress dissociate_pages() cache ops |
++--------------+-------+-------------------------------------------------------+
+|CacheEv |nsp=N |Number of object lookups/creations rejected due to |
+| | |lack of space |
++ +-------+-------------------------------------------------------+
+| |stl=N |Number of stale objects deleted |
++ +-------+-------------------------------------------------------+
+| |rtr=N |Number of objects retired when relinquished |
++ +-------+-------------------------------------------------------+
+| |cul=N |Number of objects culled |
++--------------+-------+-------------------------------------------------------+
+
+
+
+/proc/fs/fscache/histogram
+--------------------------
+
+ ::
+
+ cat /proc/fs/fscache/histogram
+ JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
+ ===== ===== ========= ========= ========= ========= =========
+
+ This shows the breakdown of the number of times each amount of time
+ between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
+ columns are as follows:
+
+ ========= =======================================================
+ COLUMN TIME MEASUREMENT
+ ========= =======================================================
+ OBJ INST Length of time to instantiate an object
+ OP RUNS Length of time a call to process an operation took
+ OBJ RUNS Length of time a call to process an object event took
+ RETRV DLY Time between an requesting a read and lookup completing
+ RETRIEVLS Time between beginning and end of a retrieval
+ ========= =======================================================
+
+ Each row shows the number of events that took a particular range of times.
+ Each step is 1 jiffy in size. The JIFS column indicates the particular
+ jiffy range covered, and the SECS field the equivalent number of seconds.
+
+
+
+Object List
+===========
+
+If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
+list of all the objects currently allocated and allow them to be viewed
+through::
+
+ /proc/fs/fscache/objects
+
+This will look something like::
+
+ [root@andromeda ~]# head /proc/fs/fscache/objects
+ OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
+ ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
+ 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
+ 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
+
+where the first set of columns before the '|' describe the object:
+
+ ======= ===============================================================
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
+ PARENT Debugging ID of parent object
+ STAT Object state
+ CHLDN Number of child objects of this object
+ OPS Number of outstanding operations on this object
+ OOP Number of outstanding child object management operations
+ IPR
+ EX Number of outstanding exclusive operations
+ READS Number of outstanding read operations
+ EM Object's event mask
+ EV Events raised on this object
+ F Object flags
+ S Object work item busy state mask (1:pending 2:running)
+ ======= ===============================================================
+
+and the second set of columns describe the object's cookie, if present:
+
+ ================ ======================================================
+ COLUMN DESCRIPTION
+ ================ ======================================================
+ NETFS_COOKIE_DEF Name of netfs cookie definition
+ TY Cookie type (IX - index, DT - data, hex - special)
+ FL Cookie flags
+ NETFS_DATA Netfs private data stored in the cookie
+ OBJECT_KEY Object key } 1 column, with separating comma
+ AUX_DATA Object aux data } presence may be configured
+ ================ ======================================================
+
+The data shown may be filtered by attaching the a key to an appropriate keyring
+before viewing the file. Something like::
+
+ keyctl add user fscache:objlist <restrictions> @s
+
+where <restrictions> are a selection of the following letters:
+
+ == =========================================================
+ K Show hexdump of object key (don't show if not given)
+ A Show hexdump of object aux data (don't show if not given)
+ == =========================================================
+
+and the following paired letters:
+
+ == =========================================================
+ C Show objects that have a cookie
+ c Show objects that don't have a cookie
+ B Show objects that are busy
+ b Show objects that aren't busy
+ W Show objects that have pending writes
+ w Show objects that don't have pending writes
+ R Show objects that have outstanding reads
+ r Show objects that don't have outstanding reads
+ S Show objects that have work queued
+ s Show objects that don't have work queued
+ == =========================================================
+
+If neither side of a letter pair is given, then both are implied. For example:
+
+ keyctl add user fscache:objlist KB @s
+
+shows objects that are busy, and lists their object keys, but does not dump
+their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
+not implied.
+
+By default all objects and all fields will be shown.
+
+
+Debugging
+=========
+
+If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
+debugging enabled by adjusting the value in::
+
+ /sys/module/fscache/parameters/debug
+
+This is a bitmask of debugging streams to enable:
+
+ ======= ======= =============================== =======================
+ BIT VALUE STREAM POINT
+ ======= ======= =============================== =======================
+ 0 1 Cache management Function entry trace
+ 1 2 Function exit trace
+ 2 4 General
+ 3 8 Cookie management Function entry trace
+ 4 16 Function exit trace
+ 5 32 General
+ 6 64 Page handling Function entry trace
+ 7 128 Function exit trace
+ 8 256 General
+ 9 512 Operation management Function entry trace
+ 10 1024 Function exit trace
+ 11 2048 General
+ ======= ======= =============================== =======================
+
+The appropriate set of values should be OR'd together and the result written to
+the control file. For example::
+
+ echo $((1|8|64)) >/sys/module/fscache/parameters/debug
+
+will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
deleted file mode 100644
index 50f0a57..0000000
--- a/Documentation/filesystems/caching/fscache.txt
+++ /dev/null
@@ -1,448 +0,0 @@
- ==========================
- General Filesystem Caching
- ==========================
-
-========
-OVERVIEW
-========
-
-This facility is a general purpose cache for network filesystems, though it
-could be used for caching other things such as ISO9660 filesystems too.
-
-FS-Cache mediates between cache backends (such as CacheFS) and network
-filesystems:
-
- +---------+
- | | +--------------+
- | NFS |--+ | |
- | | | +-->| CacheFS |
- +---------+ | +----------+ | | /dev/hda5 |
- | | | | +--------------+
- +---------+ +-->| | |
- | | | |--+
- | AFS |----->| FS-Cache |
- | | | |--+
- +---------+ +-->| | |
- | | | | +--------------+
- +---------+ | +----------+ | | |
- | | | +-->| CacheFiles |
- | ISOFS |--+ | /var/cache |
- | | +--------------+
- +---------+
-
-Or to look at it another way, FS-Cache is a module that provides a caching
-facility to a network filesystem such that the cache is transparent to the
-user:
-
- +---------+
- | |
- | Server |
- | |
- +---------+
- | NETWORK
- ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- |
- | +----------+
- V | |
- +---------+ | |
- | | | |
- | NFS |----->| FS-Cache |
- | | | |--+
- +---------+ | | | +--------------+ +--------------+
- | | | | | | | |
- V +----------+ +-->| CacheFiles |-->| Ext3 |
- +---------+ | /var/cache | | /dev/sda6 |
- | | +--------------+ +--------------+
- | VFS | ^ ^
- | | | |
- +---------+ +--------------+ |
- | KERNEL SPACE | |
- ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
- | USER SPACE | |
- V | |
- +---------+ +--------------+
- | | | |
- | Process | | cachefilesd |
- | | | |
- +---------+ +--------------+
-
-
-FS-Cache does not follow the idea of completely loading every netfs file
-opened in its entirety into a cache before permitting it to be accessed and
-then serving the pages out of that cache rather than the netfs inode because:
-
- (1) It must be practical to operate without a cache.
-
- (2) The size of any accessible file must not be limited to the size of the
- cache.
-
- (3) The combined size of all opened files (this includes mapped libraries)
- must not be limited to the size of the cache.
-
- (4) The user should not be forced to download an entire file just to do a
- one-off access of a small portion of it (such as might be done with the
- "file" program).
-
-It instead serves the cache out in PAGE_SIZE chunks as and when requested by
-the netfs('s) using it.
-
-
-FS-Cache provides the following facilities:
-
- (1) More than one cache can be used at once. Caches can be selected
- explicitly by use of tags.
-
- (2) Caches can be added / removed at any time.
-
- (3) The netfs is provided with an interface that allows either party to
- withdraw caching facilities from a file (required for (2)).
-
- (4) The interface to the netfs returns as few errors as possible, preferring
- rather to let the netfs remain oblivious.
-
- (5) Cookies are used to represent indices, files and other objects to the
- netfs. The simplest cookie is just a NULL pointer - indicating nothing
- cached there.
-
- (6) The netfs is allowed to propose - dynamically - any index hierarchy it
- desires, though it must be aware that the index search function is
- recursive, stack space is limited, and indices can only be children of
- indices.
-
- (7) Data I/O is done direct to and from the netfs's pages. The netfs
- indicates that page A is at index B of the data-file represented by cookie
- C, and that it should be read or written. The cache backend may or may
- not start I/O on that page, but if it does, a netfs callback will be
- invoked to indicate completion. The I/O may be either synchronous or
- asynchronous.
-
- (8) Cookies can be "retired" upon release. At this point FS-Cache will mark
- them as obsolete and the index hierarchy rooted at that point will get
- recycled.
-
- (9) The netfs provides a "match" function for index searches. In addition to
- saying whether a match was made or not, this can also specify that an
- entry should be updated or deleted.
-
-(10) As much as possible is done asynchronously.
-
-
-FS-Cache maintains a virtual indexing tree in which all indices, files, objects
-and pages are kept. Bits of this tree may actually reside in one or more
-caches.
-
- FSDEF
- |
- +------------------------------------+
- | |
- NFS AFS
- | |
- +--------------------------+ +-----------+
- | | | |
- homedir mirror afs.org redhat.com
- | | |
- +------------+ +---------------+ +----------+
- | | | | | |
- 00001 00002 00007 00125 vol00001 vol00002
- | | | | |
- +---+---+ +-----+ +---+ +------+------+ +-----+----+
- | | | | | | | | | | | | |
-PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
- | |
- PG0 +-------+
- | |
- 00001 00003
- |
- +---+---+
- | | |
- PG0 PG1 PG2
-
-In the example above, you can see two netfs's being backed: NFS and AFS. These
-have different index hierarchies:
-
- (*) The NFS primary index contains per-server indices. Each server index is
- indexed by NFS file handles to get data file objects. Each data file
- objects can have an array of pages, but may also have further child
- objects, such as extended attributes and directory entries. Extended
- attribute objects themselves have page-array contents.
-
- (*) The AFS primary index contains per-cell indices. Each cell index contains
- per-logical-volume indices. Each of volume index contains up to three
- indices for the read-write, read-only and backup mirrors of those volumes.
- Each of these contains vnode data file objects, each of which contains an
- array of pages.
-
-The very top index is the FS-Cache master index in which individual netfs's
-have entries.
-
-Any index object may reside in more than one cache, provided it only has index
-children. Any index with non-index object children will be assumed to only
-reside in one cache.
-
-
-The netfs API to FS-Cache can be found in:
-
- Documentation/filesystems/caching/netfs-api.txt
-
-The cache backend API to FS-Cache can be found in:
-
- Documentation/filesystems/caching/backend-api.txt
-
-A description of the internal representations and object state machine can be
-found in:
-
- Documentation/filesystems/caching/object.txt
-
-
-=======================
-STATISTICAL INFORMATION
-=======================
-
-If FS-Cache is compiled with the following options enabled:
-
- CONFIG_FSCACHE_STATS=y
- CONFIG_FSCACHE_HISTOGRAM=y
-
-then it will gather certain statistics and display them through a number of
-proc files.
-
- (*) /proc/fs/fscache/stats
-
- This shows counts of a number of events that can happen in FS-Cache:
-
- CLASS EVENT MEANING
- ======= ======= =======================================================
- Cookies idx=N Number of index cookies allocated
- dat=N Number of data storage cookies allocated
- spc=N Number of special cookies allocated
- Objects alc=N Number of objects allocated
- nal=N Number of object allocation failures
- avl=N Number of objects that reached the available state
- ded=N Number of objects that reached the dead state
- ChkAux non=N Number of objects that didn't have a coherency check
- ok=N Number of objects that passed a coherency check
- upd=N Number of objects that needed a coherency data update
- obs=N Number of objects that were declared obsolete
- Pages mrk=N Number of pages marked as being cached
- unc=N Number of uncache page requests seen
- Acquire n=N Number of acquire cookie requests seen
- nul=N Number of acq reqs given a NULL parent
- noc=N Number of acq reqs rejected due to no cache available
- ok=N Number of acq reqs succeeded
- nbf=N Number of acq reqs rejected due to error
- oom=N Number of acq reqs failed on ENOMEM
- Lookups n=N Number of lookup calls made on cache backends
- neg=N Number of negative lookups made
- pos=N Number of positive lookups made
- crt=N Number of objects created by lookup
- tmo=N Number of lookups timed out and requeued
- Updates n=N Number of update cookie requests seen
- nul=N Number of upd reqs given a NULL parent
- run=N Number of upd reqs granted CPU time
- Relinqs n=N Number of relinquish cookie requests seen
- nul=N Number of rlq reqs given a NULL parent
- wcr=N Number of rlq reqs waited on completion of creation
- AttrChg n=N Number of attribute changed requests seen
- ok=N Number of attr changed requests queued
- nbf=N Number of attr changed rejected -ENOBUFS
- oom=N Number of attr changed failed -ENOMEM
- run=N Number of attr changed ops given CPU time
- Allocs n=N Number of allocation requests seen
- ok=N Number of successful alloc reqs
- wt=N Number of alloc reqs that waited on lookup completion
- nbf=N Number of alloc reqs rejected -ENOBUFS
- int=N Number of alloc reqs aborted -ERESTARTSYS
- ops=N Number of alloc reqs submitted
- owt=N Number of alloc reqs waited for CPU time
- abt=N Number of alloc reqs aborted due to object death
- Retrvls n=N Number of retrieval (read) requests seen
- ok=N Number of successful retr reqs
- wt=N Number of retr reqs that waited on lookup completion
- nod=N Number of retr reqs returned -ENODATA
- nbf=N Number of retr reqs rejected -ENOBUFS
- int=N Number of retr reqs aborted -ERESTARTSYS
- oom=N Number of retr reqs failed -ENOMEM
- ops=N Number of retr reqs submitted
- owt=N Number of retr reqs waited for CPU time
- abt=N Number of retr reqs aborted due to object death
- Stores n=N Number of storage (write) requests seen
- ok=N Number of successful store reqs
- agn=N Number of store reqs on a page already pending storage
- nbf=N Number of store reqs rejected -ENOBUFS
- oom=N Number of store reqs failed -ENOMEM
- ops=N Number of store reqs submitted
- run=N Number of store reqs granted CPU time
- pgs=N Number of pages given store req processing time
- rxd=N Number of store reqs deleted from tracking tree
- olm=N Number of store reqs over store limit
- VmScan nos=N Number of release reqs against pages with no pending store
- gon=N Number of release reqs against pages stored by time lock granted
- bsy=N Number of release reqs ignored due to in-progress store
- can=N Number of page stores cancelled due to release req
- Ops pend=N Number of times async ops added to pending queues
- run=N Number of times async ops given CPU time
- enq=N Number of times async ops queued for processing
- can=N Number of async ops cancelled
- rej=N Number of async ops rejected due to object lookup/create failure
- ini=N Number of async ops initialised
- dfr=N Number of async ops queued for deferred release
- rel=N Number of async ops released (should equal ini=N when idle)
- gc=N Number of deferred-release async ops garbage collected
- CacheOp alo=N Number of in-progress alloc_object() cache ops
- luo=N Number of in-progress lookup_object() cache ops
- luc=N Number of in-progress lookup_complete() cache ops
- gro=N Number of in-progress grab_object() cache ops
- upo=N Number of in-progress update_object() cache ops
- dro=N Number of in-progress drop_object() cache ops
- pto=N Number of in-progress put_object() cache ops
- syn=N Number of in-progress sync_cache() cache ops
- atc=N Number of in-progress attr_changed() cache ops
- rap=N Number of in-progress read_or_alloc_page() cache ops
- ras=N Number of in-progress read_or_alloc_pages() cache ops
- alp=N Number of in-progress allocate_page() cache ops
- als=N Number of in-progress allocate_pages() cache ops
- wrp=N Number of in-progress write_page() cache ops
- ucp=N Number of in-progress uncache_page() cache ops
- dsp=N Number of in-progress dissociate_pages() cache ops
- CacheEv nsp=N Number of object lookups/creations rejected due to lack of space
- stl=N Number of stale objects deleted
- rtr=N Number of objects retired when relinquished
- cul=N Number of objects culled
-
-
- (*) /proc/fs/fscache/histogram
-
- cat /proc/fs/fscache/histogram
- JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
- ===== ===== ========= ========= ========= ========= =========
-
- This shows the breakdown of the number of times each amount of time
- between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
- columns are as follows:
-
- COLUMN TIME MEASUREMENT
- ======= =======================================================
- OBJ INST Length of time to instantiate an object
- OP RUNS Length of time a call to process an operation took
- OBJ RUNS Length of time a call to process an object event took
- RETRV DLY Time between an requesting a read and lookup completing
- RETRIEVLS Time between beginning and end of a retrieval
-
- Each row shows the number of events that took a particular range of times.
- Each step is 1 jiffy in size. The JIFS column indicates the particular
- jiffy range covered, and the SECS field the equivalent number of seconds.
-
-
-===========
-OBJECT LIST
-===========
-
-If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
-list of all the objects currently allocated and allow them to be viewed
-through:
-
- /proc/fs/fscache/objects
-
-This will look something like:
-
- [root@andromeda ~]# head /proc/fs/fscache/objects
- OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
- ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
- 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
- 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
-
-where the first set of columns before the '|' describe the object:
-
- COLUMN DESCRIPTION
- ======= ===============================================================
- OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
- PARENT Debugging ID of parent object
- STAT Object state
- CHLDN Number of child objects of this object
- OPS Number of outstanding operations on this object
- OOP Number of outstanding child object management operations
- IPR
- EX Number of outstanding exclusive operations
- READS Number of outstanding read operations
- EM Object's event mask
- EV Events raised on this object
- F Object flags
- S Object work item busy state mask (1:pending 2:running)
-
-and the second set of columns describe the object's cookie, if present:
-
- COLUMN DESCRIPTION
- =============== =======================================================
- NETFS_COOKIE_DEF Name of netfs cookie definition
- TY Cookie type (IX - index, DT - data, hex - special)
- FL Cookie flags
- NETFS_DATA Netfs private data stored in the cookie
- OBJECT_KEY Object key } 1 column, with separating comma
- AUX_DATA Object aux data } presence may be configured
-
-The data shown may be filtered by attaching the a key to an appropriate keyring
-before viewing the file. Something like:
-
- keyctl add user fscache:objlist <restrictions> @s
-
-where <restrictions> are a selection of the following letters:
-
- K Show hexdump of object key (don't show if not given)
- A Show hexdump of object aux data (don't show if not given)
-
-and the following paired letters:
-
- C Show objects that have a cookie
- c Show objects that don't have a cookie
- B Show objects that are busy
- b Show objects that aren't busy
- W Show objects that have pending writes
- w Show objects that don't have pending writes
- R Show objects that have outstanding reads
- r Show objects that don't have outstanding reads
- S Show objects that have work queued
- s Show objects that don't have work queued
-
-If neither side of a letter pair is given, then both are implied. For example:
-
- keyctl add user fscache:objlist KB @s
-
-shows objects that are busy, and lists their object keys, but does not dump
-their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
-not implied.
-
-By default all objects and all fields will be shown.
-
-
-=========
-DEBUGGING
-=========
-
-If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
-debugging enabled by adjusting the value in:
-
- /sys/module/fscache/parameters/debug
-
-This is a bitmask of debugging streams to enable:
-
- BIT VALUE STREAM POINT
- ======= ======= =============================== =======================
- 0 1 Cache management Function entry trace
- 1 2 Function exit trace
- 2 4 General
- 3 8 Cookie management Function entry trace
- 4 16 Function exit trace
- 5 32 General
- 6 64 Page handling Function entry trace
- 7 128 Function exit trace
- 8 256 General
- 9 512 Operation management Function entry trace
- 10 1024 Function exit trace
- 11 2048 General
-
-The appropriate set of values should be OR'd together and the result written to
-the control file. For example:
-
- echo $((1|8|64)) >/sys/module/fscache/parameters/debug
-
-will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/index.rst b/Documentation/filesystems/caching/index.rst
new file mode 100644
index 0000000..033da7a
--- /dev/null
+++ b/Documentation/filesystems/caching/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Filesystem Caching
+==================
+
+.. toctree::
+ :maxdepth: 2
+
+ fscache
+ object
+ backend-api
+ cachefiles
+ netfs-api
+ operations
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.rst
similarity index 91%
rename from Documentation/filesystems/caching/netfs-api.txt
rename to Documentation/filesystems/caching/netfs-api.rst
index ba968e8..d9f14b8 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.rst
@@ -1,6 +1,8 @@
- ===============================
- FS-CACHE NETWORK FILESYSTEM API
- ===============================
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+FS-Cache Network Filesystem API
+===============================
There's an API by which a network filesystem can make use of the FS-Cache
facilities. This is based around a number of principles:
@@ -19,7 +21,7 @@
This API is declared in <linux/fscache.h>.
-This document contains the following sections:
+.. This document contains the following sections:
(1) Network filesystem definition
(2) Index definition
@@ -41,12 +43,11 @@
(18) FS-Cache specific page flags.
-=============================
-NETWORK FILESYSTEM DEFINITION
+Network Filesystem Definition
=============================
FS-Cache needs a description of the network filesystem. This is specified
-using a record of the following structure:
+using a record of the following structure::
struct fscache_netfs {
uint32_t version;
@@ -71,7 +72,7 @@
another parameter passed into the registration function.
For example, kAFS (linux/fs/afs/) uses the following definitions to describe
-itself:
+itself::
struct fscache_netfs afs_cache_netfs = {
.version = 0,
@@ -79,8 +80,7 @@
};
-================
-INDEX DEFINITION
+Index Definition
================
Indices are used for two purposes:
@@ -114,11 +114,10 @@
function is recursive. Too many layers will run the kernel out of stack.
-=================
-OBJECT DEFINITION
+Object Definition
=================
-To define an object, a structure of the following type should be filled out:
+To define an object, a structure of the following type should be filled out::
struct fscache_cookie_def
{
@@ -149,16 +148,13 @@
This is one of the following values:
- (*) FSCACHE_COOKIE_TYPE_INDEX
-
+ FSCACHE_COOKIE_TYPE_INDEX
This defines an index, which is a special FS-Cache type.
- (*) FSCACHE_COOKIE_TYPE_DATAFILE
-
+ FSCACHE_COOKIE_TYPE_DATAFILE
This defines an ordinary data file.
- (*) Any other value between 2 and 255
-
+ Any other value between 2 and 255
This defines an extraordinary object such as an XATTR.
(2) The name of the object type (NUL terminated unless all 16 chars are used)
@@ -192,9 +188,14 @@
If present, the function should return one of the following values:
- (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
- (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
- (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
+ FSCACHE_CHECKAUX_OKAY
+ - the entry is okay as is
+
+ FSCACHE_CHECKAUX_NEEDS_UPDATE
+ - the entry requires update
+
+ FSCACHE_CHECKAUX_OBSOLETE
+ - the entry should be deleted
This function can also be used to extract data from the auxiliary data in
the cache and copy it into the netfs's structures.
@@ -236,32 +237,30 @@
This function is not required for indices as they're not permitted data.
-===================================
-NETWORK FILESYSTEM (UN)REGISTRATION
+Network Filesystem (Un)registration
===================================
The first step is to declare the network filesystem to the cache. This also
involves specifying the layout of the primary index (for AFS, this would be the
"cell" level).
-The registration function is:
+The registration function is::
int fscache_register_netfs(struct fscache_netfs *netfs);
It just takes a pointer to the netfs definition. It returns 0 or an error as
appropriate.
-For kAFS, registration is done as follows:
+For kAFS, registration is done as follows::
ret = fscache_register_netfs(&afs_cache_netfs);
-The last step is, of course, unregistration:
+The last step is, of course, unregistration::
void fscache_unregister_netfs(struct fscache_netfs *netfs);
-================
-CACHE TAG LOOKUP
+Cache Tag Lookup
================
FS-Cache permits the use of more than one cache. To permit particular index
@@ -270,7 +269,7 @@
FS-Cache as to which cache should be used. The problem with doing that is that
FS-Cache will always pick the first cache that was registered.
-To get the representation for a named tag:
+To get the representation for a named tag::
struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
@@ -278,7 +277,7 @@
will never return an error. It may return a dummy tag, however, if it runs out
of memory; this will inhibit caching with this tag.
-Any representation so obtained must be released by passing it to this function:
+Any representation so obtained must be released by passing it to this function::
void fscache_release_cache_tag(struct fscache_cache_tag *tag);
@@ -286,13 +285,12 @@
operation select_cache().
-==================
-INDEX REGISTRATION
+Index Registration
==================
The third step is to inform FS-Cache about part of an index hierarchy that can
be used to locate files. This is done by requesting a cookie for each index in
-the path to the file:
+the path to the file::
struct fscache_cookie *
fscache_acquire_cookie(struct fscache_cookie *parent,
@@ -339,7 +337,7 @@
calling fscache_enable_cookie() (see below).
For example, with AFS, a cell would be added to the primary index. This index
-entry would have a dependent inode containing volume mappings within this cell:
+entry would have a dependent inode containing volume mappings within this cell::
cell->cache =
fscache_acquire_cookie(afs_cache_netfs.primary_index,
@@ -349,7 +347,7 @@
cell, 0, true);
And then a particular volume could be added to that index by ID, creating
-another index for vnodes (AFS inode equivalents):
+another index for vnodes (AFS inode equivalents)::
volume->cache =
fscache_acquire_cookie(volume->cell->cache,
@@ -359,13 +357,12 @@
volume, 0, true);
-======================
-DATA FILE REGISTRATION
+Data File Registration
======================
The fourth step is to request a data file be created in the cache. This is
identical to index cookie acquisition. The only difference is that the type in
-the object definition should be something other than index type.
+the object definition should be something other than index type::
vnode->cache =
fscache_acquire_cookie(volume->cache,
@@ -375,15 +372,14 @@
vnode, vnode->status.size, true);
-=================================
-MISCELLANEOUS OBJECT REGISTRATION
+Miscellaneous Object Registration
=================================
An optional step is to request an object of miscellaneous type be created in
the cache. This is almost identical to index cookie acquisition. The only
difference is that the type in the object definition should be something other
than index type. While the parent object could be an index, it's more likely
-it would be some other type of object such as a data file.
+it would be some other type of object such as a data file::
xattr->cache =
fscache_acquire_cookie(vnode->cache,
@@ -396,13 +392,12 @@
entries for example.
-==========================
-SETTING THE DATA FILE SIZE
+Setting the Data File Size
==========================
The fifth step is to set the physical attributes of the file, such as its size.
This doesn't automatically reserve any space in the cache, but permits the
-cache to adjust its metadata for data tracking appropriately:
+cache to adjust its metadata for data tracking appropriately::
int fscache_attr_changed(struct fscache_cookie *cookie);
@@ -417,8 +412,7 @@
to the caller. The attribute adjustment excludes read and write operations.
-=====================
-PAGE ALLOC/READ/WRITE
+Page alloc/read/write
=====================
And the sixth step is to store and retrieve pages in the cache. There are
@@ -441,7 +435,7 @@
Firstly, the netfs should ask FS-Cache to examine the caches and read the
contents cached for a particular page of a particular file if present, or else
-allocate space to store the contents if not:
+allocate space to store the contents if not::
typedef
void (*fscache_rw_complete_t)(struct page *page,
@@ -474,14 +468,14 @@
(4) When the read is complete, end_io_func() will be invoked with:
- (*) The netfs data supplied when the cookie was created.
+ * The netfs data supplied when the cookie was created.
- (*) The page descriptor.
+ * The page descriptor.
- (*) The context argument passed to the above function. This will be
+ * The context argument passed to the above function. This will be
maintained with the get_context/put_context functions mentioned above.
- (*) An argument that's 0 on success or negative for an error code.
+ * An argument that's 0 on success or negative for an error code.
If an error occurs, it should be assumed that the page contains no usable
data. fscache_readpages_cancel() may need to be called.
@@ -504,11 +498,11 @@
read any data from the cache.
-PAGE ALLOCATE
+Page Allocate
-------------
Alternatively, if there's not expected to be any data in the cache for a page
-because the file has been extended, a block can simply be allocated instead:
+because the file has been extended, a block can simply be allocated instead::
int fscache_alloc_page(struct fscache_cookie *cookie,
struct page *page,
@@ -523,12 +517,12 @@
successful.
-PAGE WRITE
+Page Write
----------
Secondly, if the netfs changes the contents of the page (either due to an
initial download or if a user performs a write), then the page should be
-written back to the cache:
+written back to the cache::
int fscache_write_page(struct fscache_cookie *cookie,
struct page *page,
@@ -566,11 +560,11 @@
Writing takes place asynchronously.
-MULTIPLE PAGE READ
+Multiple Page Read
------------------
A facility is provided to read several pages at once, as requested by the
-readpages() address space operation:
+readpages() address space operation::
int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
struct address_space *mapping,
@@ -598,7 +592,7 @@
be returned.
Otherwise, if all pages had reads dispatched, then 0 will be returned, the
- list will be empty and *nr_pages will be 0.
+ list will be empty and ``*nr_pages`` will be 0.
(4) end_io_func will be called once for each page being read as the reads
complete. It will be called in process context if error != 0, but it may
@@ -609,13 +603,13 @@
been marked appropriately and will need uncaching.
-CANCELLATION OF UNREAD PAGES
+Cancellation of Unread Pages
----------------------------
If one or more pages are passed to fscache_read_or_alloc_pages() but not then
read from the cache and also not read from the underlying filesystem then
those pages will need to have any marks and reservations removed. This can be
-done by calling:
+done by calling::
void fscache_readpages_cancel(struct fscache_cookie *cookie,
struct list_head *pages);
@@ -625,11 +619,10 @@
and any that have PG_fscache set will be uncached.
-==============
-PAGE UNCACHING
+Page Uncaching
==============
-To uncache a page, this function should be called:
+To uncache a page, this function should be called::
void fscache_uncache_page(struct fscache_cookie *cookie,
struct page *page);
@@ -644,12 +637,12 @@
Furthermore, note that this does not cancel the asynchronous read or write
operation started by the read/alloc and write functions, so the page
-invalidation functions must use:
+invalidation functions must use::
bool fscache_check_page_write(struct fscache_cookie *cookie,
struct page *page);
-to see if a page is being written to the cache, and:
+to see if a page is being written to the cache, and::
void fscache_wait_on_page_write(struct fscache_cookie *cookie,
struct page *page);
@@ -660,7 +653,7 @@
When releasepage() is being implemented, a special FS-Cache function exists to
manage the heuristics of coping with vmscan trying to eject pages, which may
conflict with the cache trying to write pages to the cache (which may itself
-need to allocate memory):
+need to allocate memory)::
bool fscache_maybe_release_page(struct fscache_cookie *cookie,
struct page *page,
@@ -676,12 +669,12 @@
in which case the page will not be stored in the cache this time.
-BULK INODE PAGE UNCACHE
+Bulk Image Page Uncache
-----------------------
A convenience routine is provided to perform an uncache on all the pages
attached to an inode. This assumes that the pages on the inode correspond on a
-1:1 basis with the pages in the cache.
+1:1 basis with the pages in the cache::
void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
struct inode *inode);
@@ -692,12 +685,11 @@
error is returned.
-===============================
-INDEX AND DATA FILE CONSISTENCY
+Index and Data File consistency
===============================
To find out whether auxiliary data for an object is up to data within the
-cache, the following function can be called:
+cache, the following function can be called::
int fscache_check_consistency(struct fscache_cookie *cookie,
const void *aux_data);
@@ -708,7 +700,7 @@
return -ENOMEM and -ERESTARTSYS.
To request an update of the index data for an index or other object, the
-following function should be called:
+following function should be called::
void fscache_update_cookie(struct fscache_cookie *cookie,
const void *aux_data);
@@ -721,8 +713,7 @@
data blocks are added to a data file object.
-=================
-COOKIE ENABLEMENT
+Cookie Enablement
=================
Cookies exist in one of two states: enabled and disabled. If a cookie is
@@ -731,7 +722,7 @@
still possible to uncache pages and relinquish the cookie.
The initial enablement state is set by fscache_acquire_cookie(), but the cookie
-can be enabled or disabled later. To disable a cookie, call:
+can be enabled or disabled later. To disable a cookie, call::
void fscache_disable_cookie(struct fscache_cookie *cookie,
const void *aux_data,
@@ -746,7 +737,7 @@
calling fscache_uncache_all_inode_pages() afterwards to make sure all page
markings are cleared up.
-Cookies can be enabled or reenabled with:
+Cookies can be enabled or reenabled with::
void fscache_enable_cookie(struct fscache_cookie *cookie,
const void *aux_data,
@@ -771,13 +762,12 @@
that is non-NULL inside the enablement lock before proceeding.
-===============================
-MISCELLANEOUS COOKIE OPERATIONS
+Miscellaneous Cookie operations
===============================
There are a number of operations that can be used to control cookies:
- (*) Cookie pinning:
+ * Cookie pinning::
int fscache_pin_cookie(struct fscache_cookie *cookie);
void fscache_unpin_cookie(struct fscache_cookie *cookie);
@@ -790,7 +780,7 @@
-ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
-EIO if there's any other problem.
- (*) Data space reservation:
+ * Data space reservation::
int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
@@ -809,11 +799,10 @@
make space if it's not in use.
-=====================
-COOKIE UNREGISTRATION
+Cookie Unregistration
=====================
-To get rid of a cookie, this function should be called.
+To get rid of a cookie, this function should be called::
void fscache_relinquish_cookie(struct fscache_cookie *cookie,
const void *aux_data,
@@ -835,16 +824,14 @@
first.
-==================
-INDEX INVALIDATION
+Index Invalidation
==================
There is no direct way to invalidate an index subtree. To do this, the caller
should relinquish and retire the cookie they have, and then acquire a new one.
-======================
-DATA FILE INVALIDATION
+Data File Invalidation
======================
Sometimes it will be necessary to invalidate an object that contains data.
@@ -853,7 +840,7 @@
inode and reload from the server.
To indicate that a cache object should be invalidated, the following function
-can be called:
+can be called::
void fscache_invalidate(struct fscache_cookie *cookie);
@@ -868,13 +855,12 @@
Using the following function, the netfs can wait for the invalidation operation
to have reached a point at which it can start submitting ordinary operations
-once again:
+once again::
void fscache_wait_on_invalidate(struct fscache_cookie *cookie);
-===========================
-FS-CACHE SPECIFIC PAGE FLAG
+FS-cache Specific Page Flag
===========================
FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
@@ -898,7 +884,7 @@
This bit does not overlap with such as PG_private. This means that FS-Cache
can be used with a filesystem that uses the block buffering code.
-There are a number of operations defined on this flag:
+There are a number of operations defined on this flag::
int PageFsCache(struct page *page);
void SetPageFsCache(struct page *page)
diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.rst
similarity index 95%
rename from Documentation/filesystems/caching/object.txt
rename to Documentation/filesystems/caching/object.rst
index 100ff41..ce0e043 100644
--- a/Documentation/filesystems/caching/object.txt
+++ b/Documentation/filesystems/caching/object.rst
@@ -1,10 +1,12 @@
- ====================================================
- IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
- ====================================================
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+In-Kernel Cache Object Representation and Management
+====================================================
By: David Howells <dhowells@redhat.com>
-Contents:
+.. Contents:
(*) Representation
@@ -18,8 +20,7 @@
(*) The set of events.
-==============
-REPRESENTATION
+Representation
==============
FS-Cache maintains an in-kernel representation of each object that a netfs is
@@ -38,7 +39,7 @@
Furthermore, both cookies and objects are hierarchical. The two hierarchies
correspond, but the cookies tree is a superset of the union of the object trees
-of multiple caches:
+of multiple caches::
NETFS INDEX TREE : CACHE 1 : CACHE 2
: :
@@ -89,8 +90,7 @@
those cookies are hidden from it.
-===============================
-OBJECT MANAGEMENT STATE MACHINE
+Object Management State Machine
===============================
Within FS-Cache, each active object is managed by its own individual state
@@ -124,7 +124,7 @@
fscache_enqueue_object()).
-PROVISION OF CPU TIME
+Provision of CPU Time
---------------------
The work to be done by the various states was given CPU time by the threads of
@@ -141,7 +141,7 @@
workqueues don't necessarily have the right numbers of threads.
-LOCKING SIMPLIFICATION
+Locking Simplification
----------------------
Because only one worker thread may be operating on any particular object's
@@ -151,8 +151,7 @@
requested from either end.
-=================
-THE SET OF STATES
+The Set of States
=================
The object state machine has a set of states that it can be in. There are
@@ -275,19 +274,17 @@
this state.
-THE SET OF EVENTS
+The Set of Events
-----------------
There are a number of events that can be raised to an object state machine:
- (*) FSCACHE_OBJECT_EV_UPDATE
-
+ FSCACHE_OBJECT_EV_UPDATE
The netfs requested that an object be updated. The state machine will ask
the cache backend to update the object, and the cache backend will ask the
netfs for details of the change through its cookie definition ops.
- (*) FSCACHE_OBJECT_EV_CLEARED
-
+ FSCACHE_OBJECT_EV_CLEARED
This is signalled in two circumstances:
(a) when an object's last child object is dropped and
@@ -296,20 +293,16 @@
This is used to proceed from the dying state.
- (*) FSCACHE_OBJECT_EV_ERROR
-
+ FSCACHE_OBJECT_EV_ERROR
This is signalled when an I/O error occurs during the processing of some
object.
- (*) FSCACHE_OBJECT_EV_RELEASE
- (*) FSCACHE_OBJECT_EV_RETIRE
-
+ FSCACHE_OBJECT_EV_RELEASE, FSCACHE_OBJECT_EV_RETIRE
These are signalled when the netfs relinquishes a cookie it was using.
The event selected depends on whether the netfs asks for the backing
object to be retired (deleted) or retained.
- (*) FSCACHE_OBJECT_EV_WITHDRAW
-
+ FSCACHE_OBJECT_EV_WITHDRAW
This is signalled when the cache backend wants to withdraw an object.
This means that the object will have to be detached from the netfs's
cookie.
diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.rst
similarity index 88%
rename from Documentation/filesystems/caching/operations.txt
rename to Documentation/filesystems/caching/operations.rst
index d8976c4..9983e16 100644
--- a/Documentation/filesystems/caching/operations.txt
+++ b/Documentation/filesystems/caching/operations.rst
@@ -1,10 +1,12 @@
- ================================
- ASYNCHRONOUS OPERATIONS HANDLING
- ================================
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Asynchronous Operations Handling
+================================
By: David Howells <dhowells@redhat.com>
-Contents:
+.. Contents:
(*) Overview.
@@ -17,8 +19,7 @@
(*) Asynchronous callback.
-========
-OVERVIEW
+Overview
========
FS-Cache has an asynchronous operations handling facility that it uses for its
@@ -26,18 +27,17 @@
fscache_operation structs, though these are usually embedded into some other
structure.
-This facility is available to and expected to be be used by the cache backends,
+This facility is available to and expected to be used by the cache backends,
and FS-Cache will create operations and pass them off to the appropriate cache
backend for completion.
To make use of this facility, <linux/fscache-cache.h> should be #included.
-===============================
-OPERATION RECORD INITIALISATION
+Operation Record Initialisation
===============================
-An operation is recorded in an fscache_operation struct:
+An operation is recorded in an fscache_operation struct::
struct fscache_operation {
union {
@@ -50,7 +50,7 @@
};
Someone wanting to issue an operation should allocate something with this
-struct embedded in it. They should initialise it by calling:
+struct embedded in it. They should initialise it by calling::
void fscache_operation_init(struct fscache_operation *op,
fscache_operation_release_t release);
@@ -67,8 +67,7 @@
operation and waited for afterwards.
-==========
-PARAMETERS
+Parameters
==========
There are a number of parameters that can be set in the operation record's flag
@@ -87,7 +86,7 @@
If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
before submitting the operation, and the operating thread must wait for it
- to be cleared before proceeding:
+ to be cleared before proceeding::
wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
TASK_UNINTERRUPTIBLE);
@@ -101,7 +100,7 @@
page to a netfs page after the backing fs has read the page in.
If this option is used, op->fast_work and op->processor must be
- initialised before submitting the operation:
+ initialised before submitting the operation::
INIT_WORK(&op->fast_work, do_some_work);
@@ -114,7 +113,7 @@
pages that have just been fetched from a remote server.
If this option is used, op->slow_work and op->processor must be
- initialised before submitting the operation:
+ initialised before submitting the operation::
fscache_operation_init_slow(op, processor)
@@ -132,8 +131,7 @@
operations running at the same time.
-=========
-PROCEDURE
+Procedure
=========
Operations are used through the following procedure:
@@ -143,7 +141,7 @@
generic op embedded within.
(2) The submitting thread must then submit the operation for processing using
- one of the following two functions:
+ one of the following two functions::
int fscache_submit_op(struct fscache_object *object,
struct fscache_operation *op);
@@ -164,7 +162,7 @@
operation of conflicting exclusivity is in progress on the object.
If the operation is asynchronous, the manager will retain a reference to
- it, so the caller should put their reference to it by passing it to:
+ it, so the caller should put their reference to it by passing it to::
void fscache_put_operation(struct fscache_operation *op);
@@ -179,12 +177,12 @@
(4) The operation holds an effective lock upon the object, preventing other
exclusive ops conflicting until it is released. The operation can be
enqueued for further immediate asynchronous processing by adjusting the
- CPU time provisioning option if necessary, eg:
+ CPU time provisioning option if necessary, eg::
op->flags &= ~FSCACHE_OP_TYPE;
op->flags |= ~FSCACHE_OP_FAST;
- and calling:
+ and calling::
void fscache_enqueue_operation(struct fscache_operation *op)
@@ -192,13 +190,12 @@
pools.
-=====================
-ASYNCHRONOUS CALLBACK
+Asynchronous Callback
=====================
When used in asynchronous mode, the worker thread pool will invoke the
processor method with a pointer to the operation. This should then get at the
-container struct by using container_of():
+container struct by using container_of()::
static void fscache_write_op(struct fscache_operation *_op)
{
diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.rst
similarity index 87%
rename from Documentation/filesystems/ceph.txt
rename to Documentation/filesystems/ceph.rst
index b19b6a0..7d2ef4e 100644
--- a/Documentation/filesystems/ceph.txt
+++ b/Documentation/filesystems/ceph.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
Ceph Distributed File System
============================
@@ -15,6 +18,7 @@
* Easy deployment: most FS components are userspace daemons
Also,
+
* Flexible snapshots (on any directory)
* Recursive accounting (nested files, directories, bytes)
@@ -63,7 +67,7 @@
Finally, Ceph also allows quotas to be set on any directory in the system.
The quota can restrict the number of bytes or the number of files stored
beneath that point in the directory hierarchy. Quotas can be set using
-extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:
+extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
getfattr -n ceph.quota.max_bytes /some/dir
@@ -76,7 +80,7 @@
Mount Syntax
============
-The basic mount syntax is:
+The basic mount syntax is::
# mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
@@ -84,7 +88,7 @@
full list when it connects. (However, if the monitor you specify
happens to be down, the mount won't succeed.) The port can be left
off if the monitor is using the default. So if the monitor is at
-1.2.3.4,
+1.2.3.4::
# mount -t ceph 1.2.3.4:/ /mnt/ceph
@@ -103,17 +107,17 @@
address its connection to the monitor originates from.
wsize=X
- Specify the maximum write size in bytes. Default: 16 MB.
+ Specify the maximum write size in bytes. Default: 64 MB.
rsize=X
- Specify the maximum read size in bytes. Default: 16 MB.
+ Specify the maximum read size in bytes. Default: 64 MB.
rasize=X
Specify the maximum readahead size in bytes. Default: 8 MB.
mount_timeout=X
Specify the timeout value for mount (in seconds), in the case
- of a non-responsive Ceph file system. The default is 30
+ of a non-responsive Ceph file system. The default is 60
seconds.
caps_max=X
@@ -159,18 +163,18 @@
to the default VFS implementation if this option is used.
recover_session=<no|clean>
- Set auto reconnect mode in the case where the client is blacklisted. The
+ Set auto reconnect mode in the case where the client is blocklisted. The
available modes are "no" and "clean". The default is "no".
* no: never attempt to reconnect when client detects that it has been
- blacklisted. Operations will generally fail after being blacklisted.
+ blocklisted. Operations will generally fail after being blocklisted.
* clean: client reconnects to the ceph cluster automatically when it
- detects that it has been blacklisted. During reconnect, client drops
- dirty data/metadata, invalidates page caches and writable file handles.
- After reconnect, file locks become stale because the MDS loses track
- of them. If an inode contains any stale file locks, read/write on the
- inode is not allowed until applications release all stale file locks.
+ detects that it has been blocklisted. During reconnect, client drops
+ dirty data/metadata, invalidates page caches and writable file handles.
+ After reconnect, file locks become stale because the MDS loses track
+ of them. If an inode contains any stale file locks, read/write on the
+ inode is not allowed until applications release all stale file locks.
More Information
================
@@ -179,8 +183,8 @@
https://ceph.com/
The Linux kernel client source tree is available at
- https://github.com/ceph/ceph-client.git
- git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
+ - https://github.com/ceph/ceph-client.git
+ - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
and the source for the full system is at
https://github.com/ceph/ceph.git
diff --git a/Documentation/filesystems/cifs/cifsroot.txt b/Documentation/filesystems/cifs/cifsroot.rst
similarity index 70%
rename from Documentation/filesystems/cifs/cifsroot.txt
rename to Documentation/filesystems/cifs/cifsroot.rst
index 0fa1a2c..4930bb4 100644
--- a/Documentation/filesystems/cifs/cifsroot.txt
+++ b/Documentation/filesystems/cifs/cifsroot.rst
@@ -1,7 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
Mounting root file system via SMB (cifs.ko)
===========================================
Written 2019 by Paulo Alcantara <palcantara@suse.de>
+
Written 2019 by Aurelien Aptel <aaptel@suse.com>
The CONFIG_CIFS_ROOT option enables experimental root file system
@@ -13,7 +17,7 @@
In order to mount, the network stack will also need to be set up by
using 'ip=' config option. For more details, see
-Documentation/filesystems/nfs/nfsroot.txt.
+Documentation/admin-guide/nfs/nfsroot.rst.
A CIFS root mount currently requires the use of SMB1+UNIX Extensions
which is only supported by the Samba server. SMB1 is the older
@@ -32,7 +36,7 @@
====================
To enable SMB1+UNIX extensions you will need to set these global
-settings in Samba smb.conf:
+settings in Samba smb.conf::
[global]
server min protocol = NT1
@@ -41,12 +45,16 @@
Kernel command line
===================
-root=/dev/cifs
+::
+
+ root=/dev/cifs
This is just a virtual device that basically tells the kernel to mount
the root file system via SMB protocol.
-cifsroot=//<server-ip>/<share>[,options]
+::
+
+ cifsroot=//<server-ip>/<share>[,options]
Enables the kernel to mount the root file system via SMB that are
located in the <server-ip> and <share> specified in this option.
@@ -65,33 +73,33 @@
Examples
========
-Export root file system as a Samba share in smb.conf file.
+Export root file system as a Samba share in smb.conf file::
-...
-[linux]
- path = /path/to/rootfs
- read only = no
- guest ok = yes
- force user = root
- force group = root
- browseable = yes
- writeable = yes
- admin users = root
- public = yes
- create mask = 0777
- directory mask = 0777
-...
+ ...
+ [linux]
+ path = /path/to/rootfs
+ read only = no
+ guest ok = yes
+ force user = root
+ force group = root
+ browseable = yes
+ writeable = yes
+ admin users = root
+ public = yes
+ create mask = 0777
+ directory mask = 0777
+ ...
-Restart smb service.
+Restart smb service::
-# systemctl restart smb
+ # systemctl restart smb
Test it under QEMU on a kernel built with CONFIG_CIFS_ROOT and
-CONFIG_IP_PNP options enabled.
+CONFIG_IP_PNP options enabled::
-# qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \
- -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \
- -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3"
+ # qemu-system-x86_64 -enable-kvm -cpu host -m 1024 \
+ -kernel /path/to/linux/arch/x86/boot/bzImage -nographic \
+ -append "root=/dev/cifs rw ip=dhcp cifsroot=//10.0.2.2/linux,username=foo,password=bar console=ttyS0 3"
1: https://wiki.samba.org/index.php/UNIX_Extensions
diff --git a/Documentation/filesystems/coda.rst b/Documentation/filesystems/coda.rst
new file mode 100644
index 0000000..bdde7e4
--- /dev/null
+++ b/Documentation/filesystems/coda.rst
@@ -0,0 +1,1670 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+Coda Kernel-Venus Interface
+===========================
+
+.. Note::
+
+ This is one of the technical documents describing a component of
+ Coda -- this document describes the client kernel-Venus interface.
+
+For more information:
+
+ http://www.coda.cs.cmu.edu
+
+For user level software needed to run Coda:
+
+ ftp://ftp.coda.cs.cmu.edu
+
+To run Coda you need to get a user level cache manager for the client,
+named Venus, as well as tools to manipulate ACLs, to log in, etc. The
+client needs to have the Coda filesystem selected in the kernel
+configuration.
+
+The server needs a user level server and at present does not depend on
+kernel support.
+
+ The Venus kernel interface
+
+ Peter J. Braam
+
+ v1.0, Nov 9, 1997
+
+ This document describes the communication between Venus and kernel
+ level filesystem code needed for the operation of the Coda file sys-
+ tem. This document version is meant to describe the current interface
+ (version 1.0) as well as improvements we envisage.
+
+.. Table of Contents
+
+ 1. Introduction
+
+ 2. Servicing Coda filesystem calls
+
+ 3. The message layer
+
+ 3.1 Implementation details
+
+ 4. The interface at the call level
+
+ 4.1 Data structures shared by the kernel and Venus
+ 4.2 The pioctl interface
+ 4.3 root
+ 4.4 lookup
+ 4.5 getattr
+ 4.6 setattr
+ 4.7 access
+ 4.8 create
+ 4.9 mkdir
+ 4.10 link
+ 4.11 symlink
+ 4.12 remove
+ 4.13 rmdir
+ 4.14 readlink
+ 4.15 open
+ 4.16 close
+ 4.17 ioctl
+ 4.18 rename
+ 4.19 readdir
+ 4.20 vget
+ 4.21 fsync
+ 4.22 inactive
+ 4.23 rdwr
+ 4.24 odymount
+ 4.25 ody_lookup
+ 4.26 ody_expand
+ 4.27 prefetch
+ 4.28 signal
+
+ 5. The minicache and downcalls
+
+ 5.1 INVALIDATE
+ 5.2 FLUSH
+ 5.3 PURGEUSER
+ 5.4 ZAPFILE
+ 5.5 ZAPDIR
+ 5.6 ZAPVNODE
+ 5.7 PURGEFID
+ 5.8 REPLACE
+
+ 6. Initialization and cleanup
+
+ 6.1 Requirements
+
+1. Introduction
+===============
+
+ A key component in the Coda Distributed File System is the cache
+ manager, Venus.
+
+ When processes on a Coda enabled system access files in the Coda
+ filesystem, requests are directed at the filesystem layer in the
+ operating system. The operating system will communicate with Venus to
+ service the request for the process. Venus manages a persistent
+ client cache and makes remote procedure calls to Coda file servers and
+ related servers (such as authentication servers) to service these
+ requests it receives from the operating system. When Venus has
+ serviced a request it replies to the operating system with appropriate
+ return codes, and other data related to the request. Optionally the
+ kernel support for Coda may maintain a minicache of recently processed
+ requests to limit the number of interactions with Venus. Venus
+ possesses the facility to inform the kernel when elements from its
+ minicache are no longer valid.
+
+ This document describes precisely this communication between the
+ kernel and Venus. The definitions of so called upcalls and downcalls
+ will be given with the format of the data they handle. We shall also
+ describe the semantic invariants resulting from the calls.
+
+ Historically Coda was implemented in a BSD file system in Mach 2.6.
+ The interface between the kernel and Venus is very similar to the BSD
+ VFS interface. Similar functionality is provided, and the format of
+ the parameters and returned data is very similar to the BSD VFS. This
+ leads to an almost natural environment for implementing a kernel-level
+ filesystem driver for Coda in a BSD system. However, other operating
+ systems such as Linux and Windows 95 and NT have virtual filesystem
+ with different interfaces.
+
+ To implement Coda on these systems some reverse engineering of the
+ Venus/Kernel protocol is necessary. Also it came to light that other
+ systems could profit significantly from certain small optimizations
+ and modifications to the protocol. To facilitate this work as well as
+ to make future ports easier, communication between Venus and the
+ kernel should be documented in great detail. This is the aim of this
+ document.
+
+2. Servicing Coda filesystem calls
+===================================
+
+ The service of a request for a Coda file system service originates in
+ a process P which accessing a Coda file. It makes a system call which
+ traps to the OS kernel. Examples of such calls trapping to the kernel
+ are ``read``, ``write``, ``open``, ``close``, ``create``, ``mkdir``,
+ ``rmdir``, ``chmod`` in a Unix ontext. Similar calls exist in the Win32
+ environment, and are named ``CreateFile``.
+
+ Generally the operating system handles the request in a virtual
+ filesystem (VFS) layer, which is named I/O Manager in NT and IFS
+ manager in Windows 95. The VFS is responsible for partial processing
+ of the request and for locating the specific filesystem(s) which will
+ service parts of the request. Usually the information in the path
+ assists in locating the correct FS drivers. Sometimes after extensive
+ pre-processing, the VFS starts invoking exported routines in the FS
+ driver. This is the point where the FS specific processing of the
+ request starts, and here the Coda specific kernel code comes into
+ play.
+
+ The FS layer for Coda must expose and implement several interfaces.
+ First and foremost the VFS must be able to make all necessary calls to
+ the Coda FS layer, so the Coda FS driver must expose the VFS interface
+ as applicable in the operating system. These differ very significantly
+ among operating systems, but share features such as facilities to
+ read/write and create and remove objects. The Coda FS layer services
+ such VFS requests by invoking one or more well defined services
+ offered by the cache manager Venus. When the replies from Venus have
+ come back to the FS driver, servicing of the VFS call continues and
+ finishes with a reply to the kernel's VFS. Finally the VFS layer
+ returns to the process.
+
+ As a result of this design a basic interface exposed by the FS driver
+ must allow Venus to manage message traffic. In particular Venus must
+ be able to retrieve and place messages and to be notified of the
+ arrival of a new message. The notification must be through a mechanism
+ which does not block Venus since Venus must attend to other tasks even
+ when no messages are waiting or being processed.
+
+ **Interfaces of the Coda FS Driver**
+
+ Furthermore the FS layer provides for a special path of communication
+ between a user process and Venus, called the pioctl interface. The
+ pioctl interface is used for Coda specific services, such as
+ requesting detailed information about the persistent cache managed by
+ Venus. Here the involvement of the kernel is minimal. It identifies
+ the calling process and passes the information on to Venus. When
+ Venus replies the response is passed back to the caller in unmodified
+ form.
+
+ Finally Venus allows the kernel FS driver to cache the results from
+ certain services. This is done to avoid excessive context switches
+ and results in an efficient system. However, Venus may acquire
+ information, for example from the network which implies that cached
+ information must be flushed or replaced. Venus then makes a downcall
+ to the Coda FS layer to request flushes or updates in the cache. The
+ kernel FS driver handles such requests synchronously.
+
+ Among these interfaces the VFS interface and the facility to place,
+ receive and be notified of messages are platform specific. We will
+ not go into the calls exported to the VFS layer but we will state the
+ requirements of the message exchange mechanism.
+
+
+3. The message layer
+=====================
+
+ At the lowest level the communication between Venus and the FS driver
+ proceeds through messages. The synchronization between processes
+ requesting Coda file service and Venus relies on blocking and waking
+ up processes. The Coda FS driver processes VFS- and pioctl-requests
+ on behalf of a process P, creates messages for Venus, awaits replies
+ and finally returns to the caller. The implementation of the exchange
+ of messages is platform specific, but the semantics have (so far)
+ appeared to be generally applicable. Data buffers are created by the
+ FS Driver in kernel memory on behalf of P and copied to user memory in
+ Venus.
+
+ The FS Driver while servicing P makes upcalls to Venus. Such an
+ upcall is dispatched to Venus by creating a message structure. The
+ structure contains the identification of P, the message sequence
+ number, the size of the request and a pointer to the data in kernel
+ memory for the request. Since the data buffer is re-used to hold the
+ reply from Venus, there is a field for the size of the reply. A flags
+ field is used in the message to precisely record the status of the
+ message. Additional platform dependent structures involve pointers to
+ determine the position of the message on queues and pointers to
+ synchronization objects. In the upcall routine the message structure
+ is filled in, flags are set to 0, and it is placed on the *pending*
+ queue. The routine calling upcall is responsible for allocating the
+ data buffer; its structure will be described in the next section.
+
+ A facility must exist to notify Venus that the message has been
+ created, and implemented using available synchronization objects in
+ the OS. This notification is done in the upcall context of the process
+ P. When the message is on the pending queue, process P cannot proceed
+ in upcall. The (kernel mode) processing of P in the filesystem
+ request routine must be suspended until Venus has replied. Therefore
+ the calling thread in P is blocked in upcall. A pointer in the
+ message structure will locate the synchronization object on which P is
+ sleeping.
+
+ Venus detects the notification that a message has arrived, and the FS
+ driver allow Venus to retrieve the message with a getmsg_from_kernel
+ call. This action finishes in the kernel by putting the message on the
+ queue of processing messages and setting flags to READ. Venus is
+ passed the contents of the data buffer. The getmsg_from_kernel call
+ now returns and Venus processes the request.
+
+ At some later point the FS driver receives a message from Venus,
+ namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS
+ driver looks at the contents of the message and decides if:
+
+
+ * the message is a reply for a suspended thread P. If so it removes
+ the message from the processing queue and marks the message as
+ WRITTEN. Finally, the FS driver unblocks P (still in the kernel
+ mode context of Venus) and the sendmsg_to_kernel call returns to
+ Venus. The process P will be scheduled at some point and continues
+ processing its upcall with the data buffer replaced with the reply
+ from Venus.
+
+ * The message is a ``downcall``. A downcall is a request from Venus to
+ the FS Driver. The FS driver processes the request immediately
+ (usually a cache eviction or replacement) and when it finishes
+ sendmsg_to_kernel returns.
+
+ Now P awakes and continues processing upcall. There are some
+ subtleties to take account of. First P will determine if it was woken
+ up in upcall by a signal from some other source (for example an
+ attempt to terminate P) or as is normally the case by Venus in its
+ sendmsg_to_kernel call. In the normal case, the upcall routine will
+ deallocate the message structure and return. The FS routine can proceed
+ with its processing.
+
+
+ **Sleeping and IPC arrangements**
+
+ In case P is woken up by a signal and not by Venus, it will first look
+ at the flags field. If the message is not yet READ, the process P can
+ handle its signal without notifying Venus. If Venus has READ, and
+ the request should not be processed, P can send Venus a signal message
+ to indicate that it should disregard the previous message. Such
+ signals are put in the queue at the head, and read first by Venus. If
+ the message is already marked as WRITTEN it is too late to stop the
+ processing. The VFS routine will now continue. (-- If a VFS request
+ involves more than one upcall, this can lead to complicated state, an
+ extra field "handle_signals" could be added in the message structure
+ to indicate points of no return have been passed.--)
+
+
+
+3.1. Implementation details
+----------------------------
+
+ The Unix implementation of this mechanism has been through the
+ implementation of a character device associated with Coda. Venus
+ retrieves messages by doing a read on the device, replies are sent
+ with a write and notification is through the select system call on the
+ file descriptor for the device. The process P is kept waiting on an
+ interruptible wait queue object.
+
+ In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl
+ call is used. The DeviceIoControl call is designed to copy buffers
+ from user memory to kernel memory with OPCODES. The sendmsg_to_kernel
+ is issued as a synchronous call, while the getmsg_from_kernel call is
+ asynchronous. Windows EventObjects are used for notification of
+ message arrival. The process P is kept waiting on a KernelEvent
+ object in NT and a semaphore in Windows 95.
+
+
+4. The interface at the call level
+===================================
+
+
+ This section describes the upcalls a Coda FS driver can make to Venus.
+ Each of these upcalls make use of two structures: inputArgs and
+ outputArgs. In pseudo BNF form the structures take the following
+ form::
+
+
+ struct inputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_short pid; /* Common to all */
+ u_short pgid; /* Common to all */
+ struct CodaCred cred; /* Common to all */
+
+ <union "in" of call dependent parts of inputArgs>
+ };
+
+ struct outputArgs {
+ u_long opcode;
+ u_long unique; /* Keep multiple outstanding msgs distinct */
+ u_long result;
+
+ <union "out" of call dependent parts of inputArgs>
+ };
+
+
+
+ Before going on let us elucidate the role of the various fields. The
+ inputArgs start with the opcode which defines the type of service
+ requested from Venus. There are approximately 30 upcalls at present
+ which we will discuss. The unique field labels the inputArg with a
+ unique number which will identify the message uniquely. A process and
+ process group id are passed. Finally the credentials of the caller
+ are included.
+
+ Before delving into the specific calls we need to discuss a variety of
+ data structures shared by the kernel and Venus.
+
+
+
+
+4.1. Data structures shared by the kernel and Venus
+----------------------------------------------------
+
+
+ The CodaCred structure defines a variety of user and group ids as
+ they are set for the calling process. The vuid_t and vgid_t are 32 bit
+ unsigned integers. It also defines group membership in an array. On
+ Unix the CodaCred has proven sufficient to implement good security
+ semantics for Coda but the structure may have to undergo modification
+ for the Windows environment when these mature::
+
+ struct CodaCred {
+ vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */
+ vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
+ vgid_t cr_groups[NGROUPS]; /* Group membership for caller */
+ };
+
+
+ .. Note::
+
+ It is questionable if we need CodaCreds in Venus. Finally Venus
+ doesn't know about groups, although it does create files with the
+ default uid/gid. Perhaps the list of group membership is superfluous.
+
+
+ The next item is the fundamental identifier used to identify Coda
+ files, the ViceFid. A fid of a file uniquely defines a file or
+ directory in the Coda filesystem within a cell [1]_::
+
+ typedef struct ViceFid {
+ VolumeId Volume;
+ VnodeId Vnode;
+ Unique_t Unique;
+ } ViceFid;
+
+ .. [1] A cell is agroup of Coda servers acting under the aegis of a single
+ system control machine or SCM. See the Coda Administration manual
+ for a detailed description of the role of the SCM.
+
+ Each of the constituent fields: VolumeId, VnodeId and Unique_t are
+ unsigned 32 bit integers. We envisage that a further field will need
+ to be prefixed to identify the Coda cell; this will probably take the
+ form of a Ipv6 size IP address naming the Coda cell through DNS.
+
+ The next important structure shared between Venus and the kernel is
+ the attributes of the file. The following structure is used to
+ exchange information. It has room for future extensions such as
+ support for device files (currently not present in Coda)::
+
+
+ struct coda_timespec {
+ int64_t tv_sec; /* seconds */
+ long tv_nsec; /* nanoseconds */
+ };
+
+ struct coda_vattr {
+ enum coda_vtype va_type; /* vnode type (for create) */
+ u_short va_mode; /* files access mode and type */
+ short va_nlink; /* number of references to file */
+ vuid_t va_uid; /* owner user id */
+ vgid_t va_gid; /* owner group id */
+ long va_fsid; /* file system id (dev for now) */
+ long va_fileid; /* file id */
+ u_quad_t va_size; /* file size in bytes */
+ long va_blocksize; /* blocksize preferred for i/o */
+ struct coda_timespec va_atime; /* time of last access */
+ struct coda_timespec va_mtime; /* time of last modification */
+ struct coda_timespec va_ctime; /* time file changed */
+ u_long va_gen; /* generation number of file */
+ u_long va_flags; /* flags defined for file */
+ dev_t va_rdev; /* device special file represents */
+ u_quad_t va_bytes; /* bytes of disk space held by file */
+ u_quad_t va_filerev; /* file modification number */
+ u_int va_vaflags; /* operations flags, see below */
+ long va_spare; /* remain quad aligned */
+ };
+
+
+4.2. The pioctl interface
+--------------------------
+
+
+ Coda specific requests can be made by application through the pioctl
+ interface. The pioctl is implemented as an ordinary ioctl on a
+ fictitious file /coda/.CONTROL. The pioctl call opens this file, gets
+ a file handle and makes the ioctl call. Finally it closes the file.
+
+ The kernel involvement in this is limited to providing the facility to
+ open and close and pass the ioctl message and to verify that a path in
+ the pioctl data buffers is a file in a Coda filesystem.
+
+ The kernel is handed a data packet of the form::
+
+ struct {
+ const char *path;
+ struct ViceIoctl vidata;
+ int follow;
+ } data;
+
+
+
+ where::
+
+
+ struct ViceIoctl {
+ caddr_t in, out; /* Data to be transferred in, or out */
+ short in_size; /* Size of input buffer <= 2K */
+ short out_size; /* Maximum size of output buffer, <= 2K */
+ };
+
+
+
+ The path must be a Coda file, otherwise the ioctl upcall will not be
+ made.
+
+ .. Note:: The data structures and code are a mess. We need to clean this up.
+
+
+**We now proceed to document the individual calls**:
+
+
+4.3. root
+----------
+
+
+ Arguments
+ in
+
+ empty
+
+ out::
+
+ struct cfs_root_out {
+ ViceFid VFid;
+ } cfs_root;
+
+
+
+ Description
+ This call is made to Venus during the initialization of
+ the Coda filesystem. If the result is zero, the cfs_root structure
+ contains the ViceFid of the root of the Coda filesystem. If a non-zero
+ result is generated, its value is a platform dependent error code
+ indicating the difficulty Venus encountered in locating the root of
+ the Coda filesystem.
+
+4.4. lookup
+------------
+
+
+ Summary
+ Find the ViceFid and type of an object in a directory if it exists.
+
+ Arguments
+ in::
+
+ struct cfs_lookup_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_lookup;
+
+
+
+ out::
+
+ struct cfs_lookup_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_lookup;
+
+
+
+ Description
+ This call is made to determine the ViceFid and filetype of
+ a directory entry. The directory entry requested carries name 'name'
+ and Venus will search the directory identified by cfs_lookup_in.VFid.
+ The result may indicate that the name does not exist, or that
+ difficulty was encountered in finding it (e.g. due to disconnection).
+ If the result is zero, the field cfs_lookup_out.VFid contains the
+ targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the
+ type of object the name designates.
+
+ The name of the object is an 8 bit character string of maximum length
+ CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.)
+
+ It is extremely important to realize that Venus bitwise ors the field
+ cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should
+ not be put in the kernel name cache.
+
+ .. Note::
+
+ The type of the vtype is currently wrong. It should be
+ coda_vtype. Linux does not take note of CFS_NOCACHE. It should.
+
+
+4.5. getattr
+-------------
+
+
+ Summary Get the attributes of a file.
+
+ Arguments
+ in::
+
+ struct cfs_getattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr; /* XXXXX */
+ } cfs_getattr;
+
+
+
+ out::
+
+ struct cfs_getattr_out {
+ struct coda_vattr attr;
+ } cfs_getattr;
+
+
+
+ Description
+ This call returns the attributes of the file identified by fid.
+
+ Errors
+ Errors can occur if the object with fid does not exist, is
+ unaccessible or if the caller does not have permission to fetch
+ attributes.
+
+ .. Note::
+
+ Many kernel FS drivers (Linux, NT and Windows 95) need to acquire
+ the attributes as well as the Fid for the instantiation of an internal
+ "inode" or "FileHandle". A significant improvement in performance on
+ such systems could be made by combining the lookup and getattr calls
+ both at the Venus/kernel interaction level and at the RPC level.
+
+ The vattr structure included in the input arguments is superfluous and
+ should be removed.
+
+
+4.6. setattr
+-------------
+
+
+ Summary
+ Set the attributes of a file.
+
+ Arguments
+ in::
+
+ struct cfs_setattr_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_setattr;
+
+
+
+
+ out
+
+ empty
+
+ Description
+ The structure attr is filled with attributes to be changed
+ in BSD style. Attributes not to be changed are set to -1, apart from
+ vtype which is set to VNON. Other are set to the value to be assigned.
+ The only attributes which the FS driver may request to change are the
+ mode, owner, groupid, atime, mtime and ctime. The return value
+ indicates success or failure.
+
+ Errors
+ A variety of errors can occur. The object may not exist, may
+ be inaccessible, or permission may not be granted by Venus.
+
+
+4.7. access
+------------
+
+
+ Arguments
+ in::
+
+ struct cfs_access_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_access;
+
+
+
+ out
+
+ empty
+
+ Description
+ Verify if access to the object identified by VFid for
+ operations described by flags is permitted. The result indicates if
+ access will be granted. It is important to remember that Coda uses
+ ACLs to enforce protection and that ultimately the servers, not the
+ clients enforce the security of the system. The result of this call
+ will depend on whether a token is held by the user.
+
+ Errors
+ The object may not exist, or the ACL describing the protection
+ may not be accessible.
+
+
+4.8. create
+------------
+
+
+ Summary
+ Invoked to create a file
+
+ Arguments
+ in::
+
+ struct cfs_create_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ int excl;
+ int mode;
+ char *name; /* Place holder for data. */
+ } cfs_create;
+
+
+
+
+ out::
+
+ struct cfs_create_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_create;
+
+
+
+ Description
+ This upcall is invoked to request creation of a file.
+ The file will be created in the directory identified by VFid, its name
+ will be name, and the mode will be mode. If excl is set an error will
+ be returned if the file already exists. If the size field in attr is
+ set to zero the file will be truncated. The uid and gid of the file
+ are set by converting the CodaCred to a uid using a macro CRTOUID
+ (this macro is platform dependent). Upon success the VFid and
+ attributes of the file are returned. The Coda FS Driver will normally
+ instantiate a vnode, inode or file handle at kernel level for the new
+ object.
+
+
+ Errors
+ A variety of errors can occur. Permissions may be insufficient.
+ If the object exists and is not a file the error EISDIR is returned
+ under Unix.
+
+ .. Note::
+
+ The packing of parameters is very inefficient and appears to
+ indicate confusion between the system call creat and the VFS operation
+ create. The VFS operation create is only called to create new objects.
+ This create call differs from the Unix one in that it is not invoked
+ to return a file descriptor. The truncate and exclusive options,
+ together with the mode, could simply be part of the mode as it is
+ under Unix. There should be no flags argument; this is used in open
+ (2) to return a file descriptor for READ or WRITE mode.
+
+ The attributes of the directory should be returned too, since the size
+ and mtime changed.
+
+
+4.9. mkdir
+-----------
+
+
+ Summary
+ Create a new directory.
+
+ Arguments
+ in::
+
+ struct cfs_mkdir_in {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ char *name; /* Place holder for data. */
+ } cfs_mkdir;
+
+
+
+ out::
+
+ struct cfs_mkdir_out {
+ ViceFid VFid;
+ struct coda_vattr attr;
+ } cfs_mkdir;
+
+
+
+
+ Description
+ This call is similar to create but creates a directory.
+ Only the mode field in the input parameters is used for creation.
+ Upon successful creation, the attr returned contains the attributes of
+ the new directory.
+
+ Errors
+ As for create.
+
+ .. Note::
+
+ The input parameter should be changed to mode instead of
+ attributes.
+
+ The attributes of the parent should be returned since the size and
+ mtime changes.
+
+
+4.10. link
+-----------
+
+
+ Summary
+ Create a link to an existing file.
+
+ Arguments
+ in::
+
+ struct cfs_link_in {
+ ViceFid sourceFid; /* cnode to link *to* */
+ ViceFid destFid; /* Directory in which to place link */
+ char *tname; /* Place holder for data. */
+ } cfs_link;
+
+
+
+ out
+
+ empty
+
+ Description
+ This call creates a link to the sourceFid in the directory
+ identified by destFid with name tname. The source must reside in the
+ target's parent, i.e. the source must be have parent destFid, i.e. Coda
+ does not support cross directory hard links. Only the return value is
+ relevant. It indicates success or the type of failure.
+
+ Errors
+ The usual errors can occur.
+
+
+4.11. symlink
+--------------
+
+
+ Summary
+ create a symbolic link
+
+ Arguments
+ in::
+
+ struct cfs_symlink_in {
+ ViceFid VFid; /* Directory to put symlink in */
+ char *srcname;
+ struct coda_vattr attr;
+ char *tname;
+ } cfs_symlink;
+
+
+
+ out
+
+ none
+
+ Description
+ Create a symbolic link. The link is to be placed in the
+ directory identified by VFid and named tname. It should point to the
+ pathname srcname. The attributes of the newly created object are to
+ be set to attr.
+
+ .. Note::
+
+ The attributes of the target directory should be returned since
+ its size changed.
+
+
+4.12. remove
+-------------
+
+
+ Summary
+ Remove a file
+
+ Arguments
+ in::
+
+ struct cfs_remove_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_remove;
+
+
+
+ out
+
+ none
+
+ Description
+ Remove file named cfs_remove_in.name in directory
+ identified by VFid.
+
+
+ .. Note::
+
+ The attributes of the directory should be returned since its
+ mtime and size may change.
+
+
+4.13. rmdir
+------------
+
+
+ Summary
+ Remove a directory
+
+ Arguments
+ in::
+
+ struct cfs_rmdir_in {
+ ViceFid VFid;
+ char *name; /* Place holder for data. */
+ } cfs_rmdir;
+
+
+
+ out
+
+ none
+
+ Description
+ Remove the directory with name 'name' from the directory
+ identified by VFid.
+
+ .. Note:: The attributes of the parent directory should be returned since
+ its mtime and size may change.
+
+
+4.14. readlink
+---------------
+
+
+ Summary
+ Read the value of a symbolic link.
+
+ Arguments
+ in::
+
+ struct cfs_readlink_in {
+ ViceFid VFid;
+ } cfs_readlink;
+
+
+
+ out::
+
+ struct cfs_readlink_out {
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readlink;
+
+
+
+ Description
+ This routine reads the contents of symbolic link
+ identified by VFid into the buffer data. The buffer data must be able
+ to hold any name up to CFS_MAXNAMLEN (PATH or NAM??).
+
+ Errors
+ No unusual errors.
+
+
+4.15. open
+-----------
+
+
+ Summary
+ Open a file.
+
+ Arguments
+ in::
+
+ struct cfs_open_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_open;
+
+
+
+ out::
+
+ struct cfs_open_out {
+ dev_t dev;
+ ino_t inode;
+ } cfs_open;
+
+
+
+ Description
+ This request asks Venus to place the file identified by
+ VFid in its cache and to note that the calling process wishes to open
+ it with flags as in open(2). The return value to the kernel differs
+ for Unix and Windows systems. For Unix systems the Coda FS Driver is
+ informed of the device and inode number of the container file in the
+ fields dev and inode. For Windows the path of the container file is
+ returned to the kernel.
+
+
+ .. Note::
+
+ Currently the cfs_open_out structure is not properly adapted to
+ deal with the Windows case. It might be best to implement two
+ upcalls, one to open aiming at a container file name, the other at a
+ container file inode.
+
+
+4.16. close
+------------
+
+
+ Summary
+ Close a file, update it on the servers.
+
+ Arguments
+ in::
+
+ struct cfs_close_in {
+ ViceFid VFid;
+ int flags;
+ } cfs_close;
+
+
+
+ out
+
+ none
+
+ Description
+ Close the file identified by VFid.
+
+ .. Note::
+
+ The flags argument is bogus and not used. However, Venus' code
+ has room to deal with an execp input field, probably this field should
+ be used to inform Venus that the file was closed but is still memory
+ mapped for execution. There are comments about fetching versus not
+ fetching the data in Venus vproc_vfscalls. This seems silly. If a
+ file is being closed, the data in the container file is to be the new
+ data. Here again the execp flag might be in play to create confusion:
+ currently Venus might think a file can be flushed from the cache when
+ it is still memory mapped. This needs to be understood.
+
+
+4.17. ioctl
+------------
+
+
+ Summary
+ Do an ioctl on a file. This includes the pioctl interface.
+
+ Arguments
+ in::
+
+ struct cfs_ioctl_in {
+ ViceFid VFid;
+ int cmd;
+ int len;
+ int rwflag;
+ char *data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ out::
+
+
+ struct cfs_ioctl_out {
+ int len;
+ caddr_t data; /* Place holder for data. */
+ } cfs_ioctl;
+
+
+
+ Description
+ Do an ioctl operation on a file. The command, len and
+ data arguments are filled as usual. flags is not used by Venus.
+
+ .. Note::
+
+ Another bogus parameter. flags is not used. What is the
+ business about PREFETCHING in the Venus code?
+
+
+
+4.18. rename
+-------------
+
+
+ Summary
+ Rename a fid.
+
+ Arguments
+ in::
+
+ struct cfs_rename_in {
+ ViceFid sourceFid;
+ char *srcname;
+ ViceFid destFid;
+ char *destname;
+ } cfs_rename;
+
+
+
+ out
+
+ none
+
+ Description
+ Rename the object with name srcname in directory
+ sourceFid to destname in destFid. It is important that the names
+ srcname and destname are 0 terminated strings. Strings in Unix
+ kernels are not always null terminated.
+
+
+4.19. readdir
+--------------
+
+
+ Summary
+ Read directory entries.
+
+ Arguments
+ in::
+
+ struct cfs_readdir_in {
+ ViceFid VFid;
+ int count;
+ int offset;
+ } cfs_readdir;
+
+
+
+
+ out::
+
+ struct cfs_readdir_out {
+ int size;
+ caddr_t data; /* Place holder for data. */
+ } cfs_readdir;
+
+
+
+ Description
+ Read directory entries from VFid starting at offset and
+ read at most count bytes. Returns the data in data and returns
+ the size in size.
+
+
+ .. Note::
+
+ This call is not used. Readdir operations exploit container
+ files. We will re-evaluate this during the directory revamp which is
+ about to take place.
+
+
+4.20. vget
+-----------
+
+
+ Summary
+ instructs Venus to do an FSDB->Get.
+
+ Arguments
+ in::
+
+ struct cfs_vget_in {
+ ViceFid VFid;
+ } cfs_vget;
+
+
+
+ out::
+
+ struct cfs_vget_out {
+ ViceFid VFid;
+ int vtype;
+ } cfs_vget;
+
+
+
+ Description
+ This upcall asks Venus to do a get operation on an fsobj
+ labelled by VFid.
+
+ .. Note::
+
+ This operation is not used. However, it is extremely useful
+ since it can be used to deal with read/write memory mapped files.
+ These can be "pinned" in the Venus cache using vget and released with
+ inactive.
+
+
+4.21. fsync
+------------
+
+
+ Summary
+ Tell Venus to update the RVM attributes of a file.
+
+ Arguments
+ in::
+
+ struct cfs_fsync_in {
+ ViceFid VFid;
+ } cfs_fsync;
+
+
+
+ out
+
+ none
+
+ Description
+ Ask Venus to update RVM attributes of object VFid. This
+ should be called as part of kernel level fsync type calls. The
+ result indicates if the syncing was successful.
+
+ .. Note:: Linux does not implement this call. It should.
+
+
+4.22. inactive
+---------------
+
+
+ Summary
+ Tell Venus a vnode is no longer in use.
+
+ Arguments
+ in::
+
+ struct cfs_inactive_in {
+ ViceFid VFid;
+ } cfs_inactive;
+
+
+
+ out
+
+ none
+
+ Description
+ This operation returns EOPNOTSUPP.
+
+ .. Note:: This should perhaps be removed.
+
+
+4.23. rdwr
+-----------
+
+
+ Summary
+ Read or write from a file
+
+ Arguments
+ in::
+
+ struct cfs_rdwr_in {
+ ViceFid VFid;
+ int rwflag;
+ int count;
+ int offset;
+ int ioflag;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+
+ out::
+
+ struct cfs_rdwr_out {
+ int rwflag;
+ int count;
+ caddr_t data; /* Place holder for data. */
+ } cfs_rdwr;
+
+
+
+ Description
+ This upcall asks Venus to read or write from a file.
+
+
+ .. Note::
+
+ It should be removed since it is against the Coda philosophy that
+ read/write operations never reach Venus. I have been told the
+ operation does not work. It is not currently used.
+
+
+
+4.24. odymount
+---------------
+
+
+ Summary
+ Allows mounting multiple Coda "filesystems" on one Unix mount point.
+
+ Arguments
+ in::
+
+ struct ody_mount_in {
+ char *name; /* Place holder for data. */
+ } ody_mount;
+
+
+
+ out::
+
+ struct ody_mount_out {
+ ViceFid VFid;
+ } ody_mount;
+
+
+
+ Description
+ Asks Venus to return the rootfid of a Coda system named
+ name. The fid is returned in VFid.
+
+ .. Note::
+
+ This call was used by David for dynamic sets. It should be
+ removed since it causes a jungle of pointers in the VFS mounting area.
+ It is not used by Coda proper. Call is not implemented by Venus.
+
+
+4.25. ody_lookup
+-----------------
+
+
+ Summary
+ Looks up something.
+
+ Arguments
+ in
+
+ irrelevant
+
+
+ out
+
+ irrelevant
+
+
+ .. Note:: Gut it. Call is not implemented by Venus.
+
+
+4.26. ody_expand
+-----------------
+
+
+ Summary
+ expands something in a dynamic set.
+
+ Arguments
+ in
+
+ irrelevant
+
+ out
+
+ irrelevant
+
+ .. Note:: Gut it. Call is not implemented by Venus.
+
+
+4.27. prefetch
+---------------
+
+
+ Summary
+ Prefetch a dynamic set.
+
+ Arguments
+
+ in
+
+ Not documented.
+
+ out
+
+ Not documented.
+
+ Description
+ Venus worker.cc has support for this call, although it is
+ noted that it doesn't work. Not surprising, since the kernel does not
+ have support for it. (ODY_PREFETCH is not a defined operation).
+
+
+ .. Note:: Gut it. It isn't working and isn't used by Coda.
+
+
+
+4.28. signal
+-------------
+
+
+ Summary
+ Send Venus a signal about an upcall.
+
+ Arguments
+ in
+
+ none
+
+ out
+
+ not applicable.
+
+ Description
+ This is an out-of-band upcall to Venus to inform Venus
+ that the calling process received a signal after Venus read the
+ message from the input queue. Venus is supposed to clean up the
+ operation.
+
+ Errors
+ No reply is given.
+
+ .. Note::
+
+ We need to better understand what Venus needs to clean up and if
+ it is doing this correctly. Also we need to handle multiple upcall
+ per system call situations correctly. It would be important to know
+ what state changes in Venus take place after an upcall for which the
+ kernel is responsible for notifying Venus to clean up (e.g. open
+ definitely is such a state change, but many others are maybe not).
+
+
+5. The minicache and downcalls
+===============================
+
+
+ The Coda FS Driver can cache results of lookup and access upcalls, to
+ limit the frequency of upcalls. Upcalls carry a price since a process
+ context switch needs to take place. The counterpart of caching the
+ information is that Venus will notify the FS Driver that cached
+ entries must be flushed or renamed.
+
+ The kernel code generally has to maintain a structure which links the
+ internal file handles (called vnodes in BSD, inodes in Linux and
+ FileHandles in Windows) with the ViceFid's which Venus maintains. The
+ reason is that frequent translations back and forth are needed in
+ order to make upcalls and use the results of upcalls. Such linking
+ objects are called cnodes.
+
+ The current minicache implementations have cache entries which record
+ the following:
+
+ 1. the name of the file
+
+ 2. the cnode of the directory containing the object
+
+ 3. a list of CodaCred's for which the lookup is permitted.
+
+ 4. the cnode of the object
+
+ The lookup call in the Coda FS Driver may request the cnode of the
+ desired object from the cache, by passing its name, directory and the
+ CodaCred's of the caller. The cache will return the cnode or indicate
+ that it cannot be found. The Coda FS Driver must be careful to
+ invalidate cache entries when it modifies or removes objects.
+
+ When Venus obtains information that indicates that cache entries are
+ no longer valid, it will make a downcall to the kernel. Downcalls are
+ intercepted by the Coda FS Driver and lead to cache invalidations of
+ the kind described below. The Coda FS Driver does not return an error
+ unless the downcall data could not be read into kernel memory.
+
+
+5.1. INVALIDATE
+----------------
+
+
+ No information is available on this call.
+
+
+5.2. FLUSH
+-----------
+
+
+
+ Arguments
+ None
+
+ Summary
+ Flush the name cache entirely.
+
+ Description
+ Venus issues this call upon startup and when it dies. This
+ is to prevent stale cache information being held. Some operating
+ systems allow the kernel name cache to be switched off dynamically.
+ When this is done, this downcall is made.
+
+
+5.3. PURGEUSER
+---------------
+
+
+ Arguments
+ ::
+
+ struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */
+ struct CodaCred cred;
+ } cfs_purgeuser;
+
+
+
+ Description
+ Remove all entries in the cache carrying the Cred. This
+ call is issued when tokens for a user expire or are flushed.
+
+
+5.4. ZAPFILE
+-------------
+
+
+ Arguments
+ ::
+
+ struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapfile;
+
+
+
+ Description
+ Remove all entries which have the (dir vnode, name) pair.
+ This is issued as a result of an invalidation of cached attributes of
+ a vnode.
+
+ .. Note::
+
+ Call is not named correctly in NetBSD and Mach. The minicache
+ zapfile routine takes different arguments. Linux does not implement
+ the invalidation of attributes correctly.
+
+
+
+5.5. ZAPDIR
+------------
+
+
+ Arguments
+ ::
+
+ struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_zapdir;
+
+
+
+ Description
+ Remove all entries in the cache lying in a directory
+ CodaFid, and all children of this directory. This call is issued when
+ Venus receives a callback on the directory.
+
+
+5.6. ZAPVNODE
+--------------
+
+
+
+ Arguments
+ ::
+
+ struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */
+ struct CodaCred cred;
+ ViceFid VFid;
+ } cfs_zapvnode;
+
+
+
+ Description
+ Remove all entries in the cache carrying the cred and VFid
+ as in the arguments. This downcall is probably never issued.
+
+
+5.7. PURGEFID
+--------------
+
+
+ Arguments
+ ::
+
+ struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */
+ ViceFid CodaFid;
+ } cfs_purgefid;
+
+
+
+ Description
+ Flush the attribute for the file. If it is a dir (odd
+ vnode), purge its children from the namecache and remove the file from the
+ namecache.
+
+
+
+5.8. REPLACE
+-------------
+
+
+ Summary
+ Replace the Fid's for a collection of names.
+
+ Arguments
+ ::
+
+ struct cfs_replace_out { /* cfs_replace is a venus->kernel call */
+ ViceFid NewFid;
+ ViceFid OldFid;
+ } cfs_replace;
+
+
+
+ Description
+ This routine replaces a ViceFid in the name cache with
+ another. It is added to allow Venus during reintegration to replace
+ locally allocated temp fids while disconnected with global fids even
+ when the reference counts on those fids are not zero.
+
+
+6. Initialization and cleanup
+==============================
+
+
+ This section gives brief hints as to desirable features for the Coda
+ FS Driver at startup and upon shutdown or Venus failures. Before
+ entering the discussion it is useful to repeat that the Coda FS Driver
+ maintains the following data:
+
+
+ 1. message queues
+
+ 2. cnodes
+
+ 3. name cache entries
+
+ The name cache entries are entirely private to the driver, so they
+ can easily be manipulated. The message queues will generally have
+ clear points of initialization and destruction. The cnodes are
+ much more delicate. User processes hold reference counts in Coda
+ filesystems and it can be difficult to clean up the cnodes.
+
+ It can expect requests through:
+
+ 1. the message subsystem
+
+ 2. the VFS layer
+
+ 3. pioctl interface
+
+ Currently the pioctl passes through the VFS for Coda so we can
+ treat these similarly.
+
+
+6.1. Requirements
+------------------
+
+
+ The following requirements should be accommodated:
+
+ 1. The message queues should have open and close routines. On Unix
+ the opening of the character devices are such routines.
+
+ - Before opening, no messages can be placed.
+
+ - Opening will remove any old messages still pending.
+
+ - Close will notify any sleeping processes that their upcall cannot
+ be completed.
+
+ - Close will free all memory allocated by the message queues.
+
+
+ 2. At open the namecache shall be initialized to empty state.
+
+ 3. Before the message queues are open, all VFS operations will fail.
+ Fortunately this can be achieved by making sure than mounting the
+ Coda filesystem cannot succeed before opening.
+
+ 4. After closing of the queues, no VFS operations can succeed. Here
+ one needs to be careful, since a few operations (lookup,
+ read/write, readdir) can proceed without upcalls. These must be
+ explicitly blocked.
+
+ 5. Upon closing the namecache shall be flushed and disabled.
+
+ 6. All memory held by cnodes can be freed without relying on upcalls.
+
+ 7. Unmounting the file system can be done without relying on upcalls.
+
+ 8. Mounting the Coda filesystem should fail gracefully if Venus cannot
+ get the rootfid or the attributes of the rootfid. The latter is
+ best implemented by Venus fetching these objects before attempting
+ to mount.
+
+ .. Note::
+
+ NetBSD in particular but also Linux have not implemented the
+ above requirements fully. For smooth operation this needs to be
+ corrected.
+
+
+
diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt
deleted file mode 100644
index 1711ad4..0000000
--- a/Documentation/filesystems/coda.txt
+++ /dev/null
@@ -1,1676 +0,0 @@
-NOTE:
-This is one of the technical documents describing a component of
-Coda -- this document describes the client kernel-Venus interface.
-
-For more information:
- http://www.coda.cs.cmu.edu
-For user level software needed to run Coda:
- ftp://ftp.coda.cs.cmu.edu
-
-To run Coda you need to get a user level cache manager for the client,
-named Venus, as well as tools to manipulate ACLs, to log in, etc. The
-client needs to have the Coda filesystem selected in the kernel
-configuration.
-
-The server needs a user level server and at present does not depend on
-kernel support.
-
-
-
-
-
-
-
- The Venus kernel interface
- Peter J. Braam
- v1.0, Nov 9, 1997
-
- This document describes the communication between Venus and kernel
- level filesystem code needed for the operation of the Coda file sys-
- tem. This document version is meant to describe the current interface
- (version 1.0) as well as improvements we envisage.
- ______________________________________________________________________
-
- Table of Contents
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 1. Introduction
-
- 2. Servicing Coda filesystem calls
-
- 3. The message layer
-
- 3.1 Implementation details
-
- 4. The interface at the call level
-
- 4.1 Data structures shared by the kernel and Venus
- 4.2 The pioctl interface
- 4.3 root
- 4.4 lookup
- 4.5 getattr
- 4.6 setattr
- 4.7 access
- 4.8 create
- 4.9 mkdir
- 4.10 link
- 4.11 symlink
- 4.12 remove
- 4.13 rmdir
- 4.14 readlink
- 4.15 open
- 4.16 close
- 4.17 ioctl
- 4.18 rename
- 4.19 readdir
- 4.20 vget
- 4.21 fsync
- 4.22 inactive
- 4.23 rdwr
- 4.24 odymount
- 4.25 ody_lookup
- 4.26 ody_expand
- 4.27 prefetch
- 4.28 signal
-
- 5. The minicache and downcalls
-
- 5.1 INVALIDATE
- 5.2 FLUSH
- 5.3 PURGEUSER
- 5.4 ZAPFILE
- 5.5 ZAPDIR
- 5.6 ZAPVNODE
- 5.7 PURGEFID
- 5.8 REPLACE
-
- 6. Initialization and cleanup
-
- 6.1 Requirements
-
-
- ______________________________________________________________________
- 0wpage
-
- 11.. IInnttrroodduuccttiioonn
-
-
-
- A key component in the Coda Distributed File System is the cache
- manager, _V_e_n_u_s.
-
-
- When processes on a Coda enabled system access files in the Coda
- filesystem, requests are directed at the filesystem layer in the
- operating system. The operating system will communicate with Venus to
- service the request for the process. Venus manages a persistent
- client cache and makes remote procedure calls to Coda file servers and
- related servers (such as authentication servers) to service these
- requests it receives from the operating system. When Venus has
- serviced a request it replies to the operating system with appropriate
- return codes, and other data related to the request. Optionally the
- kernel support for Coda may maintain a minicache of recently processed
- requests to limit the number of interactions with Venus. Venus
- possesses the facility to inform the kernel when elements from its
- minicache are no longer valid.
-
- This document describes precisely this communication between the
- kernel and Venus. The definitions of so called upcalls and downcalls
- will be given with the format of the data they handle. We shall also
- describe the semantic invariants resulting from the calls.
-
- Historically Coda was implemented in a BSD file system in Mach 2.6.
- The interface between the kernel and Venus is very similar to the BSD
- VFS interface. Similar functionality is provided, and the format of
- the parameters and returned data is very similar to the BSD VFS. This
- leads to an almost natural environment for implementing a kernel-level
- filesystem driver for Coda in a BSD system. However, other operating
- systems such as Linux and Windows 95 and NT have virtual filesystem
- with different interfaces.
-
- To implement Coda on these systems some reverse engineering of the
- Venus/Kernel protocol is necessary. Also it came to light that other
- systems could profit significantly from certain small optimizations
- and modifications to the protocol. To facilitate this work as well as
- to make future ports easier, communication between Venus and the
- kernel should be documented in great detail. This is the aim of this
- document.
-
- 0wpage
-
- 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss
-
- The service of a request for a Coda file system service originates in
- a process PP which accessing a Coda file. It makes a system call which
- traps to the OS kernel. Examples of such calls trapping to the kernel
- are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix
- context. Similar calls exist in the Win32 environment, and are named
- _C_r_e_a_t_e_F_i_l_e_, .
-
- Generally the operating system handles the request in a virtual
- filesystem (VFS) layer, which is named I/O Manager in NT and IFS
- manager in Windows 95. The VFS is responsible for partial processing
- of the request and for locating the specific filesystem(s) which will
- service parts of the request. Usually the information in the path
- assists in locating the correct FS drivers. Sometimes after extensive
- pre-processing, the VFS starts invoking exported routines in the FS
- driver. This is the point where the FS specific processing of the
- request starts, and here the Coda specific kernel code comes into
- play.
-
- The FS layer for Coda must expose and implement several interfaces.
- First and foremost the VFS must be able to make all necessary calls to
- the Coda FS layer, so the Coda FS driver must expose the VFS interface
- as applicable in the operating system. These differ very significantly
- among operating systems, but share features such as facilities to
- read/write and create and remove objects. The Coda FS layer services
- such VFS requests by invoking one or more well defined services
- offered by the cache manager Venus. When the replies from Venus have
- come back to the FS driver, servicing of the VFS call continues and
- finishes with a reply to the kernel's VFS. Finally the VFS layer
- returns to the process.
-
- As a result of this design a basic interface exposed by the FS driver
- must allow Venus to manage message traffic. In particular Venus must
- be able to retrieve and place messages and to be notified of the
- arrival of a new message. The notification must be through a mechanism
- which does not block Venus since Venus must attend to other tasks even
- when no messages are waiting or being processed.
-
-
-
-
-
-
- Interfaces of the Coda FS Driver
-
- Furthermore the FS layer provides for a special path of communication
- between a user process and Venus, called the pioctl interface. The
- pioctl interface is used for Coda specific services, such as
- requesting detailed information about the persistent cache managed by
- Venus. Here the involvement of the kernel is minimal. It identifies
- the calling process and passes the information on to Venus. When
- Venus replies the response is passed back to the caller in unmodified
- form.
-
- Finally Venus allows the kernel FS driver to cache the results from
- certain services. This is done to avoid excessive context switches
- and results in an efficient system. However, Venus may acquire
- information, for example from the network which implies that cached
- information must be flushed or replaced. Venus then makes a downcall
- to the Coda FS layer to request flushes or updates in the cache. The
- kernel FS driver handles such requests synchronously.
-
- Among these interfaces the VFS interface and the facility to place,
- receive and be notified of messages are platform specific. We will
- not go into the calls exported to the VFS layer but we will state the
- requirements of the message exchange mechanism.
-
- 0wpage
-
- 33.. TThhee mmeessssaaggee llaayyeerr
-
-
-
- At the lowest level the communication between Venus and the FS driver
- proceeds through messages. The synchronization between processes
- requesting Coda file service and Venus relies on blocking and waking
- up processes. The Coda FS driver processes VFS- and pioctl-requests
- on behalf of a process P, creates messages for Venus, awaits replies
- and finally returns to the caller. The implementation of the exchange
- of messages is platform specific, but the semantics have (so far)
- appeared to be generally applicable. Data buffers are created by the
- FS Driver in kernel memory on behalf of P and copied to user memory in
- Venus.
-
- The FS Driver while servicing P makes upcalls to Venus. Such an
- upcall is dispatched to Venus by creating a message structure. The
- structure contains the identification of P, the message sequence
- number, the size of the request and a pointer to the data in kernel
- memory for the request. Since the data buffer is re-used to hold the
- reply from Venus, there is a field for the size of the reply. A flags
- field is used in the message to precisely record the status of the
- message. Additional platform dependent structures involve pointers to
- determine the position of the message on queues and pointers to
- synchronization objects. In the upcall routine the message structure
- is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g
- queue. The routine calling upcall is responsible for allocating the
- data buffer; its structure will be described in the next section.
-
- A facility must exist to notify Venus that the message has been
- created, and implemented using available synchronization objects in
- the OS. This notification is done in the upcall context of the process
- P. When the message is on the pending queue, process P cannot proceed
- in upcall. The (kernel mode) processing of P in the filesystem
- request routine must be suspended until Venus has replied. Therefore
- the calling thread in P is blocked in upcall. A pointer in the
- message structure will locate the synchronization object on which P is
- sleeping.
-
- Venus detects the notification that a message has arrived, and the FS
- driver allow Venus to retrieve the message with a getmsg_from_kernel
- call. This action finishes in the kernel by putting the message on the
- queue of processing messages and setting flags to READ. Venus is
- passed the contents of the data buffer. The getmsg_from_kernel call
- now returns and Venus processes the request.
-
- At some later point the FS driver receives a message from Venus,
- namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS
- driver looks at the contents of the message and decides if:
-
-
- +o the message is a reply for a suspended thread P. If so it removes
- the message from the processing queue and marks the message as
- WRITTEN. Finally, the FS driver unblocks P (still in the kernel
- mode context of Venus) and the sendmsg_to_kernel call returns to
- Venus. The process P will be scheduled at some point and continues
- processing its upcall with the data buffer replaced with the reply
- from Venus.
-
- +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to
- the FS Driver. The FS driver processes the request immediately
- (usually a cache eviction or replacement) and when it finishes
- sendmsg_to_kernel returns.
-
- Now P awakes and continues processing upcall. There are some
- subtleties to take account of. First P will determine if it was woken
- up in upcall by a signal from some other source (for example an
- attempt to terminate P) or as is normally the case by Venus in its
- sendmsg_to_kernel call. In the normal case, the upcall routine will
- deallocate the message structure and return. The FS routine can proceed
- with its processing.
-
-
-
-
-
-
-
- Sleeping and IPC arrangements
-
- In case P is woken up by a signal and not by Venus, it will first look
- at the flags field. If the message is not yet READ, the process P can
- handle its signal without notifying Venus. If Venus has READ, and
- the request should not be processed, P can send Venus a signal message
- to indicate that it should disregard the previous message. Such
- signals are put in the queue at the head, and read first by Venus. If
- the message is already marked as WRITTEN it is too late to stop the
- processing. The VFS routine will now continue. (-- If a VFS request
- involves more than one upcall, this can lead to complicated state, an
- extra field "handle_signals" could be added in the message structure
- to indicate points of no return have been passed.--)
-
-
-
- 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss
-
- The Unix implementation of this mechanism has been through the
- implementation of a character device associated with Coda. Venus
- retrieves messages by doing a read on the device, replies are sent
- with a write and notification is through the select system call on the
- file descriptor for the device. The process P is kept waiting on an
- interruptible wait queue object.
-
- In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl
- call is used. The DeviceIoControl call is designed to copy buffers
- from user memory to kernel memory with OPCODES. The sendmsg_to_kernel
- is issued as a synchronous call, while the getmsg_from_kernel call is
- asynchronous. Windows EventObjects are used for notification of
- message arrival. The process P is kept waiting on a KernelEvent
- object in NT and a semaphore in Windows 95.
-
- 0wpage
-
- 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell
-
-
- This section describes the upcalls a Coda FS driver can make to Venus.
- Each of these upcalls make use of two structures: inputArgs and
- outputArgs. In pseudo BNF form the structures take the following
- form:
-
-
- struct inputArgs {
- u_long opcode;
- u_long unique; /* Keep multiple outstanding msgs distinct */
- u_short pid; /* Common to all */
- u_short pgid; /* Common to all */
- struct CodaCred cred; /* Common to all */
-
- <union "in" of call dependent parts of inputArgs>
- };
-
- struct outputArgs {
- u_long opcode;
- u_long unique; /* Keep multiple outstanding msgs distinct */
- u_long result;
-
- <union "out" of call dependent parts of inputArgs>
- };
-
-
-
- Before going on let us elucidate the role of the various fields. The
- inputArgs start with the opcode which defines the type of service
- requested from Venus. There are approximately 30 upcalls at present
- which we will discuss. The unique field labels the inputArg with a
- unique number which will identify the message uniquely. A process and
- process group id are passed. Finally the credentials of the caller
- are included.
-
- Before delving into the specific calls we need to discuss a variety of
- data structures shared by the kernel and Venus.
-
-
-
-
- 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss
-
-
- The CodaCred structure defines a variety of user and group ids as
- they are set for the calling process. The vuid_t and vgid_t are 32 bit
- unsigned integers. It also defines group membership in an array. On
- Unix the CodaCred has proven sufficient to implement good security
- semantics for Coda but the structure may have to undergo modification
- for the Windows environment when these mature.
-
- struct CodaCred {
- vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */
- vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
- vgid_t cr_groups[NGROUPS]; /* Group membership for caller */
- };
-
-
-
- NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus
- doesn't know about groups, although it does create files with the
- default uid/gid. Perhaps the list of group membership is superfluous.
-
-
- The next item is the fundamental identifier used to identify Coda
- files, the ViceFid. A fid of a file uniquely defines a file or
- directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a
- group of Coda servers acting under the aegis of a single system
- control machine or SCM. See the Coda Administration manual for a
- detailed description of the role of the SCM.--)
-
-
- typedef struct ViceFid {
- VolumeId Volume;
- VnodeId Vnode;
- Unique_t Unique;
- } ViceFid;
-
-
-
- Each of the constituent fields: VolumeId, VnodeId and Unique_t are
- unsigned 32 bit integers. We envisage that a further field will need
- to be prefixed to identify the Coda cell; this will probably take the
- form of a Ipv6 size IP address naming the Coda cell through DNS.
-
- The next important structure shared between Venus and the kernel is
- the attributes of the file. The following structure is used to
- exchange information. It has room for future extensions such as
- support for device files (currently not present in Coda).
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- struct coda_timespec {
- int64_t tv_sec; /* seconds */
- long tv_nsec; /* nanoseconds */
- };
-
- struct coda_vattr {
- enum coda_vtype va_type; /* vnode type (for create) */
- u_short va_mode; /* files access mode and type */
- short va_nlink; /* number of references to file */
- vuid_t va_uid; /* owner user id */
- vgid_t va_gid; /* owner group id */
- long va_fsid; /* file system id (dev for now) */
- long va_fileid; /* file id */
- u_quad_t va_size; /* file size in bytes */
- long va_blocksize; /* blocksize preferred for i/o */
- struct coda_timespec va_atime; /* time of last access */
- struct coda_timespec va_mtime; /* time of last modification */
- struct coda_timespec va_ctime; /* time file changed */
- u_long va_gen; /* generation number of file */
- u_long va_flags; /* flags defined for file */
- dev_t va_rdev; /* device special file represents */
- u_quad_t va_bytes; /* bytes of disk space held by file */
- u_quad_t va_filerev; /* file modification number */
- u_int va_vaflags; /* operations flags, see below */
- long va_spare; /* remain quad aligned */
- };
-
-
-
-
- 44..22.. TThhee ppiiooccttll iinntteerrffaaccee
-
-
- Coda specific requests can be made by application through the pioctl
- interface. The pioctl is implemented as an ordinary ioctl on a
- fictitious file /coda/.CONTROL. The pioctl call opens this file, gets
- a file handle and makes the ioctl call. Finally it closes the file.
-
- The kernel involvement in this is limited to providing the facility to
- open and close and pass the ioctl message _a_n_d to verify that a path in
- the pioctl data buffers is a file in a Coda filesystem.
-
- The kernel is handed a data packet of the form:
-
- struct {
- const char *path;
- struct ViceIoctl vidata;
- int follow;
- } data;
-
-
-
- where
-
-
- struct ViceIoctl {
- caddr_t in, out; /* Data to be transferred in, or out */
- short in_size; /* Size of input buffer <= 2K */
- short out_size; /* Maximum size of output buffer, <= 2K */
- };
-
-
-
- The path must be a Coda file, otherwise the ioctl upcall will not be
- made.
-
- NNOOTTEE The data structures and code are a mess. We need to clean this
- up.
-
- We now proceed to document the individual calls:
-
- 0wpage
-
- 44..33.. rroooott
-
-
- AArrgguummeennttss
-
- iinn empty
-
- oouutt
-
- struct cfs_root_out {
- ViceFid VFid;
- } cfs_root;
-
-
-
- DDeessccrriippttiioonn This call is made to Venus during the initialization of
- the Coda filesystem. If the result is zero, the cfs_root structure
- contains the ViceFid of the root of the Coda filesystem. If a non-zero
- result is generated, its value is a platform dependent error code
- indicating the difficulty Venus encountered in locating the root of
- the Coda filesystem.
-
- 0wpage
-
- 44..44.. llooookkuupp
-
-
- SSuummmmaarryy Find the ViceFid and type of an object in a directory if it
- exists.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_lookup_in {
- ViceFid VFid;
- char *name; /* Place holder for data. */
- } cfs_lookup;
-
-
-
- oouutt
-
- struct cfs_lookup_out {
- ViceFid VFid;
- int vtype;
- } cfs_lookup;
-
-
-
- DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of
- a directory entry. The directory entry requested carries name name
- and Venus will search the directory identified by cfs_lookup_in.VFid.
- The result may indicate that the name does not exist, or that
- difficulty was encountered in finding it (e.g. due to disconnection).
- If the result is zero, the field cfs_lookup_out.VFid contains the
- targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the
- type of object the name designates.
-
- The name of the object is an 8 bit character string of maximum length
- CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.)
-
- It is extremely important to realize that Venus bitwise ors the field
- cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should
- not be put in the kernel name cache.
-
- NNOOTTEE The type of the vtype is currently wrong. It should be
- coda_vtype. Linux does not take note of CFS_NOCACHE. It should.
-
- 0wpage
-
- 44..55.. ggeettaattttrr
-
-
- SSuummmmaarryy Get the attributes of a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_getattr_in {
- ViceFid VFid;
- struct coda_vattr attr; /* XXXXX */
- } cfs_getattr;
-
-
-
- oouutt
-
- struct cfs_getattr_out {
- struct coda_vattr attr;
- } cfs_getattr;
-
-
-
- DDeessccrriippttiioonn This call returns the attributes of the file identified by
- fid.
-
- EErrrroorrss Errors can occur if the object with fid does not exist, is
- unaccessible or if the caller does not have permission to fetch
- attributes.
-
- NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire
- the attributes as well as the Fid for the instantiation of an internal
- "inode" or "FileHandle". A significant improvement in performance on
- such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls
- both at the Venus/kernel interaction level and at the RPC level.
-
- The vattr structure included in the input arguments is superfluous and
- should be removed.
-
- 0wpage
-
- 44..66.. sseettaattttrr
-
-
- SSuummmmaarryy Set the attributes of a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_setattr_in {
- ViceFid VFid;
- struct coda_vattr attr;
- } cfs_setattr;
-
-
-
-
- oouutt
- empty
-
- DDeessccrriippttiioonn The structure attr is filled with attributes to be changed
- in BSD style. Attributes not to be changed are set to -1, apart from
- vtype which is set to VNON. Other are set to the value to be assigned.
- The only attributes which the FS driver may request to change are the
- mode, owner, groupid, atime, mtime and ctime. The return value
- indicates success or failure.
-
- EErrrroorrss A variety of errors can occur. The object may not exist, may
- be inaccessible, or permission may not be granted by Venus.
-
- 0wpage
-
- 44..77.. aacccceessss
-
-
- SSuummmmaarryy
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_access_in {
- ViceFid VFid;
- int flags;
- } cfs_access;
-
-
-
- oouutt
- empty
-
- DDeessccrriippttiioonn Verify if access to the object identified by VFid for
- operations described by flags is permitted. The result indicates if
- access will be granted. It is important to remember that Coda uses
- ACLs to enforce protection and that ultimately the servers, not the
- clients enforce the security of the system. The result of this call
- will depend on whether a _t_o_k_e_n is held by the user.
-
- EErrrroorrss The object may not exist, or the ACL describing the protection
- may not be accessible.
-
- 0wpage
-
- 44..88.. ccrreeaattee
-
-
- SSuummmmaarryy Invoked to create a file
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_create_in {
- ViceFid VFid;
- struct coda_vattr attr;
- int excl;
- int mode;
- char *name; /* Place holder for data. */
- } cfs_create;
-
-
-
-
- oouutt
-
- struct cfs_create_out {
- ViceFid VFid;
- struct coda_vattr attr;
- } cfs_create;
-
-
-
- DDeessccrriippttiioonn This upcall is invoked to request creation of a file.
- The file will be created in the directory identified by VFid, its name
- will be name, and the mode will be mode. If excl is set an error will
- be returned if the file already exists. If the size field in attr is
- set to zero the file will be truncated. The uid and gid of the file
- are set by converting the CodaCred to a uid using a macro CRTOUID
- (this macro is platform dependent). Upon success the VFid and
- attributes of the file are returned. The Coda FS Driver will normally
- instantiate a vnode, inode or file handle at kernel level for the new
- object.
-
-
- EErrrroorrss A variety of errors can occur. Permissions may be insufficient.
- If the object exists and is not a file the error EISDIR is returned
- under Unix.
-
- NNOOTTEE The packing of parameters is very inefficient and appears to
- indicate confusion between the system call creat and the VFS operation
- create. The VFS operation create is only called to create new objects.
- This create call differs from the Unix one in that it is not invoked
- to return a file descriptor. The truncate and exclusive options,
- together with the mode, could simply be part of the mode as it is
- under Unix. There should be no flags argument; this is used in open
- (2) to return a file descriptor for READ or WRITE mode.
-
- The attributes of the directory should be returned too, since the size
- and mtime changed.
-
- 0wpage
-
- 44..99.. mmkkddiirr
-
-
- SSuummmmaarryy Create a new directory.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_mkdir_in {
- ViceFid VFid;
- struct coda_vattr attr;
- char *name; /* Place holder for data. */
- } cfs_mkdir;
-
-
-
- oouutt
-
- struct cfs_mkdir_out {
- ViceFid VFid;
- struct coda_vattr attr;
- } cfs_mkdir;
-
-
-
-
- DDeessccrriippttiioonn This call is similar to create but creates a directory.
- Only the mode field in the input parameters is used for creation.
- Upon successful creation, the attr returned contains the attributes of
- the new directory.
-
- EErrrroorrss As for create.
-
- NNOOTTEE The input parameter should be changed to mode instead of
- attributes.
-
- The attributes of the parent should be returned since the size and
- mtime changes.
-
- 0wpage
-
- 44..1100.. lliinnkk
-
-
- SSuummmmaarryy Create a link to an existing file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_link_in {
- ViceFid sourceFid; /* cnode to link *to* */
- ViceFid destFid; /* Directory in which to place link */
- char *tname; /* Place holder for data. */
- } cfs_link;
-
-
-
- oouutt
- empty
-
- DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory
- identified by destFid with name tname. The source must reside in the
- target's parent, i.e. the source must be have parent destFid, i.e. Coda
- does not support cross directory hard links. Only the return value is
- relevant. It indicates success or the type of failure.
-
- EErrrroorrss The usual errors can occur.0wpage
-
- 44..1111.. ssyymmlliinnkk
-
-
- SSuummmmaarryy create a symbolic link
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_symlink_in {
- ViceFid VFid; /* Directory to put symlink in */
- char *srcname;
- struct coda_vattr attr;
- char *tname;
- } cfs_symlink;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the
- directory identified by VFid and named tname. It should point to the
- pathname srcname. The attributes of the newly created object are to
- be set to attr.
-
- EErrrroorrss
-
- NNOOTTEE The attributes of the target directory should be returned since
- its size changed.
-
- 0wpage
-
- 44..1122.. rreemmoovvee
-
-
- SSuummmmaarryy Remove a file
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_remove_in {
- ViceFid VFid;
- char *name; /* Place holder for data. */
- } cfs_remove;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory
- identified by VFid.
-
- EErrrroorrss
-
- NNOOTTEE The attributes of the directory should be returned since its
- mtime and size may change.
-
- 0wpage
-
- 44..1133.. rrmmddiirr
-
-
- SSuummmmaarryy Remove a directory
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_rmdir_in {
- ViceFid VFid;
- char *name; /* Place holder for data. */
- } cfs_rmdir;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Remove the directory with name name from the directory
- identified by VFid.
-
- EErrrroorrss
-
- NNOOTTEE The attributes of the parent directory should be returned since
- its mtime and size may change.
-
- 0wpage
-
- 44..1144.. rreeaaddlliinnkk
-
-
- SSuummmmaarryy Read the value of a symbolic link.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_readlink_in {
- ViceFid VFid;
- } cfs_readlink;
-
-
-
- oouutt
-
- struct cfs_readlink_out {
- int count;
- caddr_t data; /* Place holder for data. */
- } cfs_readlink;
-
-
-
- DDeessccrriippttiioonn This routine reads the contents of symbolic link
- identified by VFid into the buffer data. The buffer data must be able
- to hold any name up to CFS_MAXNAMLEN (PATH or NAM??).
-
- EErrrroorrss No unusual errors.
-
- 0wpage
-
- 44..1155.. ooppeenn
-
-
- SSuummmmaarryy Open a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_open_in {
- ViceFid VFid;
- int flags;
- } cfs_open;
-
-
-
- oouutt
-
- struct cfs_open_out {
- dev_t dev;
- ino_t inode;
- } cfs_open;
-
-
-
- DDeessccrriippttiioonn This request asks Venus to place the file identified by
- VFid in its cache and to note that the calling process wishes to open
- it with flags as in open(2). The return value to the kernel differs
- for Unix and Windows systems. For Unix systems the Coda FS Driver is
- informed of the device and inode number of the container file in the
- fields dev and inode. For Windows the path of the container file is
- returned to the kernel.
- EErrrroorrss
-
- NNOOTTEE Currently the cfs_open_out structure is not properly adapted to
- deal with the Windows case. It might be best to implement two
- upcalls, one to open aiming at a container file name, the other at a
- container file inode.
-
- 0wpage
-
- 44..1166.. cclloossee
-
-
- SSuummmmaarryy Close a file, update it on the servers.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_close_in {
- ViceFid VFid;
- int flags;
- } cfs_close;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Close the file identified by VFid.
-
- EErrrroorrss
-
- NNOOTTEE The flags argument is bogus and not used. However, Venus' code
- has room to deal with an execp input field, probably this field should
- be used to inform Venus that the file was closed but is still memory
- mapped for execution. There are comments about fetching versus not
- fetching the data in Venus vproc_vfscalls. This seems silly. If a
- file is being closed, the data in the container file is to be the new
- data. Here again the execp flag might be in play to create confusion:
- currently Venus might think a file can be flushed from the cache when
- it is still memory mapped. This needs to be understood.
-
- 0wpage
-
- 44..1177.. iiooccttll
-
-
- SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_ioctl_in {
- ViceFid VFid;
- int cmd;
- int len;
- int rwflag;
- char *data; /* Place holder for data. */
- } cfs_ioctl;
-
-
-
- oouutt
-
-
- struct cfs_ioctl_out {
- int len;
- caddr_t data; /* Place holder for data. */
- } cfs_ioctl;
-
-
-
- DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and
- data arguments are filled as usual. flags is not used by Venus.
-
- EErrrroorrss
-
- NNOOTTEE Another bogus parameter. flags is not used. What is the
- business about PREFETCHING in the Venus code?
-
-
- 0wpage
-
- 44..1188.. rreennaammee
-
-
- SSuummmmaarryy Rename a fid.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_rename_in {
- ViceFid sourceFid;
- char *srcname;
- ViceFid destFid;
- char *destname;
- } cfs_rename;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Rename the object with name srcname in directory
- sourceFid to destname in destFid. It is important that the names
- srcname and destname are 0 terminated strings. Strings in Unix
- kernels are not always null terminated.
-
- EErrrroorrss
-
- 0wpage
-
- 44..1199.. rreeaaddddiirr
-
-
- SSuummmmaarryy Read directory entries.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_readdir_in {
- ViceFid VFid;
- int count;
- int offset;
- } cfs_readdir;
-
-
-
-
- oouutt
-
- struct cfs_readdir_out {
- int size;
- caddr_t data; /* Place holder for data. */
- } cfs_readdir;
-
-
-
- DDeessccrriippttiioonn Read directory entries from VFid starting at offset and
- read at most count bytes. Returns the data in data and returns
- the size in size.
-
- EErrrroorrss
-
- NNOOTTEE This call is not used. Readdir operations exploit container
- files. We will re-evaluate this during the directory revamp which is
- about to take place.
-
- 0wpage
-
- 44..2200.. vvggeett
-
-
- SSuummmmaarryy instructs Venus to do an FSDB->Get.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_vget_in {
- ViceFid VFid;
- } cfs_vget;
-
-
-
- oouutt
-
- struct cfs_vget_out {
- ViceFid VFid;
- int vtype;
- } cfs_vget;
-
-
-
- DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj
- labelled by VFid.
-
- EErrrroorrss
-
- NNOOTTEE This operation is not used. However, it is extremely useful
- since it can be used to deal with read/write memory mapped files.
- These can be "pinned" in the Venus cache using vget and released with
- inactive.
-
- 0wpage
-
- 44..2211.. ffssyynncc
-
-
- SSuummmmaarryy Tell Venus to update the RVM attributes of a file.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_fsync_in {
- ViceFid VFid;
- } cfs_fsync;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This
- should be called as part of kernel level fsync type calls. The
- result indicates if the syncing was successful.
-
- EErrrroorrss
-
- NNOOTTEE Linux does not implement this call. It should.
-
- 0wpage
-
- 44..2222.. iinnaaccttiivvee
-
-
- SSuummmmaarryy Tell Venus a vnode is no longer in use.
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_inactive_in {
- ViceFid VFid;
- } cfs_inactive;
-
-
-
- oouutt
- none
-
- DDeessccrriippttiioonn This operation returns EOPNOTSUPP.
-
- EErrrroorrss
-
- NNOOTTEE This should perhaps be removed.
-
- 0wpage
-
- 44..2233.. rrddwwrr
-
-
- SSuummmmaarryy Read or write from a file
-
- AArrgguummeennttss
-
- iinn
-
- struct cfs_rdwr_in {
- ViceFid VFid;
- int rwflag;
- int count;
- int offset;
- int ioflag;
- caddr_t data; /* Place holder for data. */
- } cfs_rdwr;
-
-
-
-
- oouutt
-
- struct cfs_rdwr_out {
- int rwflag;
- int count;
- caddr_t data; /* Place holder for data. */
- } cfs_rdwr;
-
-
-
- DDeessccrriippttiioonn This upcall asks Venus to read or write from a file.
-
- EErrrroorrss
-
- NNOOTTEE It should be removed since it is against the Coda philosophy that
- read/write operations never reach Venus. I have been told the
- operation does not work. It is not currently used.
-
-
- 0wpage
-
- 44..2244.. ooddyymmoouunntt
-
-
- SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount
- point.
-
- AArrgguummeennttss
-
- iinn
-
- struct ody_mount_in {
- char *name; /* Place holder for data. */
- } ody_mount;
-
-
-
- oouutt
-
- struct ody_mount_out {
- ViceFid VFid;
- } ody_mount;
-
-
-
- DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named
- name. The fid is returned in VFid.
-
- EErrrroorrss
-
- NNOOTTEE This call was used by David for dynamic sets. It should be
- removed since it causes a jungle of pointers in the VFS mounting area.
- It is not used by Coda proper. Call is not implemented by Venus.
-
- 0wpage
-
- 44..2255.. ooddyy__llooookkuupp
-
-
- SSuummmmaarryy Looks up something.
-
- AArrgguummeennttss
-
- iinn irrelevant
-
-
- oouutt
- irrelevant
-
- DDeessccrriippttiioonn
-
- EErrrroorrss
-
- NNOOTTEE Gut it. Call is not implemented by Venus.
-
- 0wpage
-
- 44..2266.. ooddyy__eexxppaanndd
-
-
- SSuummmmaarryy expands something in a dynamic set.
-
- AArrgguummeennttss
-
- iinn irrelevant
-
- oouutt
- irrelevant
-
- DDeessccrriippttiioonn
-
- EErrrroorrss
-
- NNOOTTEE Gut it. Call is not implemented by Venus.
-
- 0wpage
-
- 44..2277.. pprreeffeettcchh
-
-
- SSuummmmaarryy Prefetch a dynamic set.
-
- AArrgguummeennttss
-
- iinn Not documented.
-
- oouutt
- Not documented.
-
- DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is
- noted that it doesn't work. Not surprising, since the kernel does not
- have support for it. (ODY_PREFETCH is not a defined operation).
-
- EErrrroorrss
-
- NNOOTTEE Gut it. It isn't working and isn't used by Coda.
-
-
- 0wpage
-
- 44..2288.. ssiiggnnaall
-
-
- SSuummmmaarryy Send Venus a signal about an upcall.
-
- AArrgguummeennttss
-
- iinn none
-
- oouutt
- not applicable.
-
- DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus
- that the calling process received a signal after Venus read the
- message from the input queue. Venus is supposed to clean up the
- operation.
-
- EErrrroorrss No reply is given.
-
- NNOOTTEE We need to better understand what Venus needs to clean up and if
- it is doing this correctly. Also we need to handle multiple upcall
- per system call situations correctly. It would be important to know
- what state changes in Venus take place after an upcall for which the
- kernel is responsible for notifying Venus to clean up (e.g. open
- definitely is such a state change, but many others are maybe not).
-
- 0wpage
-
- 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss
-
-
- The Coda FS Driver can cache results of lookup and access upcalls, to
- limit the frequency of upcalls. Upcalls carry a price since a process
- context switch needs to take place. The counterpart of caching the
- information is that Venus will notify the FS Driver that cached
- entries must be flushed or renamed.
-
- The kernel code generally has to maintain a structure which links the
- internal file handles (called vnodes in BSD, inodes in Linux and
- FileHandles in Windows) with the ViceFid's which Venus maintains. The
- reason is that frequent translations back and forth are needed in
- order to make upcalls and use the results of upcalls. Such linking
- objects are called ccnnooddeess.
-
- The current minicache implementations have cache entries which record
- the following:
-
- 1. the name of the file
-
- 2. the cnode of the directory containing the object
-
- 3. a list of CodaCred's for which the lookup is permitted.
-
- 4. the cnode of the object
-
- The lookup call in the Coda FS Driver may request the cnode of the
- desired object from the cache, by passing its name, directory and the
- CodaCred's of the caller. The cache will return the cnode or indicate
- that it cannot be found. The Coda FS Driver must be careful to
- invalidate cache entries when it modifies or removes objects.
-
- When Venus obtains information that indicates that cache entries are
- no longer valid, it will make a downcall to the kernel. Downcalls are
- intercepted by the Coda FS Driver and lead to cache invalidations of
- the kind described below. The Coda FS Driver does not return an error
- unless the downcall data could not be read into kernel memory.
-
-
- 55..11.. IINNVVAALLIIDDAATTEE
-
-
- No information is available on this call.
-
-
- 55..22.. FFLLUUSSHH
-
-
-
- AArrgguummeennttss None
-
- SSuummmmaarryy Flush the name cache entirely.
-
- DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This
- is to prevent stale cache information being held. Some operating
- systems allow the kernel name cache to be switched off dynamically.
- When this is done, this downcall is made.
-
-
- 55..33.. PPUURRGGEEUUSSEERR
-
-
- AArrgguummeennttss
-
- struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */
- struct CodaCred cred;
- } cfs_purgeuser;
-
-
-
- DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This
- call is issued when tokens for a user expire or are flushed.
-
-
- 55..44.. ZZAAPPFFIILLEE
-
-
- AArrgguummeennttss
-
- struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */
- ViceFid CodaFid;
- } cfs_zapfile;
-
-
-
- DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair.
- This is issued as a result of an invalidation of cached attributes of
- a vnode.
-
- NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache
- zapfile routine takes different arguments. Linux does not implement
- the invalidation of attributes correctly.
-
-
-
- 55..55.. ZZAAPPDDIIRR
-
-
- AArrgguummeennttss
-
- struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */
- ViceFid CodaFid;
- } cfs_zapdir;
-
-
-
- DDeessccrriippttiioonn Remove all entries in the cache lying in a directory
- CodaFid, and all children of this directory. This call is issued when
- Venus receives a callback on the directory.
-
-
- 55..66.. ZZAAPPVVNNOODDEE
-
-
-
- AArrgguummeennttss
-
- struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */
- struct CodaCred cred;
- ViceFid VFid;
- } cfs_zapvnode;
-
-
-
- DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid
- as in the arguments. This downcall is probably never issued.
-
-
- 55..77.. PPUURRGGEEFFIIDD
-
-
- SSuummmmaarryy
-
- AArrgguummeennttss
-
- struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */
- ViceFid CodaFid;
- } cfs_purgefid;
-
-
-
- DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd
- vnode), purge its children from the namecache and remove the file from the
- namecache.
-
-
-
- 55..88.. RREEPPLLAACCEE
-
-
- SSuummmmaarryy Replace the Fid's for a collection of names.
-
- AArrgguummeennttss
-
- struct cfs_replace_out { /* cfs_replace is a venus->kernel call */
- ViceFid NewFid;
- ViceFid OldFid;
- } cfs_replace;
-
-
-
- DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with
- another. It is added to allow Venus during reintegration to replace
- locally allocated temp fids while disconnected with global fids even
- when the reference counts on those fids are not zero.
-
- 0wpage
-
- 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp
-
-
- This section gives brief hints as to desirable features for the Coda
- FS Driver at startup and upon shutdown or Venus failures. Before
- entering the discussion it is useful to repeat that the Coda FS Driver
- maintains the following data:
-
-
- 1. message queues
-
- 2. cnodes
-
- 3. name cache entries
-
- The name cache entries are entirely private to the driver, so they
- can easily be manipulated. The message queues will generally have
- clear points of initialization and destruction. The cnodes are
- much more delicate. User processes hold reference counts in Coda
- filesystems and it can be difficult to clean up the cnodes.
-
- It can expect requests through:
-
- 1. the message subsystem
-
- 2. the VFS layer
-
- 3. pioctl interface
-
- Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can
- treat these similarly.
-
-
- 66..11.. RReeqquuiirreemmeennttss
-
-
- The following requirements should be accommodated:
-
- 1. The message queues should have open and close routines. On Unix
- the opening of the character devices are such routines.
-
- +o Before opening, no messages can be placed.
-
- +o Opening will remove any old messages still pending.
-
- +o Close will notify any sleeping processes that their upcall cannot
- be completed.
-
- +o Close will free all memory allocated by the message queues.
-
-
- 2. At open the namecache shall be initialized to empty state.
-
- 3. Before the message queues are open, all VFS operations will fail.
- Fortunately this can be achieved by making sure than mounting the
- Coda filesystem cannot succeed before opening.
-
- 4. After closing of the queues, no VFS operations can succeed. Here
- one needs to be careful, since a few operations (lookup,
- read/write, readdir) can proceed without upcalls. These must be
- explicitly blocked.
-
- 5. Upon closing the namecache shall be flushed and disabled.
-
- 6. All memory held by cnodes can be freed without relying on upcalls.
-
- 7. Unmounting the file system can be done without relying on upcalls.
-
- 8. Mounting the Coda filesystem should fail gracefully if Venus cannot
- get the rootfid or the attributes of the rootfid. The latter is
- best implemented by Venus fetching these objects before attempting
- to mount.
-
- NNOOTTEE NetBSD in particular but also Linux have not implemented the
- above requirements fully. For smooth operation this needs to be
- corrected.
-
-
-
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs.rst
similarity index 86%
rename from Documentation/filesystems/configfs/configfs.txt
rename to Documentation/filesystems/configfs.rst
index 16e606c..1d3d6f4 100644
--- a/Documentation/filesystems/configfs/configfs.txt
+++ b/Documentation/filesystems/configfs.rst
@@ -1,5 +1,6 @@
-
-configfs - Userspace-driven kernel object configuration.
+=======================================================
+Configfs - Userspace-driven Kernel Object Configuration
+=======================================================
Joel Becker <joel.becker@oracle.com>
@@ -9,7 +10,8 @@
Joel Becker <joel.becker@oracle.com>
-[What is configfs?]
+What is configfs?
+=================
configfs is a ram-based filesystem that provides the converse of
sysfs's functionality. Where sysfs is a filesystem-based view of
@@ -35,10 +37,11 @@
Both sysfs and configfs can and should exist together on the same
system. One is not a replacement for the other.
-[Using configfs]
+Using configfs
+==============
configfs can be compiled as a module or into the kernel. You can access
-it by doing
+it by doing::
mount -t configfs none /config
@@ -56,28 +59,29 @@
There are two types of configfs attributes:
* Normal attributes, which similar to sysfs attributes, are small ASCII text
-files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
-only one value per file should be used, and the same caveats from sysfs apply.
-Configfs expects write(2) to store the entire buffer at once. When writing to
-normal configfs attributes, userspace processes should first read the entire
-file, modify the portions they wish to change, and then write the entire
-buffer back.
+ files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
+ only one value per file should be used, and the same caveats from sysfs apply.
+ Configfs expects write(2) to store the entire buffer at once. When writing to
+ normal configfs attributes, userspace processes should first read the entire
+ file, modify the portions they wish to change, and then write the entire
+ buffer back.
* Binary attributes, which are somewhat similar to sysfs binary attributes,
-but with a few slight changes to semantics. The PAGE_SIZE limitation does not
-apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
-The write(2) calls from user space are buffered, and the attributes'
-write_bin_attribute method will be invoked on the final close, therefore it is
-imperative for user-space to check the return code of close(2) in order to
-verify that the operation finished successfully.
-To avoid a malicious user OOMing the kernel, there's a per-binary attribute
-maximum buffer value.
+ but with a few slight changes to semantics. The PAGE_SIZE limitation does not
+ apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
+ The write(2) calls from user space are buffered, and the attributes'
+ write_bin_attribute method will be invoked on the final close, therefore it is
+ imperative for user-space to check the return code of close(2) in order to
+ verify that the operation finished successfully.
+ To avoid a malicious user OOMing the kernel, there's a per-binary attribute
+ maximum buffer value.
When an item needs to be destroyed, remove it with rmdir(2). An
item cannot be destroyed if any other item has a link to it (via
symlink(2)). Links can be removed via unlink(2).
-[Configuring FakeNBD: an Example]
+Configuring FakeNBD: an Example
+===============================
Imagine there's a Network Block Device (NBD) driver that allows you to
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
@@ -86,14 +90,14 @@
the driver about it. Here's where configfs comes in.
When the FakeNBD driver is loaded, it registers itself with configfs.
-readdir(3) sees this just fine:
+readdir(3) sees this just fine::
# ls /config
fakenbd
A fakenbd connection can be created with mkdir(2). The name is
arbitrary, but likely the tool will make some use of the name. Perhaps
-it is a uuid or a disk name:
+it is a uuid or a disk name::
# mkdir /config/fakenbd/disk1
# ls /config/fakenbd/disk1
@@ -102,7 +106,7 @@
The target attribute contains the IP address of the server FakeNBD will
connect to. The device attribute is the device on the server.
Predictably, the rw attribute determines whether the connection is
-read-only or read-write.
+read-only or read-write::
# echo 10.0.0.1 > /config/fakenbd/disk1/target
# echo /dev/sda1 > /config/fakenbd/disk1/device
@@ -111,7 +115,8 @@
That's it. That's all there is. Now the device is configured, via the
shell no less.
-[Coding With configfs]
+Coding With configfs
+====================
Every object in configfs is a config_item. A config_item reflects an
object in the subsystem. It has attributes that match values on that
@@ -130,7 +135,10 @@
subsystem is also a config_group, and can do everything a config_group
can.
-[struct config_item]
+struct config_item
+==================
+
+::
struct config_item {
char *ci_name;
@@ -168,7 +176,10 @@
Usually a subsystem wants the item to display and/or store attributes,
among other things. For that, it needs a type.
-[struct config_item_type]
+struct config_item_type
+=======================
+
+::
struct configfs_item_operations {
void (*release)(struct config_item *);
@@ -192,7 +203,10 @@
method. This method is called when the config_item's reference count
reaches zero.
-[struct configfs_attribute]
+struct configfs_attribute
+=========================
+
+::
struct configfs_attribute {
char *ca_name;
@@ -212,9 +226,12 @@
If an attribute is readable and provides a ->show method, that method will
be called whenever userspace asks for a read(2) on the attribute. If an
attribute is writable and provides a ->store method, that method will be
-be called whenever userspace asks for a write(2) on the attribute.
+called whenever userspace asks for a write(2) on the attribute.
-[struct configfs_bin_attribute]
+struct configfs_bin_attribute
+=============================
+
+::
struct configfs_bin_attribute {
struct configfs_attribute cb_attr;
@@ -240,11 +257,12 @@
single read/write will occur; the attributes' need not concern itself
with it.
-[struct config_group]
+struct config_group
+===================
A config_item cannot live in a vacuum. The only way one can be created
is via mkdir(2) on a config_group. This will trigger creation of a
-child item.
+child item::
struct config_group {
struct config_item cg_item;
@@ -264,7 +282,7 @@
that item means that a group can behave as an item in its own right.
However, it can do more: it can create child items or groups. This is
accomplished via the group operations specified on the group's
-config_item_type.
+config_item_type::
struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group,
@@ -279,7 +297,8 @@
};
A group creates child items by providing the
-ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
+ct_group_ops->make_item() method. If provided, this method is called from
+mkdir(2) in the group's directory. The subsystem allocates a new
config_item (or more likely, its container structure), initializes it,
and returns it to configfs. Configfs will then populate the filesystem
tree to reflect the new item.
@@ -296,13 +315,14 @@
the ct_group_ops->drop_item() method, and configfs will call
config_item_put() on the item on behalf of the subsystem.
-IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
-is called, configfs WILL remove the item from the filesystem tree
-(assuming that it has no children to keep it busy). The subsystem is
-responsible for responding to this. If the subsystem has references to
-the item in other threads, the memory is safe. It may take some time
-for the item to actually disappear from the subsystem's usage. But it
-is gone from configfs.
+Important:
+ drop_item() is void, and as such cannot fail. When rmdir(2)
+ is called, configfs WILL remove the item from the filesystem tree
+ (assuming that it has no children to keep it busy). The subsystem is
+ responsible for responding to this. If the subsystem has references to
+ the item in other threads, the memory is safe. It may take some time
+ for the item to actually disappear from the subsystem's usage. But it
+ is gone from configfs.
When drop_item() is called, the item's linkage has already been torn
down. It no longer has a reference on its parent and has no place in
@@ -319,10 +339,11 @@
called, as the item has not been dropped. rmdir(2) will fail, as the
directory is not empty.
-[struct configfs_subsystem]
+struct configfs_subsystem
+=========================
A subsystem must register itself, usually at module_init time. This
-tells configfs to make the subsystem appear in the file tree.
+tells configfs to make the subsystem appear in the file tree::
struct configfs_subsystem {
struct config_group su_group;
@@ -332,17 +353,19 @@
int configfs_register_subsystem(struct configfs_subsystem *subsys);
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
- A subsystem consists of a toplevel config_group and a mutex.
+A subsystem consists of a toplevel config_group and a mutex.
The group is where child config_items are created. For a subsystem,
this group is usually defined statically. Before calling
configfs_register_subsystem(), the subsystem must have initialized the
group via the usual group _init() functions, and it must also have
initialized the mutex.
- When the register call returns, the subsystem is live, and it
+
+When the register call returns, the subsystem is live, and it
will be visible via configfs. At that point, mkdir(2) can be called and
the subsystem must be ready for it.
-[An Example]
+An Example
+==========
The best example of these basic concepts is the simple_children
subsystem/group and the simple_child item in
@@ -350,7 +373,8 @@
and storing an attribute, and a simple group creating and destroying
these children.
-[Hierarchy Navigation and the Subsystem Mutex]
+Hierarchy Navigation and the Subsystem Mutex
+============================================
There is an extra bonus that configfs provides. The config_groups and
config_items are arranged in a hierarchy due to the fact that they
@@ -375,7 +399,8 @@
a subsystem to trust ci_parent and cg_children while they hold the
mutex.
-[Item Aggregation Via symlink(2)]
+Item Aggregation Via symlink(2)
+===============================
configfs provides a simple group via the group->item parent/child
relationship. Often, however, a larger environment requires aggregation
@@ -403,7 +428,8 @@
can it be removed while an item links to it. Dangling symlinks are not
allowed in configfs.
-[Automatically Created Subgroups]
+Automatically Created Subgroups
+===============================
A new config_group may want to have two types of child config_items.
While this could be codified by magic names in ->make_item(), it is much
@@ -433,7 +459,8 @@
rmdir(2). They also are not considered when rmdir(2) on the parent
group is checking for children.
-[Dependent Subsystems]
+Dependent Subsystems
+====================
Sometimes other drivers depend on particular configfs items. For
example, ocfs2 mounts depend on a heartbeat region item. If that
@@ -460,9 +487,11 @@
If it fails, it was being torn down anyway, and heartbeat can gracefully
pass up an error.
-[Committable Items]
+Committable Items
+=================
-NOTE: Committable items are currently unimplemented.
+Note:
+ Committable items are currently unimplemented.
Some config_items cannot have a valid initial state. That is, no
default values can be specified for the item's attributes such that the
@@ -504,5 +533,3 @@
shutdown, or "uncommitted". Again, this is done via rename(2), this
time from the "live" directory back to the "pending" one. The subsystem
is notified by the ct_group_ops->uncommit_object() method.
-
-
diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.rst
similarity index 88%
rename from Documentation/filesystems/cramfs.txt
rename to Documentation/filesystems/cramfs.rst
index 8e19a53..afbdbde 100644
--- a/Documentation/filesystems/cramfs.txt
+++ b/Documentation/filesystems/cramfs.rst
@@ -1,12 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
- Cramfs - cram a filesystem onto a small ROM
+===========================================
+Cramfs - cram a filesystem onto a small ROM
+===========================================
-cramfs is designed to be simple and small, and to compress things well.
+cramfs is designed to be simple and small, and to compress things well.
It uses the zlib routines to compress a file one page at a time, and
allows random page access. The meta-data is not compressed, but is
expressed in a very terse representation to make it use much less
-diskspace than traditional filesystems.
+diskspace than traditional filesystems.
You can't write to a cramfs filesystem (making it compressible and
compact also makes it _very_ hard to update on-the-fly), so you have to
@@ -28,9 +31,9 @@
Hard links are supported, but hard linked files
will still have a link count of 1 in the cramfs image.
-Cramfs directories have no `.' or `..' entries. Directories (like
+Cramfs directories have no ``.`` or ``..`` entries. Directories (like
every other file on cramfs) always have a link count of 1. (There's
-no need to use -noleaf in `find', btw.)
+no need to use -noleaf in ``find``, btw.)
No timestamps are stored in a cramfs, so these default to the epoch
(1970 GMT). Recently-accessed files may have updated timestamps, but
@@ -70,9 +73,9 @@
(Flash device in physical memory map). MTD partitions based on such devices
are fine too. Then that device should be specified with the "mtd:" prefix
as the mount device argument. For example, to mount the MTD device named
-"fs_partition" on the /mnt directory:
+"fs_partition" on the /mnt directory::
-$ mount -t cramfs mtd:fs_partition /mnt
+ $ mount -t cramfs mtd:fs_partition /mnt
To boot a kernel with this as root filesystem, suffice to specify
something like "root=mtd:fs_partition" on the kernel command line.
@@ -90,6 +93,7 @@
For /usr/share/magic
--------------------
+===== ======================= =======================
0 ulelong 0x28cd3d45 Linux cramfs offset 0
>4 ulelong x size %d
>8 ulelong x flags 0x%x
@@ -110,6 +114,7 @@
>552 ulelong x fsid.blocks %d
>556 ulelong x fsid.files %d
>560 string >\0 name "%.16s"
+===== ======================= =======================
Hacker Notes
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 6797294..8fdb78f 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -20,8 +20,144 @@
If you have a block device which supports DAX, you can make a filesystem
on it as usual. The DAX code currently only supports files with a block
size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem. When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the filesystem.
+
+Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them
+is different.
+
+Enabling DAX on ext2
+-----------------------------
+
+When mounting the filesystem, use the "-o dax" option on the command line or
+add 'dax' to the options in /etc/fstab. This works to enable DAX on all files
+within the filesystem. It is equivalent to the '-o dax=always' behavior below.
+
+
+Enabling DAX on xfs and ext4
+----------------------------
+
+Summary
+-------
+
+ 1. There exists an in-kernel file access mode flag S_DAX that corresponds to
+ the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details
+ about this access mode.
+
+ 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular
+ files and directories. This advisory flag can be set or cleared at any
+ time, but doing so does not immediately affect the S_DAX state.
+
+ 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will
+ be inherited by all regular files and subdirectories that are subsequently
+ created in this directory. Files and subdirectories that exist at the time
+ this flag is set or cleared on the parent directory are not modified by
+ this modification of the parent directory.
+
+ 4. There exist dax mount options which can override FS_XFLAG_DAX in the
+ setting of the S_DAX flag. Given underlying storage which supports DAX the
+ following hold:
+
+ "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.
+
+ "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."
+
+ "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX."
+
+ "-o dax" is a legacy option which is an alias for "dax=always".
+ This may be removed in the future so "-o dax=always" is
+ the preferred method for specifying this behavior.
+
+ NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain
+ the same even when the filesystem is mounted with a dax option. However,
+ in-core inode state (S_DAX) will be overridden until the filesystem is
+ remounted with dax=inode and the inode is evicted from kernel memory.
+
+ 5. The S_DAX policy can be changed via:
+
+ a) Setting the parent directory FS_XFLAG_DAX as needed before files are
+ created
+
+ b) Setting the appropriate dax="foo" mount option
+
+ c) Changing the FS_XFLAG_DAX flag on existing regular files and
+ directories. This has runtime constraints and limitations that are
+ described in 6) below.
+
+ 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag,
+ the change in behaviour for existing regular files may not occur
+ immediately. If the change must take effect immediately, the administrator
+ needs to:
+
+ a) stop the application so there are no active references to the data set
+ the policy change will affect
+
+ b) evict the data set from kernel caches so it will be re-instantiated when
+ the application is restarted. This can be achieved by:
+
+ i. drop-caches
+ ii. a filesystem unmount and mount cycle
+ iii. a system reboot
+
+
+Details
+-------
+
+There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX)
+and the other is a volatile flag indicating the active state of the feature
+(S_DAX).
+
+FS_XFLAG_DAX is preserved within the filesystem. This persistent config
+setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl
+(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
+
+New files and directories automatically inherit FS_XFLAG_DAX from
+their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at
+directory creation time can be used to set a default behavior for an entire
+sub-tree.
+
+To clarify inheritance, here are 3 examples:
+
+Example A:
+
+mkdir -p a/b/c
+xfs_io -c 'chattr +x' a
+mkdir a/b/c/d
+mkdir a/e
+
+ dax: a,e
+ no dax: b,c,d
+
+Example B:
+
+mkdir a
+xfs_io -c 'chattr +x' a
+mkdir -p a/b/c/d
+
+ dax: a,b,c,d
+ no dax:
+
+Example C:
+
+mkdir -p a/b/c
+xfs_io -c 'chattr +x' c
+mkdir a/b/c/d
+
+ dax: c,d
+ no dax: a,b
+
+
+The current enabled state (S_DAX) is set when a file inode is instantiated in
+memory by the kernel. It is set based on the underlying media support, the
+value of FS_XFLAG_DAX and the filesystem's dax mount option.
+
+statx can be used to query S_DAX. NOTE that only regular files will ever have
+S_DAX set and therefore statx will never indicate that S_DAX is set on
+directories.
+
+Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even
+if the underlying media does not support dax and/or the filesystem is
+overridden with a mount option.
+
Implementation Tips for Block Driver Writers
@@ -74,7 +210,7 @@
exposure of uninitialized data through mmap.
These filesystems may be used for inspiration:
-- ext2: see Documentation/filesystems/ext2.txt
+- ext2: see Documentation/filesystems/ext2.rst
- ext4: see Documentation/filesystems/ext4/
- xfs: see Documentation/admin-guide/xfs.rst
@@ -94,7 +230,7 @@
redundancy in the following ways:
1. Delete the affected file, and restore from a backup (sysadmin route):
- This will free the file system blocks that were being used by the file,
+ This will free the filesystem blocks that were being used by the file,
and the next time they're allocated, they will be zeroed first, which
happens through the driver, and will clear bad sectors.
diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.rst
similarity index 70%
rename from Documentation/filesystems/debugfs.txt
rename to Documentation/filesystems/debugfs.rst
index 9e27c84..0f2292e 100644
--- a/Documentation/filesystems/debugfs.txt
+++ b/Documentation/filesystems/debugfs.rst
@@ -1,4 +1,11 @@
-Copyright 2009 Jonathan Corbet <corbet@lwn.net>
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=======
+DebugFS
+=======
+
+Copyright |copy| 2009 Jonathan Corbet <corbet@lwn.net>
Debugfs exists as a simple way for kernel developers to make information
available to user space. Unlike /proc, which is only meant for information
@@ -6,11 +13,11 @@
debugfs has no rules at all. Developers can put any information they want
there. The debugfs filesystem is also intended to not serve as a stable
ABI to user space; in theory, there are no stability constraints placed on
-files exported there. The real world is not always so simple, though [1];
+files exported there. The real world is not always so simple, though [1]_;
even debugfs interfaces are best designed with the idea that they will need
to be maintained forever.
-Debugfs is typically mounted with a command like:
+Debugfs is typically mounted with a command like::
mount -t debugfs none /sys/kernel/debug
@@ -23,7 +30,7 @@
Code using debugfs should include <linux/debugfs.h>. Then, the first order
of business will be to create at least one directory to hold a set of
-debugfs files:
+debugfs files::
struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
@@ -36,7 +43,7 @@
indication that the kernel has been built without debugfs support and none
of the functions described below will work.
-The most general way to create a file within a debugfs directory is with:
+The most general way to create a file within a debugfs directory is with::
struct dentry *debugfs_create_file(const char *name, umode_t mode,
struct dentry *parent, void *data,
@@ -53,12 +60,12 @@
missing.
Create a file with an initial size, the following function can be used
-instead:
+instead::
- struct dentry *debugfs_create_file_size(const char *name, umode_t mode,
- struct dentry *parent, void *data,
- const struct file_operations *fops,
- loff_t file_size);
+ void debugfs_create_file_size(const char *name, umode_t mode,
+ struct dentry *parent, void *data,
+ const struct file_operations *fops,
+ loff_t file_size);
file_size is the initial file size. The other parameters are the same
as the function debugfs_create_file.
@@ -66,44 +73,52 @@
In a number of cases, the creation of a set of file operations is not
actually necessary; the debugfs code provides a number of helper functions
for simple situations. Files containing a single integer value can be
-created with any of:
+created with any of::
- struct dentry *debugfs_create_u8(const char *name, umode_t mode,
- struct dentry *parent, u8 *value);
- struct dentry *debugfs_create_u16(const char *name, umode_t mode,
- struct dentry *parent, u16 *value);
- struct dentry *debugfs_create_u32(const char *name, umode_t mode,
- struct dentry *parent, u32 *value);
- struct dentry *debugfs_create_u64(const char *name, umode_t mode,
- struct dentry *parent, u64 *value);
+ void debugfs_create_u8(const char *name, umode_t mode,
+ struct dentry *parent, u8 *value);
+ void debugfs_create_u16(const char *name, umode_t mode,
+ struct dentry *parent, u16 *value);
+ void debugfs_create_u32(const char *name, umode_t mode,
+ struct dentry *parent, u32 *value);
+ void debugfs_create_u64(const char *name, umode_t mode,
+ struct dentry *parent, u64 *value);
These files support both reading and writing the given value; if a specific
file should not be written to, simply set the mode bits accordingly. The
values in these files are in decimal; if hexadecimal is more appropriate,
-the following functions can be used instead:
+the following functions can be used instead::
- struct dentry *debugfs_create_x8(const char *name, umode_t mode,
- struct dentry *parent, u8 *value);
- struct dentry *debugfs_create_x16(const char *name, umode_t mode,
- struct dentry *parent, u16 *value);
- struct dentry *debugfs_create_x32(const char *name, umode_t mode,
- struct dentry *parent, u32 *value);
- struct dentry *debugfs_create_x64(const char *name, umode_t mode,
- struct dentry *parent, u64 *value);
+ void debugfs_create_x8(const char *name, umode_t mode,
+ struct dentry *parent, u8 *value);
+ void debugfs_create_x16(const char *name, umode_t mode,
+ struct dentry *parent, u16 *value);
+ void debugfs_create_x32(const char *name, umode_t mode,
+ struct dentry *parent, u32 *value);
+ void debugfs_create_x64(const char *name, umode_t mode,
+ struct dentry *parent, u64 *value);
These functions are useful as long as the developer knows the size of the
value to be exported. Some types can have different widths on different
-architectures, though, complicating the situation somewhat. There is a
-function meant to help out in one special case:
+architectures, though, complicating the situation somewhat. There are
+functions meant to help out in such special cases::
- struct dentry *debugfs_create_size_t(const char *name, umode_t mode,
- struct dentry *parent,
- size_t *value);
+ void debugfs_create_size_t(const char *name, umode_t mode,
+ struct dentry *parent, size_t *value);
As might be expected, this function will create a debugfs file to represent
a variable of type size_t.
-Boolean values can be placed in debugfs with:
+Similarly, there are helpers for variables of type unsigned long, in decimal
+and hexadecimal::
+
+ struct dentry *debugfs_create_ulong(const char *name, umode_t mode,
+ struct dentry *parent,
+ unsigned long *value);
+ void debugfs_create_xul(const char *name, umode_t mode,
+ struct dentry *parent, unsigned long *value);
+
+Boolean values can be placed in debugfs with::
struct dentry *debugfs_create_bool(const char *name, umode_t mode,
struct dentry *parent, bool *value);
@@ -112,16 +127,16 @@
N, followed by a newline. If written to, it will accept either upper- or
lower-case values, or 1 or 0. Any other input will be silently ignored.
-Also, atomic_t values can be placed in debugfs with:
+Also, atomic_t values can be placed in debugfs with::
- struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode,
- struct dentry *parent, atomic_t *value)
+ void debugfs_create_atomic_t(const char *name, umode_t mode,
+ struct dentry *parent, atomic_t *value)
A read of this file will get atomic_t values, and a write of this file
will set atomic_t values.
Another option is exporting a block of arbitrary binary data, with
-this structure and function:
+this structure and function::
struct debugfs_blob_wrapper {
void *data;
@@ -143,7 +158,7 @@
often during development, even if little such code reaches mainline.
Debugfs offers two functions: one to make a registers-only file, and
another to insert a register block in the middle of another sequential
-file.
+file::
struct debugfs_reg32 {
char *name;
@@ -151,35 +166,40 @@
};
struct debugfs_regset32 {
- struct debugfs_reg32 *regs;
+ const struct debugfs_reg32 *regs;
int nregs;
void __iomem *base;
+ struct device *dev; /* Optional device for Runtime PM */
};
- struct dentry *debugfs_create_regset32(const char *name, umode_t mode,
- struct dentry *parent,
- struct debugfs_regset32 *regset);
+ debugfs_create_regset32(const char *name, umode_t mode,
+ struct dentry *parent,
+ struct debugfs_regset32 *regset);
- void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs,
+ void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs,
int nregs, void __iomem *base, char *prefix);
The "base" argument may be 0, but you may want to build the reg32 array
using __stringify, and a number of register names (macros) are actually
byte offsets over a base for the register block.
-If you want to dump an u32 array in debugfs, you can create file with:
+If you want to dump an u32 array in debugfs, you can create file with::
+
+ struct debugfs_u32_array {
+ u32 *array;
+ u32 n_elements;
+ };
void debugfs_create_u32_array(const char *name, umode_t mode,
struct dentry *parent,
- u32 *array, u32 elements);
+ struct debugfs_u32_array *array);
-The "array" argument provides data, and the "elements" argument is
-the number of elements in the array. Note: Once array is created its
-size can not be changed.
+The "array" argument wraps a pointer to the array's data and the number
+of its elements. Note: Once array is created its size can not be changed.
-There is a helper function to create device related seq_file:
+There is a helper function to create device related seq_file::
- struct dentry *debugfs_create_devm_seqfile(struct device *dev,
+ void debugfs_create_devm_seqfile(struct device *dev,
const char *name,
struct dentry *parent,
int (*read_fn)(struct seq_file *s,
@@ -189,14 +209,14 @@
the "read_fn" is a function pointer which to be called to print the
seq_file content.
-There are a couple of other directory-oriented helper functions:
+There are a couple of other directory-oriented helper functions::
- struct dentry *debugfs_rename(struct dentry *old_dir,
+ struct dentry *debugfs_rename(struct dentry *old_dir,
struct dentry *old_dentry,
- struct dentry *new_dir,
+ struct dentry *new_dir,
const char *new_name);
- struct dentry *debugfs_create_symlink(const char *name,
+ struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
@@ -211,7 +231,7 @@
will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file
-can be removed with:
+can be removed with::
void debugfs_remove(struct dentry *dentry);
@@ -221,7 +241,7 @@
Once upon a time, debugfs users were required to remember the dentry
pointer for every debugfs file they created so that all files could be
cleaned up. We live in more civilized times now, though, and debugfs users
-can call:
+can call::
void debugfs_remove_recursive(struct dentry *dentry);
@@ -229,5 +249,4 @@
top-level directory, the entire hierarchy below that directory will be
removed.
-Notes:
- [1] http://lwn.net/Articles/309298/
+.. [1] http://lwn.net/Articles/309298/
diff --git a/Documentation/filesystems/devpts.rst b/Documentation/filesystems/devpts.rst
new file mode 100644
index 0000000..a03248d
--- /dev/null
+++ b/Documentation/filesystems/devpts.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+The Devpts Filesystem
+=====================
+
+Each mount of the devpts filesystem is now distinct such that ptys
+and their indicies allocated in one mount are independent from ptys
+and their indicies in all other mounts.
+
+All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node
+with permissions ``0000``.
+
+To retain backwards compatibility the a ptmx device node (aka any node
+created with ``mknod name c 5 2``) when opened will look for an instance
+of devpts under the name ``pts`` in the same directory as the ptmx device
+node.
+
+As an option instead of placing a ``/dev/ptmx`` device node at ``/dev/ptmx``
+it is possible to place a symlink to ``/dev/pts/ptmx`` at ``/dev/ptmx`` or
+to bind mount ``/dev/ptx/ptmx`` to ``/dev/ptmx``. If you opt for using
+the devpts filesystem in this manner devpts should be mounted with
+the ``ptmxmode=0666``, or ``chmod 0666 /dev/pts/ptmx`` should be called.
+
+Total count of pty pairs in all instances is limited by sysctls::
+
+ kernel.pty.max = 4096 - global limit
+ kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
+ kernel.pty.nr - current count of ptys
+
+Per-instance limit could be set by adding mount option ``max=<count>``.
+
+This feature was added in kernel 3.4 together with
+``sysctl kernel.pty.reserve``.
+
+In kernels older than 3.4 sysctl ``kernel.pty.max`` works as per-instance limit.
diff --git a/Documentation/filesystems/devpts.txt b/Documentation/filesystems/devpts.txt
deleted file mode 100644
index 9f94fe2..0000000
--- a/Documentation/filesystems/devpts.txt
+++ /dev/null
@@ -1,26 +0,0 @@
-Each mount of the devpts filesystem is now distinct such that ptys
-and their indicies allocated in one mount are independent from ptys
-and their indicies in all other mounts.
-
-All mounts of the devpts filesystem now create a /dev/pts/ptmx node
-with permissions 0000.
-
-To retain backwards compatibility the a ptmx device node (aka any node
-created with "mknod name c 5 2") when opened will look for an instance
-of devpts under the name "pts" in the same directory as the ptmx device
-node.
-
-As an option instead of placing a /dev/ptmx device node at /dev/ptmx
-it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or
-to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using
-the devpts filesystem in this manner devpts should be mounted with
-the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called.
-
-Total count of pty pairs in all instances is limited by sysctls:
-kernel.pty.max = 4096 - global limit
-kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
-kernel.pty.nr - current count of ptys
-
-Per-instance limit could be set by adding mount option "max=<count>".
-This feature was added in kernel 3.4 together with sysctl kernel.pty.reserve.
-In kernels older than 3.4 sysctl kernel.pty.max works as per-instance limit.
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index de12016..504ba94 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -28,7 +28,7 @@
if the target already exists, lock it. If the source is a non-directory,
lock it. If we need to lock both, lock them in inode pointer order.
Then call the method. All locks are exclusive.
-NB: we might get away with locking the the source (and target in exchange
+NB: we might get away with locking the source (and target in exchange
case) shared.
5) link creation. Locking rules:
@@ -56,7 +56,7 @@
* call the method.
All ->i_rwsem are taken exclusive. Again, we might get away with locking
-the the source (and target in exchange case) shared.
+the source (and target in exchange case) shared.
The rules above obviously guarantee that all directories that are going to be
read, modified or removed by method will be locked by caller.
diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.rst
similarity index 86%
rename from Documentation/filesystems/dlmfs.txt
rename to Documentation/filesystems/dlmfs.rst
index fcf4d50..28dd41a 100644
--- a/Documentation/filesystems/dlmfs.txt
+++ b/Documentation/filesystems/dlmfs.rst
@@ -1,20 +1,25 @@
-dlmfs
-==================
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=====
+DLMFS
+=====
+
A minimal DLM userspace interface implemented via a virtual file
system.
dlmfs is built with OCFS2 as it requires most of its infrastructure.
-Project web page: http://ocfs2.wiki.kernel.org
-Tools web page: https://github.com/markfasheh/ocfs2-tools
-OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+:Project web page: http://ocfs2.wiki.kernel.org
+:Tools web page: https://github.com/markfasheh/ocfs2-tools
+:OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
-CREDITS
+Credits
=======
-Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds
+Some code taken from ramfs which is Copyright |copy| 2000 Linus Torvalds
and Transmeta Corp.
Mark Fasheh <mark.fasheh@oracle.com>
@@ -96,14 +101,19 @@
open(2) with O_CREAT to ensure the resource inode is created - dlmfs does
not automatically create inodes for existing lock resources.
+============ ===========================
Open Flag Lock Request Type
---------- -----------------
+============ ===========================
O_RDONLY Shared Read
O_RDWR Exclusive
+============ ===========================
+
+============ ===========================
Open Flag Resulting Locking Behavior
---------- --------------------------
+============ ===========================
O_NONBLOCK Trylock operation
+============ ===========================
You must provide exactly one of O_RDONLY or O_RDWR.
diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.rst
similarity index 88%
rename from Documentation/filesystems/dnotify.txt
rename to Documentation/filesystems/dnotify.rst
index 1515688..a28a1f9 100644
--- a/Documentation/filesystems/dnotify.txt
+++ b/Documentation/filesystems/dnotify.rst
@@ -1,5 +1,8 @@
- Linux Directory Notification
- ============================
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Linux Directory Notification
+============================
Stephen Rothwell <sfr@canb.auug.org.au>
@@ -12,6 +15,7 @@
The application decides which "events" it wants to be notified about.
The currently defined events are:
+ ========= =====================================================
DN_ACCESS A file in the directory was accessed (read)
DN_MODIFY A file in the directory was modified (write,truncate)
DN_CREATE A file was created in the directory
@@ -19,6 +23,7 @@
DN_RENAME A file in the directory was renamed
DN_ATTRIB A file in the directory had its attributes
changed (chmod,chown)
+ ========= =====================================================
Usually, the application must reregister after each notification, but
if DN_MULTISHOT is or'ed with the event mask, then the registration will
@@ -36,7 +41,7 @@
is often blocked, so it is better to use (at least) SIGRTMIN + 1.
Implementation expectations (features and bugs :-))
----------------------------
+---------------------------------------------------
The notification should work for any local access to files even if the
actual file system is on a remote server. This implies that remote
@@ -67,4 +72,4 @@
NOTE
----
Beginning with Linux 2.6.13, dnotify has been replaced by inotify.
-See Documentation/filesystems/inotify.txt for more information on it.
+See Documentation/filesystems/inotify.rst for more information on it.
diff --git a/Documentation/filesystems/ecryptfs.txt b/Documentation/filesystems/ecryptfs.rst
similarity index 62%
rename from Documentation/filesystems/ecryptfs.txt
rename to Documentation/filesystems/ecryptfs.rst
index 01d8a08..1f2edef 100644
--- a/Documentation/filesystems/ecryptfs.txt
+++ b/Documentation/filesystems/ecryptfs.rst
@@ -1,14 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
eCryptfs: A stacked cryptographic filesystem for Linux
+======================================================
eCryptfs is free software. Please see the file COPYING for details.
For documentation, please see the files in the doc/ subdirectory. For
building and installation instructions please see the INSTALL file.
-Maintainer: Phillip Hellewell
-Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com>
-Developers: Michael C. Thompson
- Kent Yoder
-Web Site: http://ecryptfs.sf.net
+:Maintainer: Phillip Hellewell
+:Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com>
+:Developers: Michael C. Thompson
+ Kent Yoder
+:Web Site: http://ecryptfs.sf.net
This software is currently undergoing development. Make sure to
maintain a backup copy of any data you write into eCryptfs.
@@ -19,34 +23,36 @@
http://sourceforge.net/projects/ecryptfs/
Userspace requirements include:
- - David Howells' userspace keyring headers and libraries (version
- 1.0 or higher), obtainable from
- http://people.redhat.com/~dhowells/keyutils/
- - Libgcrypt
+
+- David Howells' userspace keyring headers and libraries (version
+ 1.0 or higher), obtainable from
+ http://people.redhat.com/~dhowells/keyutils/
+- Libgcrypt
-NOTES
+.. note::
-In the beta/experimental releases of eCryptfs, when you upgrade
-eCryptfs, you should copy the files to an unencrypted location and
-then copy the files back into the new eCryptfs mount to migrate the
-files.
+ In the beta/experimental releases of eCryptfs, when you upgrade
+ eCryptfs, you should copy the files to an unencrypted location and
+ then copy the files back into the new eCryptfs mount to migrate the
+ files.
-MOUNT-WIDE PASSPHRASE
+Mount-wide Passphrase
+=====================
Create a new directory into which eCryptfs will write its encrypted
files (i.e., /root/crypt). Then, create the mount point directory
-(i.e., /mnt/crypt). Now it's time to mount eCryptfs:
+(i.e., /mnt/crypt). Now it's time to mount eCryptfs::
-mount -t ecryptfs /root/crypt /mnt/crypt
+ mount -t ecryptfs /root/crypt /mnt/crypt
You should be prompted for a passphrase and a salt (the salt may be
blank).
-Try writing a new file:
+Try writing a new file::
-echo "Hello, World" > /mnt/crypt/hello.txt
+ echo "Hello, World" > /mnt/crypt/hello.txt
The operation will complete. Notice that there is a new file in
/root/crypt that is at least 12288 bytes in size (depending on your
@@ -59,10 +65,13 @@
Then umount /mnt/crypt and mount again per the instructions given
above.
-cat /mnt/crypt/hello.txt
+::
+
+ cat /mnt/crypt/hello.txt
-NOTES
+Notes
+=====
eCryptfs version 0.1 should only be mounted on (1) empty directories
or (2) directories containing files only created by eCryptfs. If you
diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst
new file mode 100644
index 0000000..0551985
--- /dev/null
+++ b/Documentation/filesystems/efivarfs.rst
@@ -0,0 +1,43 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+efivarfs - a (U)EFI variable filesystem
+=======================================
+
+The efivarfs filesystem was created to address the shortcomings of
+using entries in sysfs to maintain EFI variables. The old sysfs EFI
+variables code only supported variables of up to 1024 bytes. This
+limitation existed in version 0.99 of the EFI specification, but was
+removed before any full releases. Since variables can now be larger
+than a single page, sysfs isn't the best interface for this.
+
+Variables can be created, deleted and modified with the efivarfs
+filesystem.
+
+efivarfs is typically mounted like this::
+
+ mount -t efivarfs none /sys/firmware/efi/efivars
+
+Due to the presence of numerous firmware bugs where removing non-standard
+UEFI variables causes the system firmware to fail to POST, efivarfs
+files that are not well-known standardized variables are created
+as immutable files. This doesn't prevent removal - "chattr -i" will work -
+but it does prevent this kind of failure from being accomplished
+accidentally.
+
+.. warning ::
+ When a content of an UEFI variable in /sys/firmware/efi/efivars is
+ displayed, for example using "hexdump", pay attention that the first
+ 4 bytes of the output represent the UEFI variable attributes,
+ in little-endian format.
+
+ Practically the output of each efivar is composed of:
+
+ +-----------------------------------+
+ |4_bytes_of_attributes + efivar_data|
+ +-----------------------------------+
+
+*See also:*
+
+- Documentation/admin-guide/acpi/ssdt-overlays.rst
+- Documentation/ABI/stable/sysfs-firmware-efi-vars
diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.txt
deleted file mode 100644
index 686a64b..0000000
--- a/Documentation/filesystems/efivarfs.txt
+++ /dev/null
@@ -1,23 +0,0 @@
-
-efivarfs - a (U)EFI variable filesystem
-
-The efivarfs filesystem was created to address the shortcomings of
-using entries in sysfs to maintain EFI variables. The old sysfs EFI
-variables code only supported variables of up to 1024 bytes. This
-limitation existed in version 0.99 of the EFI specification, but was
-removed before any full releases. Since variables can now be larger
-than a single page, sysfs isn't the best interface for this.
-
-Variables can be created, deleted and modified with the efivarfs
-filesystem.
-
-efivarfs is typically mounted like this,
-
- mount -t efivarfs none /sys/firmware/efi/efivars
-
-Due to the presence of numerous firmware bugs where removing non-standard
-UEFI variables causes the system firmware to fail to POST, efivarfs
-files that are not well-known standardized variables are created
-as immutable files. This doesn't prevent removal - "chattr -i" will work -
-but it does prevent this kind of failure from being accomplished
-accidentally.
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
new file mode 100644
index 0000000..bf14517
--- /dev/null
+++ b/Documentation/filesystems/erofs.rst
@@ -0,0 +1,240 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+Enhanced Read-Only File System - EROFS
+======================================
+
+Overview
+========
+
+EROFS file-system stands for Enhanced Read-Only File System. Different
+from other read-only file systems, it aims to be designed for flexibility,
+scalability, but be kept simple and high performance.
+
+It is designed as a better filesystem solution for the following scenarios:
+
+ - read-only storage media or
+
+ - part of a fully trusted read-only solution, which means it needs to be
+ immutable and bit-for-bit identical to the official golden image for
+ their releases due to security and other considerations and
+
+ - hope to save some extra storage space with guaranteed end-to-end performance
+ by using reduced metadata and transparent file compression, especially
+ for those embedded devices with limited memory (ex, smartphone);
+
+Here is the main features of EROFS:
+
+ - Little endian on-disk design;
+
+ - Currently 4KB block size (nobh) and therefore maximum 16TB address space;
+
+ - Metadata & data could be mixed by design;
+
+ - 2 inode versions for different requirements:
+
+ ===================== ============ =====================================
+ compact (v1) extended (v2)
+ ===================== ============ =====================================
+ Inode metadata size 32 bytes 64 bytes
+ Max file size 4 GB 16 EB (also limited by max. vol size)
+ Max uids/gids 65536 4294967296
+ File change time no yes (64 + 32-bit timestamp)
+ Max hardlinks 65536 4294967296
+ Metadata reserved 4 bytes 14 bytes
+ ===================== ============ =====================================
+
+ - Support extended attributes (xattrs) as an option;
+
+ - Support xattr inline and tail-end data inline for all files;
+
+ - Support POSIX.1e ACLs by using xattrs;
+
+ - Support transparent file compression as an option:
+ LZ4 algorithm with 4 KB fixed-sized output compression for high performance.
+
+The following git tree provides the file system user-space tools under
+development (ex, formatting tool mkfs.erofs):
+
+- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
+
+Bugs and patches are welcome, please kindly help us and send to the following
+linux-erofs mailing list:
+
+- linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
+
+Mount options
+=============
+
+=================== =========================================================
+(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled
+ by default if CONFIG_EROFS_FS_XATTR is selected.
+(no)acl Setup POSIX Access Control List. Note: acl is enabled
+ by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
+cache_strategy=%s Select a strategy for cached decompression from now on:
+
+ ========== =============================================
+ disabled In-place I/O decompression only;
+ readahead Cache the last incomplete compressed physical
+ cluster for further reading. It still does
+ in-place I/O decompression for the rest
+ compressed physical clusters;
+ readaround Cache the both ends of incomplete compressed
+ physical clusters for further reading.
+ It still does in-place I/O decompression
+ for the rest compressed physical clusters.
+ ========== =============================================
+=================== =========================================================
+
+On-disk details
+===============
+
+Summary
+-------
+Different from other read-only file systems, an EROFS volume is designed
+to be as simple as possible::
+
+ |-> aligned with the block size
+ ____________________________________________________________
+ | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
+ |_|__|_|_____|__________|_____|______|__________|_____|______|
+ 0 +1K
+
+All data areas should be aligned with the block size, but metadata areas
+may not. All metadatas can be now observed in two different spaces (views):
+
+ 1. Inode metadata space
+
+ Each valid inode should be aligned with an inode slot, which is a fixed
+ value (32 bytes) and designed to be kept in line with compact inode size.
+
+ Each inode can be directly found with the following formula:
+ inode offset = meta_blkaddr * block_size + 32 * nid
+
+ ::
+
+ |-> aligned with 8B
+ |-> followed closely
+ + meta_blkaddr blocks |-> another slot
+ _____________________________________________________________________
+ | ... | inode | xattrs | extents | data inline | ... | inode ...
+ |________|_______|(optional)|(optional)|__(optional)_|_____|__________
+ |-> aligned with the inode slot size
+ . .
+ . .
+ . .
+ . .
+ . .
+ . .
+ .____________________________________________________|-> aligned with 4B
+ | xattr_ibody_header | shared xattrs | inline xattrs |
+ |____________________|_______________|_______________|
+ |-> 12 bytes <-|->x * 4 bytes<-| .
+ . . .
+ . . .
+ . . .
+ ._______________________________.______________________.
+ | id | id | id | id | ... | id | ent | ... | ent| ... |
+ |____|____|____|____|______|____|_____|_____|____|_____|
+ |-> aligned with 4B
+ |-> aligned with 4B
+
+ Inode could be 32 or 64 bytes, which can be distinguished from a common
+ field which all inode versions have -- i_format::
+
+ __________________ __________________
+ | i_format | | i_format |
+ |__________________| |__________________|
+ | ... | | ... |
+ | | | |
+ |__________________| 32 bytes | |
+ | |
+ |__________________| 64 bytes
+
+ Xattrs, extents, data inline are followed by the corresponding inode with
+ proper alignment, and they could be optional for different data mappings.
+ _currently_ total 4 valid data mappings are supported:
+
+ == ====================================================================
+ 0 flat file data without data inline (no extent);
+ 1 fixed-sized output data compression (with non-compacted indexes);
+ 2 flat file data with tail packing data inline (no extent);
+ 3 fixed-sized output data compression (with compacted indexes, v5.3+).
+ == ====================================================================
+
+ The size of the optional xattrs is indicated by i_xattr_count in inode
+ header. Large xattrs or xattrs shared by many different files can be
+ stored in shared xattrs metadata rather than inlined right after inode.
+
+ 2. Shared xattrs metadata space
+
+ Shared xattrs space is similar to the above inode space, started with
+ a specific block indicated by xattr_blkaddr, organized one by one with
+ proper align.
+
+ Each share xattr can also be directly found by the following formula:
+ xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
+
+ ::
+
+ |-> aligned by 4 bytes
+ + xattr_blkaddr blocks |-> aligned with 4 bytes
+ _________________________________________________________________________
+ | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
+ |________|_____________|_____________|_____|______________|_______________
+
+Directories
+-----------
+All directories are now organized in a compact on-disk format. Note that
+each directory block is divided into index and name areas in order to support
+random file lookup, and all directory entries are _strictly_ recorded in
+alphabetical order in order to support improved prefix binary search
+algorithm (could refer to the related source code).
+
+::
+
+ ___________________________
+ / |
+ / ______________|________________
+ / / | nameoff1 | nameoffN-1
+ ____________.______________._______________v________________v__________
+ | dirent | dirent | ... | dirent | filename | filename | ... | filename |
+ |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
+ \ ^
+ \ | * could have
+ \ | trailing '\0'
+ \________________________| nameoff0
+
+ Directory block
+
+Note that apart from the offset of the first filename, nameoff0 also indicates
+the total number of directory entries in this block since it is no need to
+introduce another on-disk field at all.
+
+Compression
+-----------
+Currently, EROFS supports 4KB fixed-sized output transparent file compression,
+as illustrated below::
+
+ |---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
+ clusterofs clusterofs clusterofs
+ | | | logical data
+ _________v_______________________________v_____________________v_______________
+ ... | . | | . | | . | ...
+ ____|____.________|_____________|________.____|_____________|__.__________|____
+ |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
+ size size size size size
+ . . . .
+ . . . .
+ . . . .
+ _______._____________._____________._____________._____________________
+ ... | | | | ... physical data
+ _______|_____________|_____________|_____________|_____________________
+ |-> cluster <-|-> cluster <-|-> cluster <-|
+ size size size
+
+Currently each on-disk physical cluster can contain 4KB (un)compressed data
+at most. For each logical cluster, there is a corresponding on-disk index to
+describe its cluster type, physical cluster address, etc.
+
+See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
diff --git a/Documentation/filesystems/erofs.txt b/Documentation/filesystems/erofs.txt
deleted file mode 100644
index b0c0853..0000000
--- a/Documentation/filesystems/erofs.txt
+++ /dev/null
@@ -1,210 +0,0 @@
-Overview
-========
-
-EROFS file-system stands for Enhanced Read-Only File System. Different
-from other read-only file systems, it aims to be designed for flexibility,
-scalability, but be kept simple and high performance.
-
-It is designed as a better filesystem solution for the following scenarios:
- - read-only storage media or
-
- - part of a fully trusted read-only solution, which means it needs to be
- immutable and bit-for-bit identical to the official golden image for
- their releases due to security and other considerations and
-
- - hope to save some extra storage space with guaranteed end-to-end performance
- by using reduced metadata and transparent file compression, especially
- for those embedded devices with limited memory (ex, smartphone);
-
-Here is the main features of EROFS:
- - Little endian on-disk design;
-
- - Currently 4KB block size (nobh) and therefore maximum 16TB address space;
-
- - Metadata & data could be mixed by design;
-
- - 2 inode versions for different requirements:
- v1 v2
- Inode metadata size: 32 bytes 64 bytes
- Max file size: 4 GB 16 EB (also limited by max. vol size)
- Max uids/gids: 65536 4294967296
- File creation time: no yes (64 + 32-bit timestamp)
- Max hardlinks: 65536 4294967296
- Metadata reserved: 4 bytes 14 bytes
-
- - Support extended attributes (xattrs) as an option;
-
- - Support xattr inline and tail-end data inline for all files;
-
- - Support POSIX.1e ACLs by using xattrs;
-
- - Support transparent file compression as an option:
- LZ4 algorithm with 4 KB fixed-output compression for high performance;
-
-The following git tree provides the file system user-space tools under
-development (ex, formatting tool mkfs.erofs):
->> git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
-
-Bugs and patches are welcome, please kindly help us and send to the following
-linux-erofs mailing list:
->> linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
-
-Mount options
-=============
-
-(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled
- by default if CONFIG_EROFS_FS_XATTR is selected.
-(no)acl Setup POSIX Access Control List. Note: acl is enabled
- by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
-cache_strategy=%s Select a strategy for cached decompression from now on:
- disabled: In-place I/O decompression only;
- readahead: Cache the last incomplete compressed physical
- cluster for further reading. It still does
- in-place I/O decompression for the rest
- compressed physical clusters;
- readaround: Cache the both ends of incomplete compressed
- physical clusters for further reading.
- It still does in-place I/O decompression
- for the rest compressed physical clusters.
-
-On-disk details
-===============
-
-Summary
--------
-Different from other read-only file systems, an EROFS volume is designed
-to be as simple as possible:
-
- |-> aligned with the block size
- ____________________________________________________________
- | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
- |_|__|_|_____|__________|_____|______|__________|_____|______|
- 0 +1K
-
-All data areas should be aligned with the block size, but metadata areas
-may not. All metadatas can be now observed in two different spaces (views):
- 1. Inode metadata space
- Each valid inode should be aligned with an inode slot, which is a fixed
- value (32 bytes) and designed to be kept in line with v1 inode size.
-
- Each inode can be directly found with the following formula:
- inode offset = meta_blkaddr * block_size + 32 * nid
-
- |-> aligned with 8B
- |-> followed closely
- + meta_blkaddr blocks |-> another slot
- _____________________________________________________________________
- | ... | inode | xattrs | extents | data inline | ... | inode ...
- |________|_______|(optional)|(optional)|__(optional)_|_____|__________
- |-> aligned with the inode slot size
- . .
- . .
- . .
- . .
- . .
- . .
- .____________________________________________________|-> aligned with 4B
- | xattr_ibody_header | shared xattrs | inline xattrs |
- |____________________|_______________|_______________|
- |-> 12 bytes <-|->x * 4 bytes<-| .
- . . .
- . . .
- . . .
- ._______________________________.______________________.
- | id | id | id | id | ... | id | ent | ... | ent| ... |
- |____|____|____|____|______|____|_____|_____|____|_____|
- |-> aligned with 4B
- |-> aligned with 4B
-
- Inode could be 32 or 64 bytes, which can be distinguished from a common
- field which all inode versions have -- i_advise:
-
- __________________ __________________
- | i_advise | | i_advise |
- |__________________| |__________________|
- | ... | | ... |
- | | | |
- |__________________| 32 bytes | |
- | |
- |__________________| 64 bytes
-
- Xattrs, extents, data inline are followed by the corresponding inode with
- proper alignes, and they could be optional for different data mappings,
- _currently_ there are totally 3 valid data mappings supported:
-
- 1) flat file data without data inline (no extent);
- 2) fixed-output size data compression (must have extents);
- 3) flat file data with tail-end data inline (no extent);
-
- The size of the optional xattrs is indicated by i_xattr_count in inode
- header. Large xattrs or xattrs shared by many different files can be
- stored in shared xattrs metadata rather than inlined right after inode.
-
- 2. Shared xattrs metadata space
- Shared xattrs space is similar to the above inode space, started with
- a specific block indicated by xattr_blkaddr, organized one by one with
- proper align.
-
- Each share xattr can also be directly found by the following formula:
- xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
-
- |-> aligned by 4 bytes
- + xattr_blkaddr blocks |-> aligned with 4 bytes
- _________________________________________________________________________
- | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
- |________|_____________|_____________|_____|______________|_______________
-
-Directories
------------
-All directories are now organized in a compact on-disk format. Note that
-each directory block is divided into index and name areas in order to support
-random file lookup, and all directory entries are _strictly_ recorded in
-alphabetical order in order to support improved prefix binary search
-algorithm (could refer to the related source code).
-
- ___________________________
- / |
- / ______________|________________
- / / | nameoff1 | nameoffN-1
- ____________.______________._______________v________________v__________
-| dirent | dirent | ... | dirent | filename | filename | ... | filename |
-|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
- \ ^
- \ | * could have
- \ | trailing '\0'
- \________________________| nameoff0
-
- Directory block
-
-Note that apart from the offset of the first filename, nameoff0 also indicates
-the total number of directory entries in this block since it is no need to
-introduce another on-disk field at all.
-
-Compression
------------
-Currently, EROFS supports 4KB fixed-output clustersize transparent file
-compression, as illustrated below:
-
- |---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
- clusterofs clusterofs clusterofs
- | | | logical data
-_________v_______________________________v_____________________v_______________
-... | . | | . | | . | ...
-____|____.________|_____________|________.____|_____________|__.__________|____
- |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
- size size size size size
- . . . .
- . . . .
- . . . .
- _______._____________._____________._____________._____________________
- ... | | | | ... physical data
- _______|_____________|_____________|_____________|_____________________
- |-> cluster <-|-> cluster <-|-> cluster <-|
- size size size
-
-Currently each on-disk physical cluster can contain 4KB (un)compressed data
-at most. For each logical cluster, there is a corresponding on-disk index to
-describe its cluster type, physical cluster address, etc.
-
-See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
-
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.rst
similarity index 91%
rename from Documentation/filesystems/ext2.txt
rename to Documentation/filesystems/ext2.rst
index 94c2cf0..d83dbbb 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
The Second Extended Filesystem
==============================
@@ -14,8 +16,9 @@
Most defaults are determined by the filesystem superblock, and can be
set using tune2fs(8). Kernel-determined defaults are indicated by (*).
-bsddf (*) Makes `df' act like BSD.
-minixdf Makes `df' act like Minix.
+==================== === ================================================
+bsddf (*) Makes ``df`` act like BSD.
+minixdf Makes ``df`` act like Minix.
check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
@@ -62,6 +65,7 @@
grpquota Enable group disk quota support
(requires CONFIG_QUOTA).
+==================== === ================================================
noquota option ls silently ignored by ext2.
@@ -294,9 +298,9 @@
If you're exceptionally paranoid, there are 3 ways of making metadata
writes synchronous on ext2:
-per-file if you have the program source: use the O_SYNC flag to open()
-per-file if you don't have the source: use "chattr +S" on the file
-per-filesystem: add the "sync" option to mount (or in /etc/fstab)
+- per-file if you have the program source: use the O_SYNC flag to open()
+- per-file if you don't have the source: use "chattr +S" on the file
+- per-filesystem: add the "sync" option to mount (or in /etc/fstab)
the first and last are not ext2 specific but do force the metadata to
be written synchronously. See also Journaling below.
@@ -316,10 +320,12 @@
format and using a compatibility flag to signal the format change (at
the expense of some compatibility).
-Filesystem block size: 1kB 2kB 4kB 8kB
-
-File size limit: 16GB 256GB 2048GB 2048GB
-Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
+===================== ======= ======= ======= ========
+Filesystem block size 1kB 2kB 4kB 8kB
+===================== ======= ======= ======= ========
+File size limit 16GB 256GB 2048GB 2048GB
+Filesystem size limit 2047GB 8192GB 16384GB 32768GB
+===================== ======= ======= ======= ========
There is a 2.4 kernel limit of 2048GB for a single block device, so no
filesystem larger than that can be created at this time. There is also
@@ -370,19 +376,24 @@
References
==========
+======================= ===============================================
The kernel source file:/usr/src/linux/fs/ext2/
e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
Filesystem Resizing http://ext2resize.sourceforge.net/
-Compression (*) http://e2compr.sourceforge.net/
+Compression [1]_ http://e2compr.sourceforge.net/
+======================= ===============================================
Implementations for:
-Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
-Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2
-DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
-OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
-RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
-(*) no longer actively developed/supported (as of Apr 2001)
-(+) no longer actively developed/supported (as of Mar 2009)
+======================= ===========================================================
+Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
+Windows 95 [1]_ http://www.yipton.net/content.html#FSDEXT2
+DOS client [1]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
+OS/2 [2]_ ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
+RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
+======================= ===========================================================
+
+.. [1] no longer actively developed/supported (as of Apr 2001)
+.. [2] no longer actively developed/supported (as of Mar 2009)
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.rst
similarity index 88%
rename from Documentation/filesystems/ext3.txt
rename to Documentation/filesystems/ext3.rst
index 58758fb..c06cec3 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.rst
@@ -1,4 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+===============
Ext3 Filesystem
===============
diff --git a/Documentation/filesystems/ext4/about.rst b/Documentation/filesystems/ext4/about.rst
index 0aadba0..cc76b57 100644
--- a/Documentation/filesystems/ext4/about.rst
+++ b/Documentation/filesystems/ext4/about.rst
@@ -39,6 +39,6 @@
Other References
----------------
-Also see http://www.nongnu.org/ext2-doc/ for quite a collection of
+Also see https://www.nongnu.org/ext2-doc/ for quite a collection of
information about ext2/3. Here's another old reference:
http://wiki.osdev.org/Ext2
diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee..849d5b1 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -28,6 +28,17 @@
safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.
+In case of ``data=ordered`` mode, Ext4 also supports fast commits which
+help reduce commit latency significantly. The default ``data=ordered``
+mode works by logging metadata blocks to the journal. In fast commit
+mode, Ext4 only stores the minimal delta needed to recreate the
+affected metadata in fast commit space that is shared with JBD2.
+Once the fast commit area fills in or if fast commit is not possible
+or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
+A full commit invalidates all the fast commits that happened before
+it and thus it makes the fast commit area empty for further fast
+commits. This feature needs to be enabled at mkfs time.
+
The journal inode is typically inode 8. The first 68 bytes of the
journal inode are replicated in the ext4 superblock. The journal itself
is normal (but hidden) file within the filesystem. The file usually
@@ -245,6 +256,10 @@
- s\_padding2
-
* - 0x54
+ - \_\_be32
+ - s\_num\_fc\_blocks
+ - Number of fast commit blocks in the journal.
+ * - 0x58
- \_\_u32
- s\_padding[42]
-
@@ -299,6 +314,8 @@
- This journal uses v3 of the checksum on-disk format. This is the same as
v2, but the journal block tag size is fixed regardless of the size of
block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
+ * - 0x20
+ - Journal has fast commit blocks. (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT)
.. _jbd2_checksum_type:
@@ -609,3 +626,58 @@
- h\_commit\_nsec
- Nanoseconds component of the above timestamp.
+Fast commits
+~~~~~~~~~~~~
+
+Fast commit area is organized as a log of tag length values. Each TLV has
+a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
+of the entire field. It is followed by variable length tag specific value.
+Here is the list of supported tags and their meanings:
+
+.. list-table::
+ :widths: 8 20 20 32
+ :header-rows: 1
+
+ * - Tag
+ - Meaning
+ - Value struct
+ - Description
+ * - EXT4_FC_TAG_HEAD
+ - Fast commit area header
+ - ``struct ext4_fc_head``
+ - Stores the TID of the transaction after which these fast commits should
+ be applied.
+ * - EXT4_FC_TAG_ADD_RANGE
+ - Add extent to inode
+ - ``struct ext4_fc_add_range``
+ - Stores the inode number and extent to be added in this inode
+ * - EXT4_FC_TAG_DEL_RANGE
+ - Remove logical offsets to inode
+ - ``struct ext4_fc_del_range``
+ - Stores the inode number and the logical offset range that needs to be
+ removed
+ * - EXT4_FC_TAG_CREAT
+ - Create directory entry for a newly created file
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry of the
+ newly created file
+ * - EXT4_FC_TAG_LINK
+ - Link a directory entry to an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+ * - EXT4_FC_TAG_UNLINK
+ - Unlink a directory entry of an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+
+ * - EXT4_FC_TAG_PAD
+ - Padding (unused area)
+ - None
+ - Unused bytes in the fast commit area.
+
+ * - EXT4_FC_TAG_TAIL
+ - Mark the end of a fast commit
+ - ``struct ext4_fc_tail``
+ - Stores the TID of the commit, CRC of the fast commit of which this tag
+ represents the end of
+
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index 93e55d7..2eb1ab2 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -596,6 +596,13 @@
- Sparse Super Block, v2. If this flag is set, the SB field s\_backup\_bgs
points to the two block groups that contain backup superblocks
(COMPAT\_SPARSE\_SUPER2).
+ * - 0x400
+ - Fast commits supported. Although fast commits blocks are
+ backward incompatible, fast commit blocks are not always
+ present in the journal. If fast commit blocks are present in
+ the journal, JBD2 incompat feature
+ (JBD2\_FEATURE\_INCOMPAT\_FAST\_COMMIT) gets
+ set (COMPAT\_FAST\_COMMIT).
.. _super_incompat:
diff --git a/Documentation/filesystems/ext4/verity.rst b/Documentation/filesystems/ext4/verity.rst
index 3e4c0ee..e99ff3f 100644
--- a/Documentation/filesystems/ext4/verity.rst
+++ b/Documentation/filesystems/ext4/verity.rst
@@ -39,3 +39,6 @@
Verity files cannot have blocks allocated past the end of the verity
metadata.
+
+Verity and DAX are not compatible and attempts to set both of these flags
+on a file will fail.
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
new file mode 100644
index 0000000..8c0fbdd
--- /dev/null
+++ b/Documentation/filesystems/f2fs.rst
@@ -0,0 +1,826 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+WHAT IS Flash-Friendly File System (F2FS)?
+==========================================
+
+NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
+been equipped on a variety systems ranging from mobile to server systems. Since
+they are known to have different characteristics from the conventional rotating
+disks, a file system, an upper layer to the storage device, should adapt to the
+changes from the sketch in the design level.
+
+F2FS is a file system exploiting NAND flash memory-based storage devices, which
+is based on Log-structured File System (LFS). The design has been focused on
+addressing the fundamental issues in LFS, which are snowball effect of wandering
+tree and high cleaning overhead.
+
+Since a NAND flash memory-based storage device shows different characteristic
+according to its internal geometry or flash memory management scheme, namely FTL,
+F2FS and its tools support various parameters not only for configuring on-disk
+layout, but also for selecting allocation and cleaning algorithms.
+
+The following git tree provides the file system formatting tool (mkfs.f2fs),
+a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
+
+- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
+
+For reporting bugs and sending patches, please use the following mailing list:
+
+- linux-f2fs-devel@lists.sourceforge.net
+
+Background and Design issues
+============================
+
+Log-structured File System (LFS)
+--------------------------------
+"A log-structured file system writes all modifications to disk sequentially in
+a log-like structure, thereby speeding up both file writing and crash recovery.
+The log is the only structure on disk; it contains indexing information so that
+files can be read back from the log efficiently. In order to maintain large free
+areas on disk for fast writing, we divide the log into segments and use a
+segment cleaner to compress the live information from heavily fragmented
+segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and
+implementation of a log-structured file system", ACM Trans. Computer Systems
+10, 1, 26–52.
+
+Wandering Tree Problem
+----------------------
+In LFS, when a file data is updated and written to the end of log, its direct
+pointer block is updated due to the changed location. Then the indirect pointer
+block is also updated due to the direct pointer block update. In this manner,
+the upper index structures such as inode, inode map, and checkpoint block are
+also updated recursively. This problem is called as wandering tree problem [1],
+and in order to enhance the performance, it should eliminate or relax the update
+propagation as much as possible.
+
+[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/
+
+Cleaning Overhead
+-----------------
+Since LFS is based on out-of-place writes, it produces so many obsolete blocks
+scattered across the whole storage. In order to serve new empty log space, it
+needs to reclaim these obsolete blocks seamlessly to users. This job is called
+as a cleaning process.
+
+The process consists of three operations as follows.
+
+1. A victim segment is selected through referencing segment usage table.
+2. It loads parent index structures of all the data in the victim identified by
+ segment summary blocks.
+3. It checks the cross-reference between the data and its parent index structure.
+4. It moves valid data selectively.
+
+This cleaning job may cause unexpected long delays, so the most important goal
+is to hide the latencies to users. And also definitely, it should reduce the
+amount of valid data to be moved, and move them quickly as well.
+
+Key Features
+============
+
+Flash Awareness
+---------------
+- Enlarge the random write area for better performance, but provide the high
+ spatial locality
+- Align FS data structures to the operational units in FTL as best efforts
+
+Wandering Tree Problem
+----------------------
+- Use a term, “node”, that represents inodes as well as various pointer blocks
+- Introduce Node Address Table (NAT) containing the locations of all the “node”
+ blocks; this will cut off the update propagation.
+
+Cleaning Overhead
+-----------------
+- Support a background cleaning process
+- Support greedy and cost-benefit algorithms for victim selection policies
+- Support multi-head logs for static/dynamic hot and cold data separation
+- Introduce adaptive logging for efficient block allocation
+
+Mount Options
+=============
+
+
+======================== ============================================================
+background_gc=%s Turn on/off cleaning operations, namely garbage
+ collection, triggered in background when I/O subsystem is
+ idle. If background_gc=on, it will turn on the garbage
+ collection and if background_gc=off, garbage collection
+ will be turned off. If background_gc=sync, it will turn
+ on synchronous garbage collection running in background.
+ Default value for this option is on. So garbage
+ collection is on by default.
+disable_roll_forward Disable the roll-forward recovery routine
+norecovery Disable the roll-forward recovery routine, mounted read-
+ only (i.e., -o ro,disable_roll_forward)
+discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
+ enabled, f2fs will issue discard/TRIM commands when a
+ segment is cleaned.
+no_heap Disable heap-style segment allocation which finds free
+ segments for data from the beginning of main area, while
+ for node from the end of main area.
+nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
+ by default if CONFIG_F2FS_FS_XATTR is selected.
+noacl Disable POSIX Access Control List. Note: acl is enabled
+ by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
+active_logs=%u Support configuring the number of active logs. In the
+ current design, f2fs supports only 2, 4, and 6 logs.
+ Default number is 6.
+disable_ext_identify Disable the extension list configured by mkfs, so f2fs
+ is not aware of cold files such as media files.
+inline_xattr Enable the inline xattrs feature.
+noinline_xattr Disable the inline xattrs feature.
+inline_xattr_size=%u Support configuring inline xattr size, it depends on
+ flexible inline xattr feature.
+inline_data Enable the inline data feature: Newly created small (<~3.4k)
+ files can be written into inode block.
+inline_dentry Enable the inline dir feature: data in newly created
+ directory entries can be written into inode block. The
+ space of inode block which is used to store inline
+ dentries is limited to ~3.4k.
+noinline_dentry Disable the inline dentry feature.
+flush_merge Merge concurrent cache_flush commands as much as possible
+ to eliminate redundant command issues. If the underlying
+ device handles the cache_flush command relatively slowly,
+ recommend to enable this option.
+nobarrier This option can be used if underlying storage guarantees
+ its cached data should be written to the novolatile area.
+ If this option is set, no cache_flush commands are issued
+ but f2fs still guarantees the write ordering of all the
+ data writes.
+fastboot This option is used when a system wants to reduce mount
+ time as much as possible, even though normal performance
+ can be sacrificed.
+extent_cache Enable an extent cache based on rb-tree, it can cache
+ as many as extent which map between contiguous logical
+ address and physical address per inode, resulting in
+ increasing the cache hit ratio. Set by default.
+noextent_cache Disable an extent cache based on rb-tree explicitly, see
+ the above extent_cache mount option.
+noinline_data Disable the inline data feature, inline data feature is
+ enabled by default.
+data_flush Enable data flushing before checkpoint in order to
+ persist data of regular and symlink.
+reserve_root=%d Support configuring reserved space which is used for
+ allocation from a privileged user with specified uid or
+ gid, unit: 4KB, the default limit is 0.2% of user blocks.
+resuid=%d The user ID which may use the reserved blocks.
+resgid=%d The group ID which may use the reserved blocks.
+fault_injection=%d Enable fault injection in all supported types with
+ specified injection rate.
+fault_type=%d Support configuring fault injection type, should be
+ enabled with fault_injection option, fault type value
+ is shown below, it supports single or combined type.
+
+ =================== ===========
+ Type_Name Type_Value
+ =================== ===========
+ FAULT_KMALLOC 0x000000001
+ FAULT_KVMALLOC 0x000000002
+ FAULT_PAGE_ALLOC 0x000000004
+ FAULT_PAGE_GET 0x000000008
+ FAULT_ALLOC_BIO 0x000000010
+ FAULT_ALLOC_NID 0x000000020
+ FAULT_ORPHAN 0x000000040
+ FAULT_BLOCK 0x000000080
+ FAULT_DIR_DEPTH 0x000000100
+ FAULT_EVICT_INODE 0x000000200
+ FAULT_TRUNCATE 0x000000400
+ FAULT_READ_IO 0x000000800
+ FAULT_CHECKPOINT 0x000001000
+ FAULT_DISCARD 0x000002000
+ FAULT_WRITE_IO 0x000004000
+ =================== ===========
+mode=%s Control block allocation mode which supports "adaptive"
+ and "lfs". In "lfs" mode, there should be no random
+ writes towards main area.
+io_bits=%u Set the bit size of write IO requests. It should be set
+ with "mode=lfs".
+usrquota Enable plain user disk quota accounting.
+grpquota Enable plain group disk quota accounting.
+prjquota Enable plain project quota accounting.
+usrjquota=<file> Appoint specified file and type during mount, so that quota
+grpjquota=<file> information can be properly updated during recovery flow,
+prjjquota=<file> <quota file>: must be in root directory;
+jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
+offusrjquota Turn off user journalled quota.
+offgrpjquota Turn off group journalled quota.
+offprjjquota Turn off project journalled quota.
+quota Enable plain user disk quota accounting.
+noquota Disable all plain disk quota option.
+whint_mode=%s Control which write hints are passed down to block
+ layer. This supports "off", "user-based", and
+ "fs-based". In "off" mode (default), f2fs does not pass
+ down hints. In "user-based" mode, f2fs tries to pass
+ down hints given by users. And in "fs-based" mode, f2fs
+ passes down hints with its policy.
+alloc_mode=%s Adjust block allocation policy, which supports "reuse"
+ and "default".
+fsync_mode=%s Control the policy of fsync. Currently supports "posix",
+ "strict", and "nobarrier". In "posix" mode, which is
+ default, fsync will follow POSIX semantics and does a
+ light operation to improve the filesystem performance.
+ In "strict" mode, fsync will be heavy and behaves in line
+ with xfs, ext4 and btrfs, where xfstest generic/342 will
+ pass, but the performance will regress. "nobarrier" is
+ based on "posix", but doesn't issue flush command for
+ non-atomic files likewise "nobarrier" mount option.
+test_dummy_encryption
+test_dummy_encryption=%s
+ Enable dummy encryption, which provides a fake fscrypt
+ context. The fake fscrypt context is used by xfstests.
+ The argument may be either "v1" or "v2", in order to
+ select the corresponding fscrypt policy version.
+checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
+ to reenable checkpointing. Is enabled by default. While
+ disabled, any unmounting or unexpected shutdowns will cause
+ the filesystem contents to appear as they did when the
+ filesystem was mounted with that option.
+ While mounting with checkpoint=disabled, the filesystem must
+ run garbage collection to ensure that all available space can
+ be used. If this takes too much time, the mount may return
+ EAGAIN. You may optionally add a value to indicate how much
+ of the disk you would be willing to temporarily give up to
+ avoid additional garbage collection. This can be given as a
+ number of blocks, or as a percent. For instance, mounting
+ with checkpoint=disable:100% would always succeed, but it may
+ hide up to all remaining free space. The actual space that
+ would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
+ This space is reclaimed once checkpoint=enable.
+compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
+ "lz4", "zstd" and "lzo-rle" algorithm.
+compress_log_size=%u Support configuring compress cluster size, the size will
+ be 4KB * (1 << %u), 16KB is minimum size, also it's
+ default size.
+compress_extension=%s Support adding specified extension, so that f2fs can enable
+ compression on those corresponding files, e.g. if all files
+ with '.ext' has high compression rate, we can set the '.ext'
+ on compression extension list and enable compression on
+ these file by default rather than to enable it via ioctl.
+ For other files, we can still enable compression via ioctl.
+ Note that, there is one reserved special extension '*', it
+ can be set to enable compression for all files.
+inlinecrypt When possible, encrypt/decrypt the contents of encrypted
+ files using the blk-crypto framework rather than
+ filesystem-layer encryption. This allows the use of
+ inline encryption hardware. The on-disk format is
+ unaffected. For more details, see
+ Documentation/block/inline-encryption.rst.
+atgc Enable age-threshold garbage collection, it provides high
+ effectiveness and efficiency on background GC.
+======================== ============================================================
+
+Debugfs Entries
+===============
+
+/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
+f2fs. Each file shows the whole f2fs information.
+
+/sys/kernel/debug/f2fs/status includes:
+
+ - major file system information managed by f2fs currently
+ - average SIT information about whole segments
+ - current memory footprint consumed by f2fs.
+
+Sysfs Entries
+=============
+
+Information about mounted f2fs file systems can be found in
+/sys/fs/f2fs. Each mounted filesystem will have a directory in
+/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda).
+The files in each per-device directory are shown in table below.
+
+Files in /sys/fs/f2fs/<devname>
+(see also Documentation/ABI/testing/sysfs-fs-f2fs)
+
+Usage
+=====
+
+1. Download userland tools and compile them.
+
+2. Skip, if f2fs was compiled statically inside kernel.
+ Otherwise, insert the f2fs.ko module::
+
+ # insmod f2fs.ko
+
+3. Create a directory to use when mounting::
+
+ # mkdir /mnt/f2fs
+
+4. Format the block device, and then mount as f2fs::
+
+ # mkfs.f2fs -l label /dev/block_device
+ # mount -t f2fs /dev/block_device /mnt/f2fs
+
+mkfs.f2fs
+---------
+The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
+which builds a basic on-disk layout.
+
+The quick options consist of:
+
+=============== ===========================================================
+``-l [label]`` Give a volume label, up to 512 unicode name.
+``-a [0 or 1]`` Split start location of each area for heap-based allocation.
+
+ 1 is set by default, which performs this.
+``-o [int]`` Set overprovision ratio in percent over volume size.
+
+ 5 is set by default.
+``-s [int]`` Set the number of segments per section.
+
+ 1 is set by default.
+``-z [int]`` Set the number of sections per zone.
+
+ 1 is set by default.
+``-e [str]`` Set basic extension list. e.g. "mp3,gif,mov"
+``-t [0 or 1]`` Disable discard command or not.
+
+ 1 is set by default, which conducts discard.
+=============== ===========================================================
+
+Note: please refer to the manpage of mkfs.f2fs(8) to get full option list.
+
+fsck.f2fs
+---------
+The fsck.f2fs is a tool to check the consistency of an f2fs-formatted
+partition, which examines whether the filesystem metadata and user-made data
+are cross-referenced correctly or not.
+Note that, initial version of the tool does not fix any inconsistency.
+
+The quick options consist of::
+
+ -d debug level [default:0]
+
+Note: please refer to the manpage of fsck.f2fs(8) to get full option list.
+
+dump.f2fs
+---------
+The dump.f2fs shows the information of specific inode and dumps SSA and SIT to
+file. Each file is dump_ssa and dump_sit.
+
+The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem.
+It shows on-disk inode information recognized by a given inode number, and is
+able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and
+./dump_sit respectively.
+
+The options consist of::
+
+ -d debug level [default:0]
+ -i inode no (hex)
+ -s [SIT dump segno from #1~#2 (decimal), for all 0~-1]
+ -a [SSA dump segno from #1~#2 (decimal), for all 0~-1]
+
+Examples::
+
+ # dump.f2fs -i [ino] /dev/sdx
+ # dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
+ # dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
+
+Note: please refer to the manpage of dump.f2fs(8) to get full option list.
+
+sload.f2fs
+----------
+The sload.f2fs gives a way to insert files and directories in the exisiting disk
+image. This tool is useful when building f2fs images given compiled files.
+
+Note: please refer to the manpage of sload.f2fs(8) to get full option list.
+
+resize.f2fs
+-----------
+The resize.f2fs lets a user resize the f2fs-formatted disk image, while preserving
+all the files and directories stored in the image.
+
+Note: please refer to the manpage of resize.f2fs(8) to get full option list.
+
+defrag.f2fs
+-----------
+The defrag.f2fs can be used to defragment scattered written data as well as
+filesystem metadata across the disk. This can improve the write speed by giving
+more free consecutive space.
+
+Note: please refer to the manpage of defrag.f2fs(8) to get full option list.
+
+f2fs_io
+-------
+The f2fs_io is a simple tool to issue various filesystem APIs as well as
+f2fs-specific ones, which is very useful for QA tests.
+
+Note: please refer to the manpage of f2fs_io(8) to get full option list.
+
+Design
+======
+
+On-disk Layout
+--------------
+
+F2FS divides the whole volume into a number of segments, each of which is fixed
+to 2MB in size. A section is composed of consecutive segments, and a zone
+consists of a set of sections. By default, section and zone sizes are set to one
+segment size identically, but users can easily modify the sizes by mkfs.
+
+F2FS splits the entire volume into six areas, and all the areas except superblock
+consist of multiple segments as described below::
+
+ align with the zone size <-|
+ |-> align with the segment size
+ _________________________________________________________________________
+ | | | Segment | Node | Segment | |
+ | Superblock | Checkpoint | Info. | Address | Summary | Main |
+ | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | |
+ |____________|_____2______|______N______|______N______|______N_____|__N___|
+ . .
+ . .
+ . .
+ ._________________________________________.
+ |_Segment_|_..._|_Segment_|_..._|_Segment_|
+ . .
+ ._________._________
+ |_section_|__...__|_
+ . .
+ .________.
+ |__zone__|
+
+- Superblock (SB)
+ It is located at the beginning of the partition, and there exist two copies
+ to avoid file system crash. It contains basic partition information and some
+ default parameters of f2fs.
+
+- Checkpoint (CP)
+ It contains file system information, bitmaps for valid NAT/SIT sets, orphan
+ inode lists, and summary entries of current active segments.
+
+- Segment Information Table (SIT)
+ It contains segment information such as valid block count and bitmap for the
+ validity of all the blocks.
+
+- Node Address Table (NAT)
+ It is composed of a block address table for all the node blocks stored in
+ Main area.
+
+- Segment Summary Area (SSA)
+ It contains summary entries which contains the owner information of all the
+ data and node blocks stored in Main area.
+
+- Main Area
+ It contains file and directory data including their indices.
+
+In order to avoid misalignment between file system and flash-based storage, F2FS
+aligns the start block address of CP with the segment size. Also, it aligns the
+start block address of Main area with the zone size by reserving some segments
+in SSA area.
+
+Reference the following survey for additional technical details.
+https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey
+
+File System Metadata Structure
+------------------------------
+
+F2FS adopts the checkpointing scheme to maintain file system consistency. At
+mount time, F2FS first tries to find the last valid checkpoint data by scanning
+CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
+One of them always indicates the last valid data, which is called as shadow copy
+mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
+
+For file system consistency, each CP points to which NAT and SIT copies are
+valid, as shown as below::
+
+ +--------+----------+---------+
+ | CP | SIT | NAT |
+ +--------+----------+---------+
+ . . . .
+ . . . .
+ . . . .
+ +-------+-------+--------+--------+--------+--------+
+ | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 |
+ +-------+-------+--------+--------+--------+--------+
+ | ^ ^
+ | | |
+ `----------------------------------------'
+
+Index Structure
+---------------
+
+The key data structure to manage the data locations is a "node". Similar to
+traditional file structures, F2FS has three types of node: inode, direct node,
+indirect node. F2FS assigns 4KB to an inode block which contains 923 data block
+indices, two direct node pointers, two indirect node pointers, and one double
+indirect node pointer as described below. One direct node block contains 1018
+data blocks, and one indirect node block contains also 1018 node blocks. Thus,
+one inode block (i.e., a file) covers::
+
+ 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
+
+ Inode block (4KB)
+ |- data (923)
+ |- direct node (2)
+ | `- data (1018)
+ |- indirect node (2)
+ | `- direct node (1018)
+ | `- data (1018)
+ `- double indirect node (1)
+ `- indirect node (1018)
+ `- direct node (1018)
+ `- data (1018)
+
+Note that all the node blocks are mapped by NAT which means the location of
+each node is translated by the NAT table. In the consideration of the wandering
+tree problem, F2FS is able to cut off the propagation of node updates caused by
+leaf data writes.
+
+Directory Structure
+-------------------
+
+A directory entry occupies 11 bytes, which consists of the following attributes.
+
+- hash hash value of the file name
+- ino inode number
+- len the length of file name
+- type file type such as directory, symlink, etc
+
+A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
+used to represent whether each dentry is valid or not. A dentry block occupies
+4KB with the following composition.
+
+::
+
+ Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
+ dentries(11 * 214 bytes) + file name (8 * 214 bytes)
+
+ [Bucket]
+ +--------------------------------+
+ |dentry block 1 | dentry block 2 |
+ +--------------------------------+
+ . .
+ . .
+ . [Dentry Block Structure: 4KB] .
+ +--------+----------+----------+------------+
+ | bitmap | reserved | dentries | file names |
+ +--------+----------+----------+------------+
+ [Dentry Block: 4KB] . .
+ . .
+ . .
+ +------+------+-----+------+
+ | hash | ino | len | type |
+ +------+------+-----+------+
+ [Dentry Structure: 11 bytes]
+
+F2FS implements multi-level hash tables for directory structure. Each level has
+a hash table with dedicated number of hash buckets as shown below. Note that
+"A(2B)" means a bucket includes 2 data blocks.
+
+::
+
+ ----------------------
+ A : bucket
+ B : block
+ N : MAX_DIR_HASH_DEPTH
+ ----------------------
+
+ level #0 | A(2B)
+ |
+ level #1 | A(2B) - A(2B)
+ |
+ level #2 | A(2B) - A(2B) - A(2B) - A(2B)
+ . | . . . .
+ level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
+ . | . . . .
+ level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
+
+The number of blocks and buckets are determined by::
+
+ ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
+ # of blocks in level #n = |
+ `- 4, Otherwise
+
+ ,- 2^(n + dir_level),
+ | if n + dir_level < MAX_DIR_HASH_DEPTH / 2,
+ # of buckets in level #n = |
+ `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1),
+ Otherwise
+
+When F2FS finds a file name in a directory, at first a hash value of the file
+name is calculated. Then, F2FS scans the hash table in level #0 to find the
+dentry consisting of the file name and its inode number. If not found, F2FS
+scans the next hash table in level #1. In this way, F2FS scans hash tables in
+each levels incrementally from 1 to N. In each level F2FS needs to scan only
+one bucket determined by the following equation, which shows O(log(# of files))
+complexity::
+
+ bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
+
+In the case of file creation, F2FS finds empty consecutive slots that cover the
+file name. F2FS searches the empty slots in the hash tables of whole levels from
+1 to N in the same way as the lookup operation.
+
+The following figure shows an example of two cases holding children::
+
+ --------------> Dir <--------------
+ | |
+ child child
+
+ child - child [hole] - child
+
+ child - child - child [hole] - [hole] - child
+
+ Case 1: Case 2:
+ Number of children = 6, Number of children = 3,
+ File size = 7 File size = 7
+
+Default Block Allocation
+------------------------
+
+At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
+and Hot/Warm/Cold data.
+
+- Hot node contains direct node blocks of directories.
+- Warm node contains direct node blocks except hot node blocks.
+- Cold node contains indirect node blocks
+- Hot data contains dentry blocks
+- Warm data contains data blocks except hot and cold data blocks
+- Cold data contains multimedia data or migrated data blocks
+
+LFS has two schemes for free space management: threaded log and copy-and-compac-
+tion. The copy-and-compaction scheme which is known as cleaning, is well-suited
+for devices showing very good sequential write performance, since free segments
+are served all the time for writing new data. However, it suffers from cleaning
+overhead under high utilization. Contrarily, the threaded log scheme suffers
+from random writes, but no cleaning process is needed. F2FS adopts a hybrid
+scheme where the copy-and-compaction scheme is adopted by default, but the
+policy is dynamically changed to the threaded log scheme according to the file
+system status.
+
+In order to align F2FS with underlying flash-based storage, F2FS allocates a
+segment in a unit of section. F2FS expects that the section size would be the
+same as the unit size of garbage collection in FTL. Furthermore, with respect
+to the mapping granularity in FTL, F2FS allocates each section of the active
+logs from different zones as much as possible, since FTL can write the data in
+the active logs into one allocation unit according to its mapping granularity.
+
+Cleaning process
+----------------
+
+F2FS does cleaning both on demand and in the background. On-demand cleaning is
+triggered when there are not enough free segments to serve VFS calls. Background
+cleaner is operated by a kernel thread, and triggers the cleaning job when the
+system is idle.
+
+F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
+In the greedy algorithm, F2FS selects a victim segment having the smallest number
+of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment
+according to the segment age and the number of valid blocks in order to address
+log block thrashing problem in the greedy algorithm. F2FS adopts the greedy
+algorithm for on-demand cleaner, while background cleaner adopts cost-benefit
+algorithm.
+
+In order to identify whether the data in the victim segment are valid or not,
+F2FS manages a bitmap. Each bit represents the validity of a block, and the
+bitmap is composed of a bit stream covering whole blocks in main area.
+
+Write-hint Policy
+-----------------
+
+1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
+
+2) whint_mode=user-based. F2FS tries to pass down hints given by
+users.
+
+===================== ======================== ===================
+User F2FS Block
+===================== ======================== ===================
+N/A META WRITE_LIFE_NOT_SET
+N/A HOT_NODE "
+N/A WARM_NODE "
+N/A COLD_NODE "
+ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
+extension list " "
+
+-- buffered io
+WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE " "
+WRITE_LIFE_MEDIUM " "
+WRITE_LIFE_LONG " "
+
+-- direct io
+WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE " WRITE_LIFE_NONE
+WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
+WRITE_LIFE_LONG " WRITE_LIFE_LONG
+===================== ======================== ===================
+
+3) whint_mode=fs-based. F2FS passes down hints with its policy.
+
+===================== ======================== ===================
+User F2FS Block
+===================== ======================== ===================
+N/A META WRITE_LIFE_MEDIUM;
+N/A HOT_NODE WRITE_LIFE_NOT_SET
+N/A WARM_NODE "
+N/A COLD_NODE WRITE_LIFE_NONE
+ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
+extension list " "
+
+-- buffered io
+WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG
+WRITE_LIFE_NONE " "
+WRITE_LIFE_MEDIUM " "
+WRITE_LIFE_LONG " "
+
+-- direct io
+WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE " WRITE_LIFE_NONE
+WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
+WRITE_LIFE_LONG " WRITE_LIFE_LONG
+===================== ======================== ===================
+
+Fallocate(2) Policy
+-------------------
+
+The default policy follows the below POSIX rule.
+
+Allocating disk space
+ The default operation (i.e., mode is zero) of fallocate() allocates
+ the disk space within the range specified by offset and len. The
+ file size (as reported by stat(2)) will be changed if offset+len is
+ greater than the file size. Any subregion within the range specified
+ by offset and len that did not contain data before the call will be
+ initialized to zero. This default behavior closely resembles the
+ behavior of the posix_fallocate(3) library function, and is intended
+ as a method of optimally implementing that function.
+
+However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
+fallocate(fd, DEFAULT_MODE), it allocates on-disk block addressess having
+zero or random data, which is useful to the below scenario where:
+
+ 1. create(fd)
+ 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE)
+ 3. fallocate(fd, 0, 0, size)
+ 4. address = fibmap(fd, offset)
+ 5. open(blkdev)
+ 6. write(blkdev, address)
+
+Compression implementation
+--------------------------
+
+- New term named cluster is defined as basic unit of compression, file can
+ be divided into multiple clusters logically. One cluster includes 4 << n
+ (n >= 0) logical pages, compression size is also cluster size, each of
+ cluster can be compressed or not.
+
+- In cluster metadata layout, one special block address is used to indicate
+ a cluster is a compressed one or normal one; for compressed cluster, following
+ metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
+ stores data including compress header and compressed data.
+
+- In order to eliminate write amplification during overwrite, F2FS only
+ support compression on write-once file, data can be compressed only when
+ all logical blocks in cluster contain valid data and compress ratio of
+ cluster data is lower than specified threshold.
+
+- To enable compression on regular inode, there are three ways:
+
+ * chattr +c file
+ * chattr +c dir; touch dir/file
+ * mount w/ -o compress_extension=ext; touch file.ext
+
+Compress metadata layout::
+
+ [Dnode Structure]
+ +-----------------------------------------------+
+ | cluster 1 | cluster 2 | ......... | cluster N |
+ +-----------------------------------------------+
+ . . . .
+ . . . .
+ . Compressed Cluster . . Normal Cluster .
+ +----------+---------+---------+---------+ +---------+---------+---------+---------+
+ |compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+ +----------+---------+---------+---------+ +---------+---------+---------+---------+
+ . .
+ . .
+ . .
+ +-------------+-------------+----------+----------------------------+
+ | data length | data chksum | reserved | compressed data |
+ +-------------+-------------+----------+----------------------------+
+
+NVMe Zoned Namespace devices
+----------------------------
+
+- ZNS defines a per-zone capacity which can be equal or less than the
+ zone-size. Zone-capacity is the number of usable blocks in the zone.
+ F2FS checks if zone-capacity is less than zone-size, if it is, then any
+ segment which starts after the zone-capacity is marked as not-free in
+ the free segment bitmap at initial mount time. These segments are marked
+ as permanently used so they are not allocated for writes and
+ consequently are not needed to be garbage collected. In case the
+ zone-capacity is not aligned to default segment size(2MB), then a segment
+ can start before the zone-capacity and span across zone-capacity boundary.
+ Such spanning segments are also considered as usable segments. All blocks
+ past the zone-capacity are considered unusable in these segments.
diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
deleted file mode 100644
index 7e19913..0000000
--- a/Documentation/filesystems/f2fs.txt
+++ /dev/null
@@ -1,839 +0,0 @@
-================================================================================
-WHAT IS Flash-Friendly File System (F2FS)?
-================================================================================
-
-NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have
-been equipped on a variety systems ranging from mobile to server systems. Since
-they are known to have different characteristics from the conventional rotating
-disks, a file system, an upper layer to the storage device, should adapt to the
-changes from the sketch in the design level.
-
-F2FS is a file system exploiting NAND flash memory-based storage devices, which
-is based on Log-structured File System (LFS). The design has been focused on
-addressing the fundamental issues in LFS, which are snowball effect of wandering
-tree and high cleaning overhead.
-
-Since a NAND flash memory-based storage device shows different characteristic
-according to its internal geometry or flash memory management scheme, namely FTL,
-F2FS and its tools support various parameters not only for configuring on-disk
-layout, but also for selecting allocation and cleaning algorithms.
-
-The following git tree provides the file system formatting tool (mkfs.f2fs),
-a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
->> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
-
-For reporting bugs and sending patches, please use the following mailing list:
->> linux-f2fs-devel@lists.sourceforge.net
-
-================================================================================
-BACKGROUND AND DESIGN ISSUES
-================================================================================
-
-Log-structured File System (LFS)
---------------------------------
-"A log-structured file system writes all modifications to disk sequentially in
-a log-like structure, thereby speeding up both file writing and crash recovery.
-The log is the only structure on disk; it contains indexing information so that
-files can be read back from the log efficiently. In order to maintain large free
-areas on disk for fast writing, we divide the log into segments and use a
-segment cleaner to compress the live information from heavily fragmented
-segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and
-implementation of a log-structured file system", ACM Trans. Computer Systems
-10, 1, 26–52.
-
-Wandering Tree Problem
-----------------------
-In LFS, when a file data is updated and written to the end of log, its direct
-pointer block is updated due to the changed location. Then the indirect pointer
-block is also updated due to the direct pointer block update. In this manner,
-the upper index structures such as inode, inode map, and checkpoint block are
-also updated recursively. This problem is called as wandering tree problem [1],
-and in order to enhance the performance, it should eliminate or relax the update
-propagation as much as possible.
-
-[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/
-
-Cleaning Overhead
------------------
-Since LFS is based on out-of-place writes, it produces so many obsolete blocks
-scattered across the whole storage. In order to serve new empty log space, it
-needs to reclaim these obsolete blocks seamlessly to users. This job is called
-as a cleaning process.
-
-The process consists of three operations as follows.
-1. A victim segment is selected through referencing segment usage table.
-2. It loads parent index structures of all the data in the victim identified by
- segment summary blocks.
-3. It checks the cross-reference between the data and its parent index structure.
-4. It moves valid data selectively.
-
-This cleaning job may cause unexpected long delays, so the most important goal
-is to hide the latencies to users. And also definitely, it should reduce the
-amount of valid data to be moved, and move them quickly as well.
-
-================================================================================
-KEY FEATURES
-================================================================================
-
-Flash Awareness
----------------
-- Enlarge the random write area for better performance, but provide the high
- spatial locality
-- Align FS data structures to the operational units in FTL as best efforts
-
-Wandering Tree Problem
-----------------------
-- Use a term, “node”, that represents inodes as well as various pointer blocks
-- Introduce Node Address Table (NAT) containing the locations of all the “node”
- blocks; this will cut off the update propagation.
-
-Cleaning Overhead
------------------
-- Support a background cleaning process
-- Support greedy and cost-benefit algorithms for victim selection policies
-- Support multi-head logs for static/dynamic hot and cold data separation
-- Introduce adaptive logging for efficient block allocation
-
-================================================================================
-MOUNT OPTIONS
-================================================================================
-
-background_gc=%s Turn on/off cleaning operations, namely garbage
- collection, triggered in background when I/O subsystem is
- idle. If background_gc=on, it will turn on the garbage
- collection and if background_gc=off, garbage collection
- will be turned off. If background_gc=sync, it will turn
- on synchronous garbage collection running in background.
- Default value for this option is on. So garbage
- collection is on by default.
-disable_roll_forward Disable the roll-forward recovery routine
-norecovery Disable the roll-forward recovery routine, mounted read-
- only (i.e., -o ro,disable_roll_forward)
-discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
- enabled, f2fs will issue discard/TRIM commands when a
- segment is cleaned.
-no_heap Disable heap-style segment allocation which finds free
- segments for data from the beginning of main area, while
- for node from the end of main area.
-nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
- by default if CONFIG_F2FS_FS_XATTR is selected.
-noacl Disable POSIX Access Control List. Note: acl is enabled
- by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
-active_logs=%u Support configuring the number of active logs. In the
- current design, f2fs supports only 2, 4, and 6 logs.
- Default number is 6.
-disable_ext_identify Disable the extension list configured by mkfs, so f2fs
- does not aware of cold files such as media files.
-inline_xattr Enable the inline xattrs feature.
-noinline_xattr Disable the inline xattrs feature.
-inline_xattr_size=%u Support configuring inline xattr size, it depends on
- flexible inline xattr feature.
-inline_data Enable the inline data feature: New created small(<~3.4k)
- files can be written into inode block.
-inline_dentry Enable the inline dir feature: data in new created
- directory entries can be written into inode block. The
- space of inode block which is used to store inline
- dentries is limited to ~3.4k.
-noinline_dentry Disable the inline dentry feature.
-flush_merge Merge concurrent cache_flush commands as much as possible
- to eliminate redundant command issues. If the underlying
- device handles the cache_flush command relatively slowly,
- recommend to enable this option.
-nobarrier This option can be used if underlying storage guarantees
- its cached data should be written to the novolatile area.
- If this option is set, no cache_flush commands are issued
- but f2fs still guarantees the write ordering of all the
- data writes.
-fastboot This option is used when a system wants to reduce mount
- time as much as possible, even though normal performance
- can be sacrificed.
-extent_cache Enable an extent cache based on rb-tree, it can cache
- as many as extent which map between contiguous logical
- address and physical address per inode, resulting in
- increasing the cache hit ratio. Set by default.
-noextent_cache Disable an extent cache based on rb-tree explicitly, see
- the above extent_cache mount option.
-noinline_data Disable the inline data feature, inline data feature is
- enabled by default.
-data_flush Enable data flushing before checkpoint in order to
- persist data of regular and symlink.
-reserve_root=%d Support configuring reserved space which is used for
- allocation from a privileged user with specified uid or
- gid, unit: 4KB, the default limit is 0.2% of user blocks.
-resuid=%d The user ID which may use the reserved blocks.
-resgid=%d The group ID which may use the reserved blocks.
-fault_injection=%d Enable fault injection in all supported types with
- specified injection rate.
-fault_type=%d Support configuring fault injection type, should be
- enabled with fault_injection option, fault type value
- is shown below, it supports single or combined type.
- Type_Name Type_Value
- FAULT_KMALLOC 0x000000001
- FAULT_KVMALLOC 0x000000002
- FAULT_PAGE_ALLOC 0x000000004
- FAULT_PAGE_GET 0x000000008
- FAULT_ALLOC_BIO 0x000000010
- FAULT_ALLOC_NID 0x000000020
- FAULT_ORPHAN 0x000000040
- FAULT_BLOCK 0x000000080
- FAULT_DIR_DEPTH 0x000000100
- FAULT_EVICT_INODE 0x000000200
- FAULT_TRUNCATE 0x000000400
- FAULT_READ_IO 0x000000800
- FAULT_CHECKPOINT 0x000001000
- FAULT_DISCARD 0x000002000
- FAULT_WRITE_IO 0x000004000
-mode=%s Control block allocation mode which supports "adaptive"
- and "lfs". In "lfs" mode, there should be no random
- writes towards main area.
-io_bits=%u Set the bit size of write IO requests. It should be set
- with "mode=lfs".
-usrquota Enable plain user disk quota accounting.
-grpquota Enable plain group disk quota accounting.
-prjquota Enable plain project quota accounting.
-usrjquota=<file> Appoint specified file and type during mount, so that quota
-grpjquota=<file> information can be properly updated during recovery flow,
-prjjquota=<file> <quota file>: must be in root directory;
-jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
-offusrjquota Turn off user journelled quota.
-offgrpjquota Turn off group journelled quota.
-offprjjquota Turn off project journelled quota.
-quota Enable plain user disk quota accounting.
-noquota Disable all plain disk quota option.
-whint_mode=%s Control which write hints are passed down to block
- layer. This supports "off", "user-based", and
- "fs-based". In "off" mode (default), f2fs does not pass
- down hints. In "user-based" mode, f2fs tries to pass
- down hints given by users. And in "fs-based" mode, f2fs
- passes down hints with its policy.
-alloc_mode=%s Adjust block allocation policy, which supports "reuse"
- and "default".
-fsync_mode=%s Control the policy of fsync. Currently supports "posix",
- "strict", and "nobarrier". In "posix" mode, which is
- default, fsync will follow POSIX semantics and does a
- light operation to improve the filesystem performance.
- In "strict" mode, fsync will be heavy and behaves in line
- with xfs, ext4 and btrfs, where xfstest generic/342 will
- pass, but the performance will regress. "nobarrier" is
- based on "posix", but doesn't issue flush command for
- non-atomic files likewise "nobarrier" mount option.
-test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt
- context. The fake fscrypt context is used by xfstests.
-checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
- to reenable checkpointing. Is enabled by default. While
- disabled, any unmounting or unexpected shutdowns will cause
- the filesystem contents to appear as they did when the
- filesystem was mounted with that option.
- While mounting with checkpoint=disabled, the filesystem must
- run garbage collection to ensure that all available space can
- be used. If this takes too much time, the mount may return
- EAGAIN. You may optionally add a value to indicate how much
- of the disk you would be willing to temporarily give up to
- avoid additional garbage collection. This can be given as a
- number of blocks, or as a percent. For instance, mounting
- with checkpoint=disable:100% would always succeed, but it may
- hide up to all remaining free space. The actual space that
- would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
- This space is reclaimed once checkpoint=enable.
-
-================================================================================
-DEBUGFS ENTRIES
-================================================================================
-
-/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as
-f2fs. Each file shows the whole f2fs information.
-
-/sys/kernel/debug/f2fs/status includes:
- - major file system information managed by f2fs currently
- - average SIT information about whole segments
- - current memory footprint consumed by f2fs.
-
-================================================================================
-SYSFS ENTRIES
-================================================================================
-
-Information about mounted f2fs file systems can be found in
-/sys/fs/f2fs. Each mounted filesystem will have a directory in
-/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda).
-The files in each per-device directory are shown in table below.
-
-Files in /sys/fs/f2fs/<devname>
-(see also Documentation/ABI/testing/sysfs-fs-f2fs)
-..............................................................................
- File Content
-
- gc_urgent_sleep_time This parameter controls sleep time for gc_urgent.
- 500 ms is set by default. See above gc_urgent.
-
- gc_min_sleep_time This tuning parameter controls the minimum sleep
- time for the garbage collection thread. Time is
- in milliseconds.
-
- gc_max_sleep_time This tuning parameter controls the maximum sleep
- time for the garbage collection thread. Time is
- in milliseconds.
-
- gc_no_gc_sleep_time This tuning parameter controls the default sleep
- time for the garbage collection thread. Time is
- in milliseconds.
-
- gc_idle This parameter controls the selection of victim
- policy for garbage collection. Setting gc_idle = 0
- (default) will disable this option. Setting
- gc_idle = 1 will select the Cost Benefit approach
- & setting gc_idle = 2 will select the greedy approach.
-
- gc_urgent This parameter controls triggering background GCs
- urgently or not. Setting gc_urgent = 0 [default]
- makes back to default behavior, while if it is set
- to 1, background thread starts to do GC by given
- gc_urgent_sleep_time interval.
-
- reclaim_segments This parameter controls the number of prefree
- segments to be reclaimed. If the number of prefree
- segments is larger than the number of segments
- in the proportion to the percentage over total
- volume size, f2fs tries to conduct checkpoint to
- reclaim the prefree segments to free segments.
- By default, 5% over total # of segments.
-
- max_small_discards This parameter controls the number of discard
- commands that consist small blocks less than 2MB.
- The candidates to be discarded are cached until
- checkpoint is triggered, and issued during the
- checkpoint. By default, it is disabled with 0.
-
- discard_granularity This parameter controls the granularity of discard
- command size. It will issue discard commands iif
- the size is larger than given granularity. Its
- unit size is 4KB, and 4 (=16KB) is set by default.
- The maximum value is 128 (=512KB).
-
- reserved_blocks This parameter indicates the number of blocks that
- f2fs reserves internally for root.
-
- batched_trim_sections This parameter controls the number of sections
- to be trimmed out in batch mode when FITRIM
- conducts. 32 sections is set by default.
-
- ipu_policy This parameter controls the policy of in-place
- updates in f2fs. There are five policies:
- 0x01: F2FS_IPU_FORCE, 0x02: F2FS_IPU_SSR,
- 0x04: F2FS_IPU_UTIL, 0x08: F2FS_IPU_SSR_UTIL,
- 0x10: F2FS_IPU_FSYNC.
-
- min_ipu_util This parameter controls the threshold to trigger
- in-place-updates. The number indicates percentage
- of the filesystem utilization, and used by
- F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies.
-
- min_fsync_blocks This parameter controls the threshold to trigger
- in-place-updates when F2FS_IPU_FSYNC mode is set.
- The number indicates the number of dirty pages
- when fsync needs to flush on its call path. If
- the number is less than this value, it triggers
- in-place-updates.
-
- min_seq_blocks This parameter controls the threshold to serialize
- write IOs issued by multiple threads in parallel.
-
- min_hot_blocks This parameter controls the threshold to allocate
- a hot data log for pending data blocks to write.
-
- min_ssr_sections This parameter adds the threshold when deciding
- SSR block allocation. If this is large, SSR mode
- will be enabled early.
-
- ram_thresh This parameter controls the memory footprint used
- by free nids and cached nat entries. By default,
- 10 is set, which indicates 10 MB / 1 GB RAM.
-
- ra_nid_pages When building free nids, F2FS reads NAT blocks
- ahead for speed up. Default is 0.
-
- dirty_nats_ratio Given dirty ratio of cached nat entries, F2FS
- determines flushing them in background.
-
- max_victim_search This parameter controls the number of trials to
- find a victim segment when conducting SSR and
- cleaning operations. The default value is 4096
- which covers 8GB block address range.
-
- migration_granularity For large-sized sections, F2FS can stop GC given
- this granularity instead of reclaiming entire
- section.
-
- dir_level This parameter controls the directory level to
- support large directory. If a directory has a
- number of files, it can reduce the file lookup
- latency by increasing this dir_level value.
- Otherwise, it needs to decrease this value to
- reduce the space overhead. The default value is 0.
-
- cp_interval F2FS tries to do checkpoint periodically, 60 secs
- by default.
-
- idle_interval F2FS detects system is idle, if there's no F2FS
- operations during given interval, 5 secs by
- default.
-
- discard_idle_interval F2FS detects the discard thread is idle, given
- time interval. Default is 5 secs.
-
- gc_idle_interval F2FS detects the GC thread is idle, given time
- interval. Default is 5 secs.
-
- umount_discard_timeout When unmounting the disk, F2FS waits for finishing
- queued discard commands which can take huge time.
- This gives time out for it, 5 secs by default.
-
- iostat_enable This controls to enable/disable iostat in F2FS.
-
- readdir_ra This enables/disabled readahead of inode blocks
- in readdir, and default is enabled.
-
- gc_pin_file_thresh This indicates how many GC can be failed for the
- pinned file. If it exceeds this, F2FS doesn't
- guarantee its pinning state. 2048 trials is set
- by default.
-
- extension_list This enables to change extension_list for hot/cold
- files in runtime.
-
- inject_rate This controls injection rate of arbitrary faults.
-
- inject_type This controls injection type of arbitrary faults.
-
- dirty_segments This shows # of dirty segments.
-
- lifetime_write_kbytes This shows # of data written to the disk.
-
- features This shows current features enabled on F2FS.
-
- current_reserved_blocks This shows # of blocks currently reserved.
-
- unusable If checkpoint=disable, this shows the number of
- blocks that are unusable.
- If checkpoint=enable it shows the number of blocks
- that would be unusable if checkpoint=disable were
- to be set.
-
-encoding This shows the encoding used for casefolding.
- If casefolding is not enabled, returns (none)
-
-================================================================================
-USAGE
-================================================================================
-
-1. Download userland tools and compile them.
-
-2. Skip, if f2fs was compiled statically inside kernel.
- Otherwise, insert the f2fs.ko module.
- # insmod f2fs.ko
-
-3. Create a directory trying to mount
- # mkdir /mnt/f2fs
-
-4. Format the block device, and then mount as f2fs
- # mkfs.f2fs -l label /dev/block_device
- # mount -t f2fs /dev/block_device /mnt/f2fs
-
-mkfs.f2fs
----------
-The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
-which builds a basic on-disk layout.
-
-The options consist of:
--l [label] : Give a volume label, up to 512 unicode name.
--a [0 or 1] : Split start location of each area for heap-based allocation.
- 1 is set by default, which performs this.
--o [int] : Set overprovision ratio in percent over volume size.
- 5 is set by default.
--s [int] : Set the number of segments per section.
- 1 is set by default.
--z [int] : Set the number of sections per zone.
- 1 is set by default.
--e [str] : Set basic extension list. e.g. "mp3,gif,mov"
--t [0 or 1] : Disable discard command or not.
- 1 is set by default, which conducts discard.
-
-fsck.f2fs
----------
-The fsck.f2fs is a tool to check the consistency of an f2fs-formatted
-partition, which examines whether the filesystem metadata and user-made data
-are cross-referenced correctly or not.
-Note that, initial version of the tool does not fix any inconsistency.
-
-The options consist of:
- -d debug level [default:0]
-
-dump.f2fs
----------
-The dump.f2fs shows the information of specific inode and dumps SSA and SIT to
-file. Each file is dump_ssa and dump_sit.
-
-The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem.
-It shows on-disk inode information recognized by a given inode number, and is
-able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and
-./dump_sit respectively.
-
-The options consist of:
- -d debug level [default:0]
- -i inode no (hex)
- -s [SIT dump segno from #1~#2 (decimal), for all 0~-1]
- -a [SSA dump segno from #1~#2 (decimal), for all 0~-1]
-
-Examples:
-# dump.f2fs -i [ino] /dev/sdx
-# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
-# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
-
-================================================================================
-DESIGN
-================================================================================
-
-On-disk Layout
---------------
-
-F2FS divides the whole volume into a number of segments, each of which is fixed
-to 2MB in size. A section is composed of consecutive segments, and a zone
-consists of a set of sections. By default, section and zone sizes are set to one
-segment size identically, but users can easily modify the sizes by mkfs.
-
-F2FS splits the entire volume into six areas, and all the areas except superblock
-consists of multiple segments as described below.
-
- align with the zone size <-|
- |-> align with the segment size
- _________________________________________________________________________
- | | | Segment | Node | Segment | |
- | Superblock | Checkpoint | Info. | Address | Summary | Main |
- | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | |
- |____________|_____2______|______N______|______N______|______N_____|__N___|
- . .
- . .
- . .
- ._________________________________________.
- |_Segment_|_..._|_Segment_|_..._|_Segment_|
- . .
- ._________._________
- |_section_|__...__|_
- . .
- .________.
- |__zone__|
-
-- Superblock (SB)
- : It is located at the beginning of the partition, and there exist two copies
- to avoid file system crash. It contains basic partition information and some
- default parameters of f2fs.
-
-- Checkpoint (CP)
- : It contains file system information, bitmaps for valid NAT/SIT sets, orphan
- inode lists, and summary entries of current active segments.
-
-- Segment Information Table (SIT)
- : It contains segment information such as valid block count and bitmap for the
- validity of all the blocks.
-
-- Node Address Table (NAT)
- : It is composed of a block address table for all the node blocks stored in
- Main area.
-
-- Segment Summary Area (SSA)
- : It contains summary entries which contains the owner information of all the
- data and node blocks stored in Main area.
-
-- Main Area
- : It contains file and directory data including their indices.
-
-In order to avoid misalignment between file system and flash-based storage, F2FS
-aligns the start block address of CP with the segment size. Also, it aligns the
-start block address of Main area with the zone size by reserving some segments
-in SSA area.
-
-Reference the following survey for additional technical details.
-https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey
-
-File System Metadata Structure
-------------------------------
-
-F2FS adopts the checkpointing scheme to maintain file system consistency. At
-mount time, F2FS first tries to find the last valid checkpoint data by scanning
-CP area. In order to reduce the scanning time, F2FS uses only two copies of CP.
-One of them always indicates the last valid data, which is called as shadow copy
-mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism.
-
-For file system consistency, each CP points to which NAT and SIT copies are
-valid, as shown as below.
-
- +--------+----------+---------+
- | CP | SIT | NAT |
- +--------+----------+---------+
- . . . .
- . . . .
- . . . .
- +-------+-------+--------+--------+--------+--------+
- | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 |
- +-------+-------+--------+--------+--------+--------+
- | ^ ^
- | | |
- `----------------------------------------'
-
-Index Structure
----------------
-
-The key data structure to manage the data locations is a "node". Similar to
-traditional file structures, F2FS has three types of node: inode, direct node,
-indirect node. F2FS assigns 4KB to an inode block which contains 923 data block
-indices, two direct node pointers, two indirect node pointers, and one double
-indirect node pointer as described below. One direct node block contains 1018
-data blocks, and one indirect node block contains also 1018 node blocks. Thus,
-one inode block (i.e., a file) covers:
-
- 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB.
-
- Inode block (4KB)
- |- data (923)
- |- direct node (2)
- | `- data (1018)
- |- indirect node (2)
- | `- direct node (1018)
- | `- data (1018)
- `- double indirect node (1)
- `- indirect node (1018)
- `- direct node (1018)
- `- data (1018)
-
-Note that, all the node blocks are mapped by NAT which means the location of
-each node is translated by the NAT table. In the consideration of the wandering
-tree problem, F2FS is able to cut off the propagation of node updates caused by
-leaf data writes.
-
-Directory Structure
--------------------
-
-A directory entry occupies 11 bytes, which consists of the following attributes.
-
-- hash hash value of the file name
-- ino inode number
-- len the length of file name
-- type file type such as directory, symlink, etc
-
-A dentry block consists of 214 dentry slots and file names. Therein a bitmap is
-used to represent whether each dentry is valid or not. A dentry block occupies
-4KB with the following composition.
-
- Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) +
- dentries(11 * 214 bytes) + file name (8 * 214 bytes)
-
- [Bucket]
- +--------------------------------+
- |dentry block 1 | dentry block 2 |
- +--------------------------------+
- . .
- . .
- . [Dentry Block Structure: 4KB] .
- +--------+----------+----------+------------+
- | bitmap | reserved | dentries | file names |
- +--------+----------+----------+------------+
- [Dentry Block: 4KB] . .
- . .
- . .
- +------+------+-----+------+
- | hash | ino | len | type |
- +------+------+-----+------+
- [Dentry Structure: 11 bytes]
-
-F2FS implements multi-level hash tables for directory structure. Each level has
-a hash table with dedicated number of hash buckets as shown below. Note that
-"A(2B)" means a bucket includes 2 data blocks.
-
-----------------------
-A : bucket
-B : block
-N : MAX_DIR_HASH_DEPTH
-----------------------
-
-level #0 | A(2B)
- |
-level #1 | A(2B) - A(2B)
- |
-level #2 | A(2B) - A(2B) - A(2B) - A(2B)
- . | . . . .
-level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
- . | . . . .
-level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
-
-The number of blocks and buckets are determined by,
-
- ,- 2, if n < MAX_DIR_HASH_DEPTH / 2,
- # of blocks in level #n = |
- `- 4, Otherwise
-
- ,- 2^(n + dir_level),
- | if n + dir_level < MAX_DIR_HASH_DEPTH / 2,
- # of buckets in level #n = |
- `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1),
- Otherwise
-
-When F2FS finds a file name in a directory, at first a hash value of the file
-name is calculated. Then, F2FS scans the hash table in level #0 to find the
-dentry consisting of the file name and its inode number. If not found, F2FS
-scans the next hash table in level #1. In this way, F2FS scans hash tables in
-each levels incrementally from 1 to N. In each levels F2FS needs to scan only
-one bucket determined by the following equation, which shows O(log(# of files))
-complexity.
-
- bucket number to scan in level #n = (hash value) % (# of buckets in level #n)
-
-In the case of file creation, F2FS finds empty consecutive slots that cover the
-file name. F2FS searches the empty slots in the hash tables of whole levels from
-1 to N in the same way as the lookup operation.
-
-The following figure shows an example of two cases holding children.
- --------------> Dir <--------------
- | |
- child child
-
- child - child [hole] - child
-
- child - child - child [hole] - [hole] - child
-
- Case 1: Case 2:
- Number of children = 6, Number of children = 3,
- File size = 7 File size = 7
-
-Default Block Allocation
-------------------------
-
-At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node
-and Hot/Warm/Cold data.
-
-- Hot node contains direct node blocks of directories.
-- Warm node contains direct node blocks except hot node blocks.
-- Cold node contains indirect node blocks
-- Hot data contains dentry blocks
-- Warm data contains data blocks except hot and cold data blocks
-- Cold data contains multimedia data or migrated data blocks
-
-LFS has two schemes for free space management: threaded log and copy-and-compac-
-tion. The copy-and-compaction scheme which is known as cleaning, is well-suited
-for devices showing very good sequential write performance, since free segments
-are served all the time for writing new data. However, it suffers from cleaning
-overhead under high utilization. Contrarily, the threaded log scheme suffers
-from random writes, but no cleaning process is needed. F2FS adopts a hybrid
-scheme where the copy-and-compaction scheme is adopted by default, but the
-policy is dynamically changed to the threaded log scheme according to the file
-system status.
-
-In order to align F2FS with underlying flash-based storage, F2FS allocates a
-segment in a unit of section. F2FS expects that the section size would be the
-same as the unit size of garbage collection in FTL. Furthermore, with respect
-to the mapping granularity in FTL, F2FS allocates each section of the active
-logs from different zones as much as possible, since FTL can write the data in
-the active logs into one allocation unit according to its mapping granularity.
-
-Cleaning process
-----------------
-
-F2FS does cleaning both on demand and in the background. On-demand cleaning is
-triggered when there are not enough free segments to serve VFS calls. Background
-cleaner is operated by a kernel thread, and triggers the cleaning job when the
-system is idle.
-
-F2FS supports two victim selection policies: greedy and cost-benefit algorithms.
-In the greedy algorithm, F2FS selects a victim segment having the smallest number
-of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment
-according to the segment age and the number of valid blocks in order to address
-log block thrashing problem in the greedy algorithm. F2FS adopts the greedy
-algorithm for on-demand cleaner, while background cleaner adopts cost-benefit
-algorithm.
-
-In order to identify whether the data in the victim segment are valid or not,
-F2FS manages a bitmap. Each bit represents the validity of a block, and the
-bitmap is composed of a bit stream covering whole blocks in main area.
-
-Write-hint Policy
------------------
-
-1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
-
-2) whint_mode=user-based. F2FS tries to pass down hints given by
-users.
-
-User F2FS Block
----- ---- -----
- META WRITE_LIFE_NOT_SET
- HOT_NODE "
- WARM_NODE "
- COLD_NODE "
-*ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
-*extension list " "
-
--- buffered io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " "
-WRITE_LIFE_MEDIUM " "
-WRITE_LIFE_LONG " "
-
--- direct io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " WRITE_LIFE_NONE
-WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
-WRITE_LIFE_LONG " WRITE_LIFE_LONG
-
-3) whint_mode=fs-based. F2FS passes down hints with its policy.
-
-User F2FS Block
----- ---- -----
- META WRITE_LIFE_MEDIUM;
- HOT_NODE WRITE_LIFE_NOT_SET
- WARM_NODE "
- COLD_NODE WRITE_LIFE_NONE
-ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
-extension list " "
-
--- buffered io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG
-WRITE_LIFE_NONE " "
-WRITE_LIFE_MEDIUM " "
-WRITE_LIFE_LONG " "
-
--- direct io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " WRITE_LIFE_NONE
-WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
-WRITE_LIFE_LONG " WRITE_LIFE_LONG
-
-Fallocate(2) Policy
--------------------
-
-The default policy follows the below posix rule.
-
-Allocating disk space
- The default operation (i.e., mode is zero) of fallocate() allocates
- the disk space within the range specified by offset and len. The
- file size (as reported by stat(2)) will be changed if offset+len is
- greater than the file size. Any subregion within the range specified
- by offset and len that did not contain data before the call will be
- initialized to zero. This default behavior closely resembles the
- behavior of the posix_fallocate(3) library function, and is intended
- as a method of optimally implementing that function.
-
-However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
-fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having
-zero or random data, which is useful to the below scenario where:
- 1. create(fd)
- 2. ioctl(fd, F2FS_IOC_SET_PIN_FILE)
- 3. fallocate(fd, 0, 0, size)
- 4. address = fibmap(fd, offset)
- 5. open(blkdev)
- 6. write(blkdev, address)
diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.rst
similarity index 67%
rename from Documentation/filesystems/fiemap.txt
rename to Documentation/filesystems/fiemap.rst
index f6d9c99..93fc96f 100644
--- a/Documentation/filesystems/fiemap.txt
+++ b/Documentation/filesystems/fiemap.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
============
Fiemap Ioctl
============
@@ -10,9 +12,9 @@
Request Basics
--------------
-A fiemap request is encoded within struct fiemap:
+A fiemap request is encoded within struct fiemap::
-struct fiemap {
+ struct fiemap {
__u64 fm_start; /* logical offset (inclusive) at
* which to start mapping (in) */
__u64 fm_length; /* logical length of mapping which
@@ -23,7 +25,7 @@
__u32 fm_extent_count; /* size of fm_extents array (in) */
__u32 fm_reserved;
struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
-};
+ };
fm_start, and fm_length specify the logical range within the file
@@ -51,12 +53,12 @@
The following flags can be set in fm_flags:
-* FIEMAP_FLAG_SYNC
-If this flag is set, the kernel will sync the file before mapping extents.
+FIEMAP_FLAG_SYNC
+ If this flag is set, the kernel will sync the file before mapping extents.
-* FIEMAP_FLAG_XATTR
-If this flag is set, the extents returned will describe the inodes
-extended attribute lookup tree, instead of its data tree.
+FIEMAP_FLAG_XATTR
+ If this flag is set, the extents returned will describe the inodes
+ extended attribute lookup tree, instead of its data tree.
Extent Mapping
@@ -75,18 +77,18 @@
flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as
-returned in fm_extents.
+returned in fm_extents::
-struct fiemap_extent {
- __u64 fe_logical; /* logical offset in bytes for the start of
- * the extent */
- __u64 fe_physical; /* physical offset in bytes for the start
- * of the extent */
- __u64 fe_length; /* length in bytes for the extent */
- __u64 fe_reserved64[2];
- __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
- __u32 fe_reserved[3];
-};
+ struct fiemap_extent {
+ __u64 fe_logical; /* logical offset in bytes for the start of
+ * the extent */
+ __u64 fe_physical; /* physical offset in bytes for the start
+ * of the extent */
+ __u64 fe_length; /* length in bytes for the extent */
+ __u64 fe_reserved64[2];
+ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
+ __u32 fe_reserved[3];
+ };
All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical
@@ -114,24 +116,27 @@
data. Note that the opposite is not true - it would be valid for
FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
-* FIEMAP_EXTENT_LAST
-This is the last extent in the file. A mapping attempt past this
-extent will return nothing.
+FIEMAP_EXTENT_LAST
+ This is generally the last extent in the file. A mapping attempt past
+ this extent may return nothing. Some implementations set this flag to
+ indicate this extent is the last one in the range queried by the user
+ (via fiemap->fm_length).
-* FIEMAP_EXTENT_UNKNOWN
-The location of this extent is currently unknown. This may indicate
-the data is stored on an inaccessible volume or that no storage has
-been allocated for the file yet.
+FIEMAP_EXTENT_UNKNOWN
+ The location of this extent is currently unknown. This may indicate
+ the data is stored on an inaccessible volume or that no storage has
+ been allocated for the file yet.
-* FIEMAP_EXTENT_DELALLOC
- - This will also set FIEMAP_EXTENT_UNKNOWN.
-Delayed allocation - while there is data for this extent, its
-physical location has not been allocated yet.
+FIEMAP_EXTENT_DELALLOC
+ This will also set FIEMAP_EXTENT_UNKNOWN.
-* FIEMAP_EXTENT_ENCODED
-This extent does not consist of plain filesystem blocks but is
-encoded (e.g. encrypted or compressed). Reading the data in this
-extent via I/O to the block device will have undefined results.
+ Delayed allocation - while there is data for this extent, its
+ physical location has not been allocated yet.
+
+FIEMAP_EXTENT_ENCODED
+ This extent does not consist of plain filesystem blocks but is
+ encoded (e.g. encrypted or compressed). Reading the data in this
+ extent via I/O to the block device will have undefined results.
Note that it is *always* undefined to try to update the data
in-place by writing to the indicated location without the
@@ -143,32 +148,32 @@
clear; user applications must not try reading or writing to the
filesystem via the block device under any other circumstances.
-* FIEMAP_EXTENT_DATA_ENCRYPTED
- - This will also set FIEMAP_EXTENT_ENCODED
-The data in this extent has been encrypted by the file system.
+FIEMAP_EXTENT_DATA_ENCRYPTED
+ This will also set FIEMAP_EXTENT_ENCODED
+ The data in this extent has been encrypted by the file system.
-* FIEMAP_EXTENT_NOT_ALIGNED
-Extent offsets and length are not guaranteed to be block aligned.
+FIEMAP_EXTENT_NOT_ALIGNED
+ Extent offsets and length are not guaranteed to be block aligned.
-* FIEMAP_EXTENT_DATA_INLINE
+FIEMAP_EXTENT_DATA_INLINE
This will also set FIEMAP_EXTENT_NOT_ALIGNED
-Data is located within a meta data block.
+ Data is located within a meta data block.
-* FIEMAP_EXTENT_DATA_TAIL
+FIEMAP_EXTENT_DATA_TAIL
This will also set FIEMAP_EXTENT_NOT_ALIGNED
-Data is packed into a block with data from other files.
+ Data is packed into a block with data from other files.
-* FIEMAP_EXTENT_UNWRITTEN
-Unwritten extent - the extent is allocated but its data has not been
-initialized. This indicates the extent's data will be all zero if read
-through the filesystem but the contents are undefined if read directly from
-the device.
+FIEMAP_EXTENT_UNWRITTEN
+ Unwritten extent - the extent is allocated but its data has not been
+ initialized. This indicates the extent's data will be all zero if read
+ through the filesystem but the contents are undefined if read directly from
+ the device.
-* FIEMAP_EXTENT_MERGED
-This will be set when a file does not support extents, i.e., it uses a block
-based addressing scheme. Since returning an extent for each block back to
-userspace would be highly inefficient, the kernel will try to merge most
-adjacent blocks into 'extents'.
+FIEMAP_EXTENT_MERGED
+ This will be set when a file does not support extents, i.e., it uses a block
+ based addressing scheme. Since returning an extent for each block back to
+ userspace would be highly inefficient, the kernel will try to merge most
+ adjacent blocks into 'extents'.
VFS -> File System Implementation
@@ -177,23 +182,23 @@
File systems wishing to support fiemap must implement a ->fiemap callback on
their inode_operations structure. The fs ->fiemap call is responsible for
defining its set of supported fiemap flags, and calling a helper function on
-each discovered extent:
+each discovered extent::
-struct inode_operations {
+ struct inode_operations {
...
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
->fiemap is passed struct fiemap_extent_info which describes the
-fiemap request:
+fiemap request::
-struct fiemap_extent_info {
+ struct fiemap_extent_info {
unsigned int fi_flags; /* Flags as passed from user */
unsigned int fi_extents_mapped; /* Number of mapped extents */
unsigned int fi_extents_max; /* Size of fiemap_extent array */
struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
-};
+ };
It is intended that the file system should not need to access any of this
structure directly. Filesystem handlers should be tolerant to signals and return
@@ -201,23 +206,25 @@
Flag checking should be done at the beginning of the ->fiemap callback via the
-fiemap_check_flags() helper:
+fiemap_prep() helper::
-int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
+ int fiemap_prep(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 *len, u32 supported_flags);
The struct fieinfo should be passed in as received from ioctl_fiemap(). The
set of fiemap flags which the fs understands should be passed via fs_flags. If
-fiemap_check_flags finds invalid user flags, it will place the bad values in
+fiemap_prep finds invalid user flags, it will place the bad values in
fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from
-fiemap_check_flags(), it should immediately exit, returning that error back to
-ioctl_fiemap().
+fiemap_prep(), it should immediately exit, returning that error back to
+ioctl_fiemap(). Additionally the range is validate against the supported
+maximum file size.
For each extent in the request range, the file system should call
-the helper function, fiemap_fill_next_extent():
+the helper function, fiemap_fill_next_extent()::
-int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
- u64 phys, u64 len, u32 flags, u32 dev);
+ int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
+ u64 phys, u64 len, u32 flags, u32 dev);
fiemap_fill_next_extent() will use the passed values to populate the
next free extent in the fm_extents array. 'General' extent flags will
diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.rst
similarity index 95%
rename from Documentation/filesystems/files.txt
rename to Documentation/filesystems/files.rst
index 46dfc6b..cbf8e57 100644
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
File management in the Linux kernel
------------------------------------
+===================================
This document describes how locking for files (struct file)
and file descriptor table (struct files) works.
@@ -34,7 +37,7 @@
the fdtable structure -
1. All references to the fdtable must be done through
- the files_fdtable() macro :
+ the files_fdtable() macro::
struct fdtable *fdt;
@@ -61,7 +64,8 @@
4. To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These
take care of barrier requirements due to lock-free lookup.
- An example :
+
+ An example::
struct file *file;
@@ -77,7 +81,7 @@
of the fd (fget()/fget_light()) are lock-free, it is possible
that look-up may race with the last put() operation on the
file structure. This is avoided using atomic_long_inc_not_zero()
- on ->f_count :
+ on ->f_count::
rcu_read_lock();
file = fcheck_files(files, fd);
@@ -106,7 +110,8 @@
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
- For example :
+
+ For example::
spin_lock(&files->file_lock);
fd = locate_fd(files, file, start);
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 8a0700a..936fae0 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -176,11 +176,11 @@
Each encrypted directory tree is protected by a *master key*. Master
keys can be up to 64 bytes long, and must be at least as long as the
-greater of the key length needed by the contents and filenames
-encryption modes being used. For example, if AES-256-XTS is used for
-contents encryption, the master key must be 64 bytes (512 bits). Note
-that the XTS mode is defined to require a key twice as long as that
-required by the underlying block cipher.
+greater of the security strength of the contents and filenames
+encryption modes being used. For example, if any AES-256 mode is
+used, the master key must be at least 256 bits, i.e. 32 bytes. A
+stricter requirement applies if the key is used by a v1 encryption
+policy and AES-256-XTS is used; such keys must be 64 bytes.
To "unlock" an encrypted directory tree, userspace must provide the
appropriate master key. There can be any number of master keys, each
@@ -234,8 +234,8 @@
entropy from the master key. HKDF is also standardized and widely
used by other software, whereas the AES-128-ECB based KDF is ad-hoc.
-Per-file keys
--------------
+Per-file encryption keys
+------------------------
Since each master key can protect many files, it is necessary to
"tweak" the encryption of each file so that the same plaintext in two
@@ -256,13 +256,8 @@
the master keys may be wrapped in userspace, e.g. as is done by the
`fscrypt <https://github.com/google/fscrypt>`_ tool.
-Including the inode number in the IVs was considered. However, it was
-rejected as it would have prevented ext4 filesystems from being
-resized, and by itself still wouldn't have been sufficient to prevent
-the same key from being directly reused for both XTS and CTS-CBC.
-
-DIRECT_KEY and per-mode keys
-----------------------------
+DIRECT_KEY policies
+-------------------
The Adiantum encryption mode (see `Encryption modes and usage`_) is
suitable for both contents and filenames encryption, and it accepts
@@ -273,9 +268,9 @@
Therefore, to improve performance and save memory, for Adiantum a
"direct key" configuration is supported. When the user has enabled
this by setting FSCRYPT_POLICY_FLAG_DIRECT_KEY in the fscrypt policy,
-per-file keys are not used. Instead, whenever any data (contents or
-filenames) is encrypted, the file's 16-byte nonce is included in the
-IV. Moreover:
+per-file encryption keys are not used. Instead, whenever any data
+(contents or filenames) is encrypted, the file's 16-byte nonce is
+included in the IV. Moreover:
- For v1 encryption policies, the encryption is done directly with the
master key. Because of this, users **must not** use the same master
@@ -285,6 +280,35 @@
key derived using the KDF. Users may use the same master key for
other v2 encryption policies.
+IV_INO_LBLK_64 policies
+-----------------------
+
+When FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64 is set in the fscrypt policy,
+the encryption keys are derived from the master key, encryption mode
+number, and filesystem UUID. This normally results in all files
+protected by the same master key sharing a single contents encryption
+key and a single filenames encryption key. To still encrypt different
+files' data differently, inode numbers are included in the IVs.
+Consequently, shrinking the filesystem may not be allowed.
+
+This format is optimized for use with inline encryption hardware
+compliant with the UFS standard, which supports only 64 IV bits per
+I/O request and may have only a small number of keyslots.
+
+IV_INO_LBLK_32 policies
+-----------------------
+
+IV_INO_LBLK_32 policies work like IV_INO_LBLK_64, except that for
+IV_INO_LBLK_32, the inode number is hashed with SipHash-2-4 (where the
+SipHash key is derived from the master key) and added to the file
+logical block number mod 2^32 to produce a 32-bit IV.
+
+This format is optimized for use with inline encryption hardware
+compliant with the eMMC v5.2 standard, which supports only 32 IV bits
+per I/O request and may have only a small number of keyslots. This
+format results in some level of IV reuse, so it should only be used
+when necessary due to hardware limitations.
+
Key identifiers
---------------
@@ -292,6 +316,16 @@
identifier" is also derived using the KDF. This value is stored in
the clear, since it is needed to reliably identify the key itself.
+Dirhash keys
+------------
+
+For directories that are indexed using a secret-keyed dirhash over the
+plaintext filenames, the KDF is also used to derive a 128-bit
+SipHash-2-4 key per directory in order to hash filenames. This works
+just like deriving a per-file encryption key, except that a different
+KDF context is used. Currently, only casefolded ("case-insensitive")
+encrypted directories use this style of hashing.
+
Encryption modes and usage
==========================
@@ -308,17 +342,18 @@
AES-128-CBC was added only for low-powered embedded devices with
crypto accelerators such as CAAM or CESA that do not support XTS. To
-use AES-128-CBC, CONFIG_CRYPTO_SHA256 (or another SHA-256
-implementation) must be enabled so that ESSIV can be used.
+use AES-128-CBC, CONFIG_CRYPTO_ESSIV and CONFIG_CRYPTO_SHA256 (or
+another SHA-256 implementation) must be enabled so that ESSIV can be
+used.
Adiantum is a (primarily) stream cipher-based mode that is fast even
on CPUs without dedicated crypto instructions. It's also a true
wide-block mode, unlike XTS. It can also eliminate the need to derive
-per-file keys. However, it depends on the security of two primitives,
-XChaCha12 and AES-256, rather than just one. See the paper
-"Adiantum: length-preserving encryption for entry-level processors"
-(https://eprint.iacr.org/2018/720.pdf) for more details. To use
-Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
+per-file encryption keys. However, it depends on the security of two
+primitives, XChaCha12 and AES-256, rather than just one. See the
+paper "Adiantum: length-preserving encryption for entry-level
+processors" (https://eprint.iacr.org/2018/720.pdf) for more details.
+To use Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
implementations of ChaCha and NHPoly1305 should be enabled, e.g.
CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM.
@@ -331,8 +366,8 @@
-------------------
For file contents, each filesystem block is encrypted independently.
-Currently, only the case where the filesystem block size is equal to
-the system's page size (usually 4096 bytes) is supported.
+Starting from Linux kernel 5.5, encryption of filesystems with block
+size less than system's page size is supported.
Each block's IV is set to the logical block number within the file as
a little endian number, except that:
@@ -341,10 +376,20 @@
is encrypted with AES-256 where the AES-256 key is the SHA-256 hash
of the file's data encryption key.
-- In the "direct key" configuration (FSCRYPT_POLICY_FLAG_DIRECT_KEY
- set in the fscrypt_policy), the file's nonce is also appended to the
- IV. Currently this is only allowed with the Adiantum encryption
- mode.
+- With `DIRECT_KEY policies`_, the file's nonce is appended to the IV.
+ Currently this is only allowed with the Adiantum encryption mode.
+
+- With `IV_INO_LBLK_64 policies`_, the logical block number is limited
+ to 32 bits and is placed in bits 0-31 of the IV. The inode number
+ (which is also limited to 32 bits) is placed in bits 32-63.
+
+- With `IV_INO_LBLK_32 policies`_, the logical block number is limited
+ to 32 bits and is placed in bits 0-31 of the IV. The inode number
+ is then hashed and added mod 2^32.
+
+Note that because file logical block numbers are included in the IVs,
+filesystems must enforce that blocks are never shifted around within
+encrypted files, e.g. via "collapse range" or "insert range".
Filenames encryption
--------------------
@@ -354,10 +399,10 @@
filenames of up to 255 bytes, the same IV is used for every filename
in a directory.
-However, each encrypted directory still uses a unique key; or
-alternatively (for the "direct key" configuration) has the file's
-nonce included in the IVs. Thus, IV reuse is limited to within a
-single directory.
+However, each encrypted directory still uses a unique key, or
+alternatively has the file's nonce (for `DIRECT_KEY policies`_) or
+inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs.
+Thus, IV reuse is limited to within a single directory.
With CTS-CBC, the IV reuse means that when the plaintext filenames
share a common prefix at least as long as the cipher block size (16
@@ -391,9 +436,9 @@
The FS_IOC_SET_ENCRYPTION_POLICY ioctl sets an encryption policy on an
empty directory or verifies that a directory or regular file already
-has the specified encryption policy. It takes in a pointer to a
-:c:type:`struct fscrypt_policy_v1` or a :c:type:`struct
-fscrypt_policy_v2`, defined as follows::
+has the specified encryption policy. It takes in a pointer to
+struct fscrypt_policy_v1 or struct fscrypt_policy_v2, defined as
+follows::
#define FSCRYPT_POLICY_V1 0
#define FSCRYPT_KEY_DESCRIPTOR_SIZE 8
@@ -419,11 +464,11 @@
This structure must be initialized as follows:
-- ``version`` must be FSCRYPT_POLICY_V1 (0) if the struct is
- :c:type:`fscrypt_policy_v1` or FSCRYPT_POLICY_V2 (2) if the struct
- is :c:type:`fscrypt_policy_v2`. (Note: we refer to the original
- policy version as "v1", though its version code is really 0.) For
- new encrypted directories, use v2 policies.
+- ``version`` must be FSCRYPT_POLICY_V1 (0) if
+ struct fscrypt_policy_v1 is used or FSCRYPT_POLICY_V2 (2) if
+ struct fscrypt_policy_v2 is used. (Note: we refer to the original
+ policy version as "v1", though its version code is really 0.)
+ For new encrypted directories, use v2 policies.
- ``contents_encryption_mode`` and ``filenames_encryption_mode`` must
be set to constants from ``<linux/fscrypt.h>`` which identify the
@@ -431,12 +476,22 @@
(1) for ``contents_encryption_mode`` and FSCRYPT_MODE_AES_256_CTS
(4) for ``filenames_encryption_mode``.
-- ``flags`` must contain a value from ``<linux/fscrypt.h>`` which
- identifies the amount of NUL-padding to use when encrypting
- filenames. If unsure, use FSCRYPT_POLICY_FLAGS_PAD_32 (0x3).
- Additionally, if the encryption modes are both
- FSCRYPT_MODE_ADIANTUM, this can contain
- FSCRYPT_POLICY_FLAG_DIRECT_KEY; see `DIRECT_KEY and per-mode keys`_.
+- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
+
+ - FSCRYPT_POLICY_FLAGS_PAD_*: The amount of NUL padding to use when
+ encrypting filenames. If unsure, use FSCRYPT_POLICY_FLAGS_PAD_32
+ (0x3).
+ - FSCRYPT_POLICY_FLAG_DIRECT_KEY: See `DIRECT_KEY policies`_.
+ - FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64: See `IV_INO_LBLK_64
+ policies`_.
+ - FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32: See `IV_INO_LBLK_32
+ policies`_.
+
+ v1 encryption policies only support the PAD_* and DIRECT_KEY flags.
+ The other flags are only supported by v2 encryption policies.
+
+ The DIRECT_KEY, IV_INO_LBLK_64, and IV_INO_LBLK_32 flags are
+ mutually exclusive.
- For v2 encryption policies, ``__reserved`` must be zeroed.
@@ -453,9 +508,9 @@
replaced with ``master_key_identifier``, which is longer and cannot
be arbitrarily chosen. Instead, the key must first be added using
`FS_IOC_ADD_ENCRYPTION_KEY`_. Then, the ``key_spec.u.identifier``
- the kernel returned in the :c:type:`struct fscrypt_add_key_arg` must
- be used as the ``master_key_identifier`` in the :c:type:`struct
- fscrypt_policy_v2`.
+ the kernel returned in the struct fscrypt_add_key_arg must
+ be used as the ``master_key_identifier`` in
+ struct fscrypt_policy_v2.
If the file is not yet encrypted, then FS_IOC_SET_ENCRYPTION_POLICY
verifies that the file is an empty directory. If so, the specified
@@ -493,7 +548,9 @@
- ``EEXIST``: the file is already encrypted with an encryption policy
different from the one specified
- ``EINVAL``: an invalid encryption policy was specified (invalid
- version, mode(s), or flags; or reserved bits were set)
+ version, mode(s), or flags; or reserved bits were set); or a v1
+ encryption policy was specified but the directory has the casefold
+ flag enabled (casefolding is incompatible with v1 policies).
- ``ENOKEY``: a v2 encryption policy was specified, but the key with
the specified ``master_key_identifier`` has not been added, nor does
the process have the CAP_FOWNER capability in the initial user
@@ -533,7 +590,7 @@
The FS_IOC_GET_ENCRYPTION_POLICY_EX ioctl retrieves the encryption
policy, if any, for a directory or regular file. No additional
permissions are required beyond the ability to open the file. It
-takes in a pointer to a :c:type:`struct fscrypt_get_policy_ex_arg`,
+takes in a pointer to struct fscrypt_get_policy_ex_arg,
defined as follows::
struct fscrypt_get_policy_ex_arg {
@@ -580,9 +637,8 @@
encryption policy, if any, for a directory or regular file. However,
unlike `FS_IOC_GET_ENCRYPTION_POLICY_EX`_,
FS_IOC_GET_ENCRYPTION_POLICY only supports the original policy
-version. It takes in a pointer directly to a :c:type:`struct
-fscrypt_policy_v1` rather than a :c:type:`struct
-fscrypt_get_policy_ex_arg`.
+version. It takes in a pointer directly to struct fscrypt_policy_v1
+rather than struct fscrypt_get_policy_ex_arg.
The error codes for FS_IOC_GET_ENCRYPTION_POLICY are the same as those
for FS_IOC_GET_ENCRYPTION_POLICY_EX, except that
@@ -601,6 +657,17 @@
FS_IOC_GET_ENCRYPTION_PWSALT is deprecated. Instead, prefer to
generate and manage any needed salt(s) in userspace.
+Getting a file's encryption nonce
+---------------------------------
+
+Since Linux v5.7, the ioctl FS_IOC_GET_ENCRYPTION_NONCE is supported.
+On encrypted files and directories it gets the inode's 16-byte nonce.
+On unencrypted files and directories, it fails with ENODATA.
+
+This ioctl can be useful for automated tests which verify that the
+encryption is being done correctly. It is not needed for normal use
+of fscrypt.
+
Adding keys
-----------
@@ -612,13 +679,13 @@
encrypted using that key appear "unlocked", i.e. in plaintext form.
It can be executed on any file or directory on the target filesystem,
but using the filesystem's root directory is recommended. It takes in
-a pointer to a :c:type:`struct fscrypt_add_key_arg`, defined as
-follows::
+a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_add_key_arg {
struct fscrypt_key_specifier key_spec;
__u32 raw_size;
- __u32 __reserved[9];
+ __u32 key_id;
+ __u32 __reserved[8];
__u8 raw[];
};
@@ -635,17 +702,22 @@
} u;
};
-:c:type:`struct fscrypt_add_key_arg` must be zeroed, then initialized
+ struct fscrypt_provisioning_key_payload {
+ __u32 type;
+ __u32 __reserved;
+ __u8 raw[];
+ };
+
+struct fscrypt_add_key_arg must be zeroed, then initialized
as follows:
- If the key is being added for use by v1 encryption policies, then
``key_spec.type`` must contain FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR, and
``key_spec.u.descriptor`` must contain the descriptor of the key
being added, corresponding to the value in the
- ``master_key_descriptor`` field of :c:type:`struct
- fscrypt_policy_v1`. To add this type of key, the calling process
- must have the CAP_SYS_ADMIN capability in the initial user
- namespace.
+ ``master_key_descriptor`` field of struct fscrypt_policy_v1.
+ To add this type of key, the calling process must have the
+ CAP_SYS_ADMIN capability in the initial user namespace.
Alternatively, if the key is being added for use by v2 encryption
policies, then ``key_spec.type`` must contain
@@ -657,9 +729,27 @@
``Documentation/security/keys/core.rst``).
- ``raw_size`` must be the size of the ``raw`` key provided, in bytes.
+ Alternatively, if ``key_id`` is nonzero, this field must be 0, since
+ in that case the size is implied by the specified Linux keyring key.
+
+- ``key_id`` is 0 if the raw key is given directly in the ``raw``
+ field. Otherwise ``key_id`` is the ID of a Linux keyring key of
+ type "fscrypt-provisioning" whose payload is
+ struct fscrypt_provisioning_key_payload whose ``raw`` field contains
+ the raw key and whose ``type`` field matches ``key_spec.type``.
+ Since ``raw`` is variable-length, the total size of this key's
+ payload must be ``sizeof(struct fscrypt_provisioning_key_payload)``
+ plus the raw key size. The process must have Search permission on
+ this key.
+
+ Most users should leave this 0 and specify the raw key directly.
+ The support for specifying a Linux keyring key is intended mainly to
+ allow re-adding keys after a filesystem is unmounted and re-mounted,
+ without having to store the raw keys in userspace memory.
- ``raw`` is a variable-length field which must contain the actual
- key, ``raw_size`` bytes long.
+ key, ``raw_size`` bytes long. Alternatively, if ``key_id`` is
+ nonzero, then this field is unused.
For v2 policy keys, the kernel keeps track of which user (identified
by effective user ID) added the key, and only allows the key to be
@@ -681,11 +771,16 @@
- ``EACCES``: FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR was specified, but the
caller does not have the CAP_SYS_ADMIN capability in the initial
- user namespace
+ user namespace; or the raw key was specified by Linux key ID but the
+ process lacks Search permission on the key.
- ``EDQUOT``: the key quota for this user would be exceeded by adding
the key
- ``EINVAL``: invalid key size or key specifier type, or reserved bits
were set
+- ``EKEYREJECTED``: the raw key was specified by Linux key ID, but the
+ key has the wrong type
+- ``ENOKEY``: the raw key was specified by Linux key ID, but no key
+ exists with that ID
- ``ENOTTY``: this type of filesystem does not implement encryption
- ``EOPNOTSUPP``: the kernel was not configured with encryption
support for this filesystem, or the filesystem superblock has not
@@ -763,8 +858,8 @@
encryption key from the filesystem, and possibly removes the key
itself. It can be executed on any file or directory on the target
filesystem, but using the filesystem's root directory is recommended.
-It takes in a pointer to a :c:type:`struct fscrypt_remove_key_arg`,
-defined as follows::
+It takes in a pointer to struct fscrypt_remove_key_arg, defined
+as follows::
struct fscrypt_remove_key_arg {
struct fscrypt_key_specifier key_spec;
@@ -859,8 +954,8 @@
The FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl retrieves the status of a
master encryption key. It can be executed on any file or directory on
the target filesystem, but using the filesystem's root directory is
-recommended. It takes in a pointer to a :c:type:`struct
-fscrypt_get_key_status_arg`, defined as follows::
+recommended. It takes in a pointer to
+struct fscrypt_get_key_status_arg, defined as follows::
struct fscrypt_get_key_status_arg {
/* input */
@@ -955,9 +1050,9 @@
- Direct I/O is not supported on encrypted files. Attempts to use
direct I/O on such files will fall back to buffered I/O.
-- The fallocate operations FALLOC_FL_COLLAPSE_RANGE,
- FALLOC_FL_INSERT_RANGE, and FALLOC_FL_ZERO_RANGE are not supported
- on encrypted files and will fail with EOPNOTSUPP.
+- The fallocate operations FALLOC_FL_COLLAPSE_RANGE and
+ FALLOC_FL_INSERT_RANGE are not supported on encrypted files and will
+ fail with EOPNOTSUPP.
- Online defragmentation of encrypted files is not supported. The
EXT4_IOC_MOVE_EXT and F2FS_IOC_MOVE_RANGE ioctls will fail with
@@ -1051,17 +1146,17 @@
Encryption context
------------------
-An encryption policy is represented on-disk by a :c:type:`struct
-fscrypt_context_v1` or a :c:type:`struct fscrypt_context_v2`. It is
-up to individual filesystems to decide where to store it, but normally
-it would be stored in a hidden extended attribute. It should *not* be
+An encryption policy is represented on-disk by
+struct fscrypt_context_v1 or struct fscrypt_context_v2. It is up to
+individual filesystems to decide where to store it, but normally it
+would be stored in a hidden extended attribute. It should *not* be
exposed by the xattr-related system calls such as getxattr() and
setxattr() because of the special semantics of the encryption xattr.
(In particular, there would be much confusion if an encryption policy
were to be added to or removed from anything other than an empty
directory.) These structs are defined as follows::
- #define FS_KEY_DERIVATION_NONCE_SIZE 16
+ #define FSCRYPT_FILE_NONCE_SIZE 16
#define FSCRYPT_KEY_DESCRIPTOR_SIZE 8
struct fscrypt_context_v1 {
@@ -1070,7 +1165,7 @@
u8 filenames_encryption_mode;
u8 flags;
u8 master_key_descriptor[FSCRYPT_KEY_DESCRIPTOR_SIZE];
- u8 nonce[FS_KEY_DERIVATION_NONCE_SIZE];
+ u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
#define FSCRYPT_KEY_IDENTIFIER_SIZE 16
@@ -1081,15 +1176,15 @@
u8 flags;
u8 __reserved[4];
u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
- u8 nonce[FS_KEY_DERIVATION_NONCE_SIZE];
+ u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
The context structs contain the same information as the corresponding
policy structs (see `Setting an encryption policy`_), except that the
context structs also contain a nonce. The nonce is randomly generated
by the kernel and is used as KDF input or as a tweak to cause
-different files to be encrypted differently; see `Per-file keys`_ and
-`DIRECT_KEY and per-mode keys`_.
+different files to be encrypted differently; see `Per-file encryption
+keys`_ and `DIRECT_KEY policies`_.
Data path changes
-----------------
@@ -1107,6 +1202,18 @@
buffers regardless of encryption. Other filesystems, such as ext4 and
F2FS, have to allocate bounce pages specially for encryption.
+Fscrypt is also able to use inline encryption hardware instead of the
+kernel crypto API for en/decryption of file contents. When possible,
+and if directed to do so (by specifying the 'inlinecrypt' mount option
+for an ext4/F2FS filesystem), it adds encryption contexts to bios and
+uses blk-crypto to perform the en/decryption instead of making use of
+the above read/write path changes. Of course, even if directed to
+make use of inline encryption, fscrypt will only be able to do so if
+either hardware inline encryption support is available for the
+selected encryption algorithm or CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK
+is selected. If neither is the case, fscrypt will fall back to using
+the above mentioned read/write path changes for en/decryption.
+
Filename hashing and encoding
-----------------------------
@@ -1140,8 +1247,8 @@
filesystem-specific hash(es) needed for directory lookups. This
allows the filesystem to still, with a high degree of confidence, map
the filename given in ->lookup() back to a particular directory entry
-that was previously listed by readdir(). See :c:type:`struct
-fscrypt_digested_name` in the source for more details.
+that was previously listed by readdir(). See
+struct fscrypt_nokey_name in the source for more details.
Note that the precise way that filenames are presented to userspace
without the key is subject to change in the future. It is only meant
@@ -1153,11 +1260,14 @@
To test fscrypt, use xfstests, which is Linux's de facto standard
filesystem test suite. First, run all the tests in the "encrypt"
-group on the relevant filesystem(s). For example, to test ext4 and
+group on the relevant filesystem(s). One can also run the tests
+with the 'inlinecrypt' mount option to test the implementation for
+inline encryption support. For example, to test ext4 and
f2fs encryption using `kvm-xfstests
<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
kvm-xfstests -c ext4,f2fs -g encrypt
+ kvm-xfstests -c ext4,f2fs -g encrypt -m inlinecrypt
UBIFS encryption can also be tested this way, but it should be done in
a separate command, and it takes some time for kvm-xfstests to set up
@@ -1179,6 +1289,7 @@
kvm-xfstests, use the "encrypt" filesystem configuration::
kvm-xfstests -c ext4/encrypt,f2fs/encrypt -g auto
+ kvm-xfstests -c ext4/encrypt,f2fs/encrypt -g auto -m inlinecrypt
Because this runs many more tests than "-g encrypt" does, it takes
much longer to run; so also consider using `gce-xfstests
@@ -1186,3 +1297,4 @@
instead of kvm-xfstests::
gce-xfstests -c ext4/encrypt,f2fs/encrypt -g auto
+ gce-xfstests -c ext4/encrypt,f2fs/encrypt -g auto -m inlinecrypt
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 42a0b6d..895e971 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -84,7 +84,7 @@
--------------------
The FS_IOC_ENABLE_VERITY ioctl enables fs-verity on a file. It takes
-in a pointer to a :c:type:`struct fsverity_enable_arg`, defined as
+in a pointer to a struct fsverity_enable_arg, defined as
follows::
struct fsverity_enable_arg {
@@ -226,6 +226,14 @@
The verity flag is not settable via FS_IOC_SETFLAGS. You must use
FS_IOC_ENABLE_VERITY instead, since parameters must be provided.
+statx
+-----
+
+Since Linux v5.5, the statx() system call sets STATX_ATTR_VERITY if
+the file has fs-verity enabled. This can perform better than
+FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
+opening the file, and opening verity files can be expensive.
+
Accessing verity files
======================
@@ -398,7 +406,7 @@
ext4
----
-ext4 supports fs-verity since Linux TODO and e2fsprogs v1.45.2.
+ext4 supports fs-verity since Linux v5.4 and e2fsprogs v1.45.2.
To create verity files on an ext4 filesystem, the filesystem must have
been formatted with ``-O verity`` or had ``tune2fs -O verity`` run on
@@ -434,7 +442,7 @@
f2fs
----
-f2fs supports fs-verity since Linux TODO and f2fs-tools v1.11.0.
+f2fs supports fs-verity since Linux v5.4 and f2fs-tools v1.11.0.
To create verity files on an f2fs filesystem, the filesystem must have
been formatted with ``-O verity``.
@@ -651,7 +659,7 @@
retrofit existing filesystems with new consistency mechanisms.
Data journalling is available on ext4, but is very slow.
- - Rebuilding the the Merkle tree after every write, which would be
+ - Rebuilding the Merkle tree after every write, which would be
extremely inefficient. Alternatively, a different authenticated
dictionary structure such as an "authenticated skiplist" could
be used. However, this would be far more complex.
diff --git a/Documentation/filesystems/fuse-io.txt b/Documentation/filesystems/fuse-io.rst
similarity index 95%
rename from Documentation/filesystems/fuse-io.txt
rename to Documentation/filesystems/fuse-io.rst
index 07b8f73..255a368 100644
--- a/Documentation/filesystems/fuse-io.txt
+++ b/Documentation/filesystems/fuse-io.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Fuse I/O Modes
+==============
+
Fuse supports the following I/O modes:
- direct-io
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.rst
similarity index 80%
rename from Documentation/filesystems/fuse.txt
rename to Documentation/filesystems/fuse.rst
index 13af4a4..8120c3c 100644
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.rst
@@ -1,41 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+FUSE
+====
+
Definitions
-~~~~~~~~~~~
+===========
Userspace filesystem:
-
A filesystem in which data and metadata are provided by an ordinary
userspace process. The filesystem can be accessed normally through
the kernel interface.
Filesystem daemon:
-
The process(es) providing the data and metadata of the filesystem.
Non-privileged mount (or user mount):
-
A userspace filesystem mounted by a non-privileged (non-root) user.
The filesystem daemon is running with the privileges of the mounting
user. NOTE: this is not the same as mounts allowed with the "user"
option in /etc/fstab, which is not discussed here.
Filesystem connection:
-
A connection between the filesystem daemon and the kernel. The
connection exists until either the daemon dies, or the filesystem is
umounted. Note that detaching (or lazy umounting) the filesystem
- does _not_ break the connection, in this case it will exist until
+ does *not* break the connection, in this case it will exist until
the last reference to the filesystem is released.
Mount owner:
-
The user who does the mounting.
User:
-
The user who is performing filesystem operations.
What is FUSE?
-~~~~~~~~~~~~~
+=============
FUSE is a userspace filesystem framework. It consists of a kernel
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
@@ -46,50 +46,41 @@
filesystems. A good example is sshfs: a secure network filesystem
using the sftp protocol.
-The userspace library and utilities are available from the FUSE
-homepage:
-
- http://fuse.sourceforge.net/
+The userspace library and utilities are available from the
+`FUSE homepage: <https://github.com/libfuse/>`_
Filesystem type
-~~~~~~~~~~~~~~~
+===============
The filesystem type given to mount(2) can be one of the following:
-'fuse'
+ fuse
+ This is the usual way to mount a FUSE filesystem. The first
+ argument of the mount system call may contain an arbitrary string,
+ which is not interpreted by the kernel.
- This is the usual way to mount a FUSE filesystem. The first
- argument of the mount system call may contain an arbitrary string,
- which is not interpreted by the kernel.
-
-'fuseblk'
-
- The filesystem is block device based. The first argument of the
- mount system call is interpreted as the name of the device.
+ fuseblk
+ The filesystem is block device based. The first argument of the
+ mount system call is interpreted as the name of the device.
Mount options
-~~~~~~~~~~~~~
+=============
-'fd=N'
-
+fd=N
The file descriptor to use for communication between the userspace
filesystem and the kernel. The file descriptor must have been
obtained by opening the FUSE device ('/dev/fuse').
-'rootmode=M'
-
+rootmode=M
The file mode of the filesystem's root in octal representation.
-'user_id=N'
-
+user_id=N
The numeric user id of the mount owner.
-'group_id=N'
-
+group_id=N
The numeric group id of the mount owner.
-'default_permissions'
-
+default_permissions
By default FUSE doesn't check file access permissions, the
filesystem is free to implement its access policy or leave it to
the underlying file access mechanism (e.g. in case of network
@@ -97,28 +88,25 @@
access based on file mode. It is usually useful together with the
'allow_other' mount option.
-'allow_other'
-
+allow_other
This option overrides the security measure restricting file access
to the user mounting the filesystem. This option is by default only
allowed to root, but this restriction can be removed with a
(userspace) configuration option.
-'max_read=N'
-
+max_read=N
With this option the maximum size of read operations can be set.
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
-'blksize=N'
-
+blksize=N
Set the block size for the filesystem. The default is 512. This
option is only valid for 'fuseblk' type mounts.
Control filesystem
-~~~~~~~~~~~~~~~~~~
+==================
-There's a control filesystem for FUSE, which can be mounted by:
+There's a control filesystem for FUSE, which can be mounted by::
mount -t fusectl none /sys/fs/fuse/connections
@@ -130,53 +118,51 @@
For each connection the following files exist within this directory:
- 'waiting'
+ waiting
+ The number of requests which are waiting to be transferred to
+ userspace or being processed by the filesystem daemon. If there is
+ no filesystem activity and 'waiting' is non-zero, then the
+ filesystem is hung or deadlocked.
- The number of requests which are waiting to be transferred to
- userspace or being processed by the filesystem daemon. If there is
- no filesystem activity and 'waiting' is non-zero, then the
- filesystem is hung or deadlocked.
-
- 'abort'
-
- Writing anything into this file will abort the filesystem
- connection. This means that all waiting requests will be aborted an
- error returned for all aborted and new requests.
+ abort
+ Writing anything into this file will abort the filesystem
+ connection. This means that all waiting requests will be aborted an
+ error returned for all aborted and new requests.
Only the owner of the mount may read or write these files.
Interrupting filesystem operations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+##################################
If a process issuing a FUSE filesystem request is interrupted, the
following will happen:
- 1) If the request is not yet sent to userspace AND the signal is
+ - If the request is not yet sent to userspace AND the signal is
fatal (SIGKILL or unhandled fatal signal), then the request is
dequeued and returns immediately.
- 2) If the request is not yet sent to userspace AND the signal is not
- fatal, then an 'interrupted' flag is set for the request. When
+ - If the request is not yet sent to userspace AND the signal is not
+ fatal, then an interrupted flag is set for the request. When
the request has been successfully transferred to userspace and
this flag is set, an INTERRUPT request is queued.
- 3) If the request is already sent to userspace, then an INTERRUPT
+ - If the request is already sent to userspace, then an INTERRUPT
request is queued.
INTERRUPT requests take precedence over other requests, so the
userspace filesystem will receive queued INTERRUPTs before any others.
The userspace filesystem may ignore the INTERRUPT requests entirely,
-or may honor them by sending a reply to the _original_ request, with
+or may honor them by sending a reply to the *original* request, with
the error set to EINTR.
It is also possible that there's a race between processing the
original request and its INTERRUPT request. There are two possibilities:
- 1) The INTERRUPT request is processed before the original request is
+ 1. The INTERRUPT request is processed before the original request is
processed
- 2) The INTERRUPT request is processed after the original request has
+ 2. The INTERRUPT request is processed after the original request has
been answered
If the filesystem cannot find the original request, it should wait for
@@ -186,7 +172,7 @@
reply will be ignored.
Aborting a filesystem connection
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+================================
It is possible to get into certain situations where the filesystem is
not responding. Reasons for this may be:
@@ -216,7 +202,7 @@
powerful method, always works.
How do non-privileged mounts work?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+==================================
Since the mount() system call is a privileged operation, a helper
program (fusermount) is needed, which is installed setuid root.
@@ -235,15 +221,13 @@
other users' or the super user's processes
How are requirements fulfilled?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+===============================
A) The mount owner could gain elevated privileges by either:
- 1) creating a filesystem containing a device file, then opening
- this device
+ 1. creating a filesystem containing a device file, then opening this device
- 2) creating a filesystem containing a suid or sgid application,
- then executing this application
+ 2. creating a filesystem containing a suid or sgid application, then executing this application
The solution is not to allow opening device files and ignore
setuid and setgid bits when executing programs. To ensure this
@@ -275,16 +259,16 @@
of other users' processes.
i) It can slow down or indefinitely delay the execution of a
- filesystem operation creating a DoS against the user or the
- whole system. For example a suid application locking a
- system file, and then accessing a file on the mount owner's
- filesystem could be stopped, and thus causing the system
- file to be locked forever.
+ filesystem operation creating a DoS against the user or the
+ whole system. For example a suid application locking a
+ system file, and then accessing a file on the mount owner's
+ filesystem could be stopped, and thus causing the system
+ file to be locked forever.
ii) It can present files or directories of unlimited length, or
- directory structures of unlimited depth, possibly causing a
- system process to eat up diskspace, memory or other
- resources, again causing DoS.
+ directory structures of unlimited depth, possibly causing a
+ system process to eat up diskspace, memory or other
+ resources, again causing *DoS*.
The solution to this as well as B) is not to allow processes
to access the filesystem, which could otherwise not be
@@ -294,28 +278,27 @@
ptrace can be used to check if a process is allowed to access
the filesystem or not.
- Note that the ptrace check is not strictly necessary to
+ Note that the *ptrace* check is not strictly necessary to
prevent B/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
- filesystem, since SIGSTOP can be used to get a similar effect.
+ filesystem, since *SIGSTOP* can be used to get a similar effect.
I think these limitations are unacceptable?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+===========================================
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
-mounts, it can relax the last limitation with a "user_allow_other"
+mounts, it can relax the last limitation with a 'user_allow_other'
config option. If this config option is set, the mounting user can
-add the "allow_other" mount option which disables the check for other
+add the 'allow_other' mount option which disables the check for other
users' processes.
Kernel - userspace interface
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+============================
The following diagram shows how a filesystem operation (in this
-example unlink) is performed in FUSE.
+example unlink) is performed in FUSE. ::
-NOTE: everything in this description is greatly simplified
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
@@ -357,12 +340,13 @@
| <fuse_unlink() |
| <sys_unlink() |
+.. note:: Everything in the description above is greatly simplified
+
There are a couple of ways in which to deadlock a FUSE filesystem.
Since we are talking about unprivileged userspace programs,
something must be done about these.
-Scenario 1 - Simple deadlock
------------------------------
+**Scenario 1 - Simple deadlock**::
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
@@ -379,12 +363,12 @@
The solution for this is to allow the filesystem to be aborted.
-Scenario 2 - Tricky deadlock
-----------------------------
+**Scenario 2 - Tricky deadlock**
+
This one needs a carefully crafted filesystem. It's a variation on
the above, only the call back to the filesystem is not explicit,
-but is caused by a pagefault.
+but is caused by a pagefault. ::
| Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
| |
@@ -410,7 +394,7 @@
| | [lock page]
| | * DEADLOCK *
-Solution is basically the same as above.
+The solution is basically the same as above.
An additional problem is that while the write buffer is being copied
to the request, the request must not be interrupted/aborted. This is
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.rst
similarity index 63%
rename from Documentation/filesystems/gfs2-glocks.txt
rename to Documentation/filesystems/gfs2-glocks.rst
index 7059623..d14f230 100644
--- a/Documentation/filesystems/gfs2-glocks.txt
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -1,5 +1,8 @@
- Glock internal locking rules
- ------------------------------
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Glock internal locking rules
+============================
This documents the basic principles of the glock state machine
internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
@@ -24,24 +27,28 @@
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
to the following DLM lock modes:
-Glock mode | DLM lock mode
-------------------------------
- UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
- SH | PR (Protected read)
- DF | CW (Concurrent write)
- EX | EX (Exclusive)
+========== ====== =====================================================
+Glock mode DLM lock mode
+========== ====== =====================================================
+ UN IV/NL Unlocked (no DLM lock associated with glock) or NL
+ SH PR (Protected read)
+ DF CW (Concurrent write)
+ EX EX (Exclusive)
+========== ====== =====================================================
Thus DF is basically a shared mode which is incompatible with the "normal"
shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
-Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
---------------------------------------------------------------------------
- UN | No | No | No | No
- SH | Yes | Yes | No | No
- DF | No | Yes | No | No
- EX | Yes | Yes | Yes | Yes
+========== ========== ============== ========== ==============
+Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
+========== ========== ============== ========== ==============
+ UN No No No No
+ SH Yes Yes No No
+ DF No Yes No No
+ EX Yes Yes Yes Yes
+========== ========== ============== ========== ==============
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
@@ -49,21 +56,23 @@
Table of glock operations and per type constants:
-Field | Purpose
-----------------------------------------------------------------------------
-go_xmote_th | Called before remote state change (e.g. to sync dirty data)
-go_xmote_bh | Called after remote state change (e.g. to refill cache)
-go_inval | Called if remote state change requires invalidating the cache
-go_demote_ok | Returns boolean value of whether its ok to demote a glock
- | (e.g. checks timeout, and that there is no cached data)
-go_lock | Called for the first local holder of a lock
-go_unlock | Called on the final local unlock of a lock
-go_dump | Called to print content of object for debugfs file, or on
- | error to dump glock to the log.
-go_type | The type of the glock, LM_TYPE_.....
-go_callback | Called if the DLM sends a callback to drop this lock
-go_flags | GLOF_ASPACE is set, if the glock has an address space
- | associated with it
+============= =============================================================
+Field Purpose
+============= =============================================================
+go_xmote_th Called before remote state change (e.g. to sync dirty data)
+go_xmote_bh Called after remote state change (e.g. to refill cache)
+go_inval Called if remote state change requires invalidating the cache
+go_demote_ok Returns boolean value of whether its ok to demote a glock
+ (e.g. checks timeout, and that there is no cached data)
+go_lock Called for the first local holder of a lock
+go_unlock Called on the final local unlock of a lock
+go_dump Called to print content of object for debugfs file, or on
+ error to dump glock to the log.
+go_type The type of the glock, ``LM_TYPE_*``
+go_callback Called if the DLM sends a callback to drop this lock
+go_flags GLOF_ASPACE is set, if the glock has an address space
+ associated with it
+============= =============================================================
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
@@ -82,21 +91,25 @@
Locking rules for glock operations:
-Operation | GLF_LOCK bit lock held | gl_lockref.lock spinlock held
--------------------------------------------------------------------------
-go_xmote_th | Yes | No
-go_xmote_bh | Yes | No
-go_inval | Yes | No
-go_demote_ok | Sometimes | Yes
-go_lock | Yes | No
-go_unlock | Yes | No
-go_dump | Sometimes | Yes
-go_callback | Sometimes (N/A) | Yes
+============= ====================== =============================
+Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held
+============= ====================== =============================
+go_xmote_th Yes No
+go_xmote_bh Yes No
+go_inval Yes No
+go_demote_ok Sometimes Yes
+go_lock Yes No
+go_unlock Yes No
+go_dump Sometimes Yes
+go_callback Sometimes (N/A) Yes
+============= ====================== =============================
-N.B. Operations must not drop either the bit lock or the spinlock
-if its held on entry. go_dump and do_demote_ok must never block.
-Note that go_dump will only be called if the glock's state
-indicates that it is caching uptodate data.
+.. Note::
+
+ Operations must not drop either the bit lock or the spinlock
+ if its held on entry. go_dump and do_demote_ok must never block.
+ Note that go_dump will only be called if the glock's state
+ indicates that it is caching uptodate data.
Glock locking order within GFS2:
@@ -104,7 +117,7 @@
2. Rename glock (for rename only)
3. Inode glock(s)
(Parents before children, inodes at "same level" with same parent in
- lock number order)
+ lock number order)
4. Rgrp glock(s) (for (de)allocation operations)
5. Transaction glock (via gfs2_trans_begin) for non-read operations
6. i_rw_mutex (if required)
@@ -117,8 +130,8 @@
is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
In general we prefer to lock local locks prior to cluster locks.
- Glock Statistics
- ------------------
+Glock Statistics
+----------------
The stats are divided into two sets: those relating to the
super block and those relating to an individual glock. The
@@ -173,8 +186,8 @@
1. To be able to better set the glock "min hold time"
2. To spot performance issues more easily
3. To improve the algorithm for selecting resource groups for
-allocation (to base it on lock wait time, rather than blindly
-using a "try lock")
+ allocation (to base it on lock wait time, rather than blindly
+ using a "try lock")
Due to the smoothing action of the updates, a step change in
some input quantity being sampled will only fully be taken
@@ -195,10 +208,13 @@
measuring system, but I hope this is as accurate as we
can reasonably make it.
-Per sb stats can be found here:
-/sys/kernel/debug/gfs2/<fsname>/sbstats
-Per glock stats can be found here:
-/sys/kernel/debug/gfs2/<fsname>/glstats
+Per sb stats can be found here::
+
+ /sys/kernel/debug/gfs2/<fsname>/sbstats
+
+Per glock stats can be found here::
+
+ /sys/kernel/debug/gfs2/<fsname>/glstats
Assuming that debugfs is mounted on /sys/kernel/debug and also
that <fsname> is replaced with the name of the gfs2 filesystem
@@ -206,14 +222,16 @@
The abbreviations used in the output as are follows:
-srtt - Smoothed round trip time for non-blocking dlm requests
-srttvar - Variance estimate for srtt
-srttb - Smoothed round trip time for (potentially) blocking dlm requests
-srttvarb - Variance estimate for srttb
-sirt - Smoothed inter-request time (for dlm requests)
-sirtvar - Variance estimate for sirt
-dlm - Number of dlm requests made (dcnt in glstats file)
-queue - Number of glock requests queued (qcnt in glstats file)
+========= ================================================================
+srtt Smoothed round trip time for non blocking dlm requests
+srttvar Variance estimate for srtt
+srttb Smoothed round trip time for (potentially) blocking dlm requests
+srttvarb Variance estimate for srttb
+sirt Smoothed inter request time (for dlm requests)
+sirtvar Variance estimate for sirt
+dlm Number of dlm requests made (dcnt in glstats file)
+queue Number of glock requests queued (qcnt in glstats file)
+========= ================================================================
The sbstats file contains a set of these stats for each glock type (so 8 lines
for each type) and for each cpu (one column per cpu). The glstats file contains
@@ -224,9 +242,12 @@
for the glock in question, along with some addition information on each dlm
reply that is received:
-status - The status of the dlm request
-flags - The dlm request flags
-tdiff - The time taken by this specific request
+====== =======================================
+status The status of the dlm request
+flags The dlm request flags
+tdiff The time taken by this specific request
+====== =======================================
+
(remaining fields as per above list)
diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.rst
similarity index 92%
rename from Documentation/filesystems/gfs2-uevents.txt
rename to Documentation/filesystems/gfs2-uevents.rst
index 19a19eb..f162a2c 100644
--- a/Documentation/filesystems/gfs2-uevents.txt
+++ b/Documentation/filesystems/gfs2-uevents.rst
@@ -1,14 +1,18 @@
- uevents and GFS2
- ==================
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+uevents and GFS2
+================
During the lifetime of a GFS2 mount, a number of uevents are generated.
This document explains what the events are and what they are used
for (by gfs_controld in gfs2-utils).
A list of GFS2 uevents
------------------------
+======================
1. ADD
+------
The ADD event occurs at mount time. It will always be the first
uevent generated by the newly created filesystem. If the mount
@@ -21,6 +25,7 @@
of the filesystem respectively.
2. ONLINE
+---------
The ONLINE uevent is generated after a successful mount or remount. It
has the same environment variables as the ADD uevent. The ONLINE
@@ -29,6 +34,7 @@
be generated by older kernels.
3. CHANGE
+---------
The CHANGE uevent is used in two places. One is when reporting the
successful mount of the filesystem by the first node (FIRSTMOUNT=Done).
@@ -52,6 +58,7 @@
uevent for a successful mount or remount.
4. OFFLINE
+----------
The OFFLINE uevent is only generated due to filesystem errors and is used
as part of the "withdraw" mechanism. Currently this doesn't give any
@@ -59,6 +66,7 @@
be fixed.
5. REMOVE
+---------
The REMOVE uevent is generated at the end of an unsuccessful mount
or at the end of a umount of the filesystem. All REMOVE uevents will
@@ -68,9 +76,10 @@
Information common to all GFS2 uevents (uevent environment variables)
-----------------------------------------------------------------------
+=====================================================================
1. LOCKTABLE=
+--------------
The LOCKTABLE is a string, as supplied on the mount command
line (locktable=) or via fstab. It is used as a filesystem label
@@ -78,6 +87,7 @@
able to join the cluster.
2. LOCKPROTO=
+-------------
The LOCKPROTO is a string, and its value depends on what is set
on the mount command line, or via fstab. It will be either
@@ -85,12 +95,14 @@
may be supported.
3. JOURNALID=
+-------------
If a journal is in use by the filesystem (journals are not
assigned for spectator mounts) then this will give the
numeric journal id in all GFS2 uevents.
4. UUID=
+--------
With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID
into the filesystem superblock. If it exists, this will
diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.rst
similarity index 76%
rename from Documentation/filesystems/gfs2.txt
rename to Documentation/filesystems/gfs2.rst
index cc4f230..8d1ab58 100644
--- a/Documentation/filesystems/gfs2.txt
+++ b/Documentation/filesystems/gfs2.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
Global File System
-------------------
+==================
https://fedorahosted.org/cluster/wiki/HomePage
@@ -14,16 +17,18 @@
GFS uses interchangeable inter-node locking mechanisms, the currently
supported mechanisms are:
- lock_nolock -- allows gfs to be used as a local file system
+ lock_nolock
+ - allows gfs to be used as a local file system
- lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking
- The dlm is found at linux/fs/dlm/
+ lock_dlm
+ - uses a distributed lock manager (dlm) for inter-node locking.
+ The dlm is found at linux/fs/dlm/
Lock_dlm depends on user space cluster management systems found
at the URL above.
To use gfs as a local file system, no external clustering systems are
-needed, simply:
+needed, simply::
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
$ mount -t gfs2 /dev/block_device /dir
@@ -37,9 +42,12 @@
is pretty close.
The following man pages can be found at the URL above:
+
+ ============ =============================================
fsck.gfs2 to repair a filesystem
gfs2_grow to expand a filesystem online
gfs2_jadd to add journals to a filesystem online
tunegfs2 to manipulate, examine and tune a filesystem
- gfs2_convert to convert a gfs filesystem to gfs2 in-place
+ gfs2_convert to convert a gfs filesystem to gfs2 in-place
mkfs.gfs2 to make a filesystem
+ ============ =============================================
diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.rst
similarity index 78%
rename from Documentation/filesystems/hfs.txt
rename to Documentation/filesystems/hfs.rst
index d096df6..776015c 100644
--- a/Documentation/filesystems/hfs.txt
+++ b/Documentation/filesystems/hfs.rst
@@ -1,11 +1,16 @@
-Note: This filesystem doesn't have a maintainer.
+.. SPDX-License-Identifier: GPL-2.0
+==================================
Macintosh HFS Filesystem for Linux
==================================
-HFS stands for ``Hierarchical File System'' and is the filesystem used
+
+.. Note:: This filesystem doesn't have a maintainer.
+
+
+HFS stands for ``Hierarchical File System`` and is the filesystem used
by the Mac Plus and all later Macintosh models. Earlier Macintosh
-models used MFS (``Macintosh File System''), which is not supported,
+models used MFS (``Macintosh File System``), which is not supported,
MacOS 8.1 and newer support a filesystem called HFS+ that's similar to
HFS but is extended in various areas. Use the hfsplus filesystem driver
to access such filesystems from Linux.
@@ -49,29 +54,29 @@
HFS is not a UNIX filesystem, thus it does not have the usual features you'd
expect:
- o You can't modify the set-uid, set-gid, sticky or executable bits or the uid
+ * You can't modify the set-uid, set-gid, sticky or executable bits or the uid
and gid of files.
- o You can't create hard- or symlinks, device files, sockets or FIFOs.
+ * You can't create hard- or symlinks, device files, sockets or FIFOs.
HFS does on the other have the concepts of multiple forks per file. These
non-standard forks are represented as hidden additional files in the normal
filesystems namespace which is kind of a cludge and makes the semantics for
the a little strange:
- o You can't create, delete or rename resource forks of files or the
+ * You can't create, delete or rename resource forks of files or the
Finder's metadata.
- o They are however created (with default values), deleted and renamed
+ * They are however created (with default values), deleted and renamed
along with the corresponding data fork or directory.
- o Copying files to a different filesystem will loose those attributes
+ * Copying files to a different filesystem will loose those attributes
that are essential for MacOS to work.
Creating HFS filesystems
-===================================
+========================
The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See
-<http://www.mars.org/home/rob/proj/hfs/> for details.
+<https://www.mars.org/home/rob/proj/hfs/> for details.
Credits
diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.rst
similarity index 95%
rename from Documentation/filesystems/hfsplus.txt
rename to Documentation/filesystems/hfsplus.rst
index 59f7569..f02f4f5 100644
--- a/Documentation/filesystems/hfsplus.txt
+++ b/Documentation/filesystems/hfsplus.rst
@@ -1,4 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+======================================
Macintosh HFSPlus Filesystem for Linux
======================================
diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.rst
similarity index 66%
rename from Documentation/filesystems/hpfs.txt
rename to Documentation/filesystems/hpfs.rst
index 74630bd..7e0dd2f 100644
--- a/Documentation/filesystems/hpfs.txt
+++ b/Documentation/filesystems/hpfs.rst
@@ -1,13 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
Read/Write HPFS 2.09
+====================
+
1998-2004, Mikulas Patocka
-email: mikulas@artax.karlin.mff.cuni.cz
-homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
+:email: mikulas@artax.karlin.mff.cuni.cz
+:homepage: https://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
-CREDITS:
+Credits
+=======
Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file
is taken from it
+
Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993)
+
Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion
Mount options
@@ -50,6 +58,7 @@
File names
+==========
As in OS/2, filenames are case insensitive. However, shell thinks that names
are case sensitive, so for example when you create a file FOO, you can use
@@ -64,6 +73,7 @@
Extended attributes
+===================
On HPFS partitions, OS/2 can associate to each file a special information called
extended attributes. Extended attributes are pairs of (key,value) where key is
@@ -88,6 +98,7 @@
Symlinks
+========
You can do symlinks on HPFS partition, symlinks are achieved by setting extended
attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and
@@ -101,6 +112,7 @@
Codepages
+=========
HPFS can contain several uppercasing tables for several codepages and each
file has a pointer to codepage its name is in. However OS/2 was created in
@@ -128,6 +140,7 @@
Known bugs
+==========
HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client
should work. If you have OS/2 server, use only read-only mode. I don't know how
@@ -152,7 +165,8 @@
to delete other files that are leaf (probability that the file is non-leaf is
about 1/50) or to truncate file first to make some space.
You encounter this problem only if you have many directories so that
-preallocated directory band is full i.e.
+preallocated directory band is full i.e.::
+
number_of_directories / size_of_filesystem_in_mb > 4.
You can't delete open directories.
@@ -174,6 +188,7 @@
What does "unbalanced tree" message mean?
+=========================================
Old versions of this driver created sometimes unbalanced dnode trees. OS/2
chkdsk doesn't scream if the tree is unbalanced (and sometimes creates
@@ -187,6 +202,7 @@
Bugs in OS/2
+============
When you have two (or more) lost directories pointing each to other, chkdsk
locks up when repairing filesystem.
@@ -199,98 +215,139 @@
marks them as short (and writes "minor fs error corrected"). This bug is not in
HPFS386.
-Codepage bugs described above.
+Codepage bugs described above
+=============================
If you don't install fixpacks, there are many, many more...
History
+=======
-0.90 First public release
-0.91 Fixed bug that caused shooting to memory when write_inode was called on
- open inode (rarely happened)
-0.92 Fixed a little memory leak in freeing directory inodes
-0.93 Fixed bug that locked up the machine when there were too many filenames
- with first 15 characters same
- Fixed write_file to zero file when writing behind file end
-0.94 Fixed a little memory leak when trying to delete busy file or directory
-0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
-1.90 First version for 2.1.1xx kernels
-1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
- Fixed a race-condition when write_inode is called while deleting file
- Fixed a bug that could possibly happen (with very low probability) when
- using 0xff in filenames
- Rewritten locking to avoid race-conditions
- Mount option 'eas' now works
- Fsync no longer returns error
- Files beginning with '.' are marked hidden
- Remount support added
- Alloc is not so slow when filesystem becomes full
- Atimes are no more updated because it slows down operation
- Code cleanup (removed all commented debug prints)
-1.92 Corrected a bug when sync was called just before closing file
-1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
- works with previous versions
- Fixed a possible problem with disks > 64G (but I don't have one, so I can't
- test it)
- Fixed a file overflow at 2G
- Added new option 'timeshift'
- Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
- read-only mode
- Fixed a bug that slowed down alloc and prevented allocating 100% space
- (this bug was not destructive)
-1.94 Added workaround for one bug in Linux
- Fixed one buffer leak
- Fixed some incompatibilities with large extended attributes (but it's still
- not 100% ok, I have no info on it and OS/2 doesn't want to create them)
- Rewritten allocation
- Fixed a bug with i_blocks (du sometimes didn't display correct values)
- Directories have no longer archive attribute set (some programs don't like
- it)
- Fixed a bug that it set badly one flag in large anode tree (it was not
- destructive)
-1.95 Fixed one buffer leak, that could happen on corrupted filesystem
- Fixed one bug in allocation in 1.94
-1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
- error sometimes when opening directories in PMSHELL)
- Fixed a possible bitmap race
- Fixed possible problem on large disks
- You can now delete open files
- Fixed a nondestructive race in rename
-1.97 Support for HPFS v3 (on large partitions)
- Fixed a bug that it didn't allow creation of files > 128M (it should be 2G)
+====== =========================================================================
+0.90 First public release
+0.91 Fixed bug that caused shooting to memory when write_inode was called on
+ open inode (rarely happened)
+0.92 Fixed a little memory leak in freeing directory inodes
+0.93 Fixed bug that locked up the machine when there were too many filenames
+ with first 15 characters same
+ Fixed write_file to zero file when writing behind file end
+0.94 Fixed a little memory leak when trying to delete busy file or directory
+0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files
+1.90 First version for 2.1.1xx kernels
+1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk
+ Fixed a race-condition when write_inode is called while deleting file
+ Fixed a bug that could possibly happen (with very low probability) when
+ using 0xff in filenames.
+
+ Rewritten locking to avoid race-conditions
+
+ Mount option 'eas' now works
+
+ Fsync no longer returns error
+
+ Files beginning with '.' are marked hidden
+
+ Remount support added
+
+ Alloc is not so slow when filesystem becomes full
+
+ Atimes are no more updated because it slows down operation
+
+ Code cleanup (removed all commented debug prints)
+1.92 Corrected a bug when sync was called just before closing file
+1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it
+ works with previous versions
+
+ Fixed a possible problem with disks > 64G (but I don't have one, so I can't
+ test it)
+
+ Fixed a file overflow at 2G
+
+ Added new option 'timeshift'
+
+ Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in
+ read-only mode
+
+ Fixed a bug that slowed down alloc and prevented allocating 100% space
+ (this bug was not destructive)
+1.94 Added workaround for one bug in Linux
+
+ Fixed one buffer leak
+
+ Fixed some incompatibilities with large extended attributes (but it's still
+ not 100% ok, I have no info on it and OS/2 doesn't want to create them)
+
+ Rewritten allocation
+
+ Fixed a bug with i_blocks (du sometimes didn't display correct values)
+
+ Directories have no longer archive attribute set (some programs don't like
+ it)
+
+ Fixed a bug that it set badly one flag in large anode tree (it was not
+ destructive)
+1.95 Fixed one buffer leak, that could happen on corrupted filesystem
+
+ Fixed one bug in allocation in 1.94
+1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported
+ error sometimes when opening directories in PMSHELL)
+
+ Fixed a possible bitmap race
+
+ Fixed possible problem on large disks
+
+ You can now delete open files
+
+ Fixed a nondestructive race in rename
+1.97 Support for HPFS v3 (on large partitions)
+
+ ZFixed a bug that it didn't allow creation of files > 128M
+ (it should be 2G)
1.97.1 Changed names of global symbols
+
Fixed a bug when chmoding or chowning root directory
-1.98 Fixed a deadlock when using old_readdir
- Better directory handling; workaround for "unbalanced tree" bug in OS/2
-1.99 Corrected a possible problem when there's not enough space while deleting
- file
- Now it tries to truncate the file if there's not enough space when deleting
- Removed a lot of redundant code
-2.00 Fixed a bug in rename (it was there since 1.96)
- Better anti-fragmentation strategy
-2.01 Fixed problem with directory listing over NFS
- Directory lseek now checks for proper parameters
- Fixed race-condition in buffer code - it is in all filesystems in Linux;
- when reading device (cat /dev/hda) while creating files on it, files
- could be damaged
-2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
- end of partition
-2.03 Char, block devices and pipes are correctly created
- Fixed non-crashing race in unlink (Alexander Viro)
- Now it works with Japanese version of OS/2
-2.04 Fixed error when ftruncate used to extend file
-2.05 Fixed crash when got mount parameters without =
- Fixed crash when allocation of anode failed due to full disk
- Fixed some crashes when block io or inode allocation failed
-2.06 Fixed some crash on corrupted disk structures
- Better allocation strategy
- Reschedule points added so that it doesn't lock CPU long time
- It should work in read-only mode on Warp Server
-2.07 More fixes for Warp Server. Now it really works
-2.08 Creating new files is not so slow on large disks
- An attempt to sync deleted file does not generate filesystem error
-2.09 Fixed error on extremely fragmented files
+1.98 Fixed a deadlock when using old_readdir
+ Better directory handling; workaround for "unbalanced tree" bug in OS/2
+1.99 Corrected a possible problem when there's not enough space while deleting
+ file
+ Now it tries to truncate the file if there's not enough space when
+ deleting
- vim: set textwidth=80:
+ Removed a lot of redundant code
+2.00 Fixed a bug in rename (it was there since 1.96)
+ Better anti-fragmentation strategy
+2.01 Fixed problem with directory listing over NFS
+
+ Directory lseek now checks for proper parameters
+
+ Fixed race-condition in buffer code - it is in all filesystems in Linux;
+ when reading device (cat /dev/hda) while creating files on it, files
+ could be damaged
+2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond
+ end of partition
+2.03 Char, block devices and pipes are correctly created
+
+ Fixed non-crashing race in unlink (Alexander Viro)
+
+ Now it works with Japanese version of OS/2
+2.04 Fixed error when ftruncate used to extend file
+2.05 Fixed crash when got mount parameters without =
+
+ Fixed crash when allocation of anode failed due to full disk
+
+ Fixed some crashes when block io or inode allocation failed
+2.06 Fixed some crash on corrupted disk structures
+
+ Better allocation strategy
+
+ Reschedule points added so that it doesn't lock CPU long time
+
+ It should work in read-only mode on Warp Server
+2.07 More fixes for Warp Server. Now it really works
+2.08 Creating new files is not so slow on large disks
+
+ An attempt to sync deleted file does not generate filesystem error
+2.09 Fixed error on extremely fragmented files
+====== =========================================================================
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 2c3a9f7..98f59a8 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -1,3 +1,5 @@
+.. _filesystems_index:
+
===============================
Filesystems in the Linux kernel
===============================
@@ -22,6 +24,20 @@
splice
locking
directory-locking
+ devpts
+ dnotify
+ fiemap
+ files
+ locks
+ mandatory-locking
+ mount_api
+ quota
+ seq_file
+ sharedsubtree
+
+ automount-support
+
+ caching/index
porting
@@ -46,4 +62,61 @@
.. toctree::
:maxdepth: 2
+ 9p
+ adfs
+ affs
+ afs
+ autofs
+ autofs-mount-control
+ befs
+ bfs
+ btrfs
+ cifs/cifsroot
+ ceph
+ coda
+ configfs
+ cramfs
+ debugfs
+ dlmfs
+ ecryptfs
+ efivarfs
+ erofs
+ ext2
+ ext3
+ f2fs
+ gfs2
+ gfs2-uevents
+ gfs2-glocks
+ hfs
+ hfsplus
+ hpfs
+ fuse
+ fuse-io
+ inotify
+ isofs
+ nilfs2
+ nfs/index
+ ntfs
+ ocfs2
+ ocfs2-online-filecheck
+ omfs
+ orangefs
+ overlayfs
+ proc
+ qnx6
+ ramfs-rootfs-initramfs
+ relay
+ romfs
+ spufs/index
+ squashfs
+ sysfs
+ sysv-fs
+ tmpfs
+ ubifs
+ ubifs-authentication.rst
+ udf
virtiofs
+ vfat
+ xfs-delayed-logging-design
+ xfs-self-describing-metadata
+ zonefs
diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.rst
similarity index 82%
rename from Documentation/filesystems/inotify.txt
rename to Documentation/filesystems/inotify.rst
index 51f61db..7f7ef8a 100644
--- a/Documentation/filesystems/inotify.txt
+++ b/Documentation/filesystems/inotify.rst
@@ -1,27 +1,36 @@
- inotify
- a powerful yet simple file change notification system
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================================
+Inotify - A Powerful yet Simple File Change Notification System
+===============================================================
Document started 15 Mar 2005 by Robert Love <rml@novell.com>
+
Document updated 4 Jan 2015 by Zhang Zhen <zhenzhang.zhang@huawei.com>
- --Deleted obsoleted interface, just refer to manpages for user interface.
+
+ - Deleted obsoleted interface, just refer to manpages for user interface.
(i) Rationale
-Q: What is the design decision behind not tying the watch to the open fd of
+Q:
+ What is the design decision behind not tying the watch to the open fd of
the watched object?
-A: Watches are associated with an open inotify device, not an open file.
+A:
+ Watches are associated with an open inotify device, not an open file.
This solves the primary problem with dnotify: keeping the file open pins
the file and thus, worse, pins the mount. Dnotify is therefore infeasible
for use on a desktop system with removable media as the media cannot be
unmounted. Watching a file should not require that it be open.
-Q: What is the design decision behind using an-fd-per-instance as opposed to
+Q:
+ What is the design decision behind using an-fd-per-instance as opposed to
an fd-per-watch?
-A: An fd-per-watch quickly consumes more file descriptors than are allowed,
+A:
+ An fd-per-watch quickly consumes more file descriptors than are allowed,
more fd's than are feasible to manage, and more fd's than are optimally
select()-able. Yes, root can bump the per-process fd limit and yes, users
can use epoll, but requiring both is a silly and extraneous requirement.
@@ -29,8 +38,8 @@
spaces is thus sensible. The current design is what user-space developers
want: Users initialize inotify, once, and add n watches, requiring but one
fd and no twiddling with fd limits. Initializing an inotify instance two
- thousand times is silly. If we can implement user-space's preferences
- cleanly--and we can, the idr layer makes stuff like this trivial--then we
+ thousand times is silly. If we can implement user-space's preferences
+ cleanly--and we can, the idr layer makes stuff like this trivial--then we
should.
There are other good arguments. With a single fd, there is a single
@@ -65,9 +74,11 @@
need not be a one-fd-per-process mapping; it is one-fd-per-queue and a
process can easily want more than one queue.
-Q: Why the system call approach?
+Q:
+ Why the system call approach?
-A: The poor user-space interface is the second biggest problem with dnotify.
+A:
+ The poor user-space interface is the second biggest problem with dnotify.
Signals are a terrible, terrible interface for file notification. Or for
anything, for that matter. The ideal solution, from all perspectives, is a
file descriptor-based one that allows basic file I/O and poll/select.
diff --git a/Documentation/filesystems/isofs.rst b/Documentation/filesystems/isofs.rst
new file mode 100644
index 0000000..08fd469
--- /dev/null
+++ b/Documentation/filesystems/isofs.rst
@@ -0,0 +1,64 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+ISO9660 Filesystem
+==================
+
+Mount options that are the same as for msdos and vfat partitions.
+
+ ========= ========================================================
+ gid=nnn All files in the partition will be in group nnn.
+ uid=nnn All files in the partition will be owned by user id nnn.
+ umask=nnn The permission mask (see umask(1)) for the partition.
+ ========= ========================================================
+
+Mount options that are the same as vfat partitions. These are only useful
+when using discs encoded using Microsoft's Joliet extensions.
+
+ ============== =============================================================
+ iocharset=name Character set to use for converting from Unicode to
+ ASCII. Joliet filenames are stored in Unicode format, but
+ Unix for the most part doesn't know how to deal with Unicode.
+ There is also an option of doing UTF-8 translations with the
+ utf8 option.
+ utf8 Encode Unicode names in UTF-8 format. Default is no.
+ ============== =============================================================
+
+Mount options unique to the isofs filesystem.
+
+ ================= ============================================================
+ block=512 Set the block size for the disk to 512 bytes
+ block=1024 Set the block size for the disk to 1024 bytes
+ block=2048 Set the block size for the disk to 2048 bytes
+ check=relaxed Matches filenames with different cases
+ check=strict Matches only filenames with the exact same case
+ cruft Try to handle badly formatted CDs.
+ map=off Do not map non-Rock Ridge filenames to lower case
+ map=normal Map non-Rock Ridge filenames to lower case
+ map=acorn As map=normal but also apply Acorn extensions if present
+ mode=xxx Sets the permissions on files to xxx unless Rock Ridge
+ extensions set the permissions otherwise
+ dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge
+ extensions set the permissions otherwise
+ overriderockperm Set permissions on files and directories according to
+ 'mode' and 'dmode' even though Rock Ridge extensions are
+ present.
+ nojoliet Ignore Joliet extensions if they are present.
+ norock Ignore Rock Ridge extensions if they are present.
+ hide Completely strip hidden files from the file system.
+ showassoc Show files marked with the 'associated' bit
+ unhide Deprecated; showing hidden files is now default;
+ If given, it is a synonym for 'showassoc' which will
+ recreate previous unhide behavior
+ session=x Select number of session on multisession CD
+ sbsector=xxx Session begins from sector xxx
+ ================= ============================================================
+
+Recommended documents about ISO 9660 standard are located at:
+
+- http://www.y-adagio.com/
+- ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
+
+Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
+identical with ISO 9660.", so it is a valid and gratis substitute of the
+official ISO specification.
diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt
deleted file mode 100644
index ba0a933..0000000
--- a/Documentation/filesystems/isofs.txt
+++ /dev/null
@@ -1,48 +0,0 @@
-Mount options that are the same as for msdos and vfat partitions.
-
- gid=nnn All files in the partition will be in group nnn.
- uid=nnn All files in the partition will be owned by user id nnn.
- umask=nnn The permission mask (see umask(1)) for the partition.
-
-Mount options that are the same as vfat partitions. These are only useful
-when using discs encoded using Microsoft's Joliet extensions.
- iocharset=name Character set to use for converting from Unicode to
- ASCII. Joliet filenames are stored in Unicode format, but
- Unix for the most part doesn't know how to deal with Unicode.
- There is also an option of doing UTF-8 translations with the
- utf8 option.
- utf8 Encode Unicode names in UTF-8 format. Default is no.
-
-Mount options unique to the isofs filesystem.
- block=512 Set the block size for the disk to 512 bytes
- block=1024 Set the block size for the disk to 1024 bytes
- block=2048 Set the block size for the disk to 2048 bytes
- check=relaxed Matches filenames with different cases
- check=strict Matches only filenames with the exact same case
- cruft Try to handle badly formatted CDs.
- map=off Do not map non-Rock Ridge filenames to lower case
- map=normal Map non-Rock Ridge filenames to lower case
- map=acorn As map=normal but also apply Acorn extensions if present
- mode=xxx Sets the permissions on files to xxx unless Rock Ridge
- extensions set the permissions otherwise
- dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge
- extensions set the permissions otherwise
- overriderockperm Set permissions on files and directories according to
- 'mode' and 'dmode' even though Rock Ridge extensions are
- present.
- nojoliet Ignore Joliet extensions if they are present.
- norock Ignore Rock Ridge extensions if they are present.
- hide Completely strip hidden files from the file system.
- showassoc Show files marked with the 'associated' bit
- unhide Deprecated; showing hidden files is now default;
- If given, it is a synonym for 'showassoc' which will
- recreate previous unhide behavior
- session=x Select number of session on multisession CD
- sbsector=xxx Session begins from sector xxx
-
-Recommended documents about ISO 9660 standard are located at:
-http://www.y-adagio.com/
-ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf
-Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically
-identical with ISO 9660.", so it is a valid and gratis substitute of the
-official ISO specification.
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b3..e18f90f 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -10,27 +10,27 @@
The journalling layer is easy to use. You need to first of all create a
journal_t data structure. There are two calls to do this dependent on
how you decide to allocate the physical media on which the journal
-resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
-filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
+resides. The jbd2_journal_init_inode() call is for journals stored in
+filesystem inodes, or the jbd2_journal_init_dev() call can be used
for journal stored on a raw device (in a continuous range of blocks). A
journal_t is a typedef for a struct pointer, so when you are finally
-finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
+finished make sure you call jbd2_journal_destroy() on it to free up
any used kernel memory.
Once you have got your journal_t object you need to 'mount' or load the
journal file. The journalling layer expects the space for the journal
was already allocated and initialized properly by the userspace tools.
-When loading the journal you must call :c:func:`jbd2_journal_load` to process
+When loading the journal you must call jbd2_journal_load() to process
journal contents. If the client file system detects the journal contents
does not need to be processed (or even need not have valid contents), it
-may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
-calling :c:func:`jbd2_journal_load`.
+may call jbd2_journal_wipe() to clear the journal contents before
+calling jbd2_journal_load().
Note that jbd2_journal_wipe(..,0) calls
-:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
-transactions in the journal and similarly :c:func:`jbd2_journal_load` will
-call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
-:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
+jbd2_journal_skip_recovery() for you if it detects any outstanding
+transactions in the journal and similarly jbd2_journal_load() will
+call jbd2_journal_recover() if necessary. I would advise reading
+ext4_load_journal() in fs/ext4/super.c for examples on this stage.
Now you can go ahead and start modifying the underlying filesystem.
Almost.
@@ -39,57 +39,57 @@
by wrapping them into transactions. Additionally you also need to wrap
the modification of each of the buffers with calls to the journal layer,
so it knows what the modifications you are actually making are. To do
-this use :c:func:`jbd2_journal_start` which returns a transaction handle.
+this use jbd2_journal_start() which returns a transaction handle.
-:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
+jbd2_journal_start() and its counterpart jbd2_journal_stop(),
which indicates the end of a transaction are nestable calls, so you can
reenter a transaction if necessary, but remember you must call
-:c:func:`jbd2_journal_stop` the same number of times as
-:c:func:`jbd2_journal_start` before the transaction is completed (or more
+jbd2_journal_stop() the same number of times as
+jbd2_journal_start() before the transaction is completed (or more
accurately leaves the update phase). Ext4/VFS makes use of this feature to
simplify handling of inode dirtying, quota support, etc.
Inside each transaction you need to wrap the modifications to the
individual buffers (blocks). Before you start to modify a buffer you
-need to call :c:func:`jbd2_journal_get_create_access()` /
-:c:func:`jbd2_journal_get_write_access()` /
-:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
+need to call jbd2_journal_get_create_access() /
+jbd2_journal_get_write_access() /
+jbd2_journal_get_undo_access() as appropriate, this allows the
journalling layer to copy the unmodified
data if it needs to. After all the buffer may be part of a previously
uncommitted transaction. At this point you are at last ready to modify a
buffer, and once you are have done so you need to call
-:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
+jbd2_journal_dirty_metadata(). Or if you've asked for access to a
buffer you now know is now longer required to be pushed back on the
-device you can call :c:func:`jbd2_journal_forget` in much the same way as you
-might have used :c:func:`bforget` in the past.
+device you can call jbd2_journal_forget() in much the same way as you
+might have used bforget() in the past.
-A :c:func:`jbd2_journal_flush` may be called at any time to commit and
+A jbd2_journal_flush() may be called at any time to commit and
checkpoint all your transactions.
-Then at umount time , in your :c:func:`put_super` you can then call
-:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
+Then at umount time , in your put_super() you can then call
+jbd2_journal_destroy() to clean up your in-core journal object.
Unfortunately there a couple of ways the journal layer can cause a
deadlock. The first thing to note is that each task can only have a
single outstanding transaction at any one time, remember nothing commits
-until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
+until the outermost jbd2_journal_stop(). This means you must complete
the transaction at the end of each file/inode/address etc. operation you
perform, so that the journalling system isn't re-entered on another
journal. Since transactions can't be nested/batched across differing
journals, and another filesystem other than yours (say ext4) may be
modified in a later syscall.
-The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
+The second case to bear in mind is that jbd2_journal_start() can block
if there isn't enough space in the journal for your transaction (based
on the passed nblocks param) - when it blocks it merely(!) needs to wait
for transactions to complete and be committed from other tasks, so
-essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
-deadlocks you must treat :c:func:`jbd2_journal_start` /
-:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
+essentially we are waiting for jbd2_journal_stop(). So to avoid
+deadlocks you must treat jbd2_journal_start() /
+jbd2_journal_stop() as if they were semaphores and include them in
your semaphore ordering rules to prevent
-deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
-behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
-easily as on :c:func:`jbd2_journal_start`.
+deadlocks. Note that jbd2_journal_extend() has similar blocking
+behaviour to jbd2_journal_start() so you can deadlock here just as
+easily as on jbd2_journal_start().
Try to reserve the right number of blocks the first time. ;-). This will
be the maximum number of blocks you are going to touch in this
@@ -116,8 +116,8 @@
that need processing when the transaction commits.
JBD2 also provides a way to block all transaction updates via
-:c:func:`jbd2_journal_lock_updates()` /
-:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
+jbd2_journal_lock_updates() /
+jbd2_journal_unlock_updates(). Ext4 uses this when it wants a
window with a clean and stable fs for a moment. E.g.
::
@@ -132,6 +132,37 @@
if you allow unprivileged userspace to trigger codepaths containing
these calls.
+Fast commits
+~~~~~~~~~~~~
+
+JBD2 to also allows you to perform file-system specific delta commits known as
+fast commits. In order to use fast commits, you will need to set following
+callbacks that perform correspodning work:
+
+`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
+fast commit.
+
+`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
+blocks.
+
+File system is free to perform fast commits as and when it wants as long as it
+gets permission from JBD2 to do so by calling the function
+:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
+file system should tell JBD2 about it by calling
+:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
+commit immediately after stopping the fast commit it can do so by calling
+:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
+fails for some reason and the only way to guarantee consistency is for JBD2 to
+perform the full traditional commit.
+
+JBD2 helper functions to manage fast commit buffers. File system can use
+:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
+and wait on IO completion of fast commit buffers.
+
+Currently, only Ext4 implements fast commits. For details of its implementation
+of fast commits, please refer to the top level comments in
+fs/ext4/fast_commit.c.
+
Summary
~~~~~~~
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index fc3a070..c0f2c75 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -105,7 +105,7 @@
listxattr: no
fiemap: no
update_time: no
-atomic_open: exclusive
+atomic_open: shared (exclusive if O_CREAT is set in open flags)
tmpfile: no
============ =============================================
@@ -239,6 +239,7 @@
int (*readpage)(struct file *, struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
+ void (*readahead)(struct readahead_control *);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
@@ -271,7 +272,8 @@
readpage: yes, unlocks
writepages:
set_page_dirty no
-readpages:
+readahead: yes, unlocks
+readpages: no
write_begin: locks the page exclusive
write_end: yes, unlocks exclusive
bmap:
@@ -295,6 +297,8 @@
->readpage() unlocks the page, either synchronously or via I/O
completion.
+->readahead() unlocks the pages that I/O is attempted on like ->readpage().
+
->readpages() populates the pagecache with the passed pages and starts
I/O against them. They come unlocked upon I/O completion.
@@ -425,17 +429,19 @@
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *); /* break_lease callback */
int (*lm_change)(struct file_lock **, int);
+ bool (*lm_breaker_owns_lease)(struct file_lock *);
locking rules:
-========== ============= ================= =========
+====================== ============= ================= =========
ops inode->i_lock blocked_lock_lock may block
-========== ============= ================= =========
+====================== ============= ================= =========
lm_notify: yes yes no
lm_grant: no no no
lm_break: yes no no
lm_change yes no no
-========== ============= ================= =========
+lm_breaker_owns_lease: no no no
+====================== ============= ================= =========
buffer_head
===========
@@ -461,7 +467,6 @@
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t, void **,
unsigned long *);
- int (*media_changed) (struct gendisk *);
void (*unlock_native_capacity) (struct gendisk *);
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
@@ -477,16 +482,12 @@
ioctl: no
compat_ioctl: no
direct_access: no
-media_changed: no
unlock_native_capacity: no
revalidate_disk: no
getgeo: no
swap_slot_free_notify: no (see below)
======================= ===================
-media_changed, unlock_native_capacity and revalidate_disk are called only from
-check_disk_change().
-
swap_slot_free_notify is called with swap_lock and sometimes the page lock
held.
@@ -610,9 +611,9 @@
locking rules:
-============= ======== ===========================
-ops mmap_sem PageLocked(page)
-============= ======== ===========================
+============= ========= ===========================
+ops mmap_lock PageLocked(page)
+============= ========= ===========================
open: yes
close: yes
fault: yes can return with page locked
@@ -620,7 +621,7 @@
page_mkwrite: yes can return with page locked
pfn_mkwrite: yes
access: yes
-============= ======== ===========================
+============= ========= ===========================
->fault() is called when a previously not present pte is about
to be faulted in. The filesystem must find and return the page associated
diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.rst
similarity index 90%
rename from Documentation/filesystems/locks.txt
rename to Documentation/filesystems/locks.rst
index 5368690..c5ae858 100644
--- a/Documentation/filesystems/locks.txt
+++ b/Documentation/filesystems/locks.rst
@@ -1,4 +1,8 @@
- File Locking Release Notes
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+File Locking Release Notes
+==========================
Andy Walker <andy@lysaker.kvaerner.no>
@@ -6,7 +10,7 @@
1. What's New?
---------------
+==============
1.1 Broken Flock Emulation
--------------------------
@@ -25,7 +29,7 @@
---------------------------
1.2.1 Typical Problems - Sendmail
----------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Because sendmail was unable to use the old flock() emulation, many sendmail
installations use fcntl() instead of flock(). This is true of Slackware 3.0
for example. This gave rise to some other subtle problems if sendmail was
@@ -37,7 +41,7 @@
1.2.2 The Solution
-------------------
+^^^^^^^^^^^^^^^^^^
The solution I have chosen, after much experimentation and discussion,
is to make flock() and fcntl() locks oblivious to each other. Both can
exists, and neither will have any effect on the other.
@@ -54,7 +58,7 @@
---------------------------------------
Mandatory locking, as described in
-'Documentation/filesystems/mandatory-locking.txt' was prior to this release a
+'Documentation/filesystems/mandatory-locking.rst' was prior to this release a
general configuration option that was valid for all mounted filesystems. This
had a number of inherent dangers, not the least of which was the ability to
freeze an NFS server by asking it to read a file for which a mandatory lock
diff --git a/Documentation/filesystems/mandatory-locking.txt b/Documentation/filesystems/mandatory-locking.rst
similarity index 90%
rename from Documentation/filesystems/mandatory-locking.txt
rename to Documentation/filesystems/mandatory-locking.rst
index a251ca3..9ce7354 100644
--- a/Documentation/filesystems/mandatory-locking.txt
+++ b/Documentation/filesystems/mandatory-locking.rst
@@ -1,8 +1,13 @@
- Mandatory File Locking For The Linux Operating System
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+Mandatory File Locking For The Linux Operating System
+=====================================================
Andy Walker <andy@lysaker.kvaerner.no>
15 April 1996
+
(Updated September 2007)
0. Why you should avoid mandatory locking
@@ -53,15 +58,17 @@
as candidates for mandatory locking, and using the existing fcntl()/lockf()
interface for applying locks just as if they were normal, advisory locks.
-Note 1: In saying "file" in the paragraphs above I am actually not telling
-the whole truth. System V locking is based on fcntl(). The granularity of
-fcntl() is such that it allows the locking of byte ranges in files, in addition
-to entire files, so the mandatory locking rules also have byte level
-granularity.
+.. Note::
-Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite
-borrowing the fcntl() locking scheme from System V. The mandatory locking
-scheme is defined by the System V Interface Definition (SVID) Version 3.
+ 1. In saying "file" in the paragraphs above I am actually not telling
+ the whole truth. System V locking is based on fcntl(). The granularity of
+ fcntl() is such that it allows the locking of byte ranges in files, in
+ addition to entire files, so the mandatory locking rules also have byte
+ level granularity.
+
+ 2. POSIX.1 does not specify any scheme for mandatory locking, despite
+ borrowing the fcntl() locking scheme from System V. The mandatory locking
+ scheme is defined by the System V Interface Definition (SVID) Version 3.
2. Marking a file for mandatory locking
---------------------------------------
diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.rst
similarity index 78%
rename from Documentation/filesystems/mount_api.txt
rename to Documentation/filesystems/mount_api.rst
index 00ff0cf..d7f53d6 100644
--- a/Documentation/filesystems/mount_api.txt
+++ b/Documentation/filesystems/mount_api.rst
@@ -1,8 +1,10 @@
- ====================
- FILESYSTEM MOUNT API
- ====================
+.. SPDX-License-Identifier: GPL-2.0
-CONTENTS
+====================
+Filesystem Mount API
+====================
+
+.. CONTENTS
(1) Overview.
@@ -21,8 +23,7 @@
(8) Parameter helper functions.
-========
-OVERVIEW
+Overview
========
The creation of new mounts is now to be done in a multistep process:
@@ -43,7 +44,7 @@
(7) Destroy the context.
-To support this, the file_system_type struct gains two new fields:
+To support this, the file_system_type struct gains two new fields::
int (*init_fs_context)(struct fs_context *fc);
const struct fs_parameter_description *parameters;
@@ -57,12 +58,11 @@
that the namespaces may be adjusted first.
-======================
-THE FILESYSTEM CONTEXT
+The Filesystem context
======================
The creation and reconfiguration of a superblock is governed by a filesystem
-context. This is represented by the fs_context structure:
+context. This is represented by the fs_context structure::
struct fs_context {
const struct fs_context_operations *ops;
@@ -86,78 +86,106 @@
The fs_context fields are as follows:
- (*) const struct fs_context_operations *ops
+ * ::
+
+ const struct fs_context_operations *ops
These are operations that can be done on a filesystem context (see
below). This must be set by the ->init_fs_context() file_system_type
operation.
- (*) struct file_system_type *fs_type
+ * ::
+
+ struct file_system_type *fs_type
A pointer to the file_system_type of the filesystem that is being
constructed or reconfigured. This retains a reference on the type owner.
- (*) void *fs_private
+ * ::
+
+ void *fs_private
A pointer to the file system's private data. This is where the filesystem
will need to store any options it parses.
- (*) struct dentry *root
+ * ::
+
+ struct dentry *root
A pointer to the root of the mountable tree (and indirectly, the
superblock thereof). This is filled in by the ->get_tree() op. If this
is set, an active reference on root->d_sb must also be held.
- (*) struct user_namespace *user_ns
- (*) struct net *net_ns
+ * ::
+
+ struct user_namespace *user_ns
+ struct net *net_ns
There are a subset of the namespaces in use by the invoking process. They
retain references on each namespace. The subscribed namespaces may be
replaced by the filesystem to reflect other sources, such as the parent
mount superblock on an automount.
- (*) const struct cred *cred
+ * ::
+
+ const struct cred *cred
The mounter's credentials. This retains a reference on the credentials.
- (*) char *source
+ * ::
+
+ char *source
This specifies the source. It may be a block device (e.g. /dev/sda1) or
something more exotic, such as the "host:/path" that NFS desires.
- (*) char *subtype
+ * ::
+
+ char *subtype
This is a string to be added to the type displayed in /proc/mounts to
qualify it (used by FUSE). This is available for the filesystem to set if
desired.
- (*) void *security
+ * ::
+
+ void *security
A place for the LSMs to hang their security data for the superblock. The
relevant security operations are described below.
- (*) void *s_fs_info
+ * ::
+
+ void *s_fs_info
The proposed s_fs_info for a new superblock, set in the superblock by
sget_fc(). This can be used to distinguish superblocks.
- (*) unsigned int sb_flags
- (*) unsigned int sb_flags_mask
+ * ::
+
+ unsigned int sb_flags
+ unsigned int sb_flags_mask
Which bits SB_* flags are to be set/cleared in super_block::s_flags.
- (*) unsigned int s_iflags
+ * ::
+
+ unsigned int s_iflags
These will be bitwise-OR'd with s->s_iflags when a superblock is created.
- (*) enum fs_context_purpose
+ * ::
+
+ enum fs_context_purpose
This indicates the purpose for which the context is intended. The
available values are:
- FS_CONTEXT_FOR_MOUNT, -- New superblock for explicit mount
- FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
- FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
+ ========================== ======================================
+ FS_CONTEXT_FOR_MOUNT, New superblock for explicit mount
+ FS_CONTEXT_FOR_SUBMOUNT New automatic submount of extant mount
+ FS_CONTEXT_FOR_RECONFIGURE Change an existing mount
+ ========================== ======================================
The mount context is created by calling vfs_new_fs_context() or
vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
@@ -176,17 +204,16 @@
module.
-=================================
-THE FILESYSTEM CONTEXT OPERATIONS
+The Filesystem Context Operations
=================================
-The filesystem context points to a table of operations:
+The filesystem context points to a table of operations::
struct fs_context_operations {
void (*free)(struct fs_context *fc);
int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
int (*parse_param)(struct fs_context *fc,
- struct struct fs_parameter *param);
+ struct fs_parameter *param);
int (*parse_monolithic)(struct fs_context *fc, void *data);
int (*get_tree)(struct fs_context *fc);
int (*reconfigure)(struct fs_context *fc);
@@ -195,24 +222,32 @@
These operations are invoked by the various stages of the mount procedure to
manage the filesystem context. They are as follows:
- (*) void (*free)(struct fs_context *fc);
+ * ::
+
+ void (*free)(struct fs_context *fc);
Called to clean up the filesystem-specific part of the filesystem context
when the context is destroyed. It should be aware that parts of the
context may have been removed and NULL'd out by ->get_tree().
- (*) int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
+ * ::
+
+ int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
Called when a filesystem context has been duplicated to duplicate the
filesystem-private data. An error may be returned to indicate failure to
do this.
- [!] Note that even if this fails, put_fs_context() will be called
+ .. Warning::
+
+ Note that even if this fails, put_fs_context() will be called
immediately thereafter, so ->dup() *must* make the
filesystem-private data safe for ->free().
- (*) int (*parse_param)(struct fs_context *fc,
- struct struct fs_parameter *param);
+ * ::
+
+ int (*parse_param)(struct fs_context *fc,
+ struct fs_parameter *param);
Called when a parameter is being added to the filesystem context. param
points to the key name and maybe a value object. VFS-specific options
@@ -224,7 +259,9 @@
If successful, 0 should be returned or a negative error code otherwise.
- (*) int (*parse_monolithic)(struct fs_context *fc, void *data);
+ * ::
+
+ int (*parse_monolithic)(struct fs_context *fc, void *data);
Called when the mount(2) system call is invoked to pass the entire data
page in one go. If this is expected to be just a list of "key[=val]"
@@ -236,7 +273,9 @@
finds it's the standard key-val list then it may pass it off to
generic_parse_monolithic().
- (*) int (*get_tree)(struct fs_context *fc);
+ * ::
+
+ int (*get_tree)(struct fs_context *fc);
Called to get or create the mountable root and superblock, using the
information stored in the filesystem context (reconfiguration goes via a
@@ -249,7 +288,9 @@
The phase on a userspace-driven context will be set to only allow this to
be called once on any particular context.
- (*) int (*reconfigure)(struct fs_context *fc);
+ * ::
+
+ int (*reconfigure)(struct fs_context *fc);
Called to effect reconfiguration of a superblock using information stored
in the filesystem context. It may detach any resources it desires from
@@ -259,19 +300,20 @@
On success it should return 0. In the case of an error, it should return
a negative error code.
- [NOTE] reconfigure is intended as a replacement for remount_fs.
+ .. Note:: reconfigure is intended as a replacement for remount_fs.
-===========================
-FILESYSTEM CONTEXT SECURITY
+Filesystem context Security
===========================
The filesystem context contains a security pointer that the LSMs can use for
building up a security context for the superblock to be mounted. There are a
number of operations used by the new mount code for this purpose:
- (*) int security_fs_context_alloc(struct fs_context *fc,
- struct dentry *reference);
+ * ::
+
+ int security_fs_context_alloc(struct fs_context *fc,
+ struct dentry *reference);
Called to initialise fc->security (which is preset to NULL) and allocate
any resources needed. It should return 0 on success or a negative error
@@ -283,22 +325,28 @@
non-NULL in the case of a submount (FS_CONTEXT_FOR_SUBMOUNT) in which case
it indicates the automount point.
- (*) int security_fs_context_dup(struct fs_context *fc,
- struct fs_context *src_fc);
+ * ::
+
+ int security_fs_context_dup(struct fs_context *fc,
+ struct fs_context *src_fc);
Called to initialise fc->security (which is preset to NULL) and allocate
any resources needed. The original filesystem context is pointed to by
src_fc and may be used for reference. It should return 0 on success or a
negative error code on failure.
- (*) void security_fs_context_free(struct fs_context *fc);
+ * ::
+
+ void security_fs_context_free(struct fs_context *fc);
Called to clean up anything attached to fc->security. Note that the
contents may have been transferred to a superblock and the pointer cleared
during get_tree.
- (*) int security_fs_context_parse_param(struct fs_context *fc,
- struct fs_parameter *param);
+ * ::
+
+ int security_fs_context_parse_param(struct fs_context *fc,
+ struct fs_parameter *param);
Called for each mount parameter, including the source. The arguments are
as for the ->parse_param() method. It should return 0 to indicate that
@@ -310,7 +358,9 @@
(provided the value pointer is NULL'd out). If it is stolen, 1 must be
returned to prevent it being passed to the filesystem.
- (*) int security_fs_context_validate(struct fs_context *fc);
+ * ::
+
+ int security_fs_context_validate(struct fs_context *fc);
Called after all the options have been parsed to validate the collection
as a whole and to do any necessary allocation so that
@@ -320,36 +370,43 @@
In the case of reconfiguration, the target superblock will be accessible
via fc->root.
- (*) int security_sb_get_tree(struct fs_context *fc);
+ * ::
+
+ int security_sb_get_tree(struct fs_context *fc);
Called during the mount procedure to verify that the specified superblock
is allowed to be mounted and to transfer the security data there. It
should return 0 or a negative error code.
- (*) void security_sb_reconfigure(struct fs_context *fc);
+ * ::
+
+ void security_sb_reconfigure(struct fs_context *fc);
Called to apply any reconfiguration to an LSM's context. It must not
fail. Error checking and resource allocation must be done in advance by
the parameter parsing and validation hooks.
- (*) int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
- unsigned int mnt_flags);
+ * ::
+
+ int security_sb_mountpoint(struct fs_context *fc,
+ struct path *mountpoint,
+ unsigned int mnt_flags);
Called during the mount procedure to verify that the root dentry attached
to the context is permitted to be attached to the specified mountpoint.
It should return 0 on success or a negative error code on failure.
-==========================
-VFS FILESYSTEM CONTEXT API
+VFS Filesystem context API
==========================
There are four operations for creating a filesystem context and one for
destroying a context:
- (*) struct fs_context *fs_context_for_mount(
- struct file_system_type *fs_type,
- unsigned int sb_flags);
+ * ::
+
+ struct fs_context *fs_context_for_mount(struct file_system_type *fs_type,
+ unsigned int sb_flags);
Allocate a filesystem context for the purpose of setting up a new mount,
whether that be with a new superblock or sharing an existing one. This
@@ -359,7 +416,9 @@
fs_type specifies the filesystem type that will manage the context and
sb_flags presets the superblock flags stored therein.
- (*) struct fs_context *fs_context_for_reconfigure(
+ * ::
+
+ struct fs_context *fs_context_for_reconfigure(
struct dentry *dentry,
unsigned int sb_flags,
unsigned int sb_flags_mask);
@@ -369,7 +428,9 @@
configured. sb_flags and sb_flags_mask indicate which superblock flags
need changing and to what.
- (*) struct fs_context *fs_context_for_submount(
+ * ::
+
+ struct fs_context *fs_context_for_submount(
struct file_system_type *fs_type,
struct dentry *reference);
@@ -382,7 +443,9 @@
Note that it's not a requirement that the reference dentry be of the same
filesystem type as fs_type.
- (*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
+ * ::
+
+ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
Duplicate a filesystem context, copying any options noted and duplicating
or additionally referencing any resources held therein. This is available
@@ -392,14 +455,18 @@
The purpose in the new context is inherited from the old one.
- (*) void put_fs_context(struct fs_context *fc);
+ * ::
+
+ void put_fs_context(struct fs_context *fc);
Destroy a filesystem context, releasing any resources it holds. This
calls the ->free() operation. This is intended to be called by anyone who
created a filesystem context.
- [!] filesystem contexts are not refcounted, so this causes unconditional
- destruction.
+ .. Warning::
+
+ filesystem contexts are not refcounted, so this causes unconditional
+ destruction.
In all the above operations, apart from the put op, the return is a mount
context pointer or a negative error code.
@@ -407,10 +474,12 @@
For the remaining operations, if an error occurs, a negative error code will be
returned.
- (*) int vfs_parse_fs_param(struct fs_context *fc,
- struct fs_parameter *param);
+ * ::
- Supply a single mount parameter to the filesystem context. This include
+ int vfs_parse_fs_param(struct fs_context *fc,
+ struct fs_parameter *param);
+
+ Supply a single mount parameter to the filesystem context. This includes
the specification of the source/device which is specified as the "source"
parameter (which may be specified multiple times if the filesystem
supports that).
@@ -423,54 +492,64 @@
The parameter value is typed and can be one of:
- fs_value_is_flag, Parameter not given a value.
- fs_value_is_string, Value is a string
- fs_value_is_blob, Value is a binary blob
- fs_value_is_filename, Value is a filename* + dirfd
- fs_value_is_filename_empty, Value is a filename* + dirfd + AT_EMPTY_PATH
- fs_value_is_file, Value is an open file (file*)
+ ==================== =============================
+ fs_value_is_flag Parameter not given a value
+ fs_value_is_string Value is a string
+ fs_value_is_blob Value is a binary blob
+ fs_value_is_filename Value is a filename* + dirfd
+ fs_value_is_file Value is an open file (file*)
+ ==================== =============================
If there is a value, that value is stored in a union in the struct in one
of param->{string,blob,name,file}. Note that the function may steal and
clear the pointer, but then becomes responsible for disposing of the
object.
- (*) int vfs_parse_fs_string(struct fs_context *fc, const char *key,
- const char *value, size_t v_size);
+ * ::
+
+ int vfs_parse_fs_string(struct fs_context *fc, const char *key,
+ const char *value, size_t v_size);
A wrapper around vfs_parse_fs_param() that copies the value string it is
passed.
- (*) int generic_parse_monolithic(struct fs_context *fc, void *data);
+ * ::
+
+ int generic_parse_monolithic(struct fs_context *fc, void *data);
Parse a sys_mount() data page, assuming the form to be a text list
consisting of key[=val] options separated by commas. Each item in the
list is passed to vfs_mount_option(). This is the default when the
->parse_monolithic() method is NULL.
- (*) int vfs_get_tree(struct fs_context *fc);
+ * ::
+
+ int vfs_get_tree(struct fs_context *fc);
Get or create the mountable root and superblock, using the parameters in
the filesystem context to select/configure the superblock. This invokes
the ->get_tree() method.
- (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+ * ::
+
+ struct vfsmount *vfs_create_mount(struct fs_context *fc);
Create a mount given the parameters in the specified filesystem context.
Note that this does not attach the mount to anything.
-===========================
-SUPERBLOCK CREATION HELPERS
+Superblock Creation Helpers
===========================
A number of VFS helpers are available for use by filesystems for the creation
or looking up of superblocks.
- (*) struct super_block *
- sget_fc(struct fs_context *fc,
- int (*test)(struct super_block *sb, struct fs_context *fc),
- int (*set)(struct super_block *sb, struct fs_context *fc));
+ * ::
+
+ struct super_block *
+ sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *sb, struct fs_context *fc),
+ int (*set)(struct super_block *sb, struct fs_context *fc));
This is the core routine. If test is non-NULL, it searches for an
existing superblock matching the criteria held in the fs_context, using
@@ -483,10 +562,12 @@
The following helpers all wrap sget_fc():
- (*) int vfs_get_super(struct fs_context *fc,
- enum vfs_get_super_keying keying,
- int (*fill_super)(struct super_block *sb,
- struct fs_context *fc))
+ * ::
+
+ int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc))
This creates/looks up a deviceless superblock. The keying indicates how
many superblocks of this type may exist and in what manner they may be
@@ -511,20 +592,18 @@
one.
-=====================
-PARAMETER DESCRIPTION
+Parameter Description
=====================
Parameters are described using structures defined in linux/fs_parser.h.
-There's a core description struct that links everything together:
+There's a core description struct that links everything together::
struct fs_parameter_description {
- const char name[16];
const struct fs_parameter_spec *specs;
const struct fs_parameter_enum *enums;
};
-For example:
+For example::
enum {
Opt_autocell,
@@ -535,22 +614,18 @@
};
static const struct fs_parameter_description afs_fs_parameters = {
- .name = "kAFS",
.specs = afs_param_specs,
.enums = afs_param_enums,
};
The members are as follows:
- (1) const char name[16];
+ (1) ::
- The name to be used in error messages generated by the parse helper
- functions.
-
- (2) const struct fs_parameter_specification *specs;
+ const struct fs_parameter_specification *specs;
Table of parameter specifications, terminated with a null entry, where the
- entries are of type:
+ entries are of type::
struct fs_parameter_spec {
const char *name;
@@ -566,6 +641,7 @@
The 'type' field indicates the desired value type and must be one of:
+ ======================= ======================= =====================
TYPE NAME EXPECTED VALUE RESULT IN
======================= ======================= =====================
fs_param_is_flag No value n/a
@@ -581,19 +657,23 @@
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
fs_param_is_fd File descriptor result->int_32
+ ======================= ======================= =====================
Note that if the value is of fs_param_is_bool type, fs_parse() will try
to match any string value against "0", "1", "no", "yes", "false", "true".
Each parameter can also be qualified with 'flags':
+ ======================= ================================================
fs_param_v_optional The value is optional
fs_param_neg_with_no result->negated set if key is prefixed with "no"
fs_param_neg_with_empty result->negated set if value is ""
fs_param_deprecated The parameter is deprecated.
+ ======================= ================================================
These are wrapped with a number of convenience wrappers:
+ ======================= ===============================================
MACRO SPECIFIES
======================= ===============================================
fsparam_flag() fs_param_is_flag
@@ -610,9 +690,10 @@
fsparam_bdev() fs_param_is_blockdev
fsparam_path() fs_param_is_path
fsparam_fd() fs_param_is_fd
+ ======================= ===============================================
all of which take two arguments, name string and option number - for
- example:
+ example::
static const struct fs_parameter_spec afs_param_specs[] = {
fsparam_flag ("autocell", Opt_autocell),
@@ -626,10 +707,12 @@
of arguments to specify the type and the flags for anything that doesn't
match one of the above macros.
- (6) const struct fs_parameter_enum *enums;
+ (2) ::
+
+ const struct fs_parameter_enum *enums;
Table of enum value names to integer mappings, terminated with a null
- entry. This is of type:
+ entry. This is of type::
struct fs_parameter_enum {
u8 opt;
@@ -638,7 +721,7 @@
};
Where the array is an unsorted list of { parameter ID, name }-keyed
- elements that indicate the value to map to, e.g.:
+ elements that indicate the value to map to, e.g.::
static const struct fs_parameter_enum afs_param_enums[] = {
{ Opt_bar, "x", 1},
@@ -656,18 +739,19 @@
userspace using the fsinfo() syscall.
-==========================
-PARAMETER HELPER FUNCTIONS
+Parameter Helper Functions
==========================
A number of helper functions are provided to help a filesystem or an LSM
process the parameters it is given.
- (*) int lookup_constant(const struct constant_table tbl[],
- const char *name, int not_found);
+ * ::
+
+ int lookup_constant(const struct constant_table tbl[],
+ const char *name, int not_found);
Look up a constant by name in a table of name -> integer mappings. The
- table is an array of elements of the following type:
+ table is an array of elements of the following type::
struct constant_table {
const char *name;
@@ -677,9 +761,11 @@
If a match is found, the corresponding value is returned. If a match
isn't found, the not_found value is returned instead.
- (*) bool validate_constant_table(const struct constant_table *tbl,
- size_t tbl_size,
- int low, int high, int special);
+ * ::
+
+ bool validate_constant_table(const struct constant_table *tbl,
+ size_t tbl_size,
+ int low, int high, int special);
Validate a constant table. Checks that all the elements are appropriately
ordered, that there are no duplicates and that the values are between low
@@ -690,16 +776,20 @@
If all is good, true is returned. If the table is invalid, errors are
logged to dmesg and false is returned.
- (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+ * ::
+
+ bool fs_validate_description(const struct fs_parameter_description *desc);
This performs some validation checks on a parameter description. It
returns true if the description is good and false if it is not. It will
log errors to dmesg if validation fails.
- (*) int fs_parse(struct fs_context *fc,
- const struct fs_parameter_description *desc,
- struct fs_parameter *param,
- struct fs_parse_result *result);
+ * ::
+
+ int fs_parse(struct fs_context *fc,
+ const struct fs_parameter_description *desc,
+ struct fs_parameter *param,
+ struct fs_parse_result *result);
This is the main interpreter of parameters. It uses the parameter
description to look up a parameter by key name and to convert that to an
@@ -719,14 +809,16 @@
parameter is matched, but the value is erroneous, -EINVAL will be
returned; otherwise the parameter's option number will be returned.
- (*) int fs_lookup_param(struct fs_context *fc,
- struct fs_parameter *value,
- bool want_bdev,
- struct path *_path);
+ * ::
+
+ int fs_lookup_param(struct fs_context *fc,
+ struct fs_parameter *value,
+ bool want_bdev,
+ struct path *_path);
This takes a parameter that carries a string or filename type and attempts
to do a path lookup on it. If the parameter expects a blockdev, a check
is made that the inode actually represents one.
- Returns 0 if successful and *_path will be set; returns a negative error
- code if not.
+ Returns 0 if successful and ``*_path`` will be set; returns a negative
+ error code if not.
diff --git a/Documentation/filesystems/nfs/fault_injection.txt b/Documentation/filesystems/nfs/fault_injection.txt
deleted file mode 100644
index f3a5b0a..0000000
--- a/Documentation/filesystems/nfs/fault_injection.txt
+++ /dev/null
@@ -1,69 +0,0 @@
-
-Fault Injection
-===============
-Fault injection is a method for forcing errors that may not normally occur, or
-may be difficult to reproduce. Forcing these errors in a controlled environment
-can help the developer find and fix bugs before their code is shipped in a
-production system. Injecting an error on the Linux NFS server will allow us to
-observe how the client reacts and if it manages to recover its state correctly.
-
-NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this
-feature.
-
-
-Using Fault Injection
-=====================
-On the client, mount the fault injection server through NFS v4.0+ and do some
-work over NFS (open files, take locks, ...).
-
-On the server, mount the debugfs filesystem to <debug_dir> and ls
-<debug_dir>/nfsd. This will show a list of files that will be used for
-injecting faults on the NFS server. As root, write a number n to the file
-corresponding to the action you want the server to take. The server will then
-process the first n items it finds. So if you want to forget 5 locks, echo '5'
-to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget
-all corresponding items. A log message will be created containing the number
-of items forgotten (check dmesg).
-
-Go back to work on the client and check if the client recovered from the error
-correctly.
-
-
-Available Faults
-================
-forget_clients:
- The NFS server keeps a list of clients that have placed a mount call. If
- this list is cleared, the server will have no knowledge of who the client
- is, forcing the client to reauthenticate with the server.
-
-forget_openowners:
- The NFS server keeps a list of what files are currently opened and who
- they were opened by. Clearing this list will force the client to reopen
- its files.
-
-forget_locks:
- The NFS server keeps a list of what files are currently locked in the VFS.
- Clearing this list will force the client to reclaim its locks (files are
- unlocked through the VFS as they are cleared from this list).
-
-forget_delegations:
- A delegation is used to assure the client that a file, or part of a file,
- has not changed since the delegation was awarded. Clearing this list will
- force the client to reacquire its delegation before accessing the file
- again.
-
-recall_delegations:
- Delegations can be recalled by the server when another client attempts to
- access a file. This test will notify the client that its delegation has
- been revoked, forcing the client to reacquire the delegation before using
- the file again.
-
-
-tools/nfs/inject_faults.sh script
-=================================
-This script has been created to ease the fault injection process. This script
-will detect the mounted debugfs directory and write to the files located there
-based on the arguments passed by the user. For example, running
-`inject_faults.sh forget_locks 1` as root will instruct the server to forget
-one lock. Running `inject_faults forget_locks` will instruct the server to
-forgetall locks.
diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt
deleted file mode 100644
index b86831a..0000000
--- a/Documentation/filesystems/nfs/idmapper.txt
+++ /dev/null
@@ -1,75 +0,0 @@
-
-=========
-ID Mapper
-=========
-Id mapper is used by NFS to translate user and group ids into names, and to
-translate user and group names into ids. Part of this translation involves
-performing an upcall to userspace to request the information. There are two
-ways NFS could obtain this information: placing a call to /sbin/request-key
-or by placing a call to the rpc.idmap daemon.
-
-NFS will attempt to call /sbin/request-key first. If this succeeds, the
-result will be cached using the generic request-key cache. This call should
-only fail if /etc/request-key.conf is not configured for the id_resolver key
-type, see the "Configuring" section below if you wish to use the request-key
-method.
-
-If the call to /sbin/request-key fails (if /etc/request-key.conf is not
-configured with the id_resolver key type), then the idmapper will ask the
-legacy rpc.idmap daemon for the id mapping. This result will be stored
-in a custom NFS idmap cache.
-
-
-===========
-Configuring
-===========
-The file /etc/request-key.conf will need to be modified so /sbin/request-key can
-direct the upcall. The following line should be added:
-
-#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
-#====== ======= =============== =============== ===============================
-create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
-
-This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap.
-The last parameter, 600, defines how many seconds into the future the key will
-expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout
-is not specified, nfs.idmap will default to 600 seconds.
-
-id mapper uses for key descriptions:
- uid: Find the UID for the given user
- gid: Find the GID for the given group
- user: Find the user name for the given UID
- group: Find the group name for the given GID
-
-You can handle any of these individually, rather than using the generic upcall
-program. If you would like to use your own program for a uid lookup then you
-would edit your request-key.conf so it look similar to this:
-
-#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
-#====== ======= =============== =============== ===============================
-create id_resolver uid:* * /some/other/program %k %d 600
-create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
-
-Notice that the new line was added above the line for the generic program.
-request-key will find the first matching line and corresponding program. In
-this case, /some/other/program will handle all uid lookups and
-/usr/sbin/nfs.idmap will handle gid, user, and group lookups.
-
-See <file:Documentation/security/keys/request-key.rst> for more information
-about the request-key function.
-
-
-=========
-nfs.idmap
-=========
-nfs.idmap is designed to be called by request-key, and should not be run "by
-hand". This program takes two arguments, a serialized key and a key
-description. The serialized key is first converted into a key_serial_t, and
-then passed as an argument to keyctl_instantiate (both are part of keyutils.h).
-
-The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap
-determines the correct function to call by looking at the first part of the
-description string. For example, a uid lookup description will appear as
-"uid:user@domain".
-
-nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise.
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
new file mode 100644
index 0000000..6580562
--- /dev/null
+++ b/Documentation/filesystems/nfs/index.rst
@@ -0,0 +1,13 @@
+===============================
+NFS
+===============================
+
+
+.. toctree::
+ :maxdepth: 1
+
+ pnfs
+ rpc-cache
+ rpc-server-gss
+ nfs41-server
+ knfsd-stats
diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.rst
similarity index 94%
rename from Documentation/filesystems/nfs/knfsd-stats.txt
rename to Documentation/filesystems/nfs/knfsd-stats.rst
index 1a5d821..80bcf13 100644
--- a/Documentation/filesystems/nfs/knfsd-stats.txt
+++ b/Documentation/filesystems/nfs/knfsd-stats.rst
@@ -1,7 +1,9 @@
-
+============================
Kernel NFS Server Statistics
============================
+:Authors: Greg Banks <gnb@sgi.com> - 26 Mar 2009
+
This document describes the format and semantics of the statistics
which the kernel NFS server makes available to userspace. These
statistics are available in several text form pseudo files, each of
@@ -18,7 +20,7 @@
separated by whitespace.
/proc/fs/nfsd/pool_stats
-------------------------
+========================
This file is available in kernels from 2.6.30 onwards, if the
/proc/fs/nfsd filesystem is mounted (it almost always should be).
@@ -109,15 +111,12 @@
(sockets-enqueued counts this case), or the packet can be temporarily
deferred because the transport is currently being used by an nfsd
thread. This last case is not very interesting and is not explicitly
-counted, but can be inferred from the other counters thus:
+counted, but can be inferred from the other counters thus::
-packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
+ packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
More
-----
+====
+
Descriptions of the other statistics file should go here.
-
-
-Greg Banks <gnb@sgi.com>
-26 Mar 2009
diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt
deleted file mode 100644
index 22dc0dd..0000000
--- a/Documentation/filesystems/nfs/nfs-rdma.txt
+++ /dev/null
@@ -1,274 +0,0 @@
-################################################################################
-# #
-# NFS/RDMA README #
-# #
-################################################################################
-
- Author: NetApp and Open Grid Computing
- Date: May 29, 2008
-
-Table of Contents
-~~~~~~~~~~~~~~~~~
- - Overview
- - Getting Help
- - Installation
- - Check RDMA and NFS Setup
- - NFS/RDMA Setup
-
-Overview
-~~~~~~~~
-
- This document describes how to install and setup the Linux NFS/RDMA client
- and server software.
-
- The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
- was first included in the following release, Linux 2.6.25.
-
- In our testing, we have obtained excellent performance results (full 10Gbit
- wire bandwidth at minimal client CPU) under many workloads. The code passes
- the full Connectathon test suite and operates over both Infiniband and iWARP
- RDMA adapters.
-
-Getting Help
-~~~~~~~~~~~~
-
- If you get stuck, you can ask questions on the
-
- nfs-rdma-devel@lists.sourceforge.net
-
- mailing list.
-
-Installation
-~~~~~~~~~~~~
-
- These instructions are a step by step guide to building a machine for
- use with NFS/RDMA.
-
- - Install an RDMA device
-
- Any device supported by the drivers in drivers/infiniband/hw is acceptable.
-
- Testing has been performed using several Mellanox-based IB cards, the
- Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
-
- - Install a Linux distribution and tools
-
- The first kernel release to contain both the NFS/RDMA client and server was
- Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
- Linux kernel release should be installed.
-
- The procedures described in this document have been tested with
- distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
-
- - Install nfs-utils-1.1.2 or greater on the client
-
- An NFS/RDMA mount point can be obtained by using the mount.nfs command in
- nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
- version with support for NFS/RDMA mounts, but for various reasons we
- recommend using nfs-utils-1.1.2 or greater). To see which version of
- mount.nfs you are using, type:
-
- $ /sbin/mount.nfs -V
-
- If the version is less than 1.1.2 or the command does not exist,
- you should install the latest version of nfs-utils.
-
- Download the latest package from:
-
- http://www.kernel.org/pub/linux/utils/nfs
-
- Uncompress the package and follow the installation instructions.
-
- If you will not need the idmapper and gssd executables (you do not need
- these to create an NFS/RDMA enabled mount command), the installation
- process can be simplified by disabling these features when running
- configure:
-
- $ ./configure --disable-gss --disable-nfsv4
-
- To build nfs-utils you will need the tcp_wrappers package installed. For
- more information on this see the package's README and INSTALL files.
-
- After building the nfs-utils package, there will be a mount.nfs binary in
- the utils/mount directory. This binary can be used to initiate NFS v2, v3,
- or v4 mounts. To initiate a v4 mount, the binary must be called
- mount.nfs4. The standard technique is to create a symlink called
- mount.nfs4 to mount.nfs.
-
- This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
-
- $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
-
- In this location, mount.nfs will be invoked automatically for NFS mounts
- by the system mount command.
-
- NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
- on the NFS client machine. You do not need this specific version of
- nfs-utils on the server. Furthermore, only the mount.nfs command from
- nfs-utils-1.1.2 is needed on the client.
-
- - Install a Linux kernel with NFS/RDMA
-
- The NFS/RDMA client and server are both included in the mainline Linux
- kernel version 2.6.25 and later. This and other versions of the Linux
- kernel can be found at:
-
- https://www.kernel.org/pub/linux/kernel/
-
- Download the sources and place them in an appropriate location.
-
- - Configure the RDMA stack
-
- Make sure your kernel configuration has RDMA support enabled. Under
- Device Drivers -> InfiniBand support, update the kernel configuration
- to enable InfiniBand support [NOTE: the option name is misleading. Enabling
- InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
-
- Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
- iWARP adapter support (amso, cxgb3, etc.).
-
- If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
-
- - Configure the NFS client and server
-
- Your kernel configuration must also have NFS file system support and/or
- NFS server support enabled. These and other NFS related configuration
- options can be found under File Systems -> Network File Systems.
-
- - Build, install, reboot
-
- The NFS/RDMA code will be enabled automatically if NFS and RDMA
- are turned on. The NFS/RDMA client and server are configured via the hidden
- SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The
- value of SUNRPC_XPRT_RDMA will be:
-
- - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
- and server will not be built
- - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
- in this case the NFS/RDMA client and server will be built as modules
- - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
- and server will be built into the kernel
-
- Therefore, if you have followed the steps above and turned no NFS and RDMA,
- the NFS/RDMA client and server will be built.
-
- Build a new kernel, install it, boot it.
-
-Check RDMA and NFS Setup
-~~~~~~~~~~~~~~~~~~~~~~~~
-
- Before configuring the NFS/RDMA software, it is a good idea to test
- your new kernel to ensure that the kernel is working correctly.
- In particular, it is a good idea to verify that the RDMA stack
- is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
- is working properly.
-
- - Check RDMA Setup
-
- If you built the RDMA components as modules, load them at
- this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
- card:
-
- $ modprobe ib_mthca
- $ modprobe ib_ipoib
-
- If you are using InfiniBand, make sure there is a Subnet Manager (SM)
- running on the network. If your IB switch has an embedded SM, you can
- use it. Otherwise, you will need to run an SM, such as OpenSM, on one
- of your end nodes.
-
- If an SM is running on your network, you should see the following:
-
- $ cat /sys/class/infiniband/driverX/ports/1/state
- 4: ACTIVE
-
- where driverX is mthca0, ipath5, ehca3, etc.
-
- To further test the InfiniBand software stack, use IPoIB (this
- assumes you have two IB hosts named host1 and host2):
-
- host1$ ip link set dev ib0 up
- host1$ ip address add dev ib0 a.b.c.x
- host2$ ip link set dev ib0 up
- host2$ ip address add dev ib0 a.b.c.y
- host1$ ping a.b.c.y
- host2$ ping a.b.c.x
-
- For other device types, follow the appropriate procedures.
-
- - Check NFS Setup
-
- For the NFS components enabled above (client and/or server),
- test their functionality over standard Ethernet using TCP/IP or UDP/IP.
-
-NFS/RDMA Setup
-~~~~~~~~~~~~~~
-
- We recommend that you use two machines, one to act as the client and
- one to act as the server.
-
- One time configuration:
-
- - On the server system, configure the /etc/exports file and
- start the NFS/RDMA server.
-
- Exports entries with the following formats have been tested:
-
- /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
- /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
-
- The IP address(es) is(are) the client's IPoIB address for an InfiniBand
- HCA or the client's iWARP address(es) for an RNIC.
-
- NOTE: The "insecure" option must be used because the NFS/RDMA client does
- not use a reserved port.
-
- Each time a machine boots:
-
- - Load and configure the RDMA drivers
-
- For InfiniBand using a Mellanox adapter:
-
- $ modprobe ib_mthca
- $ modprobe ib_ipoib
- $ ip li set dev ib0 up
- $ ip addr add dev ib0 a.b.c.d
-
- NOTE: use unique addresses for the client and server
-
- - Start the NFS server
-
- If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
- kernel config), load the RDMA transport module:
-
- $ modprobe svcrdma
-
- Regardless of how the server was built (module or built-in), start the
- server:
-
- $ /etc/init.d/nfs start
-
- or
-
- $ service nfs start
-
- Instruct the server to listen on the RDMA transport:
-
- $ echo rdma 20049 > /proc/fs/nfsd/portlist
-
- - On the client system
-
- If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
- kernel config), load the RDMA client module:
-
- $ modprobe xprtrdma.ko
-
- Regardless of how the client was built (module or built-in), use this
- command to mount the NFS/RDMA server:
-
- $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
-
- To verify that the mount is using RDMA, run "cat /proc/mounts" and check
- the "proto" field for the given mount.
-
- Congratulations! You're using NFS/RDMA!
diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt
deleted file mode 100644
index f2571c8..0000000
--- a/Documentation/filesystems/nfs/nfs.txt
+++ /dev/null
@@ -1,136 +0,0 @@
-
-The NFS client
-==============
-
-The NFS version 2 protocol was first documented in RFC1094 (March 1989).
-Since then two more major releases of NFS have been published, with NFSv3
-being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
-2003).
-
-The Linux NFS client currently supports all the above published versions,
-and work is in progress on adding support for minor version 1 of the NFSv4
-protocol.
-
-The purpose of this document is to provide information on some of the
-special features of the NFS client that can be configured by system
-administrators.
-
-
-The nfs4_unique_id parameter
-============================
-
-NFSv4 requires clients to identify themselves to servers with a unique
-string. File open and lock state shared between one client and one server
-is associated with this identity. To support robust NFSv4 state recovery
-and transparent state migration, this identity string must not change
-across client reboots.
-
-Without any other intervention, the Linux client uses a string that contains
-the local system's node name. System administrators, however, often do not
-take care to ensure that node names are fully qualified and do not change
-over the lifetime of a client system. Node names can have other
-administrative requirements that require particular behavior that does not
-work well as part of an nfs_client_id4 string.
-
-The nfs.nfs4_unique_id boot parameter specifies a unique string that can be
-used instead of a system's node name when an NFS client identifies itself to
-a server. Thus, if the system's node name is not unique, or it changes, its
-nfs.nfs4_unique_id stays the same, preventing collision with other clients
-or loss of state during NFS reboot recovery or transparent state migration.
-
-The nfs.nfs4_unique_id string is typically a UUID, though it can contain
-anything that is believed to be unique across all NFS clients. An
-nfs4_unique_id string should be chosen when a client system is installed,
-just as a system's root file system gets a fresh UUID in its label at
-install time.
-
-The string should remain fixed for the lifetime of the client. It can be
-changed safely if care is taken that the client shuts down cleanly and all
-outstanding NFSv4 state has expired, to prevent loss of NFSv4 state.
-
-This string can be stored in an NFS client's grub.conf, or it can be provided
-via a net boot facility such as PXE. It may also be specified as an nfs.ko
-module parameter. Specifying a uniquifier string is not support for NFS
-clients running in containers.
-
-
-The DNS resolver
-================
-
-NFSv4 allows for one server to refer the NFS client to data that has been
-migrated onto another server by means of the special "fs_locations"
-attribute. See
- http://tools.ietf.org/html/rfc3530#section-6
-and
- http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
-
-The fs_locations information can take the form of either an ip address and
-a path, or a DNS hostname and a path. The latter requires the NFS client to
-do a DNS lookup in order to mount the new volume, and hence the need for an
-upcall to allow userland to provide this service.
-
-Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
-/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
-
- (1) The process checks the dns_resolve cache to see if it contains a
- valid entry. If so, it returns that entry and exits.
-
- (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
- (may be changed using the 'nfs.cache_getent' kernel boot parameter)
- is run, with two arguments:
- - the cache name, "dns_resolve"
- - the hostname to resolve
-
- (3) After looking up the corresponding ip address, the helper script
- writes the result into the rpc_pipefs pseudo-file
- '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
- in the following (text) format:
-
- "<ip address> <hostname> <ttl>\n"
-
- Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
- (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
- <hostname> is identical to the second argument of the helper
- script, and <ttl> is the 'time to live' of this cache entry (in
- units of seconds).
-
- Note: If <ip address> is invalid, say the string "0", then a negative
- entry is created, which will cause the kernel to treat the hostname
- as having no valid DNS translation.
-
-
-
-
-A basic sample /sbin/nfs_cache_getent
-=====================================
-
-#!/bin/bash
-#
-ttl=600
-#
-cut=/usr/bin/cut
-getent=/usr/bin/getent
-rpc_pipefs=/var/lib/nfs/rpc_pipefs
-#
-die()
-{
- echo "Usage: $0 cache_name entry_name"
- exit 1
-}
-
-[ $# -lt 2 ] && die
-cachename="$1"
-cache_path=${rpc_pipefs}/cache/${cachename}/channel
-
-case "${cachename}" in
- dns_resolve)
- name="$2"
- result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
- [ -z "${result}" ] && result="0"
- ;;
- *)
- die
- ;;
-esac
-echo "${result} ${name} ${ttl}" >${cache_path}
-
diff --git a/Documentation/filesystems/nfs/nfs41-server.rst b/Documentation/filesystems/nfs/nfs41-server.rst
new file mode 100644
index 0000000..16b5f02
--- /dev/null
+++ b/Documentation/filesystems/nfs/nfs41-server.rst
@@ -0,0 +1,256 @@
+=============================
+NFSv4.1 Server Implementation
+=============================
+
+Server support for minorversion 1 can be controlled using the
+/proc/fs/nfsd/versions control file. The string output returned
+by reading this file will contain either "+4.1" or "-4.1"
+correspondingly.
+
+Currently, server support for minorversion 1 is enabled by default.
+It can be disabled at run time by writing the string "-4.1" to
+the /proc/fs/nfsd/versions control file. Note that to write this
+control file, the nfsd service must be taken down. You can use rpc.nfsd
+for this; see rpc.nfsd(8).
+
+(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
+"-4", respectively. Therefore, code meant to work on both new and old
+kernels must turn 4.1 on or off *before* turning support for version 4
+on or off; rpc.nfsd does this correctly.)
+
+The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
+on RFC 5661.
+
+From the many new features in NFSv4.1 the current implementation
+focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
+"exactly once" semantics and better control and throttling of the
+resources allocated for each client.
+
+The table below, taken from the NFSv4.1 document, lists
+the operations that are mandatory to implement (REQ), optional
+(OPT), and NFSv4.0 operations that are required not to implement (MNI)
+in minor version 1. The first column indicates the operations that
+are not supported yet by the linux server implementation.
+
+The OPTIONAL features identified and their abbreviations are as follows:
+
+- **pNFS** Parallel NFS
+- **FDELG** File Delegations
+- **DDELG** Directory Delegations
+
+The following abbreviations indicate the linux server implementation status.
+
+- **I** Implemented NFSv4.1 operations.
+- **NS** Not Supported.
+- **NS\*** Unimplemented optional feature.
+
+Operations
+==========
+
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition |
++=======================+======================+=====================+===========================+================+
+| | ACCESS | REQ | | Section 18.1 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | BACKCHANNEL_CTL | REQ | | Section 18.33 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | CLOSE | REQ | | Section 18.2 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | COMMIT | REQ | | Section 18.3 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | CREATE | REQ | | Section 18.4 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | CREATE_SESSION | REQ | | Section 18.36 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | DELEGRETURN | OPT | FDELG, | Section 18.6 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | | | DDELG, pNFS | |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | | | (REQ) | |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | DESTROY_CLIENTID | REQ | | Section 18.50 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | DESTROY_SESSION | REQ | | Section 18.37 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | EXCHANGE_ID | REQ | | Section 18.35 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | FREE_STATEID | REQ | | Section 18.38 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | GETATTR | REQ | | Section 18.7 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | GETFH | REQ | | Section 18.8 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LINK | OPT | | Section 18.9 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOCK | REQ | | Section 18.10 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOCKT | REQ | | Section 18.11 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOCKU | REQ | | Section 18.12 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOOKUP | REQ | | Section 18.13 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | LOOKUPP | REQ | | Section 18.14 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | NVERIFY | REQ | | Section 18.15 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | OPEN | REQ | | Section 18.16 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | OPENATTR | OPT | | Section 18.17 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | OPEN_CONFIRM | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | OPEN_DOWNGRADE | REQ | | Section 18.18 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | PUTFH | REQ | | Section 18.19 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | PUTPUBFH | REQ | | Section 18.20 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | PUTROOTFH | REQ | | Section 18.21 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | READ | REQ | | Section 18.22 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | READDIR | REQ | | Section 18.23 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | READLINK | OPT | | Section 18.24 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RECLAIM_COMPLETE | REQ | | Section 18.51 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RELEASE_LOCKOWNER | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | REMOVE | REQ | | Section 18.25 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RENAME | REQ | | Section 18.26 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RENEW | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | RESTOREFH | REQ | | Section 18.27 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SAVEFH | REQ | | Section 18.28 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SECINFO | REQ | | Section 18.29 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | | | layout (REQ) | Section 13.12 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | SEQUENCE | REQ | | Section 18.46 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SETATTR | REQ | | Section 18.30 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SETCLIENTID | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | SETCLIENTID_CONFIRM | MNI | | N/A |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS | SET_SSV | REQ | | Section 18.47 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| I | TEST_STATEID | REQ | | Section 18.48 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | VERIFY | REQ | | Section 18.31 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| NS* | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+| | WRITE | REQ | | Section 18.32 |
++-----------------------+----------------------+---------------------+---------------------------+----------------+
+
+
+Callback Operations
+===================
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| Implementation status | Operation | REQ,REC, OPT or NMI | Feature (REQ, REC or OPT) | Definition |
++=======================+=========================+=====================+===========================+===============+
+| | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_NOTIFY_LOCK | OPT | | Section 20.11 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | CB_RECALL | OPT | FDELG, | Section 20.2 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| NS* | CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | DDELG, pNFS | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+| | | | (REQ) | |
++-----------------------+-------------------------+---------------------+---------------------------+---------------+
+
+
+Implementation notes:
+=====================
+
+SSV:
+ The spec claims this is mandatory, but we don't actually know of any
+ implementations, so we're ignoring it for now. The server returns
+ NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.
+
+GSS on the backchannel:
+ Again, theoretically required but not widely implemented (in
+ particular, the current Linux client doesn't request it). We return
+ NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.
+
+DELEGPURGE:
+ mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
+ CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
+ persist across client reboots). Thus we need not implement this for
+ now.
+
+EXCHANGE_ID:
+ implementation ids are ignored
+
+CREATE_SESSION:
+ backchannel attributes are ignored
+
+SEQUENCE:
+ no support for dynamic slot table renegotiation (optional)
+
+Nonstandard compound limitations:
+ No support for a sessions fore channel RPC compound that requires both a
+ ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
+ fail to live up to the promise we made in CREATE_SESSION fore channel
+ negotiation.
+
+See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.
diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt
deleted file mode 100644
index 682a59f..0000000
--- a/Documentation/filesystems/nfs/nfs41-server.txt
+++ /dev/null
@@ -1,173 +0,0 @@
-NFSv4.1 Server Implementation
-
-Server support for minorversion 1 can be controlled using the
-/proc/fs/nfsd/versions control file. The string output returned
-by reading this file will contain either "+4.1" or "-4.1"
-correspondingly.
-
-Currently, server support for minorversion 1 is enabled by default.
-It can be disabled at run time by writing the string "-4.1" to
-the /proc/fs/nfsd/versions control file. Note that to write this
-control file, the nfsd service must be taken down. You can use rpc.nfsd
-for this; see rpc.nfsd(8).
-
-(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
-"-4", respectively. Therefore, code meant to work on both new and old
-kernels must turn 4.1 on or off *before* turning support for version 4
-on or off; rpc.nfsd does this correctly.)
-
-The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
-on RFC 5661.
-
-From the many new features in NFSv4.1 the current implementation
-focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
-"exactly once" semantics and better control and throttling of the
-resources allocated for each client.
-
-The table below, taken from the NFSv4.1 document, lists
-the operations that are mandatory to implement (REQ), optional
-(OPT), and NFSv4.0 operations that are required not to implement (MNI)
-in minor version 1. The first column indicates the operations that
-are not supported yet by the linux server implementation.
-
-The OPTIONAL features identified and their abbreviations are as follows:
- pNFS Parallel NFS
- FDELG File Delegations
- DDELG Directory Delegations
-
-The following abbreviations indicate the linux server implementation status.
- I Implemented NFSv4.1 operations.
- NS Not Supported.
- NS* Unimplemented optional feature.
-
-Operations
-
- +----------------------+------------+--------------+----------------+
- | Operation | REQ, REC, | Feature | Definition |
- | | OPT, or | (REQ, REC, | |
- | | MNI | or OPT) | |
- +----------------------+------------+--------------+----------------+
- | ACCESS | REQ | | Section 18.1 |
-I | BACKCHANNEL_CTL | REQ | | Section 18.33 |
-I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
- | CLOSE | REQ | | Section 18.2 |
- | COMMIT | REQ | | Section 18.3 |
- | CREATE | REQ | | Section 18.4 |
-I | CREATE_SESSION | REQ | | Section 18.36 |
-NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
- | DELEGRETURN | OPT | FDELG, | Section 18.6 |
- | | | DDELG, pNFS | |
- | | | (REQ) | |
-I | DESTROY_CLIENTID | REQ | | Section 18.50 |
-I | DESTROY_SESSION | REQ | | Section 18.37 |
-I | EXCHANGE_ID | REQ | | Section 18.35 |
-I | FREE_STATEID | REQ | | Section 18.38 |
- | GETATTR | REQ | | Section 18.7 |
-I | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
-NS*| GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
- | GETFH | REQ | | Section 18.8 |
-NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
-I | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
-I | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
-I | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
- | LINK | OPT | | Section 18.9 |
- | LOCK | REQ | | Section 18.10 |
- | LOCKT | REQ | | Section 18.11 |
- | LOCKU | REQ | | Section 18.12 |
- | LOOKUP | REQ | | Section 18.13 |
- | LOOKUPP | REQ | | Section 18.14 |
- | NVERIFY | REQ | | Section 18.15 |
- | OPEN | REQ | | Section 18.16 |
-NS*| OPENATTR | OPT | | Section 18.17 |
- | OPEN_CONFIRM | MNI | | N/A |
- | OPEN_DOWNGRADE | REQ | | Section 18.18 |
- | PUTFH | REQ | | Section 18.19 |
- | PUTPUBFH | REQ | | Section 18.20 |
- | PUTROOTFH | REQ | | Section 18.21 |
- | READ | REQ | | Section 18.22 |
- | READDIR | REQ | | Section 18.23 |
- | READLINK | OPT | | Section 18.24 |
- | RECLAIM_COMPLETE | REQ | | Section 18.51 |
- | RELEASE_LOCKOWNER | MNI | | N/A |
- | REMOVE | REQ | | Section 18.25 |
- | RENAME | REQ | | Section 18.26 |
- | RENEW | MNI | | N/A |
- | RESTOREFH | REQ | | Section 18.27 |
- | SAVEFH | REQ | | Section 18.28 |
- | SECINFO | REQ | | Section 18.29 |
-I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
- | | | layout (REQ) | Section 13.12 |
-I | SEQUENCE | REQ | | Section 18.46 |
- | SETATTR | REQ | | Section 18.30 |
- | SETCLIENTID | MNI | | N/A |
- | SETCLIENTID_CONFIRM | MNI | | N/A |
-NS | SET_SSV | REQ | | Section 18.47 |
-I | TEST_STATEID | REQ | | Section 18.48 |
- | VERIFY | REQ | | Section 18.31 |
-NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
- | WRITE | REQ | | Section 18.32 |
-
-Callback Operations
-
- +-------------------------+-----------+-------------+---------------+
- | Operation | REQ, REC, | Feature | Definition |
- | | OPT, or | (REQ, REC, | |
- | | MNI | or OPT) | |
- +-------------------------+-----------+-------------+---------------+
- | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
-I | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
-NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
-NS*| CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
-NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
-NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
- | CB_RECALL | OPT | FDELG, | Section 20.2 |
- | | | DDELG, pNFS | |
- | | | (REQ) | |
-NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
- | | | DDELG, pNFS | |
- | | | (REQ) | |
-NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
-NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
- | | | (REQ) | |
-I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
- | | | DDELG, pNFS | |
- | | | (REQ) | |
-NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
- | | | DDELG, pNFS | |
- | | | (REQ) | |
- +-------------------------+-----------+-------------+---------------+
-
-Implementation notes:
-
-SSV:
-* The spec claims this is mandatory, but we don't actually know of any
- implementations, so we're ignoring it for now. The server returns
- NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof.
-
-GSS on the backchannel:
-* Again, theoretically required but not widely implemented (in
- particular, the current Linux client doesn't request it). We return
- NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION.
-
-DELEGPURGE:
-* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
- CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
- persist across client reboots). Thus we need not implement this for
- now.
-
-EXCHANGE_ID:
-* implementation ids are ignored
-
-CREATE_SESSION:
-* backchannel attributes are ignored
-
-SEQUENCE:
-* no support for dynamic slot table renegotiation (optional)
-
-Nonstandard compound limitations:
-* No support for a sessions fore channel RPC compound that requires both a
- ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
- fail to live up to the promise we made in CREATE_SESSION fore channel
- negotiation.
-
-See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues.
diff --git a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt b/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt
deleted file mode 100644
index 56a96fb..0000000
--- a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt
+++ /dev/null
@@ -1,41 +0,0 @@
-Administrative interfaces for nfsd
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Note that normally these interfaces are used only by the utilities in
-nfs-utils.
-
-nfsd is controlled mainly by pseudofiles under the "nfsd" filesystem,
-which is normally mounted at /proc/fs/nfsd/.
-
-The server is always started by the first write of a nonzero value to
-nfsd/threads.
-
-Before doing that, NFSD can be told which sockets to listen on by
-writing to nfsd/portlist; that write may be:
-
- - an ascii-encoded file descriptor, which should refer to a
- bound (and listening, for tcp) socket, or
- - "transportname port", where transportname is currently either
- "udp", "tcp", or "rdma".
-
-If nfsd is started without doing any of these, then it will create one
-udp and one tcp listener at port 2049 (see nfsd_init_socks).
-
-On startup, nfsd and lockd grace periods start.
-
-nfsd is shut down by a write of 0 to nfsd/threads. All locks and state
-are thrown away at that point.
-
-Between startup and shutdown, the number of threads may be adjusted up
-or down by additional writes to nfsd/threads or by writes to
-nfsd/pool_threads.
-
-For more detail about files under nfsd/ and what they control, see
-fs/nfsd/nfsctl.c; most of them have detailed comments.
-
-Implementation notes
-^^^^^^^^^^^^^^^^^^^^
-
-Note that the rpc server requires the caller to serialize addition and
-removal of listening sockets, and startup and shutdown of the server.
-For nfsd this is done using nfsd_mutex.
diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt
deleted file mode 100644
index ae43324..0000000
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ /dev/null
@@ -1,355 +0,0 @@
-Mounting the root filesystem via NFS (nfsroot)
-===============================================
-
-Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
-Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
-Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
-Updated 2006 by Horms <horms@verge.net.au>
-Updated 2018 by Chris Novakovic <chris@chrisn.me.uk>
-
-
-
-In order to use a diskless system, such as an X-terminal or printer server
-for example, it is necessary for the root filesystem to be present on a
-non-disk device. This may be an initramfs (see Documentation/filesystems/
-ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/admin-guide/initrd.rst) or a
-filesystem mounted via NFS. The following text describes on how to use NFS
-for the root filesystem. For the rest of this text 'client' means the
-diskless system, and 'server' means the NFS server.
-
-
-
-
-1.) Enabling nfsroot capabilities
- -----------------------------
-
-In order to use nfsroot, NFS client support needs to be selected as
-built-in during configuration. Once this has been selected, the nfsroot
-option will become available, which should also be selected.
-
-In the networking options, kernel level autoconfiguration can be selected,
-along with the types of autoconfiguration to support. Selecting all of
-DHCP, BOOTP and RARP is safe.
-
-
-
-
-2.) Kernel command line
- -------------------
-
-When the kernel has been loaded by a boot loader (see below) it needs to be
-told what root fs device to use. And in the case of nfsroot, where to find
-both the server and the name of the directory on the server to mount as root.
-This can be established using the following kernel command line parameters:
-
-
-root=/dev/nfs
-
- This is necessary to enable the pseudo-NFS-device. Note that it's not a
- real device but just a synonym to tell the kernel to use NFS instead of
- a real device.
-
-
-nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
-
- If the `nfsroot' parameter is NOT given on the command line,
- the default "/tftpboot/%s" will be used.
-
- <server-ip> Specifies the IP address of the NFS server.
- The default address is determined by the `ip' parameter
- (see below). This parameter allows the use of different
- servers for IP autoconfiguration and NFS.
-
- <root-dir> Name of the directory on the server to mount as root.
- If there is a "%s" token in the string, it will be
- replaced by the ASCII-representation of the client's
- IP address.
-
- <nfs-options> Standard NFS options. All options are separated by commas.
- The following defaults are used:
- port = as given by server portmap daemon
- rsize = 4096
- wsize = 4096
- timeo = 7
- retrans = 3
- acregmin = 3
- acregmax = 60
- acdirmin = 30
- acdirmax = 60
- flags = hard, nointr, noposix, cto, ac
-
-
-ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>:
- <dns0-ip>:<dns1-ip>:<ntp0-ip>
-
- This parameter tells the kernel how to configure IP addresses of devices
- and also how to set up the IP routing table. It was originally called
- `nfsaddrs', but now the boot-time IP configuration works independently of
- NFS, so it was renamed to `ip' and the old name remained as an alias for
- compatibility reasons.
-
- If this parameter is missing from the kernel command line, all fields are
- assumed to be empty, and the defaults mentioned below apply. In general
- this means that the kernel tries to configure everything using
- autoconfiguration.
-
- The <autoconf> parameter can appear alone as the value to the `ip'
- parameter (without all the ':' characters before). If the value is
- "ip=off" or "ip=none", no autoconfiguration will take place, otherwise
- autoconfiguration will take place. The most common way to use this
- is "ip=dhcp".
-
- <client-ip> IP address of the client.
-
- Default: Determined using autoconfiguration.
-
- <server-ip> IP address of the NFS server. If RARP is used to determine
- the client address and this parameter is NOT empty only
- replies from the specified server are accepted.
-
- Only required for NFS root. That is autoconfiguration
- will not be triggered if it is missing and NFS root is not
- in operation.
-
- Value is exported to /proc/net/pnp with the prefix "bootserver "
- (see below).
-
- Default: Determined using autoconfiguration.
- The address of the autoconfiguration server is used.
-
- <gw-ip> IP address of a gateway if the server is on a different subnet.
-
- Default: Determined using autoconfiguration.
-
- <netmask> Netmask for local network interface. If unspecified
- the netmask is derived from the client IP address assuming
- classful addressing.
-
- Default: Determined using autoconfiguration.
-
- <hostname> Name of the client. If a '.' character is present, anything
- before the first '.' is used as the client's hostname, and anything
- after it is used as its NIS domain name. May be supplied by
- autoconfiguration, but its absence will not trigger autoconfiguration.
- If specified and DHCP is used, the user-provided hostname (and NIS
- domain name, if present) will be carried in the DHCP request; this
- may cause a DNS record to be created or updated for the client.
-
- Default: Client IP address is used in ASCII notation.
-
- <device> Name of network device to use.
-
- Default: If the host only has one device, it is used.
- Otherwise the device is determined using
- autoconfiguration. This is done by sending
- autoconfiguration requests out of all devices,
- and using the device that received the first reply.
-
- <autoconf> Method to use for autoconfiguration. In the case of options
- which specify multiple autoconfiguration protocols,
- requests are sent using all protocols, and the first one
- to reply is used.
-
- Only autoconfiguration protocols that have been compiled
- into the kernel will be used, regardless of the value of
- this option.
-
- off or none: don't use autoconfiguration
- (do static IP assignment instead)
- on or any: use any protocol available in the kernel
- (default)
- dhcp: use DHCP
- bootp: use BOOTP
- rarp: use RARP
- both: use both BOOTP and RARP but not DHCP
- (old option kept for backwards compatibility)
-
- if dhcp is used, the client identifier can be used by following
- format "ip=dhcp,client-id-type,client-id-value"
-
- Default: any
-
- <dns0-ip> IP address of primary nameserver.
- Value is exported to /proc/net/pnp with the prefix "nameserver "
- (see below).
-
- Default: None if not using autoconfiguration; determined
- automatically if using autoconfiguration.
-
- <dns1-ip> IP address of secondary nameserver.
- See <dns0-ip>.
-
- <ntp0-ip> IP address of a Network Time Protocol (NTP) server.
- Value is exported to /proc/net/ipconfig/ntp_servers, but is
- otherwise unused (see below).
-
- Default: None if not using autoconfiguration; determined
- automatically if using autoconfiguration.
-
- After configuration (whether manual or automatic) is complete, two files
- are created in the following format; lines are omitted if their respective
- value is empty following configuration:
-
- - /proc/net/pnp:
-
- #PROTO: <DHCP|BOOTP|RARP|MANUAL> (depending on configuration method)
- domain <dns-domain> (if autoconfigured, the DNS domain)
- nameserver <dns0-ip> (primary name server IP)
- nameserver <dns1-ip> (secondary name server IP)
- nameserver <dns2-ip> (tertiary name server IP)
- bootserver <server-ip> (NFS server IP)
-
- - /proc/net/ipconfig/ntp_servers:
-
- <ntp0-ip> (NTP server IP)
- <ntp1-ip> (NTP server IP)
- <ntp2-ip> (NTP server IP)
-
- <dns-domain> and <dns2-ip> (in /proc/net/pnp) and <ntp1-ip> and <ntp2-ip>
- (in /proc/net/ipconfig/ntp_servers) are requested during autoconfiguration;
- they cannot be specified as part of the "ip=" kernel command line parameter.
-
- Because the "domain" and "nameserver" options are recognised by DNS
- resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems
- that use an NFS root filesystem.
-
- Note that the kernel will not synchronise the system time with any NTP
- servers it discovers; this is the responsibility of a user space process
- (e.g. an initrd/initramfs script that passes the IP addresses listed in
- /proc/net/ipconfig/ntp_servers to an NTP client before mounting the real
- root filesystem if it is on NFS).
-
-
-nfsrootdebug
-
- This parameter enables debugging messages to appear in the kernel
- log at boot time so that administrators can verify that the correct
- NFS mount options, server address, and root path are passed to the
- NFS client.
-
-
-rdinit=<executable file>
-
- To specify which file contains the program that starts system
- initialization, administrators can use this command line parameter.
- The default value of this parameter is "/init". If the specified
- file exists and the kernel can execute it, root filesystem related
- kernel command line parameters, including `nfsroot=', are ignored.
-
- A description of the process of mounting the root file system can be
- found in:
-
- Documentation/driver-api/early-userspace/early_userspace_support.rst
-
-
-
-
-3.) Boot Loader
- ----------
-
-To get the kernel into memory different approaches can be used.
-They depend on various facilities being available:
-
-
-3.1) Booting from a floppy using syslinux
-
- When building kernels, an easy way to create a boot floppy that uses
- syslinux is to use the zdisk or bzdisk make targets which use zimage
- and bzimage images respectively. Both targets accept the
- FDARGS parameter which can be used to set the kernel command line.
-
- e.g.
- make bzdisk FDARGS="root=/dev/nfs"
-
- Note that the user running this command will need to have
- access to the floppy drive device, /dev/fd0
-
- For more information on syslinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
-
- N.B: Previously it was possible to write a kernel directly to
- a floppy using dd, configure the boot device using rdev, and
- boot using the resulting floppy. Linux no longer supports this
- method of booting.
-
-3.2) Booting from a cdrom using isolinux
-
- When building kernels, an easy way to create a bootable cdrom that
- uses isolinux is to use the isoimage target which uses a bzimage
- image. Like zdisk and bzdisk, this target accepts the FDARGS
- parameter which can be used to set the kernel command line.
-
- e.g.
- make isoimage FDARGS="root=/dev/nfs"
-
- The resulting iso image will be arch/<ARCH>/boot/image.iso
- This can be written to a cdrom using a variety of tools including
- cdrecord.
-
- e.g.
- cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso
-
- For more information on isolinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
-
-3.2) Using LILO
- When using LILO all the necessary command line parameters may be
- specified using the 'append=' directive in the LILO configuration
- file.
-
- However, to use the 'root=' directive you also need to create
- a dummy root device, which may be removed after LILO is run.
-
- mknod /dev/boot255 c 0 255
-
- For information on configuring LILO, please refer to its documentation.
-
-3.3) Using GRUB
- When using GRUB, kernel parameter are simply appended after the kernel
- specification: kernel <kernel> <parameters>
-
-3.4) Using loadlin
- loadlin may be used to boot Linux from a DOS command prompt without
- requiring a local hard disk to mount as root. This has not been
- thoroughly tested by the authors of this document, but in general
- it should be possible configure the kernel command line similarly
- to the configuration of LILO.
-
- Please refer to the loadlin documentation for further information.
-
-3.5) Using a boot ROM
- This is probably the most elegant way of booting a diskless client.
- With a boot ROM the kernel is loaded using the TFTP protocol. The
- authors of this document are not aware of any no commercial boot
- ROMs that support booting Linux over the network. However, there
- are two free implementations of a boot ROM, netboot-nfs and
- etherboot, both of which are available on sunsite.unc.edu, and both
- of which contain everything you need to boot a diskless Linux client.
-
-3.6) Using pxelinux
- Pxelinux may be used to boot linux using the PXE boot loader
- which is present on many modern network cards.
-
- When using pxelinux, the kernel image is specified using
- "kernel <relative-path-below /tftpboot>". The nfsroot parameters
- are passed to the kernel by adding them to the "append" line.
- It is common to use serial console in conjunction with pxeliunx,
- see Documentation/admin-guide/serial-console.rst for more information.
-
- For more information on isolinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
-
-
-
-
-4.) Credits
- -------
-
- The nfsroot code in the kernel and the RARP support have been written
- by Gero Kuhlmann <gero@gkminix.han.de>.
-
- The rest of the IP layer autoconfiguration code has been written
- by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
-
- In order to write the initial version of nfsroot I would like to thank
- Jens-Uwe Mager <jum@anubis.han.de> for his help.
diff --git a/Documentation/filesystems/nfs/pnfs-block-server.txt b/Documentation/filesystems/nfs/pnfs-block-server.txt
deleted file mode 100644
index 2143673..0000000
--- a/Documentation/filesystems/nfs/pnfs-block-server.txt
+++ /dev/null
@@ -1,37 +0,0 @@
-pNFS block layout server user guide
-
-The Linux NFS server now supports the pNFS block layout extension. In this
-case the NFS server acts as Metadata Server (MDS) for pNFS, which in addition
-to handling all the metadata access to the NFS export also hands out layouts
-to the clients to directly access the underlying block devices that are
-shared with the client.
-
-To use pNFS block layouts with with the Linux NFS server the exported file
-system needs to support the pNFS block layouts (currently just XFS), and the
-file system must sit on shared storage (typically iSCSI) that is accessible
-to the clients in addition to the MDS. As of now the file system needs to
-sit directly on the exported volume, striping or concatenation of
-volumes on the MDS and clients is not supported yet.
-
-On the server, pNFS block volume support is automatically if the file system
-support it. On the client make sure the kernel has the CONFIG_PNFS_BLOCK
-option enabled, the blkmapd daemon from nfs-utils is running, and the
-file system is mounted using the NFSv4.1 protocol version (mount -o vers=4.1).
-
-If the nfsd server needs to fence a non-responding client it calls
-/sbin/nfsd-recall-failed with the first argument set to the IP address of
-the client, and the second argument set to the device node without the /dev
-prefix for the file system to be fenced. Below is an example file that shows
-how to translate the device into a serial number from SCSI EVPD 0x80:
-
-cat > /sbin/nfsd-recall-failed << EOF
-#!/bin/sh
-
-CLIENT="$1"
-DEV="/dev/$2"
-EVPD=`sg_inq --page=0x80 ${DEV} | \
- grep "Unit serial number:" | \
- awk -F ': ' '{print $2}'`
-
-echo "fencing client ${CLIENT} serial ${EVPD}" >> /var/log/pnfsd-fence.log
-EOF
diff --git a/Documentation/filesystems/nfs/pnfs-scsi-server.txt b/Documentation/filesystems/nfs/pnfs-scsi-server.txt
deleted file mode 100644
index 5bef726..0000000
--- a/Documentation/filesystems/nfs/pnfs-scsi-server.txt
+++ /dev/null
@@ -1,23 +0,0 @@
-
-pNFS SCSI layout server user guide
-==================================
-
-This document describes support for pNFS SCSI layouts in the Linux NFS server.
-With pNFS SCSI layouts, the NFS server acts as Metadata Server (MDS) for pNFS,
-which in addition to handling all the metadata access to the NFS export,
-also hands out layouts to the clients so that they can directly access the
-underlying SCSI LUNs that are shared with the client.
-
-To use pNFS SCSI layouts with with the Linux NFS server, the exported file
-system needs to support the pNFS SCSI layouts (currently just XFS), and the
-file system must sit on a SCSI LUN that is accessible to the clients in
-addition to the MDS. As of now the file system needs to sit directly on the
-exported LUN, striping or concatenation of LUNs on the MDS and clients
-is not supported yet.
-
-On a server built with CONFIG_NFSD_SCSI, the pNFS SCSI volume support is
-automatically enabled if the file system is exported using the "pnfs"
-option and the underlying SCSI device support persistent reservations.
-On the client make sure the kernel has the CONFIG_PNFS_BLOCK option
-enabled, and the file system is mounted using the NFSv4.1 protocol
-version (mount -o vers=4.1).
diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.rst
similarity index 87%
rename from Documentation/filesystems/nfs/pnfs.txt
rename to Documentation/filesystems/nfs/pnfs.rst
index 80dc0bd..7c470ec 100644
--- a/Documentation/filesystems/nfs/pnfs.txt
+++ b/Documentation/filesystems/nfs/pnfs.rst
@@ -1,15 +1,17 @@
-Reference counting in pnfs:
+==========================
+Reference counting in pnfs
==========================
The are several inter-related caches. We have layouts which can
reference multiple devices, each of which can reference multiple data servers.
Each data server can be referenced by multiple devices. Each device
-can be referenced by multiple layouts. To keep all of this straight,
+can be referenced by multiple layouts. To keep all of this straight,
we need to reference count.
struct pnfs_layout_hdr
-----------------------
+======================
+
The on-the-wire command LAYOUTGET corresponds to struct
pnfs_layout_segment, usually referred to by the variable name lseg.
Each nfs_inode may hold a pointer to a cache of these layout
@@ -25,7 +27,8 @@
keeps it in the list.
deviceid_cache
---------------
+==============
+
lsegs reference device ids, which are resolved per nfs_client and
layout driver type. The device ids are held in a RCU cache (struct
nfs4_deviceid_cache). The cache itself is referenced across each
@@ -38,24 +41,26 @@
deviceid's per filesystem, and multiple filesystems per nfs_client.
The hash code is copied from the nfsd code base. A discussion of
-hashing and variations of this algorithm can be found at:
-http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809
+hashing and variations of this algorithm can be found `here.
+<http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809>`_
data server cache
------------------
+=================
+
file driver devices refer to data servers, which are kept in a module
level cache. Its reference is held over the lifetime of the deviceid
pointing to it.
lseg
-----
+====
+
lseg maintains an extra reference corresponding to the NFS_LSEG_VALID
bit which holds it in the pnfs_layout_hdr's list. When the final lseg
is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED
bit is set, preventing any new lsegs from being added.
layout drivers
---------------
+==============
PNFS utilizes what is called layout drivers. The STD defines 4 basic
layout types: "files", "objects", "blocks", and "flexfiles". For each
@@ -68,6 +73,6 @@
Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory
blocks-layout setup
--------------------
+===================
TODO: Document the setup needs of the blocks layout driver
diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.rst
similarity index 66%
rename from Documentation/filesystems/nfs/rpc-cache.txt
rename to Documentation/filesystems/nfs/rpc-cache.rst
index c4dac82..bb164ee 100644
--- a/Documentation/filesystems/nfs/rpc-cache.txt
+++ b/Documentation/filesystems/nfs/rpc-cache.rst
@@ -1,9 +1,14 @@
- This document gives a brief introduction to the caching
+=========
+RPC Cache
+=========
+
+This document gives a brief introduction to the caching
mechanisms in the sunrpc layer that is used, in particular,
for NFS authentication.
-CACHES
+Caches
======
+
The caching replaces the old exports table and allows for
a wide variety of values to be caches.
@@ -12,6 +17,7 @@
of common code for managing these caches.
Examples of caches that are likely to be needed are:
+
- mapping from IP address to client name
- mapping from client name and filesystem to export options
- mapping from UID to list of GIDs, to work around NFS's limitation
@@ -21,6 +27,7 @@
- mapping from network identify to public key for crypto authentication.
The common code handles such things as:
+
- general cache lookup with correct locking
- supporting 'NEGATIVE' as well as positive entries
- allowing an EXPIRED time on cache items, and removing
@@ -35,60 +42,66 @@
Creating a Cache
----------------
-1/ A cache needs a datum to store. This is in the form of a
- structure definition that must contain a
- struct cache_head
+- A cache needs a datum to store. This is in the form of a
+ structure definition that must contain a struct cache_head
as an element, usually the first.
It will also contain a key and some content.
Each cache element is reference counted and contains
expiry and update times for use in cache management.
-2/ A cache needs a "cache_detail" structure that
+- A cache needs a "cache_detail" structure that
describes the cache. This stores the hash table, some
parameters for cache management, and some operations detailing how
to work with particular cache items.
- The operations requires are:
- struct cache_head *alloc(void)
- This simply allocates appropriate memory and returns
- a pointer to the cache_detail embedded within the
- structure
- void cache_put(struct kref *)
- This is called when the last reference to an item is
- dropped. The pointer passed is to the 'ref' field
- in the cache_head. cache_put should release any
- references create by 'cache_init' and, if CACHE_VALID
- is set, any references created by cache_update.
- It should then release the memory allocated by
- 'alloc'.
- int match(struct cache_head *orig, struct cache_head *new)
- test if the keys in the two structures match. Return
- 1 if they do, 0 if they don't.
- void init(struct cache_head *orig, struct cache_head *new)
- Set the 'key' fields in 'new' from 'orig'. This may
- include taking references to shared objects.
- void update(struct cache_head *orig, struct cache_head *new)
- Set the 'content' fileds in 'new' from 'orig'.
- int cache_show(struct seq_file *m, struct cache_detail *cd,
- struct cache_head *h)
- Optional. Used to provide a /proc file that lists the
- contents of a cache. This should show one item,
- usually on just one line.
- int cache_request(struct cache_detail *cd, struct cache_head *h,
- char **bpp, int *blen)
- Format a request to be send to user-space for an item
- to be instantiated. *bpp is a buffer of size *blen.
- bpp should be moved forward over the encoded message,
- and *blen should be reduced to show how much free
- space remains. Return 0 on success or <0 if not
- enough room or other problem.
- int cache_parse(struct cache_detail *cd, char *buf, int len)
- A message from user space has arrived to fill out a
- cache entry. It is in 'buf' of length 'len'.
- cache_parse should parse this, find the item in the
- cache with sunrpc_cache_lookup_rcu, and update the item
- with sunrpc_cache_update.
+
+ The operations are:
+
+ struct cache_head \*alloc(void)
+ This simply allocates appropriate memory and returns
+ a pointer to the cache_detail embedded within the
+ structure
+
+ void cache_put(struct kref \*)
+ This is called when the last reference to an item is
+ dropped. The pointer passed is to the 'ref' field
+ in the cache_head. cache_put should release any
+ references create by 'cache_init' and, if CACHE_VALID
+ is set, any references created by cache_update.
+ It should then release the memory allocated by
+ 'alloc'.
+
+ int match(struct cache_head \*orig, struct cache_head \*new)
+ test if the keys in the two structures match. Return
+ 1 if they do, 0 if they don't.
+
+ void init(struct cache_head \*orig, struct cache_head \*new)
+ Set the 'key' fields in 'new' from 'orig'. This may
+ include taking references to shared objects.
+
+ void update(struct cache_head \*orig, struct cache_head \*new)
+ Set the 'content' fileds in 'new' from 'orig'.
+
+ int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
+ Optional. Used to provide a /proc file that lists the
+ contents of a cache. This should show one item,
+ usually on just one line.
+
+ int cache_request(struct cache_detail \*cd, struct cache_head \*h, char \*\*bpp, int \*blen)
+ Format a request to be send to user-space for an item
+ to be instantiated. \*bpp is a buffer of size \*blen.
+ bpp should be moved forward over the encoded message,
+ and \*blen should be reduced to show how much free
+ space remains. Return 0 on success or <0 if not
+ enough room or other problem.
+
+ int cache_parse(struct cache_detail \*cd, char \*buf, int len)
+ A message from user space has arrived to fill out a
+ cache entry. It is in 'buf' of length 'len'.
+ cache_parse should parse this, find the item in the
+ cache with sunrpc_cache_lookup_rcu, and update the item
+ with sunrpc_cache_update.
-3/ A cache needs to be registered using cache_register(). This
+- A cache needs to be registered using cache_register(). This
includes it on a list of caches that will be regularly
cleaned to discard old data.
@@ -107,7 +120,7 @@
call is needed but not possible, -EAGAIN if an upcall is pending,
or 0 if the data is valid;
-cache_check can be passed a "struct cache_req *". This structure is
+cache_check can be passed a "struct cache_req\*". This structure is
typically embedded in the actual request and can be used to create a
deferred copy of the request (struct cache_deferred_req). This is
done when the found cache item is not uptodate, but the is reason to
@@ -139,9 +152,11 @@
passed as a whole to the cache for parsing and interpretation.
Each cache can treat the write requests differently, but it is
expected that a message written will contain:
+
- a key
- an expiry time
- a content.
+
with the intention that an item in the cache with the give key
should be create or updated to have the given content, and the
expiry time should be set on that item.
@@ -156,7 +171,8 @@
select or poll for read will block waiting for another request to be
added.
-Thus a user-space helper is likely to:
+Thus a user-space helper is likely to::
+
open the channel.
select for readable
read a request
@@ -175,12 +191,13 @@
takes a cache item and encodes a request into the buffer
provided.
-Note: If a cache has no active readers on the channel, and has had not
-active readers for more than 60 seconds, further requests will not be
-added to the channel but instead all lookups that do not find a valid
-entry will fail. This is partly for backward compatibility: The
-previous nfs exports table was deemed to be authoritative and a
-failed lookup meant a definite 'no'.
+.. note::
+ If a cache has no active readers on the channel, and has had not
+ active readers for more than 60 seconds, further requests will not be
+ added to the channel but instead all lookups that do not find a valid
+ entry will fail. This is partly for backward compatibility: The
+ previous nfs exports table was deemed to be authoritative and a
+ failed lookup meant a definite 'no'.
request/response format
-----------------------
@@ -193,10 +210,11 @@
Fields within the record should be separated by spaces, normally one.
If spaces, newlines, or nul characters are needed in a field they
much be quoted. two mechanisms are available:
-1/ If a field begins '\x' then it must contain an even number of
+
+- If a field begins '\x' then it must contain an even number of
hex digits, and pairs of these digits provide the bytes in the
field.
-2/ otherwise a \ in the field must be followed by 3 octal digits
+- otherwise a \ in the field must be followed by 3 octal digits
which give the code for a byte. Other characters are treated
as them selves. At the very least, space, newline, nul, and
'\' must be quoted in this way.
diff --git a/Documentation/filesystems/nfs/rpc-server-gss.txt b/Documentation/filesystems/nfs/rpc-server-gss.rst
similarity index 86%
rename from Documentation/filesystems/nfs/rpc-server-gss.txt
rename to Documentation/filesystems/nfs/rpc-server-gss.rst
index 310bbba..ccaea9e 100644
--- a/Documentation/filesystems/nfs/rpc-server-gss.txt
+++ b/Documentation/filesystems/nfs/rpc-server-gss.rst
@@ -1,4 +1,4 @@
-
+=========================================
rpcsec_gss support for kernel RPC servers
=========================================
@@ -9,14 +9,16 @@
purposes of authentication.)
RPCGSS is specified in a few IETF documents:
- - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt
- - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt
-and there is a 3rd version being proposed:
- - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt
- (At draft n. 02 at the time of writing)
+
+ - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt
+ - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt
+
+There is a third version that we don't currently implement:
+
+ - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt
Background
-----------
+==========
The RPCGSS Authentication method describes a way to perform GSSAPI
Authentication for NFS. Although GSSAPI is itself completely mechanism
@@ -29,6 +31,7 @@
GSSAPI is a complex library, and implementing it completely in kernel is
unwarranted. However GSSAPI operations are fundementally separable in 2
parts:
+
- initial context establishment
- integrity/privacy protection (signing and encrypting of individual
packets)
@@ -41,7 +44,7 @@
need upcalls to request userspace to perform context establishment.
NFS Server Legacy Upcall Mechanism
-----------------------------------
+==================================
The classic upcall mechanism uses a custom text based upcall mechanism
to talk to a custom daemon called rpc.svcgssd that is provide by the
@@ -62,21 +65,20 @@
back to the kernel (4KiB).
NFS Server New RPC Upcall Mechanism
------------------------------------
+===================================
The newer upcall mechanism uses RPC over a unix socket to a daemon
called gss-proxy, implemented by a userspace program called Gssproxy.
-The gss_proxy RPC protocol is currently documented here:
-
- https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation
+The gss_proxy RPC protocol is currently documented `here
+<https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation>`_.
This upcall mechanism uses the kernel rpc client and connects to the gssproxy
userspace program over a regular unix socket. The gssproxy protocol does not
suffer from the size limitations of the legacy protocol.
Negotiating Upcall Mechanisms
------------------------------
+=============================
To provide backward compatibility, the kernel defaults to using the
legacy mechanism. To switch to the new mechanism, gss-proxy must bind
diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.rst
similarity index 89%
rename from Documentation/filesystems/nilfs2.txt
rename to Documentation/filesystems/nilfs2.rst
index f2f3f85..6c49f04 100644
--- a/Documentation/filesystems/nilfs2.txt
+++ b/Documentation/filesystems/nilfs2.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
NILFS2
-------
+======
NILFS2 is a log-structured file system (LFS) supporting continuous
snapshotting. In addition to versioning capability of the entire file
@@ -25,9 +28,9 @@
cleaner or garbage collector) are required. Details on the tools are
described in the man pages included in the package.
-Project web page: https://nilfs.sourceforge.io/
-Download page: https://nilfs.sourceforge.io/en/download.html
-List info: http://vger.kernel.org/vger-lists.html#linux-nilfs
+:Project web page: https://nilfs.sourceforge.io/
+:Download page: https://nilfs.sourceforge.io/en/download.html
+:List info: http://vger.kernel.org/vger-lists.html#linux-nilfs
Caveats
=======
@@ -47,6 +50,7 @@
NILFS2 supports the following mount options:
(*) == default
+======================= =======================================================
barrier(*) This enables/disables the use of write barriers. This
nobarrier requires an IO stack which can support barriers, and
if nilfs gets an error on a barrier write, it will
@@ -79,6 +83,7 @@
nodiscard(*) The discard/TRIM commands are sent to the underlying
block device when blocks are freed. This is useful
for SSD devices and sparse/thinly-provisioned LUNs.
+======================= =======================================================
Ioctls
======
@@ -87,9 +92,11 @@
through the system call interfaces. The list of all NILFS2 specific ioctls are
shown in the table below.
-Table of NILFS2 specific ioctls
-..............................................................................
+Table of NILFS2 specific ioctls:
+
+ ============================== ===============================================
Ioctl Description
+ ============================== ===============================================
NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between
checkpoint and snapshot state. This ioctl is
used in chcp and mkcp utilities.
@@ -142,11 +149,12 @@
NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and
upper limit of segments in bytes. This ioctl
is used by nilfs_resize utility.
+ ============================== ===============================================
NILFS2 usage
============
-To use nilfs2 as a local file system, simply:
+To use nilfs2 as a local file system, simply::
# mkfs -t nilfs2 /dev/block_device
# mount -t nilfs2 /dev/block_device /dir
@@ -157,18 +165,20 @@
Checkpoints and snapshots are managed by the following commands.
Their manpages are included in the nilfs-utils package above.
+ ==== ===========================================================
lscp list checkpoints or snapshots.
mkcp make a checkpoint or a snapshot.
chcp change an existing checkpoint to a snapshot or vice versa.
rmcp invalidate specified checkpoint(s).
+ ==== ===========================================================
-To mount a snapshot,
+To mount a snapshot::
# mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir
where <cno> is the checkpoint number of the snapshot.
-To unmount the NILFS2 mount point or snapshot, simply:
+To unmount the NILFS2 mount point or snapshot, simply::
# umount /dir
@@ -181,7 +191,7 @@
A nilfs2 volume is equally divided into a number of segments except
for the super block (SB) and segment #0. A segment is the container
of logs. Each log is composed of summary information blocks, payload
-blocks, and an optional super root block (SR):
+blocks, and an optional super root block (SR)::
______________________________________________________
| |SB| | Segment | Segment | Segment | ... | Segment | |
@@ -200,7 +210,7 @@
|_blocks__|_________________|__|
The payload blocks are organized per file, and each file consists of
-data blocks and B-tree node blocks:
+data blocks and B-tree node blocks::
|<--- File-A --->|<--- File-B --->|
_______________________________________________________________
@@ -213,7 +223,7 @@
The organization of the blocks is recorded in the summary information
blocks, which contains a header structure (nilfs_segment_summary), per
-file structures (nilfs_finfo), and per block structures (nilfs_binfo):
+file structures (nilfs_finfo), and per block structures (nilfs_binfo)::
_________________________________________________________________________
| Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |...
@@ -223,7 +233,7 @@
The logs include regular files, directory files, symbolic link files
and several meta data files. The mata data files are the files used
to maintain file system meta data. The current version of NILFS2 uses
-the following meta data files:
+the following meta data files::
1) Inode file (ifile) -- Stores on-disk inodes
2) Checkpoint file (cpfile) -- Stores checkpoints
@@ -232,7 +242,7 @@
(DAT) block numbers. This file serves to
make on-disk blocks relocatable.
-The following figure shows a typical organization of the logs:
+The following figure shows a typical organization of the logs::
_________________________________________________________________________
| Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR|
@@ -250,7 +260,7 @@
of regular files, directories, symlinks and other special files, are
included in the ifile. The inode of ifile itself is included in the
corresponding checkpoint entry in the cpfile. Thus, the hierarchy
-among NILFS2 files can be depicted as follows:
+among NILFS2 files can be depicted as follows::
Super block (SB)
|
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.rst
similarity index 85%
rename from Documentation/filesystems/ntfs.txt
rename to Documentation/filesystems/ntfs.rst
index 553f10d..5bb093a 100644
--- a/Documentation/filesystems/ntfs.txt
+++ b/Documentation/filesystems/ntfs.rst
@@ -1,19 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
The Linux NTFS filesystem driver
================================
-Table of contents
-=================
+.. Table of contents
-- Overview
-- Web site
-- Features
-- Supported mount options
-- Known bugs and (mis-)features
-- Using NTFS volume and stripe sets
- - The Device-Mapper driver
- - The Software RAID / MD driver
- - Limitations when using the MD driver
+ - Overview
+ - Web site
+ - Features
+ - Supported mount options
+ - Known bugs and (mis-)features
+ - Using NTFS volume and stripe sets
+ - The Device-Mapper driver
+ - The Software RAID / MD driver
+ - Limitations when using the MD driver
Overview
@@ -66,8 +68,10 @@
partition by creating a large file while in Windows and then loopback
mounting the file while in Linux and creating a Linux filesystem on it that
is used to install Linux on it.
-- A comparison of the two drivers using:
+- A comparison of the two drivers using::
+
time find . -type f -exec md5sum "{}" \;
+
run three times in sequence with each driver (after a reboot) on a 1.4GiB
NTFS partition, showed the new driver to be 20% faster in total time elapsed
(from 9:43 minutes on average down to 7:53). The time spent in user space
@@ -104,6 +108,7 @@
mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
following mount options:
+======================= =======================================================
iocharset=name Deprecated option. Still supported but please use
nls=name in the future. See description for nls=name.
@@ -175,16 +180,22 @@
errors=opt What to do when critical filesystem errors are found.
Following values can be used for "opt":
- continue: DEFAULT, try to clean-up as much as
+
+ ======== =========================================
+ continue DEFAULT, try to clean-up as much as
possible, e.g. marking a corrupt inode as
bad so it is no longer accessed, and then
continue.
- recover: At present only supported is recovery of
+ recover At present only supported is recovery of
the boot sector from the backup copy.
If read-only mount, the recovery is done
in memory only and not written to disk.
- Note that the options are additive, i.e. specifying:
+ ======== =========================================
+
+ Note that the options are additive, i.e. specifying::
+
errors=continue,errors=recover
+
means the driver will attempt to recover and if that
fails it will clean-up as much as possible and
continue.
@@ -202,12 +213,18 @@
In general use the default. If you have a lot of small
files then use a higher value. The values have the
following meaning:
+
+ ===== =================================
Value MFT zone size (% of volume size)
+ ===== =================================
1 12.5%
2 25%
3 37.5%
4 50%
+ ===== =================================
+
Note this option is irrelevant for read-only mounts.
+======================= =======================================================
Known bugs and (mis-)features
@@ -252,18 +269,18 @@
components and their sizes in sectors, i.e. multiples of 512-byte blocks.
For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
-example if one of your partitions is /dev/hda2 you would do:
+example if one of your partitions is /dev/hda2 you would do::
-$ fdisk -ul /dev/hda
+ $ fdisk -ul /dev/hda
-Disk /dev/hda: 81.9 GB, 81964302336 bytes
-255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
-Units = sectors of 1 * 512 = 512 bytes
+ Disk /dev/hda: 81.9 GB, 81964302336 bytes
+ 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
+ Units = sectors of 1 * 512 = 512 bytes
- Device Boot Start End Blocks Id System
- /dev/hda1 * 63 4209029 2104483+ 83 Linux
- /dev/hda2 4209030 37768814 16779892+ 86 NTFS
- /dev/hda3 37768815 46170809 4200997+ 83 Linux
+ Device Boot Start End Blocks Id System
+ /dev/hda1 * 63 4209029 2104483+ 83 Linux
+ /dev/hda2 4209030 37768814 16779892+ 86 NTFS
+ /dev/hda3 37768815 46170809 4200997+ 83 Linux
And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
33559785 sectors.
@@ -271,15 +288,17 @@
For Win2k and later dynamic disks, you can for example use the ldminfo utility
which is part of the Linux LDM tools (the latest version at the time of
writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
+
http://www.linux-ntfs.org/
+
Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
able to compile this yourself easily so use the binary version!
-Then you would use ldminfo in dump mode to obtain the necessary information:
+Then you would use ldminfo in dump mode to obtain the necessary information::
-$ ./ldminfo --dump /dev/hda
+ $ ./ldminfo --dump /dev/hda
This would dump the LDM database found on /dev/hda which describes all of your
dynamic disks and all the volumes on them. At the bottom you will see the
@@ -305,42 +324,36 @@
Assuming you know all your devices and their sizes things are easy.
For a linear raid the table would look like this (note all values are in
-512-byte sectors):
+512-byte sectors)::
---- cut here ---
-# Offset into Size of this Raid type Device Start sector
-# volume device of device
-0 1028161 linear /dev/hda1 0
-1028161 3903762 linear /dev/hdb2 0
-4931923 2103211 linear /dev/hdc1 0
---- cut here ---
+ # Offset into Size of this Raid type Device Start sector
+ # volume device of device
+ 0 1028161 linear /dev/hda1 0
+ 1028161 3903762 linear /dev/hdb2 0
+ 4931923 2103211 linear /dev/hdc1 0
For a striped volume, i.e. raid level 0, you will need to know the chunk size
you used when creating the volume. Windows uses 64kiB as the default, so it
will probably be this unless you changes the defaults when creating the array.
For a raid level 0 the table would look like this (note all values are in
-512-byte sectors):
+512-byte sectors)::
---- cut here ---
-# Offset Size Raid Number Chunk 1st Start 2nd Start
-# into of the type of size Device in Device in
-# volume volume stripes device device
-0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
---- cut here ---
+ # Offset Size Raid Number Chunk 1st Start 2nd Start
+ # into of the type of size Device in Device in
+ # volume volume stripes device device
+ 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
If there are more than two devices, just add each of them to the end of the
line.
Finally, for a mirrored volume, i.e. raid level 1, the table would look like
-this (note all values are in 512-byte sectors):
+this (note all values are in 512-byte sectors)::
---- cut here ---
-# Ofs Size Raid Log Number Region Should Number Source Start Target Start
-# in of the type type of log size sync? of Device in Device in
-# vol volume params mirrors Device Device
-0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
---- cut here ---
+ # Ofs Size Raid Log Number Region Should Number Source Start Target Start
+ # in of the type type of log size sync? of Device in Device in
+ # vol volume params mirrors Device Device
+ 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
If you are mirroring to multiple devices you can specify further targets at the
end of the line.
@@ -353,17 +366,17 @@
them.
Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
-and hand it over to dmsetup to work with, like so:
+and hand it over to dmsetup to work with, like so::
-$ dmsetup create myvolume1 /etc/ntfsvolume1
+ $ dmsetup create myvolume1 /etc/ntfsvolume1
You can obviously replace "myvolume1" with whatever name you like.
If it all worked, you will now have the device /dev/device-mapper/myvolume1
which you can then just use as an argument to the mount command as usual to
-mount the ntfs volume. For example:
+mount the ntfs volume. For example::
-$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
+ $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
(You need to create the directory /mnt/myvol1 first and of course you can use
anything you like instead of /mnt/myvol1 as long as it is an existing
@@ -395,18 +408,18 @@
"chunk-size 64k" option for each raid-disk, too.
For example, if you have a stripe set consisting of two partitions /dev/hda5
-and /dev/hdb1 your /etc/raidtab would look like this:
+and /dev/hdb1 your /etc/raidtab would look like this::
-raiddev /dev/md0
- raid-level 0
- nr-raid-disks 2
- nr-spare-disks 0
- persistent-superblock 0
- chunk-size 64k
- device /dev/hda5
- raid-disk 0
- device /dev/hdb1
- raid-disk 1
+ raiddev /dev/md0
+ raid-level 0
+ nr-raid-disks 2
+ nr-spare-disks 0
+ persistent-superblock 0
+ chunk-size 64k
+ device /dev/hda5
+ raid-disk 0
+ device /dev/hdb1
+ raid-disk 1
For linear raid, just change the raid-level above to "raid-level linear", for
mirrors, change it to "raid-level 1", and for stripe sets with parity, change
@@ -427,7 +440,9 @@
raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
Then just use the mount command as usual to mount the ntfs volume using for
-example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
+example::
+
+ mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
It is advisable to do the mount read-only to see if the md volume has been
setup correctly to avoid the possibility of causing damage to the data on the
diff --git a/Documentation/filesystems/ocfs2-online-filecheck.txt b/Documentation/filesystems/ocfs2-online-filecheck.rst
similarity index 77%
rename from Documentation/filesystems/ocfs2-online-filecheck.txt
rename to Documentation/filesystems/ocfs2-online-filecheck.rst
index 139fab1..2257bb5 100644
--- a/Documentation/filesystems/ocfs2-online-filecheck.txt
+++ b/Documentation/filesystems/ocfs2-online-filecheck.rst
@@ -1,5 +1,8 @@
- OCFS2 online file check
- -----------------------
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+OCFS2 file system - online file check
+=====================================
This document will describe OCFS2 online file check feature.
@@ -40,7 +43,7 @@
by the inode number which caused the error. This inode number would be the
input to check/fix the file.
-There is a sysfs directory for each OCFS2 file system mounting:
+There is a sysfs directory for each OCFS2 file system mounting::
/sys/fs/ocfs2/<devname>/filecheck
@@ -50,34 +53,36 @@
fixed. Currently, three operations are supported, which includes checking
inode, fixing inode and setting the size of result record history.
-1. If you want to know what error exactly happened to <inode> before fixing, do
+1. If you want to know what error exactly happened to <inode> before fixing, do::
- # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
- # cat /sys/fs/ocfs2/<devname>/filecheck/check
+ # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
+ # cat /sys/fs/ocfs2/<devname>/filecheck/check
-The output is like this:
- INO DONE ERROR
-39502 1 GENERATION
+The output is like this::
-<INO> lists the inode numbers.
-<DONE> indicates whether the operation has been finished.
-<ERROR> says what kind of errors was found. For the detailed error numbers,
-please refer to the file linux/fs/ocfs2/filecheck.h.
+ INO DONE ERROR
+ 39502 1 GENERATION
-2. If you determine to fix this inode, do
+ <INO> lists the inode numbers.
+ <DONE> indicates whether the operation has been finished.
+ <ERROR> says what kind of errors was found. For the detailed error numbers,
+ please refer to the file linux/fs/ocfs2/filecheck.h.
- # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
- # cat /sys/fs/ocfs2/<devname>/filecheck/fix
+2. If you determine to fix this inode, do::
-The output is like this:
- INO DONE ERROR
-39502 1 SUCCESS
+ # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
+ # cat /sys/fs/ocfs2/<devname>/filecheck/fix
+
+The output is like this:::
+
+ INO DONE ERROR
+ 39502 1 SUCCESS
This time, the <ERROR> column indicates whether this fix is successful or not.
3. The record cache is used to store the history of check/fix results. It's
default size is 10, and can be adjust between the range of 10 ~ 100. You can
-adjust the size like this:
+adjust the size like this::
# echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set
diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.rst
similarity index 86%
rename from Documentation/filesystems/ocfs2.txt
rename to Documentation/filesystems/ocfs2.rst
index 4c49e54..42ca9a3 100644
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.rst
@@ -1,5 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
OCFS2 filesystem
-==================
+================
+
OCFS2 is a general purpose extent based shared disk cluster file
system with many similarities to ext3. It supports 64 bit inode
numbers, and has automatically extending metadata groups which may
@@ -10,26 +14,30 @@
Project web page: http://ocfs2.wiki.kernel.org
Tools git tree: https://github.com/markfasheh/ocfs2-tools
-OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/
All code copyright 2005 Oracle except when otherwise noted.
-CREDITS:
+Credits
+=======
+
Lots of code taken from ext3 and other projects.
Authors in alphabetical order:
-Joel Becker <joel.becker@oracle.com>
-Zach Brown <zach.brown@oracle.com>
-Mark Fasheh <mfasheh@suse.com>
-Kurt Hackel <kurt.hackel@oracle.com>
-Tao Ma <tao.ma@oracle.com>
-Sunil Mushran <sunil.mushran@oracle.com>
-Manish Singh <manish.singh@oracle.com>
-Tiger Yang <tiger.yang@oracle.com>
+
+- Joel Becker <joel.becker@oracle.com>
+- Zach Brown <zach.brown@oracle.com>
+- Mark Fasheh <mfasheh@suse.com>
+- Kurt Hackel <kurt.hackel@oracle.com>
+- Tao Ma <tao.ma@oracle.com>
+- Sunil Mushran <sunil.mushran@oracle.com>
+- Manish Singh <manish.singh@oracle.com>
+- Tiger Yang <tiger.yang@oracle.com>
Caveats
=======
Features which OCFS2 does not support yet:
+
- Directory change notification (F_NOTIFY)
- Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease)
@@ -37,8 +45,10 @@
=============
OCFS2 supports the following mount options:
+
(*) == default
+======================= ========================================================
barrier=1 This enables/disables barriers. barrier=0 disables it,
barrier=1 enables it.
errors=remount-ro(*) Remount the filesystem read-only on an error.
@@ -104,3 +114,4 @@
for descriptor blocks. If enabled older kernels cannot
mount the device. This will enable 'journal_checksum'
internally.
+======================= ========================================================
diff --git a/Documentation/filesystems/omfs.rst b/Documentation/filesystems/omfs.rst
new file mode 100644
index 0000000..a104c25
--- /dev/null
+++ b/Documentation/filesystems/omfs.rst
@@ -0,0 +1,112 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Optimized MPEG Filesystem (OMFS)
+================================
+
+Overview
+========
+
+OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR
+and Rio Karma MP3 player. The filesystem is extent-based, utilizing
+block sizes from 2k to 8k, with hash-based directories. This
+filesystem driver may be used to read and write disks from these
+devices.
+
+Note, it is not recommended that this FS be used in place of a general
+filesystem for your own streaming media device. Native Linux filesystems
+will likely perform better.
+
+More information is available at:
+
+ http://linux-karma.sf.net/
+
+Various utilities, including mkomfs and omfsck, are included with
+omfsprogs, available at:
+
+ https://bobcopeland.com/karma/
+
+Instructions are included in its README.
+
+Options
+=======
+
+OMFS supports the following mount-time options:
+
+ ============ ========================================
+ uid=n make all files owned by specified user
+ gid=n make all files owned by specified group
+ umask=xxx set permission umask to xxx
+ fmask=xxx set umask to xxx for files
+ dmask=xxx set umask to xxx for directories
+ ============ ========================================
+
+Disk format
+===========
+
+OMFS discriminates between "sysblocks" and normal data blocks. The sysblock
+group consists of super block information, file metadata, directory structures,
+and extents. Each sysblock has a header containing CRCs of the entire
+sysblock, and may be mirrored in successive blocks on the disk. A sysblock may
+have a smaller size than a data block, but since they are both addressed by the
+same 64-bit block number, any remaining space in the smaller sysblock is
+unused.
+
+Sysblock header information::
+
+ struct omfs_header {
+ __be64 h_self; /* FS block where this is located */
+ __be32 h_body_size; /* size of useful data after header */
+ __be16 h_crc; /* crc-ccitt of body_size bytes */
+ char h_fill1[2];
+ u8 h_version; /* version, always 1 */
+ char h_type; /* OMFS_INODE_X */
+ u8 h_magic; /* OMFS_IMAGIC */
+ u8 h_check_xor; /* XOR of header bytes before this */
+ __be32 h_fill2;
+ };
+
+Files and directories are both represented by omfs_inode::
+
+ struct omfs_inode {
+ struct omfs_header i_head; /* header */
+ __be64 i_parent; /* parent containing this inode */
+ __be64 i_sibling; /* next inode in hash bucket */
+ __be64 i_ctime; /* ctime, in milliseconds */
+ char i_fill1[35];
+ char i_type; /* OMFS_[DIR,FILE] */
+ __be32 i_fill2;
+ char i_fill3[64];
+ char i_name[OMFS_NAMELEN]; /* filename */
+ __be64 i_size; /* size of file, in bytes */
+ };
+
+Directories in OMFS are implemented as a large hash table. Filenames are
+hashed then prepended into the bucket list beginning at OMFS_DIR_START.
+Lookup requires hashing the filename, then seeking across i_sibling pointers
+until a match is found on i_name. Empty buckets are represented by block
+pointers with all-1s (~0).
+
+A file is an omfs_inode structure followed by an extent table beginning at
+OMFS_EXTENT_START::
+
+ struct omfs_extent_entry {
+ __be64 e_cluster; /* start location of a set of blocks */
+ __be64 e_blocks; /* number of blocks after e_cluster */
+ };
+
+ struct omfs_extent {
+ __be64 e_next; /* next extent table location */
+ __be32 e_extent_count; /* total # extents in this table */
+ __be32 e_fill;
+ struct omfs_extent_entry e_entry; /* start of extent entries */
+ };
+
+Each extent holds the block offset followed by number of blocks allocated to
+the extent. The final extent in each table is a terminator with e_cluster
+being ~0 and e_blocks being ones'-complement of the total number of blocks
+in the table.
+
+If this table overflows, a continuation inode is written and pointed to by
+e_next. These have a header but lack the rest of the inode structure.
+
diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt
deleted file mode 100644
index 1d0d41f..0000000
--- a/Documentation/filesystems/omfs.txt
+++ /dev/null
@@ -1,106 +0,0 @@
-Optimized MPEG Filesystem (OMFS)
-
-Overview
-========
-
-OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR
-and Rio Karma MP3 player. The filesystem is extent-based, utilizing
-block sizes from 2k to 8k, with hash-based directories. This
-filesystem driver may be used to read and write disks from these
-devices.
-
-Note, it is not recommended that this FS be used in place of a general
-filesystem for your own streaming media device. Native Linux filesystems
-will likely perform better.
-
-More information is available at:
-
- http://linux-karma.sf.net/
-
-Various utilities, including mkomfs and omfsck, are included with
-omfsprogs, available at:
-
- http://bobcopeland.com/karma/
-
-Instructions are included in its README.
-
-Options
-=======
-
-OMFS supports the following mount-time options:
-
- uid=n - make all files owned by specified user
- gid=n - make all files owned by specified group
- umask=xxx - set permission umask to xxx
- fmask=xxx - set umask to xxx for files
- dmask=xxx - set umask to xxx for directories
-
-Disk format
-===========
-
-OMFS discriminates between "sysblocks" and normal data blocks. The sysblock
-group consists of super block information, file metadata, directory structures,
-and extents. Each sysblock has a header containing CRCs of the entire
-sysblock, and may be mirrored in successive blocks on the disk. A sysblock may
-have a smaller size than a data block, but since they are both addressed by the
-same 64-bit block number, any remaining space in the smaller sysblock is
-unused.
-
-Sysblock header information:
-
-struct omfs_header {
- __be64 h_self; /* FS block where this is located */
- __be32 h_body_size; /* size of useful data after header */
- __be16 h_crc; /* crc-ccitt of body_size bytes */
- char h_fill1[2];
- u8 h_version; /* version, always 1 */
- char h_type; /* OMFS_INODE_X */
- u8 h_magic; /* OMFS_IMAGIC */
- u8 h_check_xor; /* XOR of header bytes before this */
- __be32 h_fill2;
-};
-
-Files and directories are both represented by omfs_inode:
-
-struct omfs_inode {
- struct omfs_header i_head; /* header */
- __be64 i_parent; /* parent containing this inode */
- __be64 i_sibling; /* next inode in hash bucket */
- __be64 i_ctime; /* ctime, in milliseconds */
- char i_fill1[35];
- char i_type; /* OMFS_[DIR,FILE] */
- __be32 i_fill2;
- char i_fill3[64];
- char i_name[OMFS_NAMELEN]; /* filename */
- __be64 i_size; /* size of file, in bytes */
-};
-
-Directories in OMFS are implemented as a large hash table. Filenames are
-hashed then prepended into the bucket list beginning at OMFS_DIR_START.
-Lookup requires hashing the filename, then seeking across i_sibling pointers
-until a match is found on i_name. Empty buckets are represented by block
-pointers with all-1s (~0).
-
-A file is an omfs_inode structure followed by an extent table beginning at
-OMFS_EXTENT_START:
-
-struct omfs_extent_entry {
- __be64 e_cluster; /* start location of a set of blocks */
- __be64 e_blocks; /* number of blocks after e_cluster */
-};
-
-struct omfs_extent {
- __be64 e_next; /* next extent table location */
- __be32 e_extent_count; /* total # extents in this table */
- __be32 e_fill;
- struct omfs_extent_entry e_entry; /* start of extent entries */
-};
-
-Each extent holds the block offset followed by number of blocks allocated to
-the extent. The final extent in each table is a terminator with e_cluster
-being ~0 and e_blocks being ones'-complement of the total number of blocks
-in the table.
-
-If this table overflows, a continuation inode is written and pointed to by
-e_next. These have a header but lack the rest of the inode structure.
-
diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.rst
similarity index 82%
rename from Documentation/filesystems/orangefs.txt
rename to Documentation/filesystems/orangefs.rst
index f4ba949..463e376 100644
--- a/Documentation/filesystems/orangefs.txt
+++ b/Documentation/filesystems/orangefs.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
ORANGEFS
========
@@ -21,43 +24,33 @@
* Stateless
-MAILING LIST ARCHIVES
+Mailing List Archives
=====================
http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
-MAILING LIST SUBMISSIONS
+Mailing List Submissions
========================
devel@lists.orangefs.org
-DOCUMENTATION
+Documentation
=============
http://www.orangefs.org/documentation/
-
-USERSPACE FILESYSTEM SOURCE
-===========================
-
-http://www.orangefs.org/download
-
-Orangefs versions prior to 2.9.3 would not be compatible with the
-upstream version of the kernel client.
-
-
-RUNNING ORANGEFS ON A SINGLE SERVER
+Running ORANGEFS On a Single Server
===================================
OrangeFS is usually run in large installations with multiple servers and
clients, but a complete filesystem can be run on a single machine for
development and testing.
-On Fedora, install orangefs and orangefs-server.
+On Fedora, install orangefs and orangefs-server::
-dnf -y install orangefs orangefs-server
+ dnf -y install orangefs orangefs-server
There is an example server configuration file in
/etc/orangefs/orangefs.conf. Change localhost to your hostname if
@@ -70,29 +63,37 @@
controls clients which use libpvfs2. This does not control the
pvfs2-client-core.
-Create the filesystem.
+Create the filesystem::
-pvfs2-server -f /etc/orangefs/orangefs.conf
+ pvfs2-server -f /etc/orangefs/orangefs.conf
-Start the server.
+Start the server::
-systemctl start orangefs-server
+ systemctl start orangefs-server
-Test the server.
+Test the server::
-pvfs2-ping -m /pvfsmnt
+ pvfs2-ping -m /pvfsmnt
Start the client. The module must be compiled in or loaded before this
-point.
+point::
-systemctl start orangefs-client
+ systemctl start orangefs-client
-Mount the filesystem.
+Mount the filesystem::
-mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
+ mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
+
+Userspace Filesystem Source
+===========================
+
+http://www.orangefs.org/download
+
+Orangefs versions prior to 2.9.3 would not be compatible with the
+upstream version of the kernel client.
-BUILDING ORANGEFS ON A SINGLE SERVER
+Building ORANGEFS on a Single Server
====================================
Where OrangeFS cannot be installed from distribution packages, it may be
@@ -102,49 +103,55 @@
in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by
default, we will probably be changing the default to LMDB soon.
-./configure --prefix=/opt/ofs --with-db-backend=lmdb
+::
-make
+ ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint
-make install
+ make
-Create an orangefs config file.
+ make install
-/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
+Create an orangefs config file by running pvfs2-genconfig and
+specifying a target config file. Pvfs2-genconfig will prompt you
+through. Generally it works fine to take the defaults, but you
+should use your server's hostname, rather than "localhost" when
+it comes to that question::
-Create an /etc/pvfs2tab file.
+ /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
-echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
- /etc/pvfs2tab
+Create an /etc/pvfs2tab file (localhost is fine)::
-Create the mount point you specified in the tab file if needed.
+ echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
+ /etc/pvfs2tab
-mkdir /pvfsmnt
+Create the mount point you specified in the tab file if needed::
-Bootstrap the server.
+ mkdir /pvfsmnt
-/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
+Bootstrap the server::
-Start the server.
+ /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
-/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf
+Start the server::
+
+ /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf
Now the server should be running. Pvfs2-ls is a simple
-test to verify that the server is running.
+test to verify that the server is running::
-/opt/ofs/bin/pvfs2-ls /pvfsmnt
+ /opt/ofs/bin/pvfs2-ls /pvfsmnt
If stuff seems to be working, load the kernel module and
-turn on the client core.
+turn on the client core::
-/opt/ofs/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core
+ /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core
-Mount your filesystem.
+Mount your filesystem::
-mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
+ mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt
-RUNNING XFSTESTS
+Running xfstests
================
It is useful to use a scratch filesystem with xfstests. This can be
@@ -159,21 +166,23 @@
This change should be made before creating the filesystem.
-pvfs2-server -f /etc/orangefs/orangefs.conf
+::
-To run xfstests, create /etc/xfsqa.config.
+ pvfs2-server -f /etc/orangefs/orangefs.conf
-TEST_DIR=/orangefs
-TEST_DEV=tcp://localhost:3334/orangefs
-SCRATCH_MNT=/scratch
-SCRATCH_DEV=tcp://localhost:3334/scratch
+To run xfstests, create /etc/xfsqa.config::
-Then xfstests can be run
+ TEST_DIR=/orangefs
+ TEST_DEV=tcp://localhost:3334/orangefs
+ SCRATCH_MNT=/scratch
+ SCRATCH_DEV=tcp://localhost:3334/scratch
-./check -pvfs2
+Then xfstests can be run::
+
+ ./check -pvfs2
-OPTIONS
+Options
=======
The following mount options are accepted:
@@ -193,32 +202,32 @@
Distributed locking is being worked on for the future.
-DEBUGGING
+Debugging
=========
If you want the debug (GOSSIP) statements in a particular
-source file (inode.c for example) go to syslog:
+source file (inode.c for example) go to syslog::
echo inode > /sys/kernel/debug/orangefs/kernel-debug
-No debugging (the default):
+No debugging (the default)::
echo none > /sys/kernel/debug/orangefs/kernel-debug
-Debugging from several source files:
+Debugging from several source files::
echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
-All debugging:
+All debugging::
echo all > /sys/kernel/debug/orangefs/kernel-debug
-Get a list of all debugging keywords:
+Get a list of all debugging keywords::
cat /sys/kernel/debug/orangefs/debug-help
-PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE
+Protocol between Kernel Module and Userspace
============================================
Orangefs is a user space filesystem and an associated kernel module.
@@ -234,7 +243,8 @@
can read from and write to. Userspace can also manipulate the
kernel module through the pseudo device with ioctl.
-THE BUFMAP:
+The Bufmap
+----------
At startup userspace allocates two page-size-aligned (posix_memalign)
mlocked memory buffers, one is used for IO and one is used for readdir
@@ -250,7 +260,8 @@
to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
then contains:
- * refcnt - a reference counter
+ * refcnt
+ - a reference counter
* desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
partition size, which represents the filesystem's block size and
is used for s_blocksize in super blocks.
@@ -259,17 +270,19 @@
* desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
* total_size - the total size of the IO buffer.
* page_count - the number of 4096 byte pages in the IO buffer.
- * page_array - a pointer to page_count * (sizeof(struct page*)) bytes
+ * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
of kcalloced memory. This memory is used as an array of pointers
to each of the pages in the IO buffer through a call to get_user_pages.
- * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc))
+ * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
bytes of kcalloced memory. This memory is further intialized:
user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
structure. user_desc->ptr points to the IO buffer.
- pages_per_desc = bufmap->desc_size / PAGE_SIZE
- offset = 0
+ ::
+
+ pages_per_desc = bufmap->desc_size / PAGE_SIZE
+ offset = 0
bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
bufmap->desc_array[0].array_count = pages_per_desc = 1024
@@ -293,7 +306,8 @@
* readdir_index_lock - a spinlock to protect readdir_index_array during
update.
-OPERATIONS:
+Operations
+----------
The kernel module builds an "op" (struct orangefs_kernel_op_s) when it
needs to communicate with userspace. Part of the op contains the "upcall"
@@ -308,13 +322,19 @@
Ops are stateful:
- * unknown - op was just initialized
- * waiting - op is on request_list (upward bound)
- * inprogr - op is in progress (waiting for downcall)
- * serviced - op has matching downcall; ok
- * purged - op has to start a timer since client-core
+ * unknown
+ - op was just initialized
+ * waiting
+ - op is on request_list (upward bound)
+ * inprogr
+ - op is in progress (waiting for downcall)
+ * serviced
+ - op has matching downcall; ok
+ * purged
+ - op has to start a timer since client-core
exited uncleanly before servicing op
- * given up - submitter has given up waiting for it
+ * given up
+ - submitter has given up waiting for it
When some arbitrary userspace program needs to perform a
filesystem operation on Orangefs (readdir, I/O, create, whatever)
@@ -389,10 +409,15 @@
response type.
The several members outside of the union are:
- - int32_t type - type of operation.
- - int32_t status - return code for the operation.
- - int64_t trailer_size - 0 unless readdir operation.
- - char *trailer_buf - initialized to NULL, used during readdir operations.
+
+ ``int32_t type``
+ - type of operation.
+ ``int32_t status``
+ - return code for the operation.
+ ``int64_t trailer_size``
+ - 0 unless readdir operation.
+ ``char *trailer_buf``
+ - initialized to NULL, used during readdir operations.
The appropriate member inside the union is filled out for any
particular response.
@@ -449,18 +474,20 @@
made by the kernel side.
A buffer_list containing:
+
- a pointer to the prepared response to the request from the
kernel (struct pvfs2_downcall_t).
- and also, in the case of a readdir request, a pointer to a
buffer containing descriptors for the objects in the target
directory.
+
... is sent to the function (PINT_dev_write_list) which performs
the writev.
PINT_dev_write_list has a local iovec array: struct iovec io_array[10];
The first four elements of io_array are initialized like this for all
-responses:
+responses::
io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
io_array[0].iov_len = sizeof(int32_t)
@@ -475,7 +502,7 @@
of global variable vfs_request (vfs_request_t)
io_array[3].iov_len = sizeof(pvfs2_downcall_t)
-Readdir responses initialize the fifth element io_array like this:
+Readdir responses initialize the fifth element io_array like this::
io_array[4].iov_base = contents of member trailer_buf (char *)
from out_downcall member of global variable
@@ -517,13 +544,13 @@
hence the motivation to use the dentry when possible.
The timeout values d_time and getattr_time are jiffy based, and the
-code is designed to avoid the jiffy-wrap problem:
+code is designed to avoid the jiffy-wrap problem::
-"In general, if the clock may have wrapped around more than once, there
-is no way to tell how much time has elapsed. However, if the times t1
-and t2 are known to be fairly close, we can reliably compute the
-difference in a way that takes into account the possibility that the
-clock may have wrapped between times."
+ "In general, if the clock may have wrapped around more than once, there
+ is no way to tell how much time has elapsed. However, if the times t1
+ and t2 are known to be fairly close, we can reliably compute the
+ difference in a way that takes into account the possibility that the
+ clock may have wrapped between times."
- from course notes by instructor Andy Wang
+from course notes by instructor Andy Wang
diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.rst
similarity index 79%
rename from Documentation/filesystems/overlayfs.txt
rename to Documentation/filesystems/overlayfs.rst
index 845d689..137afeb 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
Written by: Neil Brown
Please see MAINTAINERS file for where to send questions.
@@ -38,13 +40,46 @@
underlying filesystem, the same compliant behavior could be achieved
with the "xino" feature. The "xino" feature composes a unique object
identifier from the real object st_ino and an underlying fsid index.
+
If all underlying filesystems support NFS file handles and export file
handles with 32bit inode number encoding (e.g. ext4), overlay filesystem
will use the high inode number bits for fsid. Even when the underlying
filesystem uses 64bit inode numbers, users can still enable the "xino"
feature with the "-o xino=on" overlay mount option. That is useful for the
case of underlying filesystems like xfs and tmpfs, which use 64bit inode
-numbers, but are very unlikely to use the high inode number bit.
+numbers, but are very unlikely to use the high inode number bits. In case
+the underlying inode number does overflow into the high xino bits, overlay
+filesystem will fall back to the non xino behavior for that inode.
+
+The following table summarizes what can be expected in different overlay
+configurations.
+
+Inode properties
+````````````````
+
++--------------+------------+------------+-----------------+----------------+
+|Configuration | Persistent | Uniform | st_ino == d_ino | d_ino == i_ino |
+| | st_ino | st_dev | | [*] |
++==============+=====+======+=====+======+========+========+========+=======+
+| | dir | !dir | dir | !dir | dir + !dir | dir | !dir |
++--------------+-----+------+-----+------+--------+--------+--------+-------+
+| All layers | Y | Y | Y | Y | Y | Y | Y | Y |
+| on same fs | | | | | | | | |
++--------------+-----+------+-----+------+--------+--------+--------+-------+
+| Layers not | N | Y | Y | N | N | Y | N | Y |
+| on same fs, | | | | | | | | |
+| xino=off | | | | | | | | |
++--------------+-----+------+-----+------+--------+--------+--------+-------+
+| xino=on/auto | Y | Y | Y | Y | Y | Y | Y | Y |
+| | | | | | | | | |
++--------------+-----+------+-----+------+--------+--------+--------+-------+
+| xino=on/auto,| N | Y | Y | N | N | Y | N | Y |
+| ino overflow | | | | | | | | |
++--------------+-----+------+-----+------+--------+--------+--------+-------+
+
+[*] nfsd v3 readdirplus verifies d_ino == i_ino. i_ino is exposed via several
+/proc files, such as /proc/locks and /proc/self/fdinfo/<fd> of an inotify
+file descriptor.
Upper and Lower
@@ -181,7 +216,7 @@
worried about backward compatibility with kernels that have the redirect_dir
feature and follow redirects even if turned off.
-Module options (can also be changed through /sys/module/overlay/parameters/*):
+Module options (can also be changed through /sys/module/overlay/parameters/):
- "redirect_dir=BOOL":
See OVERLAY_FS_REDIRECT_DIR kernel config option above.
@@ -246,10 +281,54 @@
rename or unlink will of course be noticed and handled).
+Permission model
+----------------
+
+Permission checking in the overlay filesystem follows these principles:
+
+ 1) permission check SHOULD return the same result before and after copy up
+
+ 2) task creating the overlay mount MUST NOT gain additional privileges
+
+ 3) non-mounting task MAY gain additional privileges through the overlay,
+ compared to direct access on underlying lower or upper filesystems
+
+This is achieved by performing two permission checks on each access
+
+ a) check if current task is allowed access based on local DAC (owner,
+ group, mode and posix acl), as well as MAC checks
+
+ b) check if mounting task would be allowed real operation on lower or
+ upper layer based on underlying filesystem permissions, again including
+ MAC checks
+
+Check (a) ensures consistency (1) since owner, group, mode and posix acls
+are copied up. On the other hand it can result in server enforced
+permissions (used by NFS, for example) being ignored (3).
+
+Check (b) ensures that no task gains permissions to underlying layers that
+the mounting task does not have (2). This also means that it is possible
+to create setups where the consistency rule (1) does not hold; normally,
+however, the mounting task will have sufficient privileges to perform all
+operations.
+
+Another way to demonstrate this model is drawing parallels between
+
+ mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged
+
+and
+
+ cp -a /lower /upper
+ mount --bind /upper /merged
+
+The resulting access permissions should be the same. The difference is in
+the time of copy (on-demand vs. up-front).
+
+
Multiple lower layers
---------------------
-Multiple lower layers can now be given using the the colon (":") as a
+Multiple lower layers can now be given using the colon (":") as a
separator character between the directory names. For example:
mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged
@@ -263,7 +342,7 @@
Metadata only copy up
---------------------
+---------------------
When metadata only copy up feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
@@ -286,10 +365,10 @@
"trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible
for untrusted layers like from a pen drive.
-Note: redirect_dir={off|nofollow|follow(*)} conflicts with metacopy=on, and
-results in an error.
+Note: redirect_dir={off|nofollow|follow[*]} and nfs_export=on mount options
+conflict with metacopy=on, and will result in an error.
-(*) redirect_dir=follow only conflicts with metacopy=on if upperdir=... is
+[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is
given.
Sharing and copying layers
@@ -381,7 +460,8 @@
value of d_ino returned by readdir(3) will act like on a normal filesystem.
E.g. the value of st_dev may be different for two objects in the same
overlay filesystem and the value of st_ino for directory objects may not be
-persistent and could change even while the overlay filesystem is mounted.
+persistent and could change even while the overlay filesystem is mounted, as
+summarized in the `Inode properties`_ table above.
Changes to underlying filesystems
@@ -480,6 +560,36 @@
verified on mount time to check that upper file handles are not stale.
This verification may cause significant overhead in some cases.
+Note: the mount options index=off,nfs_export=on are conflicting for a
+read-write mount and will result in an error.
+
+
+Volatile mount
+--------------
+
+This is enabled with the "volatile" mount option. Volatile mounts are not
+guaranteed to survive a crash. It is strongly recommended that volatile
+mounts are only used if data written to the overlay can be recreated
+without significant effort.
+
+The advantage of mounting with the "volatile" option is that all forms of
+sync calls to the upper filesystem are omitted.
+
+In order to avoid a giving a false sense of safety, the syncfs (and fsync)
+semantics of volatile mounts are slightly different than that of the rest of
+VFS. If any writeback error occurs on the upperdir's filesystem after a
+volatile mount takes place, all sync functions will return an error. Once this
+condition is reached, the filesystem will not recover, and every subsequent sync
+call will return an error, even if the upperdir has not experience a new error
+since the last sync call.
+
+When overlay is mounted with "volatile" option, the directory
+"$workdir/work/incompat/volatile" is created. During next mount, overlay
+checks for this directory and refuses to mount if present. This is a strong
+indicator that user should throw away upper and work directories and create
+fresh one. In very limited cases where the user knows that the system has
+not crashed and contents of upperdir are intact, The "volatile" directory
+can be removed.
Testsuite
---------
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 434a07b..c482e16 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -13,6 +13,7 @@
including:
- per-directory parallel name lookup.
+- ``openat2()`` resolution restriction flags.
Introduction to pathname lookup
===============================
@@ -42,15 +43,15 @@
non-"``/``" characters. These form two kinds of paths. Those that
start with slashes are "absolute" and start from the filesystem root.
The others are "relative" and start from the current directory, or
-from some other location specified by a file descriptor given to a
-"``XXXat``" system call such as `openat() <openat_>`_.
+from some other location specified by a file descriptor given to
+"``*at()``" system calls such as `openat() <openat_>`_.
.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
It is tempting to describe the second kind as starting with a
component, but that isn't always accurate: a pathname can lack both
slashes and components, it can be empty, in other words. This is
-generally forbidden in POSIX, but some of those "xxx``at``" system calls
+generally forbidden in POSIX, but some of those "``*at()``" system calls
in Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For
example, if you have an open file descriptor on an executable file you
can execute it by calling `execveat() <execveat_>`_ passing
@@ -68,17 +69,17 @@
exist, it could be "``.``" or "``..``" which are handled quite differently
from other components.
-.. _POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+.. _POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
tempting to consider that to have an empty final component. In many
ways that would lead to correct results, but not always. In
particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
by the final component, and they are required to work with pathnames
-ending in "``/``". According to POSIX_
+ending in "``/``". According to POSIX_:
- A pathname that contains at least one non- <slash> character and
- that ends with one or more trailing <slash> characters shall not
+ A pathname that contains at least one non-<slash> character and
+ that ends with one or more trailing <slash> characters shall not
be resolved successfully unless the last pathname component before
the trailing <slash> characters names an existing directory or a
directory entry that is to be created for a directory immediately
@@ -228,13 +229,20 @@
it might end up continuing the search down the wrong chain,
and so miss out on part of the correct chain.
-The name-lookup process (``d_lookup()``) does _not_ try to prevent this
+The name-lookup process (``d_lookup()``) does *not* try to prevent this
from happening, but only to detect when it happens.
``rename_lock`` is a seqlock that is updated whenever any dentry is
renamed. If ``d_lookup`` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries
again.
+``rename_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``rename_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
inode->i_rwsem
~~~~~~~~~~~~~~
@@ -348,6 +356,13 @@
needed to stabilize the link to the mounted-on dentry, which the
refcount on the mount itself doesn't ensure.
+``mount_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``mount_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
RCU
~~~
@@ -361,7 +376,7 @@
Bringing it together with ``struct nameidata``
----------------------------------------------
-.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
+.. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
Throughout the process of walking a path, the current status is stored
in a ``struct nameidata``, "namei" being the traditional name - dating
@@ -383,17 +398,14 @@
``struct qstr last``
~~~~~~~~~~~~~~~~~~~~
-This is a string together with a length (i.e. _not_ ``nul`` terminated)
+This is a string together with a length (i.e. *not* ``nul`` terminated)
that is the "next" component in the pathname.
``int last_type``
~~~~~~~~~~~~~~~~~
-This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or
-``LAST_BIND``. The ``last`` field is only valid if the type is
-``LAST_NORM``. ``LAST_BIND`` is used when following a symlink and no
-components of the symlink have been processed yet. Others should be
-fairly self-explanatory.
+This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT`` or ``LAST_DOTDOT``.
+The ``last`` field is only valid if the type is ``LAST_NORM``.
``struct path root``
~~~~~~~~~~~~~~~~~~~~
@@ -405,6 +417,10 @@
only one root is in effect for the entire path walk, even if it races
with a ``chroot()`` system call.
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
The root is needed when either of two conditions holds: (1) either the
pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
component is being handled, since "``..``" from the root must always stay
@@ -639,8 +655,8 @@
clearly seen in functions like ``filename_lookup()``,
``filename_parentat()``, ``filename_mountpoint()``,
``do_filp_open()``, and ``do_file_open_root()``. These five
-correspond roughly to the four ``path_``* functions we met earlier,
-each of which calls ``link_path_walk()``. The ``path_*`` functions are
+correspond roughly to the four ``path_*()`` functions we met earlier,
+each of which calls ``link_path_walk()``. The ``path_*()`` functions are
called using different mode flags until a mode is found which works.
They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
that fails with the error ``ECHILD`` they are called again with no
@@ -704,7 +720,7 @@
variables, then ``read_seqcount_retry()`` is called to confirm the two
are consistent, and only then is ``->d_compare()`` called. When
standard filename comparison is used, ``dentry_cmp()`` is called
-instead. Notably it does _not_ use ``read_seqcount_retry()``, but
+instead. Notably it does *not* use ``read_seqcount_retry()``, but
instead has a large comment explaining why the consistency guarantee
isn't necessary. A subsequent ``read_seqcount_retry()`` will be
sufficient to catch any problem that could occur at this point.
@@ -912,7 +928,7 @@
sedate approach.
The emphasis here is "try quickly and check". It should probably be
-"try quickly _and carefully,_ then check". The fact that checking is
+"try quickly *and carefully*, then check". The fact that checking is
needed is a reminder that the system is dynamic and only a limited
number of things are safe at all. The most likely cause of errors in
this whole process is assuming something is safe when in reality it
@@ -1149,7 +1165,7 @@
the stack frame discarded.
The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
$ ls -l /proc/self/fd/1
lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1249,7 +1265,7 @@
and looking up a symlink on the way to some other destination can
update the atime on that symlink.
-.. _clearest statement: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
+.. _clearest statement: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
It is not clear why this is the case; POSIX has little to say on the
subject. The `clearest statement`_ is that, if a particular implementation
@@ -1286,7 +1302,9 @@
A suitable way to wrap up this tour of pathname walking is to list
the various flags that can be stored in the ``nameidata`` to guide the
lookup process. Many of these are only meaningful on the final
-component, others reflect the current state of the pathname lookup.
+component, others reflect the current state of the pathname lookup, and some
+apply restrictions to all path components encountered in the path lookup.
+
And then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
the others. If this is not set, an empty pathname causes an error
very early on. If it is set, empty pathnames are not considered to be
@@ -1310,13 +1328,48 @@
``LOOKUP_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``). In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``). In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
``LOOKUP_JUMPED`` is set when the look completes - which may be at the
final component or, when creating, unlinking, or renaming, at the penultimate component.
+Resolution-restriction flags
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to allow userspace to protect itself against certain race conditions
+and attack scenarios involving changing path components, a series of flags are
+available which apply restrictions to all path components encountered during
+path lookup. These flags are exposed through ``openat2()``'s ``resolve`` field.
+
+``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
+This is distinctly different from ``LOOKUP_FOLLOW``, because the latter only
+relates to restricting the following of trailing symlinks.
+
+``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
+ensure that they return errors from ``nd_jump_link()``, because that is how
+``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
+
+``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both
+bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
+lookup is determined by the first mountpoint the path lookup reaches --
+absolute paths start with the ``vfsmount`` of ``/``, and relative paths start
+with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
+``vfsmount`` of the path is unchanged.
+
+``LOOKUP_BENEATH`` blocks any path components which resolve outside the
+starting point of the resolution. This is done by blocking ``nd_jump_root()``
+as well as blocking ".." if it would jump outside the starting point.
+``rename_lock`` and ``mount_lock`` are used to detect attacks against the
+resolution of "..". Magic-links are also blocked.
+
+``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
+were the filesystem root. ``nd_jump_root()`` brings the resolution back to
+the starting point, and ".." at the starting point will act as a no-op. As with
+``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
+attacks against ".." resolution. Magic-links are also blocked.
+
Final-component flags
~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
index 9b8930f..1aa7ce0 100644
--- a/Documentation/filesystems/path-lookup.txt
+++ b/Documentation/filesystems/path-lookup.txt
@@ -375,7 +375,7 @@
Papers and other documentation on dcache locking
================================================
-1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124).
+1. Scaling dcache with RCU (https://linuxjournal.com/article.php?sid=7124).
2. http://lse.sourceforge.net/locking/dcache/dcache.html
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 26c0939..867036a 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -858,3 +858,10 @@
[should've been added in 2016] stale comment in finish_open() nonwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
+
+---
+
+**mandatory**
+
+clone_private_mount() returns a longterm mount now, so the proper destructor of
+its result is kern_unmount() or kern_unmount_array().
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.rst
similarity index 60%
rename from Documentation/filesystems/proc.txt
rename to Documentation/filesystems/proc.rst
index 99ca040..533c79e 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.rst
@@ -1,19 +1,20 @@
-------------------------------------------------------------------------------
- T H E /proc F I L E S Y S T E M
-------------------------------------------------------------------------------
-/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999
- Bodo Bauer <bb@ricochet.net>
+.. SPDX-License-Identifier: GPL-2.0
-2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000
-move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009
-------------------------------------------------------------------------------
-Version 1.3 Kernel version 2.2.12
- Kernel version 2.4.0-test11-pre4
-------------------------------------------------------------------------------
-fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
+====================
+The /proc Filesystem
+====================
-Table of Contents
------------------
+===================== ======================================= ================
+/proc/sys Terrehon Bowden <terrehon@pacbell.net>, October 7 1999
+ Bodo Bauer <bb@ricochet.net>
+2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000
+move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009
+fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
+===================== ======================================= ================
+
+
+
+.. Table of Contents
0 Preface
0.1 Introduction/Credits
@@ -50,9 +51,10 @@
4 Configuring procfs
4.1 Mount options
-------------------------------------------------------------------------------
+ 5 Filesystem behavior
+
Preface
-------------------------------------------------------------------------------
+=======
0.1 Introduction/Credits
------------------------
@@ -95,20 +97,18 @@
complaining about how you screwed up your system because of incorrect
documentation, we won't feel responsible...
-------------------------------------------------------------------------------
-CHAPTER 1: COLLECTING SYSTEM INFORMATION
-------------------------------------------------------------------------------
+Chapter 1: Collecting System Information
+========================================
-------------------------------------------------------------------------------
In This Chapter
-------------------------------------------------------------------------------
+---------------
* Investigating the properties of the pseudo file system /proc and its
ability to provide information on the running Linux system
* Examining /proc's structure
* Uncovering various information about the kernel and the processes running
on the system
-------------------------------------------------------------------------------
+------------------------------------------------------------------------------
The proc file system acts as an interface to internal data structures in the
kernel. It can be used to obtain information about the system and to change
@@ -123,10 +123,10 @@
The directory /proc contains (among other things) one subdirectory for each
process running on the system, which is named after the process ID (PID).
-The link self points to the process reading the file system. Each process
+The link 'self' points to the process reading the file system. Each process
subdirectory has the entries listed in Table 1-1.
-Note that an open a file descriptor to /proc/<pid> or to any of its
+Note that an open file descriptor to /proc/<pid> or to any of its
contained files or subdirectories does not prevent <pid> being reused
for some other process in the event that <pid> exits. Operations on
open /proc/<pid> file descriptors corresponding to dead processes
@@ -134,9 +134,11 @@
also assigned the process ID <pid>. Instead, operations on these FDs
usually fail with ESRCH.
-Table 1-1: Process specific entries in /proc
-..............................................................................
+.. table:: Table 1-1: Process specific entries in /proc
+
+ ============= ===============================================================
File Content
+ ============= ===============================================================
clear_refs Clears page referenced bits shown in smaps output
cmdline Command line arguments
cpu Current and last cpu in which it was executed (2.4)(smp)
@@ -160,10 +162,10 @@
can be derived from smaps, but is faster and more convenient
numa_maps An extension based on maps, showing the memory locality and
binding policy as well as mem usage (in pages) of each mapping.
-..............................................................................
+ ============= ===============================================================
For example, to get the status information of a process, all you have to do is
-read the file /proc/PID/status:
+read the file /proc/PID/status::
>cat /proc/self/status
Name: cat
@@ -218,18 +220,21 @@
The statm file contains more detailed information about the process
memory usage. Its seven fields are explained in Table 1-3. The stat file
-contains details information about the process itself. Its fields are
+contains detailed information about the process itself. Its fields are
explained in Table 1-4.
(for SMP CONFIG users)
+
For making accounting scalable, RSS related information are handled in an
asynchronous manner and the value may not be very precise. To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
It's slow but very precise.
-Table 1-2: Contents of the status files (as of 4.19)
-..............................................................................
+.. table:: Table 1-2: Contents of the status files (as of 4.19)
+
+ ========================== ===================================================
Field Content
+ ========================== ===================================================
Name filename of the executable
Umask file mode creation mask
State state (R is running, S is sleeping, D is sleeping
@@ -254,7 +259,8 @@
VmPin pinned memory size
VmHWM peak resident set size ("high water mark")
VmRSS size of memory portions. It contains the three
- following parts (VmRSS = RssAnon + RssFile + RssShmem)
+ following parts
+ (VmRSS = RssAnon + RssFile + RssShmem)
RssAnon size of resident anonymous memory
RssFile size of resident file mappings
RssShmem size of resident shmem memory (includes SysV shm,
@@ -292,27 +298,32 @@
Mems_allowed_list Same as previous, but in "list format"
voluntary_ctxt_switches number of voluntary context switches
nonvoluntary_ctxt_switches number of non voluntary context switches
-..............................................................................
+ ========================== ===================================================
-Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-..............................................................................
+
+.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
+
+ ======== =============================== ==============================
Field Content
+ ======== =============================== ==============================
size total program size (pages) (same as VmSize in status)
resident size of memory portions (pages) (same as VmRSS in status)
shared number of pages that are shared (i.e. backed by a file, same
as RssFile+RssShmem in status)
trs number of pages that are 'code' (not including libs; broken,
- includes data segment)
+ includes data segment)
lrs number of pages of library (always 0 on 2.6)
drs number of pages of data/stack (including libs; broken,
- includes library text)
+ includes library text)
dt number of dirty pages (always 0 on 2.6)
-..............................................................................
+ ======== =============================== ==============================
-Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
-..............................................................................
- Field Content
+.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
+
+ ============= ===============================================================
+ Field Content
+ ============= ===============================================================
pid process id
tcomm filename of the executable
state state (R is running, S is sleeping, D is sleeping in an
@@ -348,7 +359,8 @@
blocked bitmap of blocked signals
sigign bitmap of ignored signals
sigcatch bitmap of caught signals
- 0 (place holder, used to be the wchan address, use /proc/PID/wchan instead)
+ 0 (place holder, used to be the wchan address,
+ use /proc/PID/wchan instead)
0 (place holder)
0 (place holder)
exit_signal signal to send to parent thread on exit
@@ -365,39 +377,40 @@
arg_end address below which program command line is placed
env_start address above which program environment is placed
env_end address below which program environment is placed
- exit_code the thread's exit_code in the form reported by the waitpid system call
-..............................................................................
+ exit_code the thread's exit_code in the form reported by the waitpid
+ system call
+ ============= ===============================================================
The /proc/PID/maps file contains the currently mapped memory regions and
their access permissions.
-The format is:
+The format is::
-address perms offset dev inode pathname
+ address perms offset dev inode pathname
-08048000-08049000 r-xp 00000000 03:00 8312 /opt/test
-08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test
-0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
-a7cb1000-a7cb2000 ---p 00000000 00:00 0
-a7cb2000-a7eb2000 rw-p 00000000 00:00 0
-a7eb2000-a7eb3000 ---p 00000000 00:00 0
-a7eb3000-a7ed5000 rw-p 00000000 00:00 0
-a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6
-a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6
-a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6
-a800b000-a800e000 rw-p 00000000 00:00 0
-a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
-a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
-a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
-a8024000-a8027000 rw-p 00000000 00:00 0
-a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
-a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
-a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
-aff35000-aff4a000 rw-p 00000000 00:00 0 [stack]
-ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
+ 08048000-08049000 r-xp 00000000 03:00 8312 /opt/test
+ 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test
+ 0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
+ a7cb1000-a7cb2000 ---p 00000000 00:00 0
+ a7cb2000-a7eb2000 rw-p 00000000 00:00 0
+ a7eb2000-a7eb3000 ---p 00000000 00:00 0
+ a7eb3000-a7ed5000 rw-p 00000000 00:00 0
+ a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6
+ a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6
+ a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6
+ a800b000-a800e000 rw-p 00000000 00:00 0
+ a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
+ a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
+ a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
+ a8024000-a8027000 rw-p 00000000 00:00 0
+ a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
+ a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
+ a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
+ aff35000-aff4a000 rw-p 00000000 00:00 0 [stack]
+ ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
where "address" is the address space in the process that it occupies, "perms"
-is a set of permissions:
+is a set of permissions::
r = read
w = write
@@ -411,42 +424,44 @@
The "pathname" shows the name associated file for this mapping. If the mapping
is not associated with a file:
- [heap] = the heap of the program
- [stack] = the stack of the main process
- [vdso] = the "virtual dynamic shared object",
+ ======= ====================================
+ [heap] the heap of the program
+ [stack] the stack of the main process
+ [vdso] the "virtual dynamic shared object",
the kernel system call handler
+ ======= ====================================
or if empty, the mapping is anonymous.
The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each mapping (aka Virtual
-Memory Area, or VMA) there is a series of lines such as the following:
+Memory Area, or VMA) there is a series of lines such as the following::
-08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash
+ 08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash
-Size: 1084 kB
-KernelPageSize: 4 kB
-MMUPageSize: 4 kB
-Rss: 892 kB
-Pss: 374 kB
-Shared_Clean: 892 kB
-Shared_Dirty: 0 kB
-Private_Clean: 0 kB
-Private_Dirty: 0 kB
-Referenced: 892 kB
-Anonymous: 0 kB
-LazyFree: 0 kB
-AnonHugePages: 0 kB
-ShmemPmdMapped: 0 kB
-Shared_Hugetlb: 0 kB
-Private_Hugetlb: 0 kB
-Swap: 0 kB
-SwapPss: 0 kB
-KernelPageSize: 4 kB
-MMUPageSize: 4 kB
-Locked: 0 kB
-THPeligible: 0
-VmFlags: rd ex mr mw me dw
+ Size: 1084 kB
+ KernelPageSize: 4 kB
+ MMUPageSize: 4 kB
+ Rss: 892 kB
+ Pss: 374 kB
+ Shared_Clean: 892 kB
+ Shared_Dirty: 0 kB
+ Private_Clean: 0 kB
+ Private_Dirty: 0 kB
+ Referenced: 892 kB
+ Anonymous: 0 kB
+ LazyFree: 0 kB
+ AnonHugePages: 0 kB
+ ShmemPmdMapped: 0 kB
+ Shared_Hugetlb: 0 kB
+ Private_Hugetlb: 0 kB
+ Swap: 0 kB
+ SwapPss: 0 kB
+ KernelPageSize: 4 kB
+ MMUPageSize: 4 kB
+ Locked: 0 kB
+ THPeligible: 0
+ VmFlags: rd ex mr mw me dw
The first of these lines shows the same information as is displayed for the
mapping in /proc/PID/maps. Following lines show the size of the mapping
@@ -461,26 +476,35 @@
in memory, where each page is divided by the number of processes sharing it.
So if a process has 1000 pages all to itself, and 1000 shared with one other
process, its PSS will be 1500.
+
Note that even a page which is part of a MAP_SHARED mapping, but has only
a single pte mapped, i.e. is currently used by only one process, is accounted
as private and not as shared.
+
"Referenced" indicates the amount of memory currently marked as referenced or
accessed.
+
"Anonymous" shows the amount of memory that does not belong to any file. Even
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
+
"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
The memory isn't freed immediately with madvise(). It's freed in memory
pressure if the memory is clean. Please note that the printed value might
be lower than the real value due to optimizations used in the current
implementation. If this is not desirable please file a bug report.
+
"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+
"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
huge pages.
+
"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
+
"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
+
For shmem mappings, "Swap" includes also the size of the mapped (and not
replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
@@ -489,36 +513,40 @@
"THPeligible" indicates whether the mapping is eligible for allocating THP
pages - 1 if true, 0 otherwise. It just shows the current status.
-"VmFlags" field deserves a separate description. This member represents the kernel
-flags associated with the particular virtual memory area in two letter encoded
-manner. The codes are the following:
- rd - readable
- wr - writeable
- ex - executable
- sh - shared
- mr - may read
- mw - may write
- me - may execute
- ms - may share
- gd - stack segment growns down
- pf - pure PFN range
- dw - disabled write to the mapped file
- lo - pages are locked in memory
- io - memory mapped I/O area
- sr - sequential read advise provided
- rr - random read advise provided
- dc - do not copy area on fork
- de - do not expand area on remapping
- ac - area is accountable
- nr - swap space is not reserved for the area
- ht - area uses huge tlb pages
- ar - architecture specific flag
- dd - do not include area into core dump
- sd - soft-dirty flag
- mm - mixed map area
- hg - huge page advise flag
- nh - no-huge page advise flag
- mg - mergable advise flag
+"VmFlags" field deserves a separate description. This member represents the
+kernel flags associated with the particular virtual memory area in two letter
+encoded manner. The codes are the following:
+
+ == =======================================
+ rd readable
+ wr writeable
+ ex executable
+ sh shared
+ mr may read
+ mw may write
+ me may execute
+ ms may share
+ gd stack segment growns down
+ pf pure PFN range
+ dw disabled write to the mapped file
+ lo pages are locked in memory
+ io memory mapped I/O area
+ sr sequential read advise provided
+ rr random read advise provided
+ dc do not copy area on fork
+ de do not expand area on remapping
+ ac area is accountable
+ nr swap space is not reserved for the area
+ ht area uses huge tlb pages
+ ar architecture specific flag
+ dd do not include area into core dump
+ sd soft dirty flag
+ mm mixed map area
+ hg huge page advise flag
+ nh no huge page advise flag
+ mg mergable advise flag
+ bt arm64 BTI guarded page
+ == =======================================
Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may
@@ -531,6 +559,7 @@
Note: reading /proc/PID/maps or /proc/PID/smaps is inherently racy (consistent
output can be achieved only in the single read call).
+
This typically manifests when doing partial reads of these files while the
memory map is being modified. Despite the races, we do provide the following
guarantees:
@@ -544,9 +573,9 @@
but their values are the sums of the corresponding values for all mappings of
the process. Additionally, it contains these fields:
-Pss_Anon
-Pss_File
-Pss_Shmem
+- Pss_Anon
+- Pss_File
+- Pss_Shmem
They represent the proportional shares of anonymous, file, and shmem pages, as
described for smaps above. These fields are omitted in smaps since each
@@ -558,20 +587,25 @@
bits on both physical and virtual pages associated with a process, and the
soft-dirty bit on pte (see Documentation/admin-guide/mm/soft-dirty.rst
for details).
-To clear the bits for all the pages associated with the process
+To clear the bits for all the pages associated with the process::
+
> echo 1 > /proc/PID/clear_refs
-To clear the bits for the anonymous pages associated with the process
+To clear the bits for the anonymous pages associated with the process::
+
> echo 2 > /proc/PID/clear_refs
-To clear the bits for the file mapped pages associated with the process
+To clear the bits for the file mapped pages associated with the process::
+
> echo 3 > /proc/PID/clear_refs
-To clear the soft-dirty bit
+To clear the soft-dirty bit::
+
> echo 4 > /proc/PID/clear_refs
To reset the peak resident set size ("high water mark") to the process's
-current value:
+current value::
+
> echo 5 > /proc/PID/clear_refs
Any other value written to /proc/PID/clear_refs will have no effect.
@@ -584,30 +618,33 @@
The /proc/pid/numa_maps is an extension based on maps, showing the memory
locality and binding policy, as well as the memory usage (in pages) of
each mapping. The output follows a general format where mapping details get
-summarized separated by blank spaces, one mapping per each file line:
+summarized separated by blank spaces, one mapping per each file line::
-address policy mapping details
+ address policy mapping details
-00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4
-00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4
-3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4
-320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
-3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
-3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4
-3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4
-320698b000 default file=/lib64/libc-2.12.so
-3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4
-3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
-3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4
-7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4
-7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4
-7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048
-7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4
-7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4
+ 00400000 default file=/usr/local/bin/app mapped=1 active=0 N3=1 kernelpagesize_kB=4
+ 00600000 default file=/usr/local/bin/app anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ 3206000000 default file=/lib64/ld-2.12.so mapped=26 mapmax=6 N0=24 N3=2 kernelpagesize_kB=4
+ 320621f000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ 3206220000 default file=/lib64/ld-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ 3206221000 default anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ 3206800000 default file=/lib64/libc-2.12.so mapped=59 mapmax=21 active=55 N0=41 N3=18 kernelpagesize_kB=4
+ 320698b000 default file=/lib64/libc-2.12.so
+ 3206b8a000 default file=/lib64/libc-2.12.so anon=2 dirty=2 N3=2 kernelpagesize_kB=4
+ 3206b8e000 default file=/lib64/libc-2.12.so anon=1 dirty=1 N3=1 kernelpagesize_kB=4
+ 3206b8f000 default anon=3 dirty=3 active=1 N3=3 kernelpagesize_kB=4
+ 7f4dc10a2000 default anon=3 dirty=3 N3=3 kernelpagesize_kB=4
+ 7f4dc10b4000 default anon=2 dirty=2 active=1 N3=2 kernelpagesize_kB=4
+ 7f4dc1200000 default file=/anon_hugepage\040(deleted) huge anon=1 dirty=1 N3=1 kernelpagesize_kB=2048
+ 7fff335f0000 default stack anon=3 dirty=3 N3=3 kernelpagesize_kB=4
+ 7fff3369d000 default mapped=1 mapmax=35 active=0 N3=1 kernelpagesize_kB=4
Where:
+
"address" is the starting address for the mapping;
+
"policy" reports the NUMA memory policy set for the mapping (see Documentation/admin-guide/mm/numa_memory_policy.rst);
+
"mapping details" summarizes mapping data such as mapping type, page usage counters,
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up.
@@ -621,81 +658,83 @@
system. It depends on the kernel configuration and the loaded modules, which
files are there, and which are missing.
-Table 1-5: Kernel info in /proc
-..............................................................................
- File Content
- apm Advanced power management info
- buddyinfo Kernel memory allocator information (see text) (2.5)
- bus Directory containing bus specific information
- cmdline Kernel command line
- cpuinfo Info about the CPU
- devices Available devices (block and character)
- dma Used DMS channels
- filesystems Supported filesystems
- driver Various drivers grouped here, currently rtc (2.4)
- execdomains Execdomains, related to security (2.4)
- fb Frame Buffer devices (2.4)
- fs File system parameters, currently nfs/exports (2.4)
- ide Directory containing info about the IDE subsystem
- interrupts Interrupt usage
- iomem Memory map (2.4)
- ioports I/O port usage
- irq Masks for irq to cpu affinity (2.4)(smp?)
- isapnp ISA PnP (Plug&Play) Info (2.4)
- kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
- kmsg Kernel messages
- ksyms Kernel symbol table
- loadavg Load average of last 1, 5 & 15 minutes
- locks Kernel locks
- meminfo Memory info
- misc Miscellaneous
- modules List of loaded modules
- mounts Mounted filesystems
- net Networking info (see text)
+.. table:: Table 1-5: Kernel info in /proc
+
+ ============ ===============================================================
+ File Content
+ ============ ===============================================================
+ apm Advanced power management info
+ buddyinfo Kernel memory allocator information (see text) (2.5)
+ bus Directory containing bus specific information
+ cmdline Kernel command line
+ cpuinfo Info about the CPU
+ devices Available devices (block and character)
+ dma Used DMS channels
+ filesystems Supported filesystems
+ driver Various drivers grouped here, currently rtc (2.4)
+ execdomains Execdomains, related to security (2.4)
+ fb Frame Buffer devices (2.4)
+ fs File system parameters, currently nfs/exports (2.4)
+ ide Directory containing info about the IDE subsystem
+ interrupts Interrupt usage
+ iomem Memory map (2.4)
+ ioports I/O port usage
+ irq Masks for irq to cpu affinity (2.4)(smp?)
+ isapnp ISA PnP (Plug&Play) Info (2.4)
+ kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
+ kmsg Kernel messages
+ ksyms Kernel symbol table
+ loadavg Load average of last 1, 5 & 15 minutes
+ locks Kernel locks
+ meminfo Memory info
+ misc Miscellaneous
+ modules List of loaded modules
+ mounts Mounted filesystems
+ net Networking info (see text)
pagetypeinfo Additional page allocator information (see text) (2.5)
- partitions Table of partitions known to the system
- pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
- decoupled by lspci (2.4)
- rtc Real time clock
- scsi SCSI info (see text)
- slabinfo Slab pool info
- softirqs softirq usage
- stat Overall statistics
- swaps Swap space utilization
- sys See chapter 2
- sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4)
- tty Info of tty drivers
- uptime Wall clock since boot, combined idle time of all cpus
- version Kernel version
- video bttv info of video resources (2.4)
- vmallocinfo Show vmalloced areas
-..............................................................................
+ partitions Table of partitions known to the system
+ pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
+ decoupled by lspci (2.4)
+ rtc Real time clock
+ scsi SCSI info (see text)
+ slabinfo Slab pool info
+ softirqs softirq usage
+ stat Overall statistics
+ swaps Swap space utilization
+ sys See chapter 2
+ sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4)
+ tty Info of tty drivers
+ uptime Wall clock since boot, combined idle time of all cpus
+ version Kernel version
+ video bttv info of video resources (2.4)
+ vmallocinfo Show vmalloced areas
+ ============ ===============================================================
You can, for example, check which interrupts are currently in use and what
-they are used for by looking in the file /proc/interrupts:
+they are used for by looking in the file /proc/interrupts::
- > cat /proc/interrupts
- CPU0
- 0: 8728810 XT-PIC timer
- 1: 895 XT-PIC keyboard
- 2: 0 XT-PIC cascade
- 3: 531695 XT-PIC aha152x
- 4: 2014133 XT-PIC serial
- 5: 44401 XT-PIC pcnet_cs
- 8: 2 XT-PIC rtc
- 11: 8 XT-PIC i82365
- 12: 182918 XT-PIC PS/2 Mouse
- 13: 1 XT-PIC fpu
- 14: 1232265 XT-PIC ide0
- 15: 7 XT-PIC ide1
- NMI: 0
+ > cat /proc/interrupts
+ CPU0
+ 0: 8728810 XT-PIC timer
+ 1: 895 XT-PIC keyboard
+ 2: 0 XT-PIC cascade
+ 3: 531695 XT-PIC aha152x
+ 4: 2014133 XT-PIC serial
+ 5: 44401 XT-PIC pcnet_cs
+ 8: 2 XT-PIC rtc
+ 11: 8 XT-PIC i82365
+ 12: 182918 XT-PIC PS/2 Mouse
+ 13: 1 XT-PIC fpu
+ 14: 1232265 XT-PIC ide0
+ 15: 7 XT-PIC ide1
+ NMI: 0
In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the
-output of a SMP machine):
+output of a SMP machine)::
- > cat /proc/interrupts
+ > cat /proc/interrupts
- CPU0 CPU1
+ CPU0 CPU1
0: 1243498 1214548 IO-APIC-edge timer
1: 8949 8958 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
@@ -708,8 +747,8 @@
15: 2183 2415 IO-APIC-edge ide1
17: 30564 30414 IO-APIC-level eth0
18: 177 164 IO-APIC-level bttv
- NMI: 2457961 2457959
- LOC: 2457882 2457881
+ NMI: 2457961 2457959
+ LOC: 2457882 2457881
ERR: 2155
NMI is incremented in this case because every timer interrupt generates a NMI
@@ -726,21 +765,25 @@
/proc/interrupts to display every IRQ vector in use by the system, not
just those considered 'most important'. The new vectors are:
- THR -- interrupt raised when a machine check threshold counter
+THR
+ interrupt raised when a machine check threshold counter
(typically counting ECC corrected errors of memory or cache) exceeds
a configurable threshold. Only available on some systems.
- TRM -- a thermal event interrupt occurs when a temperature threshold
+TRM
+ a thermal event interrupt occurs when a temperature threshold
has been exceeded for the CPU. This interrupt may also be generated
when the temperature drops back to normal.
- SPU -- a spurious interrupt is some interrupt that was raised then lowered
+SPU
+ a spurious interrupt is some interrupt that was raised then lowered
by some IO device before it could be fully processed by the APIC. Hence
the APIC sees the interrupt but does not know what device it came from.
For this case the APIC will generate the interrupt with a IRQ vector
of 0xff. This might also be generated by chipset bugs.
- RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are
+RES, CAL, TLB
+ rescheduling, call and TLB flush interrupts are
sent from one CPU to another per the needs of the OS. Typically,
their statistics are used by kernel developers and interested users to
determine the occurrence of interrupts of the given type.
@@ -751,12 +794,13 @@
i386 and x86_64 platforms support the new IRQ vector displays.
Of some interest is the introduction of the /proc/irq directory to 2.4.
-It could be used to set IRQ to CPU affinity, this means that you can "hook" an
+It could be used to set IRQ to CPU affinity. This means that you can "hook" an
IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
prof_cpu_mask.
-For example
+For example::
+
> ls /proc/irq/
0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
1 11 13 15 17 19 3 5 7 9 default_smp_affinity
@@ -764,20 +808,20 @@
smp_affinity
smp_affinity is a bitmask, in which you can specify which CPUs can handle the
-IRQ, you can set it by doing:
+IRQ. You can set it by doing::
> echo 1 > /proc/irq/10/smp_affinity
This means that only the first CPU will handle the IRQ, but you can also echo
5 which means that only the first and third CPU can handle the IRQ.
-The contents of each smp_affinity file is the same by default:
+The contents of each smp_affinity file is the same by default::
> cat /proc/irq/0/smp_affinity
ffffffff
There is an alternate interface, smp_affinity_list which allows specifying
-a cpu range instead of a bitmask:
+a CPU range instead of a bitmask::
> cat /proc/irq/0/smp_affinity_list
1024-1031
@@ -791,7 +835,7 @@
include information about any possible driver locality preference.
prof_cpu_mask specifies which CPUs are to be profiled by the system wide
-profiler. Default value is ffffffff (all cpus if there are only 32 of them).
+profiler. Default value is ffffffff (all CPUs if there are only 32 of them).
The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
between all the CPUs which are allowed to handle it. As usual the kernel has
@@ -810,50 +854,50 @@
Commonly used objects have their own slab pool (such as network buffers,
directory cache, and so on).
-..............................................................................
+::
-> cat /proc/buddyinfo
+ > cat /proc/buddyinfo
-Node 0, zone DMA 0 4 5 4 4 3 ...
-Node 0, zone Normal 1 0 0 1 101 8 ...
-Node 0, zone HighMem 2 0 0 1 1 0 ...
+ Node 0, zone DMA 0 4 5 4 4 3 ...
+ Node 0, zone Normal 1 0 0 1 101 8 ...
+ Node 0, zone HighMem 2 0 0 1 1 0 ...
External fragmentation is a problem under some workloads, and buddyinfo is a
-useful tool for helping diagnose these problems. Buddyinfo will give you a
+useful tool for helping diagnose these problems. Buddyinfo will give you a
clue as to how big an area you can safely allocate, or why a previous
allocation failed.
-Each column represents the number of pages of a certain order which are
-available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
-ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
-available in ZONE_NORMAL, etc...
+Each column represents the number of pages of a certain order which are
+available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
+ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
+available in ZONE_NORMAL, etc...
More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo::
-> cat /proc/pagetypeinfo
-Page block order: 9
-Pages per block: 512
+ > cat /proc/pagetypeinfo
+ Page block order: 9
+ Pages per block: 512
-Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
-Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
-Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
-Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
-Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
-Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
-Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
-Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
-Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
-Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
-Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+ Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
+ Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
+ Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
+ Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
+ Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
+ Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+ Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
+ Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
+ Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
+ Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
+ Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
-Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
-Node 0, zone DMA 2 0 5 1 0
-Node 0, zone DMA32 41 6 967 2 0
+ Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
+ Node 0, zone DMA 2 0 5 1 0
+ Node 0, zone DMA32 41 6 967 2 0
Fragmentation avoidance in the kernel works by grouping pages of different
migrate types into the same contiguous regions of memory called page blocks.
-A page block is typically the size of the default hugepage size e.g. 2MB on
+A page block is typically the size of the default hugepage size, e.g. 2MB on
X86-64. By keeping pages grouped based on their ability to move, the kernel
can reclaim pages within a page block to satisfy a high-order allocation.
@@ -870,59 +914,63 @@
also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.
-..............................................................................
-meminfo:
+meminfo
+~~~~~~~
Provides information about distribution and utilization of memory. This
varies by architecture and compile options. The following is from a
16GB PIII, which has highmem enabled. You may not have all of these fields.
-> cat /proc/meminfo
+::
-MemTotal: 16344972 kB
-MemFree: 13634064 kB
-MemAvailable: 14836172 kB
-Buffers: 3656 kB
-Cached: 1195708 kB
-SwapCached: 0 kB
-Active: 891636 kB
-Inactive: 1077224 kB
-HighTotal: 15597528 kB
-HighFree: 13629632 kB
-LowTotal: 747444 kB
-LowFree: 4432 kB
-SwapTotal: 0 kB
-SwapFree: 0 kB
-Dirty: 968 kB
-Writeback: 0 kB
-AnonPages: 861800 kB
-Mapped: 280372 kB
-Shmem: 644 kB
-KReclaimable: 168048 kB
-Slab: 284364 kB
-SReclaimable: 159856 kB
-SUnreclaim: 124508 kB
-PageTables: 24448 kB
-NFS_Unstable: 0 kB
-Bounce: 0 kB
-WritebackTmp: 0 kB
-CommitLimit: 7669796 kB
-Committed_AS: 100056 kB
-VmallocTotal: 112216 kB
-VmallocUsed: 428 kB
-VmallocChunk: 111088 kB
-Percpu: 62080 kB
-HardwareCorrupted: 0 kB
-AnonHugePages: 49152 kB
-ShmemHugePages: 0 kB
-ShmemPmdMapped: 0 kB
+ > cat /proc/meminfo
+ MemTotal: 16344972 kB
+ MemFree: 13634064 kB
+ MemAvailable: 14836172 kB
+ Buffers: 3656 kB
+ Cached: 1195708 kB
+ SwapCached: 0 kB
+ Active: 891636 kB
+ Inactive: 1077224 kB
+ HighTotal: 15597528 kB
+ HighFree: 13629632 kB
+ LowTotal: 747444 kB
+ LowFree: 4432 kB
+ SwapTotal: 0 kB
+ SwapFree: 0 kB
+ Dirty: 968 kB
+ Writeback: 0 kB
+ AnonPages: 861800 kB
+ Mapped: 280372 kB
+ Shmem: 644 kB
+ KReclaimable: 168048 kB
+ Slab: 284364 kB
+ SReclaimable: 159856 kB
+ SUnreclaim: 124508 kB
+ PageTables: 24448 kB
+ NFS_Unstable: 0 kB
+ Bounce: 0 kB
+ WritebackTmp: 0 kB
+ CommitLimit: 7669796 kB
+ Committed_AS: 100056 kB
+ VmallocTotal: 112216 kB
+ VmallocUsed: 428 kB
+ VmallocChunk: 111088 kB
+ Percpu: 62080 kB
+ HardwareCorrupted: 0 kB
+ AnonHugePages: 49152 kB
+ ShmemHugePages: 0 kB
+ ShmemPmdMapped: 0 kB
- MemTotal: Total usable ram (i.e. physical ram minus a few reserved
+MemTotal
+ Total usable RAM (i.e. physical RAM minus a few reserved
bits and the kernel binary code)
- MemFree: The sum of LowFree+HighFree
-MemAvailable: An estimate of how much memory is available for starting new
+MemFree
+ The sum of LowFree+HighFree
+MemAvailable
+ An estimate of how much memory is available for starting new
applications, without swapping. Calculated from MemFree,
SReclaimable, the size of the file LRU lists, and the low
watermarks in each zone.
@@ -930,69 +978,99 @@
page cache to function well, and that not all reclaimable
slab will be reclaimable, due to items being in use. The
impact of those factors will vary from system to system.
- Buffers: Relatively temporary storage for raw disk blocks
+Buffers
+ Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so)
- Cached: in-memory cache for files read from the disk (the
+Cached
+ in-memory cache for files read from the disk (the
pagecache). Doesn't include SwapCached
- SwapCached: Memory that once was swapped out, is swapped back in but
+SwapCached
+ Memory that once was swapped out, is swapped back in but
still also is in the swapfile (if memory is needed it
doesn't need to be swapped out AGAIN because it is already
in the swapfile. This saves I/O)
- Active: Memory that has been used more recently and usually not
+Active
+ Memory that has been used more recently and usually not
reclaimed unless absolutely necessary.
- Inactive: Memory which has been less recently used. It is more
+Inactive
+ Memory which has been less recently used. It is more
eligible to be reclaimed for other purposes
- HighTotal:
- HighFree: Highmem is all memory above ~860MB of physical memory
+HighTotal, HighFree
+ Highmem is all memory above ~860MB of physical memory.
Highmem areas are for use by userspace programs, or
for the pagecache. The kernel must use tricks to access
this memory, making it slower to access than lowmem.
- LowTotal:
- LowFree: Lowmem is memory which can be used for everything that
+LowTotal, LowFree
+ Lowmem is memory which can be used for everything that
highmem can be used for, but it is also available for the
kernel's use for its own data structures. Among many
other things, it is where everything from the Slab is
allocated. Bad things happen when you're out of lowmem.
- SwapTotal: total amount of swap space available
- SwapFree: Memory which has been evicted from RAM, and is temporarily
+SwapTotal
+ total amount of swap space available
+SwapFree
+ Memory which has been evicted from RAM, and is temporarily
on the disk
- Dirty: Memory which is waiting to get written back to the disk
- Writeback: Memory which is actively being written back to the disk
- AnonPages: Non-file backed pages mapped into userspace page tables
-HardwareCorrupted: The amount of RAM/memory in KB, the kernel identifies as
+Dirty
+ Memory which is waiting to get written back to the disk
+Writeback
+ Memory which is actively being written back to the disk
+AnonPages
+ Non-file backed pages mapped into userspace page tables
+HardwareCorrupted
+ The amount of RAM/memory in KB, the kernel identifies as
corrupted.
-AnonHugePages: Non-file backed huge pages mapped into userspace page tables
- Mapped: files which have been mmaped, such as libraries
- Shmem: Total memory used by shared memory (shmem) and tmpfs
-ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
+AnonHugePages
+ Non-file backed huge pages mapped into userspace page tables
+Mapped
+ files which have been mmaped, such as libraries
+Shmem
+ Total memory used by shared memory (shmem) and tmpfs
+ShmemHugePages
+ Memory used by shared memory (shmem) and tmpfs allocated
with huge pages
-ShmemPmdMapped: Shared memory mapped into userspace with huge pages
-KReclaimable: Kernel allocations that the kernel will attempt to reclaim
+ShmemPmdMapped
+ Shared memory mapped into userspace with huge pages
+KReclaimable
+ Kernel allocations that the kernel will attempt to reclaim
under memory pressure. Includes SReclaimable (below), and other
direct allocations with a shrinker.
- Slab: in-kernel data structures cache
-SReclaimable: Part of Slab, that might be reclaimed, such as caches
- SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
- PageTables: amount of memory dedicated to the lowest level of page
+Slab
+ in-kernel data structures cache
+SReclaimable
+ Part of Slab, that might be reclaimed, such as caches
+SUnreclaim
+ Part of Slab, that cannot be reclaimed on memory pressure
+PageTables
+ amount of memory dedicated to the lowest level of page
tables.
-NFS_Unstable: NFS pages sent to the server, but not yet committed to stable
- storage
- Bounce: Memory used for block device "bounce buffers"
-WritebackTmp: Memory used by FUSE for temporary writeback buffers
- CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'),
+NFS_Unstable
+ Always zero. Previous counted pages which had been written to
+ the server, but has not been committed to stable storage.
+Bounce
+ Memory used for block device "bounce buffers"
+WritebackTmp
+ Memory used by FUSE for temporary writeback buffers
+CommitLimit
+ Based on the overcommit ratio ('vm.overcommit_ratio'),
this is the total amount of memory currently available to
be allocated on the system. This limit is only adhered to
if strict overcommit accounting is enabled (mode 2 in
'vm.overcommit_memory').
- The CommitLimit is calculated with the following formula:
- CommitLimit = ([total RAM pages] - [total huge TLB pages]) *
- overcommit_ratio / 100 + [total swap pages]
+
+ The CommitLimit is calculated with the following formula::
+
+ CommitLimit = ([total RAM pages] - [total huge TLB pages]) *
+ overcommit_ratio / 100 + [total swap pages]
+
For example, on a system with 1G of physical RAM and 7G
of swap with a `vm.overcommit_ratio` of 30 it would
yield a CommitLimit of 7.3G.
+
For more details, see the memory overcommit documentation
in vm/overcommit-accounting.
-Committed_AS: The amount of memory presently allocated on the system.
+Committed_AS
+ The amount of memory presently allocated on the system.
The committed memory is a sum of all of the memory which
has been allocated by processes, even if it has not been
"used" by them as of yet. A process which malloc()'s 1G
@@ -1000,26 +1078,30 @@
using 1G. This 1G is memory which has been "committed" to
by the VM and can be used at any time by the allocating
application. With strict overcommit enabled on the system
- (mode 2 in 'vm.overcommit_memory'),allocations which would
+ (mode 2 in 'vm.overcommit_memory'), allocations which would
exceed the CommitLimit (detailed above) will not be permitted.
This is useful if one needs to guarantee that processes will
not fail due to lack of memory once that memory has been
successfully allocated.
-VmallocTotal: total size of vmalloc memory area
- VmallocUsed: amount of vmalloc area which is used
-VmallocChunk: largest contiguous block of vmalloc area which is free
- Percpu: Memory allocated to the percpu allocator used to back percpu
+VmallocTotal
+ total size of vmalloc memory area
+VmallocUsed
+ amount of vmalloc area which is used
+VmallocChunk
+ largest contiguous block of vmalloc area which is free
+Percpu
+ Memory allocated to the percpu allocator used to back percpu
allocations. This stat excludes the cost of metadata.
-..............................................................................
-
-vmallocinfo:
+vmallocinfo
+~~~~~~~~~~~
Provides information about vmalloced/vmaped areas. One line per area,
containing the virtual address range of the area, size in bytes,
caller information of the creator, and optional information depending
-on the kind of area :
+on the kind of area:
+ ========== ===================================================
pages=nr number of pages
phys=addr if a physical address was specified
ioremap I/O mapping (ioremap() and friends)
@@ -1029,49 +1111,54 @@
vpages buffer for pages pointers was vmalloced (huge area)
N<node>=nr (Only on NUMA kernels)
Number of pages allocated on memory node <node>
+ ========== ===================================================
-> cat /proc/vmallocinfo
-0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ...
- /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
-0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ...
- /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
-0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f...
- phys=7fee8000 ioremap
-0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f...
- phys=7fee7000 ioremap
-0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210
-0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ...
- /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
-0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ...
- pages=2 vmalloc N1=2
-0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ...
- /0x130 [x_tables] pages=4 vmalloc N0=4
-0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ...
- pages=14 vmalloc N2=14
-0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ...
- pages=4 vmalloc N1=4
-0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ...
- pages=2 vmalloc N1=2
-0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ...
- pages=10 vmalloc N0=10
+::
-..............................................................................
+ > cat /proc/vmallocinfo
+ 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ...
+ /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128
+ 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ...
+ /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64
+ 0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f...
+ phys=7fee8000 ioremap
+ 0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f...
+ phys=7fee7000 ioremap
+ 0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210
+ 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ...
+ /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3
+ 0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ...
+ pages=2 vmalloc N1=2
+ 0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ...
+ /0x130 [x_tables] pages=4 vmalloc N0=4
+ 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ...
+ pages=14 vmalloc N2=14
+ 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ...
+ pages=4 vmalloc N1=4
+ 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ...
+ pages=2 vmalloc N1=2
+ 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ...
+ pages=10 vmalloc N0=10
-softirqs:
-Provides counts of softirq handlers serviced since boot time, for each cpu.
+softirqs
+~~~~~~~~
-> cat /proc/softirqs
- CPU0 CPU1 CPU2 CPU3
- HI: 0 0 0 0
- TIMER: 27166 27120 27097 27034
- NET_TX: 0 0 0 17
- NET_RX: 42 0 0 39
- BLOCK: 0 0 107 1121
- TASKLET: 0 0 0 290
- SCHED: 27035 26983 26971 26746
- HRTIMER: 0 0 0 0
- RCU: 1678 1769 2178 2250
+Provides counts of softirq handlers serviced since boot time, for each CPU.
+
+::
+
+ > cat /proc/softirqs
+ CPU0 CPU1 CPU2 CPU3
+ HI: 0 0 0 0
+ TIMER: 27166 27120 27097 27034
+ NET_TX: 0 0 0 17
+ NET_RX: 42 0 0 39
+ BLOCK: 0 0 107 1121
+ TASKLET: 0 0 0 290
+ SCHED: 27035 26983 26971 26746
+ HRTIMER: 0 0 0 0
+ RCU: 1678 1769 2178 2250
1.3 IDE devices in /proc/ide
@@ -1082,8 +1169,8 @@
file drivers and a link for each IDE device, pointing to the device directory
in the controller specific subtree.
-The file drivers contains general information about the drivers used for the
-IDE devices:
+The file 'drivers' contains general information about the drivers used for the
+IDE devices::
> cat /proc/ide/drivers
ide-cdrom version 4.53
@@ -1094,57 +1181,61 @@
directories contains the files shown in table 1-6.
-Table 1-6: IDE controller info in /proc/ide/ide?
-..............................................................................
- File Content
- channel IDE channel (0 or 1)
- config Configuration (only for PCI/IDE bridge)
- mate Mate name
- model Type/Chipset of IDE controller
-..............................................................................
+.. table:: Table 1-6: IDE controller info in /proc/ide/ide?
+
+ ======= =======================================
+ File Content
+ ======= =======================================
+ channel IDE channel (0 or 1)
+ config Configuration (only for PCI/IDE bridge)
+ mate Mate name
+ model Type/Chipset of IDE controller
+ ======= =======================================
Each device connected to a controller has a separate subdirectory in the
controllers directory. The files listed in table 1-7 are contained in these
directories.
-Table 1-7: IDE device information
-..............................................................................
- File Content
- cache The cache
- capacity Capacity of the medium (in 512Byte blocks)
- driver driver and version
- geometry physical and logical geometry
- identify device identify block
- media media type
- model device identifier
- settings device setup
- smart_thresholds IDE disk management thresholds
- smart_values IDE disk management values
-..............................................................................
+.. table:: Table 1-7: IDE device information
-The most interesting file is settings. This file contains a nice overview of
-the drive parameters:
+ ================ ==========================================
+ File Content
+ ================ ==========================================
+ cache The cache
+ capacity Capacity of the medium (in 512Byte blocks)
+ driver driver and version
+ geometry physical and logical geometry
+ identify device identify block
+ media media type
+ model device identifier
+ settings device setup
+ smart_thresholds IDE disk management thresholds
+ smart_values IDE disk management values
+ ================ ==========================================
- # cat /proc/ide/ide0/hda/settings
- name value min max mode
- ---- ----- --- --- ----
- bios_cyl 526 0 65535 rw
- bios_head 255 0 255 rw
- bios_sect 63 0 63 rw
- breada_readahead 4 0 127 rw
- bswap 0 0 1 r
- file_readahead 72 0 2097151 rw
- io_32bit 0 0 3 rw
- keepsettings 0 0 1 rw
- max_kb_per_request 122 1 127 rw
- multcount 0 0 8 rw
- nice1 1 0 1 rw
- nowerr 0 0 1 rw
- pio_mode write-only 0 255 w
- slow 0 0 1 rw
- unmaskirq 0 0 1 rw
- using_dma 0 0 1 rw
+The most interesting file is ``settings``. This file contains a nice
+overview of the drive parameters::
+
+ # cat /proc/ide/ide0/hda/settings
+ name value min max mode
+ ---- ----- --- --- ----
+ bios_cyl 526 0 65535 rw
+ bios_head 255 0 255 rw
+ bios_sect 63 0 63 rw
+ breada_readahead 4 0 127 rw
+ bswap 0 0 1 r
+ file_readahead 72 0 2097151 rw
+ io_32bit 0 0 3 rw
+ keepsettings 0 0 1 rw
+ max_kb_per_request 122 1 127 rw
+ multcount 0 0 8 rw
+ nice1 1 0 1 rw
+ nowerr 0 0 1 rw
+ pio_mode write-only 0 255 w
+ slow 0 0 1 rw
+ unmaskirq 0 0 1 rw
+ using_dma 0 0 1 rw
1.4 Networking info in /proc/net
@@ -1155,67 +1246,70 @@
support this. Table 1-9 lists the files and their meaning.
-Table 1-8: IPv6 info in /proc/net
-..............................................................................
- File Content
- udp6 UDP sockets (IPv6)
- tcp6 TCP sockets (IPv6)
- raw6 Raw device statistics (IPv6)
- igmp6 IP multicast addresses, which this host joined (IPv6)
- if_inet6 List of IPv6 interface addresses
- ipv6_route Kernel routing table for IPv6
- rt6_stats Global IPv6 routing tables statistics
- sockstat6 Socket statistics (IPv6)
- snmp6 Snmp data (IPv6)
-..............................................................................
+.. table:: Table 1-8: IPv6 info in /proc/net
+ ========== =====================================================
+ File Content
+ ========== =====================================================
+ udp6 UDP sockets (IPv6)
+ tcp6 TCP sockets (IPv6)
+ raw6 Raw device statistics (IPv6)
+ igmp6 IP multicast addresses, which this host joined (IPv6)
+ if_inet6 List of IPv6 interface addresses
+ ipv6_route Kernel routing table for IPv6
+ rt6_stats Global IPv6 routing tables statistics
+ sockstat6 Socket statistics (IPv6)
+ snmp6 Snmp data (IPv6)
+ ========== =====================================================
-Table 1-9: Network info in /proc/net
-..............................................................................
- File Content
- arp Kernel ARP table
- dev network devices with statistics
+.. table:: Table 1-9: Network info in /proc/net
+
+ ============= ================================================================
+ File Content
+ ============= ================================================================
+ arp Kernel ARP table
+ dev network devices with statistics
dev_mcast the Layer2 multicast groups a device is listening too
(interface index, label, number of references, number of bound
- addresses).
- dev_stat network device status
- ip_fwchains Firewall chain linkage
- ip_fwnames Firewall chain names
- ip_masq Directory containing the masquerading tables
- ip_masquerade Major masquerading table
- netstat Network statistics
- raw raw device statistics
- route Kernel routing table
- rpc Directory containing rpc info
- rt_cache Routing cache
- snmp SNMP data
- sockstat Socket statistics
- tcp TCP sockets
- udp UDP sockets
- unix UNIX domain sockets
- wireless Wireless interface data (Wavelan etc)
- igmp IP multicast addresses, which this host joined
- psched Global packet scheduler parameters.
- netlink List of PF_NETLINK sockets
- ip_mr_vifs List of multicast virtual interfaces
- ip_mr_cache List of multicast routing cache
-..............................................................................
+ addresses).
+ dev_stat network device status
+ ip_fwchains Firewall chain linkage
+ ip_fwnames Firewall chain names
+ ip_masq Directory containing the masquerading tables
+ ip_masquerade Major masquerading table
+ netstat Network statistics
+ raw raw device statistics
+ route Kernel routing table
+ rpc Directory containing rpc info
+ rt_cache Routing cache
+ snmp SNMP data
+ sockstat Socket statistics
+ tcp TCP sockets
+ udp UDP sockets
+ unix UNIX domain sockets
+ wireless Wireless interface data (Wavelan etc)
+ igmp IP multicast addresses, which this host joined
+ psched Global packet scheduler parameters.
+ netlink List of PF_NETLINK sockets
+ ip_mr_vifs List of multicast virtual interfaces
+ ip_mr_cache List of multicast routing cache
+ ============= ================================================================
You can use this information to see which network devices are available in
-your system and how much traffic was routed over those devices:
+your system and how much traffic was routed over those devices::
- > cat /proc/net/dev
- Inter-|Receive |[...
- face |bytes packets errs drop fifo frame compressed multicast|[...
- lo: 908188 5596 0 0 0 0 0 0 [...
- ppp0:15475140 20721 410 0 0 410 0 0 [...
- eth0: 614530 7085 0 0 0 0 0 1 [...
-
- ...] Transmit
- ...] bytes packets errs drop fifo colls carrier compressed
- ...] 908188 5596 0 0 0 0 0 0
- ...] 1375103 17405 0 0 0 0 0 0
- ...] 1703981 5535 0 0 0 3 0 0
+ > cat /proc/net/dev
+ Inter-|Receive |[...
+ face |bytes packets errs drop fifo frame compressed multicast|[...
+ lo: 908188 5596 0 0 0 0 0 0 [...
+ ppp0:15475140 20721 410 0 0 410 0 0 [...
+ eth0: 614530 7085 0 0 0 0 0 1 [...
+
+ ...] Transmit
+ ...] bytes packets errs drop fifo colls carrier compressed
+ ...] 908188 5596 0 0 0 0 0 0
+ ...] 1375103 17405 0 0 0 0 0 0
+ ...] 1703981 5535 0 0 0 3 0 0
In addition, each Channel Bond interface has its own directory. For
example, the bond0 device will have a directory called /proc/net/bond0/.
@@ -1228,62 +1322,62 @@
If you have a SCSI host adapter in your system, you'll find a subdirectory
named after the driver for this adapter in /proc/scsi. You'll also see a list
-of all recognized SCSI devices in /proc/scsi:
+of all recognized SCSI devices in /proc/scsi::
- >cat /proc/scsi/scsi
- Attached devices:
- Host: scsi0 Channel: 00 Id: 00 Lun: 00
- Vendor: IBM Model: DGHS09U Rev: 03E0
- Type: Direct-Access ANSI SCSI revision: 03
- Host: scsi0 Channel: 00 Id: 06 Lun: 00
- Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04
- Type: CD-ROM ANSI SCSI revision: 02
+ >cat /proc/scsi/scsi
+ Attached devices:
+ Host: scsi0 Channel: 00 Id: 00 Lun: 00
+ Vendor: IBM Model: DGHS09U Rev: 03E0
+ Type: Direct-Access ANSI SCSI revision: 03
+ Host: scsi0 Channel: 00 Id: 06 Lun: 00
+ Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04
+ Type: CD-ROM ANSI SCSI revision: 02
The directory named after the driver has one file for each adapter found in
the system. These files contain information about the controller, including
the used IRQ and the IO address range. The amount of information shown is
dependent on the adapter you use. The example shows the output for an Adaptec
-AHA-2940 SCSI adapter:
+AHA-2940 SCSI adapter::
- > cat /proc/scsi/aic7xxx/0
-
- Adaptec AIC7xxx driver version: 5.1.19/3.2.4
- Compile Options:
- TCQ Enabled By Default : Disabled
- AIC7XXX_PROC_STATS : Disabled
- AIC7XXX_RESET_DELAY : 5
- Adapter Configuration:
- SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter
- Ultra Wide Controller
- PCI MMAPed I/O Base: 0xeb001000
- Adapter SEEPROM Config: SEEPROM found and used.
- Adaptec SCSI BIOS: Enabled
- IRQ: 10
- SCBs: Active 0, Max Active 2,
- Allocated 15, HW 16, Page 255
- Interrupts: 160328
- BIOS Control Word: 0x18b6
- Adapter Control Word: 0x005b
- Extended Translation: Enabled
- Disconnect Enable Flags: 0xffff
- Ultra Enable Flags: 0x0001
- Tag Queue Enable Flags: 0x0000
- Ordered Queue Tag Flags: 0x0000
- Default Tag Queue Depth: 8
- Tagged Queue By Device array for aic7xxx host instance 0:
- {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255}
- Actual queue depth per device for aic7xxx host instance 0:
- {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
- Statistics:
- (scsi0:0:0:0)
- Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8
- Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0)
- Total transfers 160151 (74577 reads and 85574 writes)
- (scsi0:0:6:0)
- Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15
- Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0)
- Total transfers 0 (0 reads and 0 writes)
+ > cat /proc/scsi/aic7xxx/0
+
+ Adaptec AIC7xxx driver version: 5.1.19/3.2.4
+ Compile Options:
+ TCQ Enabled By Default : Disabled
+ AIC7XXX_PROC_STATS : Disabled
+ AIC7XXX_RESET_DELAY : 5
+ Adapter Configuration:
+ SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter
+ Ultra Wide Controller
+ PCI MMAPed I/O Base: 0xeb001000
+ Adapter SEEPROM Config: SEEPROM found and used.
+ Adaptec SCSI BIOS: Enabled
+ IRQ: 10
+ SCBs: Active 0, Max Active 2,
+ Allocated 15, HW 16, Page 255
+ Interrupts: 160328
+ BIOS Control Word: 0x18b6
+ Adapter Control Word: 0x005b
+ Extended Translation: Enabled
+ Disconnect Enable Flags: 0xffff
+ Ultra Enable Flags: 0x0001
+ Tag Queue Enable Flags: 0x0000
+ Ordered Queue Tag Flags: 0x0000
+ Default Tag Queue Depth: 8
+ Tagged Queue By Device array for aic7xxx host instance 0:
+ {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255}
+ Actual queue depth per device for aic7xxx host instance 0:
+ {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
+ Statistics:
+ (scsi0:0:0:0)
+ Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8
+ Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0)
+ Total transfers 160151 (74577 reads and 85574 writes)
+ (scsi0:0:6:0)
+ Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15
+ Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0)
+ Total transfers 0 (0 reads and 0 writes)
1.6 Parallel port info in /proc/parport
@@ -1296,50 +1390,54 @@
These directories contain the four files shown in Table 1-10.
-Table 1-10: Files in /proc/parport
-..............................................................................
- File Content
- autoprobe Any IEEE-1284 device ID information that has been acquired.
+.. table:: Table 1-10: Files in /proc/parport
+
+ ========= ====================================================================
+ File Content
+ ========= ====================================================================
+ autoprobe Any IEEE-1284 device ID information that has been acquired.
devices list of the device drivers using that port. A + will appear by the
name of the device currently using the port (it might not appear
- against any).
- hardware Parallel port's base address, IRQ line and DMA channel.
+ against any).
+ hardware Parallel port's base address, IRQ line and DMA channel.
irq IRQ that parport is using for that port. This is in a separate
file to allow you to alter it by writing a new value in (IRQ
- number or none).
-..............................................................................
+ number or none).
+ ========= ====================================================================
1.7 TTY info in /proc/tty
-------------------------
Information about the available and actually used tty's can be found in the
-directory /proc/tty.You'll find entries for drivers and line disciplines in
+directory /proc/tty. You'll find entries for drivers and line disciplines in
this directory, as shown in Table 1-11.
-Table 1-11: Files in /proc/tty
-..............................................................................
- File Content
- drivers list of drivers and their usage
- ldiscs registered line disciplines
- driver/serial usage statistic and status of single tty lines
-..............................................................................
+.. table:: Table 1-11: Files in /proc/tty
+
+ ============= ==============================================
+ File Content
+ ============= ==============================================
+ drivers list of drivers and their usage
+ ldiscs registered line disciplines
+ driver/serial usage statistic and status of single tty lines
+ ============= ==============================================
To see which tty's are currently in use, you can simply look into the file
-/proc/tty/drivers:
+/proc/tty/drivers::
- > cat /proc/tty/drivers
- pty_slave /dev/pts 136 0-255 pty:slave
- pty_master /dev/ptm 128 0-255 pty:master
- pty_slave /dev/ttyp 3 0-255 pty:slave
- pty_master /dev/pty 2 0-255 pty:master
- serial /dev/cua 5 64-67 serial:callout
- serial /dev/ttyS 4 64-67 serial
- /dev/tty0 /dev/tty0 4 0 system:vtmaster
- /dev/ptmx /dev/ptmx 5 2 system
- /dev/console /dev/console 5 1 system:console
- /dev/tty /dev/tty 5 0 system:/dev/tty
- unknown /dev/tty 4 1-63 console
+ > cat /proc/tty/drivers
+ pty_slave /dev/pts 136 0-255 pty:slave
+ pty_master /dev/ptm 128 0-255 pty:master
+ pty_slave /dev/ttyp 3 0-255 pty:slave
+ pty_master /dev/pty 2 0-255 pty:master
+ serial /dev/cua 5 64-67 serial:callout
+ serial /dev/ttyS 4 64-67 serial
+ /dev/tty0 /dev/tty0 4 0 system:vtmaster
+ /dev/ptmx /dev/ptmx 5 2 system
+ /dev/console /dev/console 5 1 system:console
+ /dev/tty /dev/tty 5 0 system:/dev/tty
+ unknown /dev/tty 4 1-63 console
1.8 Miscellaneous kernel statistics in /proc/stat
@@ -1347,7 +1445,7 @@
Various pieces of information about kernel activity are available in the
/proc/stat file. All of the numbers reported in this file are aggregates
-since the system first booted. For a quick look, simply cat the file:
+since the system first booted. For a quick look, simply cat the file::
> cat /proc/stat
cpu 2255 34 2290 22625563 6290 127 456 0 0 0
@@ -1372,13 +1470,15 @@
- idle: twiddling thumbs
- iowait: In a word, iowait stands for waiting for I/O to complete. But there
are several problems:
- 1. Cpu will not wait for I/O to complete, iowait is the time that a task is
- waiting for I/O to complete. When cpu goes into idle state for
- outstanding task io, another task will be scheduled on this CPU.
+
+ 1. CPU will not wait for I/O to complete, iowait is the time that a task is
+ waiting for I/O to complete. When CPU goes into idle state for
+ outstanding task I/O, another task will be scheduled on this CPU.
2. In a multi-core CPU, the task waiting for I/O to complete is not running
on any CPU, so the iowait of each CPU is difficult to calculate.
3. The value of iowait field in /proc/stat will decrease in certain
conditions.
+
So, the iowait is not reliable by reading from /proc/stat.
- irq: servicing interrupts
- softirq: servicing softirqs
@@ -1422,18 +1522,19 @@
/proc/fs/ext4/dm-0). The files in each per-device directory are shown
in Table 1-12, below.
-Table 1-12: Files in /proc/fs/ext4/<devname>
-..............................................................................
- File Content
- mb_groups details of multiblock allocator buddy cache of free blocks
-..............................................................................
+.. table:: Table 1-12: Files in /proc/fs/ext4/<devname>
-2.0 /proc/consoles
-------------------
+ ============== ==========================================================
+ File Content
+ mb_groups details of multiblock allocator buddy cache of free blocks
+ ============== ==========================================================
+
+1.10 /proc/consoles
+-------------------
Shows registered system console lines.
To see which character device lines are currently used for the system console
-/dev/console, you may simply look into the file /proc/consoles:
+/dev/console, you may simply look into the file /proc/consoles::
> cat /proc/consoles
tty0 -WU (ECp) 4:7
@@ -1441,41 +1542,45 @@
The columns are:
- device name of the device
- operations R = can do read operations
- W = can do write operations
- U = can do unblank
- flags E = it is enabled
- C = it is preferred console
- B = it is primary boot console
- p = it is used for printk buffer
- b = it is not a TTY but a Braille device
- a = it is safe to use when cpu is offline
- major:minor major and minor number of the device separated by a colon
++--------------------+-------------------------------------------------------+
+| device | name of the device |
++====================+=======================================================+
+| operations | * R = can do read operations |
+| | * W = can do write operations |
+| | * U = can do unblank |
++--------------------+-------------------------------------------------------+
+| flags | * E = it is enabled |
+| | * C = it is preferred console |
+| | * B = it is primary boot console |
+| | * p = it is used for printk buffer |
+| | * b = it is not a TTY but a Braille device |
+| | * a = it is safe to use when cpu is offline |
++--------------------+-------------------------------------------------------+
+| major:minor | major and minor number of the device separated by a |
+| | colon |
++--------------------+-------------------------------------------------------+
-------------------------------------------------------------------------------
Summary
-------------------------------------------------------------------------------
+-------
+
The /proc file system serves information about the running system. It not only
allows access to process data but also allows you to request the kernel status
by reading files in the hierarchy.
The directory structure of /proc reflects the types of information and makes
it easy, if not obvious, where to look for specific data.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
-CHAPTER 2: MODIFYING SYSTEM PARAMETERS
-------------------------------------------------------------------------------
+Chapter 2: Modifying System Parameters
+======================================
-------------------------------------------------------------------------------
In This Chapter
-------------------------------------------------------------------------------
+---------------
+
* Modifying kernel parameters by writing into files found in /proc/sys
* Exploring the files which modify certain parameters
* Review of the /proc/sys file tree
-------------------------------------------------------------------------------
+------------------------------------------------------------------------------
A very interesting part of /proc is the directory /proc/sys. This is not only
a source of information, it also allows you to change parameters within the
@@ -1485,10 +1590,9 @@
everything works the way you want it to. You may have no alternative but to
reboot the machine once an error has been made.
-To change a value, simply echo the new value into the file. An example is
-given below in the section on the file system data. You need to be root to do
-this. You can create your own boot script to perform this every time your
-system boots.
+To change a value, simply echo the new value into the file.
+You need to be root to do this. You can create your own boot script
+to perform this every time your system boots.
The files in /proc/sys can be used to fine tune and monitor miscellaneous and
general things in the operation of the Linux kernel. Since some of the files
@@ -1503,25 +1607,24 @@
Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these
entries.
-------------------------------------------------------------------------------
Summary
-------------------------------------------------------------------------------
+-------
+
Certain aspects of kernel behavior can be modified at runtime, without the
need to recompile the kernel, or even to reboot the system. The files in the
/proc/sys tree can not only be read, but also modified. You can use the echo
command to write value into these files, thereby changing the default settings
of the kernel.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
-CHAPTER 3: PER-PROCESS PARAMETERS
-------------------------------------------------------------------------------
+
+Chapter 3: Per-process Parameters
+=================================
3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
--------------------------------------------------------------------------------
-These file can be used to adjust the badness heuristic used to select which
-process gets killed in out of memory conditions.
+These files can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory (oom) conditions.
The badness heuristic assigns a value to each candidate task ranging from 0
(never kill) to 1000 (always kill) to determine which process is targeted. The
@@ -1530,9 +1633,6 @@
For example, if a task is using all allowed memory, its badness score will be
1000. If it is using half of its allowed memory, its score will be 500.
-There is an additional factor included in the badness score: the current memory
-and swap usage is discounted by 3% for root processes.
-
The amount of "allowed" memory depends on the context in which the oom killer
was called. If it is due to the memory assigned to the allocating task's cpuset
being exhausted, the allowed memory represents the set of mems assigned to that
@@ -1568,57 +1668,57 @@
value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
requires CAP_SYS_RESOURCE.
-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with separate address spaces instead, if possible. This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
-This file can be used to check the current score used by the oom-killer is for
+This file can be used to check the current score used by the oom-killer for
any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
process should be killed in an out-of-memory situation.
+Please note that the exported value includes oom_score_adj so it is
+effectively in range [0,2000].
+
3.3 /proc/<pid>/io - Display the IO accounting fields
-------------------------------------------------------
-This file contains IO statistics for each running process
+This file contains IO statistics for each running process.
Example
--------
+~~~~~~~
-test:/tmp # dd if=/dev/zero of=/tmp/test.dat &
-[1] 3828
+::
-test:/tmp # cat /proc/3828/io
-rchar: 323934931
-wchar: 323929600
-syscr: 632687
-syscw: 632675
-read_bytes: 0
-write_bytes: 323932160
-cancelled_write_bytes: 0
+ test:/tmp # dd if=/dev/zero of=/tmp/test.dat &
+ [1] 3828
+
+ test:/tmp # cat /proc/3828/io
+ rchar: 323934931
+ wchar: 323929600
+ syscr: 632687
+ syscw: 632675
+ read_bytes: 0
+ write_bytes: 323932160
+ cancelled_write_bytes: 0
Description
------------
+~~~~~~~~~~~
rchar
------
+^^^^^
I/O counter: chars read
The number of bytes which this task has caused to be read from storage. This
is simply the sum of bytes which this process passed to read() and pread().
It includes things like tty IO and it is unaffected by whether or not actual
physical disk IO was required (the read might have been satisfied from
-pagecache)
+pagecache).
wchar
------
+^^^^^
I/O counter: chars written
The number of bytes which this task has caused, or shall cause to be written
@@ -1626,7 +1726,7 @@
syscr
------
+^^^^^
I/O counter: read syscalls
Attempt to count the number of read I/O operations, i.e. syscalls like read()
@@ -1634,7 +1734,7 @@
syscw
------
+^^^^^
I/O counter: write syscalls
Attempt to count the number of write I/O operations, i.e. syscalls like
@@ -1642,7 +1742,7 @@
read_bytes
-----------
+^^^^^^^^^^
I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to
@@ -1652,7 +1752,7 @@
write_bytes
------------
+^^^^^^^^^^^
I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to
@@ -1660,7 +1760,7 @@
cancelled_write_bytes
----------------------
+^^^^^^^^^^^^^^^^^^^^^
The big inaccuracy here is truncate. If a process writes 1MB to a file and
then deletes the file, it will in fact perform no writeout. But it will have
@@ -1673,12 +1773,11 @@
that.
-Note
-----
+.. Note::
-At its current implementation state, this is a bit racy on 32-bit machines: if
-process A reads process B's /proc/pid/io while process B is updating one of
-those 64-bit counters, process A could see an intermediate result.
+ At its current implementation state, this is a bit racy on 32-bit machines:
+ if process A reads process B's /proc/pid/io while process B is updating one
+ of those 64-bit counters, process A could see an intermediate result.
More information about this can be found within the taskstats documentation in
@@ -1698,12 +1797,13 @@
corresponding memory type are dumped, otherwise they are not dumped.
The following 9 memory types are supported:
+
- (bit 0) anonymous private memory
- (bit 1) anonymous shared memory
- (bit 2) file-backed private memory
- (bit 3) file-backed shared memory
- (bit 4) ELF header pages in file-backed private memory areas (it is
- effective only if the bit 2 is cleared)
+ effective only if the bit 2 is cleared)
- (bit 5) hugetlb private memory
- (bit 6) hugetlb shared memory
- (bit 7) DAX private memory
@@ -1719,13 +1819,13 @@
segments, ELF header pages and hugetlb private memory are dumped.
If you don't want to dump all shared memory segments attached to pid 1234,
-write 0x31 to the process's proc file.
+write 0x31 to the process's proc file::
$ echo 0x31 > /proc/1234/coredump_filter
When a new process is created, the process inherits the bitmask status from its
parent. It is useful to set up coredump_filter before the program runs.
-For example:
+For example::
$ echo 0x7 > /proc/self/coredump_filter
$ ./some_program
@@ -1733,44 +1833,46 @@
3.5 /proc/<pid>/mountinfo - Information about mounts
--------------------------------------------------------
-This file contains lines of the form:
+This file contains lines of the form::
-36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
-(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
+ 36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
+ (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
-(1) mount ID: unique identifier of the mount (may be reused after umount)
-(2) parent ID: ID of parent (or of self for the top of the mount tree)
-(3) major:minor: value of st_dev for files on filesystem
-(4) root: root of the mount within the filesystem
-(5) mount point: mount point relative to the process's root
-(6) mount options: per mount options
-(7) optional fields: zero or more fields of the form "tag[:value]"
-(8) separator: marks the end of the optional fields
-(9) filesystem type: name of filesystem of the form "type[.subtype]"
-(10) mount source: filesystem specific information or "none"
-(11) super options: per super block options
+ (1) mount ID: unique identifier of the mount (may be reused after umount)
+ (2) parent ID: ID of parent (or of self for the top of the mount tree)
+ (3) major:minor: value of st_dev for files on filesystem
+ (4) root: root of the mount within the filesystem
+ (5) mount point: mount point relative to the process's root
+ (6) mount options: per mount options
+ (7) optional fields: zero or more fields of the form "tag[:value]"
+ (8) separator: marks the end of the optional fields
+ (9) filesystem type: name of filesystem of the form "type[.subtype]"
+ (10) mount source: filesystem specific information or "none"
+ (11) super options: per super block options
Parsers should ignore all unrecognised optional fields. Currently the
possible optional fields are:
-shared:X mount is shared in peer group X
-master:X mount is slave to peer group X
-propagate_from:X mount is slave and receives propagation from peer group X (*)
-unbindable mount is unbindable
+================ ==============================================================
+shared:X mount is shared in peer group X
+master:X mount is slave to peer group X
+propagate_from:X mount is slave and receives propagation from peer group X [#]_
+unbindable mount is unbindable
+================ ==============================================================
-(*) X is the closest dominant peer group under the process's root. If
-X is the immediate master of the mount, or if there's no dominant peer
-group under the same root, then only the "master:X" field is present
-and not the "propagate_from:X" field.
+.. [#] X is the closest dominant peer group under the process's root. If
+ X is the immediate master of the mount, or if there's no dominant peer
+ group under the same root, then only the "master:X" field is present
+ and not the "propagate_from:X" field.
For more information on mount propagation see:
- Documentation/filesystems/sharedsubtree.txt
+ Documentation/filesystems/sharedsubtree.rst
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
--------------------------------------------------------
-These files provide a method to access a tasks comm value. It also allows for
+These files provide a method to access a task's comm value. It also allows for
a task to set its own or one of its thread siblings comm value. The comm value
is limited in size compared to the cmdline value, so writing anything longer
then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
@@ -1783,98 +1885,107 @@
of a task pointed by <pid>/<tid> pair. The format is a space separated
stream of pids.
-Note the "first level" here -- if a child has own children they will
-not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+Note the "first level" here -- if a child has its own children they will
+not be listed here; one needs to read /proc/<children-pid>/task/<tid>/children
to obtain the descendants.
Since this interface is intended to be fast and cheap it doesn't
guarantee to provide precise results and some children might be
skipped, especially if they've exited right after we printed their
-pids, so one need to either stop or freeze processes being inspected
+pids, so one needs to either stop or freeze processes being inspected
if precise results are needed.
3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
---------------------------------------------------------------
This file provides information associated with an opened file. The regular
-files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos'
+files have at least three fields -- 'pos', 'flags' and 'mnt_id'. The 'pos'
represents the current offset of the opened file in decimal form [see lseek(2)
for details], 'flags' denotes the octal O_xxx mask the file has been
created with [see open(2) for details] and 'mnt_id' represents mount ID of
the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo
for details].
-A typical output is
+A typical output is::
pos: 0
flags: 0100002
mnt_id: 19
-All locks associated with a file descriptor are shown in its fdinfo too.
+All locks associated with a file descriptor are shown in its fdinfo too::
-lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF
+ lock: 1: FLOCK ADVISORY WRITE 359 00:13:11691 0 EOF
The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags
pair provide additional information particular to the objects they represent.
- Eventfd files
- ~~~~~~~~~~~~~
+Eventfd files
+~~~~~~~~~~~~~
+
+::
+
pos: 0
flags: 04002
mnt_id: 9
eventfd-count: 5a
- where 'eventfd-count' is hex value of a counter.
+where 'eventfd-count' is hex value of a counter.
- Signalfd files
- ~~~~~~~~~~~~~~
+Signalfd files
+~~~~~~~~~~~~~~
+
+::
+
pos: 0
flags: 04002
mnt_id: 9
sigmask: 0000000000000200
- where 'sigmask' is hex value of the signal mask associated
- with a file.
+where 'sigmask' is hex value of the signal mask associated
+with a file.
- Epoll files
- ~~~~~~~~~~~
+Epoll files
+~~~~~~~~~~~
+
+::
+
pos: 0
flags: 02
mnt_id: 9
tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7
- where 'tfd' is a target file descriptor number in decimal form,
- 'events' is events mask being watched and the 'data' is data
- associated with a target [see epoll(7) for more details].
+where 'tfd' is a target file descriptor number in decimal form,
+'events' is events mask being watched and the 'data' is data
+associated with a target [see epoll(7) for more details].
- The 'pos' is current offset of the target file in decimal form
- [see lseek(2)], 'ino' and 'sdev' are inode and device numbers
- where target file resides, all in hex format.
+The 'pos' is current offset of the target file in decimal form
+[see lseek(2)], 'ino' and 'sdev' are inode and device numbers
+where target file resides, all in hex format.
- Fsnotify files
- ~~~~~~~~~~~~~~
- For inotify files the format is the following
+Fsnotify files
+~~~~~~~~~~~~~~
+For inotify files the format is the following::
pos: 0
flags: 02000000
inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
- where 'wd' is a watch descriptor in decimal form, ie a target file
- descriptor number, 'ino' and 'sdev' are inode and device where the
- target file resides and the 'mask' is the mask of events, all in hex
- form [see inotify(7) for more details].
+where 'wd' is a watch descriptor in decimal form, i.e. a target file
+descriptor number, 'ino' and 'sdev' are inode and device where the
+target file resides and the 'mask' is the mask of events, all in hex
+form [see inotify(7) for more details].
- If the kernel was built with exportfs support, the path to the target
- file is encoded as a file handle. The file handle is provided by three
- fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex
- format.
+If the kernel was built with exportfs support, the path to the target
+file is encoded as a file handle. The file handle is provided by three
+fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex
+format.
- If the kernel is built without exportfs support the file handle won't be
- printed out.
+If the kernel is built without exportfs support the file handle won't be
+printed out.
- If there is no inotify mark attached yet the 'inotify' line will be omitted.
+If there is no inotify mark attached yet the 'inotify' line will be omitted.
- For fanotify files the format is
+For fanotify files the format is::
pos: 0
flags: 02
@@ -1883,20 +1994,22 @@
fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
- where fanotify 'flags' and 'event-flags' are values used in fanotify_init
- call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
- flags associated with mark which are tracked separately from events
- mask. 'ino', 'sdev' are target inode and device, 'mask' is the events
- mask and 'ignored_mask' is the mask of events which are to be ignored.
- All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
- does provide information about flags and mask used in fanotify_mark
- call [see fsnotify manpage for details].
+where fanotify 'flags' and 'event-flags' are values used in fanotify_init
+call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
+flags associated with mark which are tracked separately from events
+mask. 'ino' and 'sdev' are target inode and device, 'mask' is the events
+mask and 'ignored_mask' is the mask of events which are to be ignored.
+All are in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
+provide information about flags and mask used in fanotify_mark
+call [see fsnotify manpage for details].
- While the first three lines are mandatory and always printed, the rest is
- optional and may be omitted if no marks created yet.
+While the first three lines are mandatory and always printed, the rest is
+optional and may be omitted if no marks created yet.
- Timerfd files
- ~~~~~~~~~~~~~
+Timerfd files
+~~~~~~~~~~~~~
+
+::
pos: 0
flags: 02
@@ -1907,18 +2020,18 @@
it_value: (0, 49406829)
it_interval: (1, 0)
- where 'clockid' is the clock type and 'ticks' is the number of the timer expirations
- that have occurred [see timerfd_create(2) for details]. 'settime flags' are
- flags in octal form been used to setup the timer [see timerfd_settime(2) for
- details]. 'it_value' is remaining time until the timer exiration.
- 'it_interval' is the interval for the timer. Note the timer might be set up
- with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value'
- still exhibits timer's remaining time.
+where 'clockid' is the clock type and 'ticks' is the number of the timer expirations
+that have occurred [see timerfd_create(2) for details]. 'settime flags' are
+flags in octal form been used to setup the timer [see timerfd_settime(2) for
+details]. 'it_value' is remaining time until the timer expiration.
+'it_interval' is the interval for the timer. Note the timer might be set up
+with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value'
+still exhibits timer's remaining time.
3.9 /proc/<pid>/map_files - Information about memory mapped files
---------------------------------------------------------------------
This directory contains symbolic links which represent memory mapped files
-the process is maintaining. Example output:
+the process is maintaining. Example output::
| lr-------- 1 root root 64 Jan 27 11:24 333c600000-333c620000 -> /usr/lib64/ld-2.18.so
| lr-------- 1 root root 64 Jan 27 11:24 333c81f000-333c820000 -> /usr/lib64/ld-2.18.so
@@ -1940,13 +2053,13 @@
3.10 /proc/<pid>/timerslack_ns - Task timerslack value
---------------------------------------------------------
This file provides the value of the task's timerslack value in nanoseconds.
-This value specifies a amount of time that normal timers may be deferred
+This value specifies an amount of time that normal timers may be deferred
in order to coalesce timers and avoid unnecessary wakeups.
-This allows a task's interactivity vs power consumption trade off to be
+This allows a task's interactivity vs power consumption tradeoff to be
adjusted.
-Writing 0 to the file will set the tasks timerslack to the default value.
+Writing 0 to the file will set the task's timerslack to the default value.
Valid values are from 0 - ULLONG_MAX
@@ -1976,17 +2089,22 @@
architecture specific status of the task.
Example
--------
+~~~~~~~
+
+::
+
$ cat /proc/6753/arch_status
AVX512_elapsed_ms: 8
Description
------------
+~~~~~~~~~~~
-x86 specific entries:
----------------------
- AVX512_elapsed_ms:
- ------------------
+x86 specific entries
+~~~~~~~~~~~~~~~~~~~~~
+
+AVX512_elapsed_ms
+^^^^^^^^^^^^^^^^^^
+
If AVX512 is supported on the machine, this entry shows the milliseconds
elapsed since the last time AVX512 usage was recorded. The recording
happens on a best effort basis when a task is scheduled out. This means
@@ -2010,38 +2128,89 @@
the task is unlikely an AVX512 user, but depends on the workload and the
scheduling scenario, it also could be a false negative mentioned above.
-------------------------------------------------------------------------------
-Configuring procfs
-------------------------------------------------------------------------------
+Chapter 4: Configuring procfs
+=============================
4.1 Mount options
---------------------
The following mount options are supported:
+ ========= ========================================================
hidepid= Set /proc/<pid>/ access mode.
gid= Set the group authorized to learn processes information.
+ subset= Show only the specified subset of procfs.
+ ========= ========================================================
-hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
-(default).
+hidepid=off or hidepid=0 means classic mode - everybody may access all
+/proc/<pid>/ directories (default).
-hidepid=1 means users may not access any /proc/<pid>/ directories but their
-own. Sensitive files like cmdline, sched*, status are now protected against
-other users. This makes it impossible to learn whether any user runs
-specific program (given the program doesn't reveal itself by its behaviour).
-As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users,
-poorly written programs passing sensitive information via program arguments are
-now protected against local eavesdroppers.
+hidepid=noaccess or hidepid=1 means users may not access any /proc/<pid>/
+directories but their own. Sensitive files like cmdline, sched*, status are now
+protected against other users. This makes it impossible to learn whether any
+user runs specific program (given the program doesn't reveal itself by its
+behaviour). As an additional bonus, as /proc/<pid>/cmdline is unaccessible for
+other users, poorly written programs passing sensitive information via program
+arguments are now protected against local eavesdroppers.
-hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other
-users. It doesn't mean that it hides a fact whether a process with a specific
-pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"),
-but it hides process' uid and gid, which may be learned by stat()'ing
-/proc/<pid>/ otherwise. It greatly complicates an intruder's task of gathering
-information about running processes, whether some daemon runs with elevated
-privileges, whether other user runs some sensitive program, whether other users
-run any program at all, etc.
+hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
+fully invisible to other users. It doesn't mean that it hides a fact whether a
+process with a specific pid value exists (it can be learned by other means, e.g.
+by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+stat()'ing /proc/<pid>/ otherwise. It greatly complicates an intruder's task of
+gathering information about running processes, whether some daemon runs with
+elevated privileges, whether other user runs some sensitive program, whether
+other users run any program at all, etc.
+
+hidepid=ptraceable or hidepid=4 means that procfs should only contain
+/proc/<pid>/ directories that the caller can ptrace.
gid= defines a group authorized to learn processes information otherwise
prohibited by hidepid=. If you use some daemon like identd which needs to learn
information about processes information, just add identd to this group.
+
+subset=pid hides all top level files and directories in the procfs that
+are not related to tasks.
+
+Chapter 5: Filesystem behavior
+==============================
+
+Originally, before the advent of pid namepsace, procfs was a global file
+system. It means that there was only one procfs instance in the system.
+
+When pid namespace was added, a separate procfs instance was mounted in
+each pid namespace. So, procfs mount options are global among all
+mountpoints within the same namespace::
+
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=2 0 0
+
+ # strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
+ mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
+ +++ exited with 0 +++
+
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=2 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=2 0 0
+
+and only after remounting procfs mount options will change at all
+mountpoints::
+
+ # mount -o remount,hidepid=1 -t proc proc /tmp/proc
+
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=1 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=1 0 0
+
+This behavior is different from the behavior of other filesystems.
+
+The new procfs behavior is more like other filesystems. Each procfs mount
+creates a new procfs instance. Mount options affect own procfs instance.
+It means that it became possible to have several procfs instances
+displaying tasks with different filtering options in one pid namespace::
+
+ # mount -o hidepid=invisible -t proc proc /proc
+ # mount -o hidepid=noaccess -t proc proc /tmp/proc
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=invisible 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=noaccess 0 0
diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.rst
similarity index 98%
rename from Documentation/filesystems/qnx6.txt
rename to Documentation/filesystems/qnx6.rst
index 48ea68f..fd13433 100644
--- a/Documentation/filesystems/qnx6.txt
+++ b/Documentation/filesystems/qnx6.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
The QNX6 Filesystem
===================
@@ -14,10 +17,12 @@
qnx6fs shares many properties with traditional Unix filesystems. It has the
concepts of blocks, inodes and directories.
+
On QNX it is possible to create little endian and big endian qnx6 filesystems.
This feature makes it possible to create and use a different endianness fs
for the target (QNX is used on quite a range of embedded systems) platform
running on a different endianness.
+
The Linux driver handles endianness transparently. (LE and BE)
Blocks
@@ -26,6 +31,7 @@
The space in the device or file is split up into blocks. These are a fixed
size of 512, 1024, 2048 or 4096, which is decided when the filesystem is
created.
+
Blockpointers are 32bit, so the maximum space that can be addressed is
2^32 * 4096 bytes or 16TB
@@ -50,6 +56,7 @@
data and the addressing levels in that specific tree.
If the level value is 0, up to 16 direct blocks can be addressed by each
node.
+
Level 1 adds an additional indirect addressing level where each indirect
addressing block holds up to blocksize / 4 bytes pointers to data blocks.
Level 2 adds an additional indirect addressing block level (so, already up
@@ -57,11 +64,13 @@
Unused block pointers are always set to ~0 - regardless of root node,
indirect addressing blocks or inodes.
+
Data leaves are always on the lowest level. So no data is stored on upper
tree levels.
The first Superblock is located at 0x2000. (0x2000 is the bootblock size)
The Audi MMI 3G first superblock directly starts at byte 0.
+
Second superblock position can either be calculated from the superblock
information (total number of filesystem blocks) or by taking the highest
device address, zeroing the last 3 bytes and then subtracting 0x1000 from
@@ -84,6 +93,7 @@
There are also pointers to the first 16 blocks, if the object data can be
addressed with 16 direct blocks.
+
For more than 16 blocks an indirect addressing in form of another tree is
used. (scheme is the same as the one used for the superblock root nodes)
@@ -96,13 +106,18 @@
A directory is a filesystem object and has an inode just like a file.
It is a specially formatted file containing records which associate each
name with an inode number.
+
'.' inode number points to the directory inode
+
'..' inode number points to the parent directory inode
+
Eeach filename record additionally got a filename length field.
One special case are long filenames or subdirectory names.
+
These got set a filename length field of 0xff in the corresponding directory
record plus the longfile inode number also stored in that record.
+
With that longfilename inode number, the longfilename tree can be walked
starting with the superblock longfilename root node pointers.
@@ -111,6 +126,7 @@
Symbolic links are also filesystem objects with inodes. They got a specific
bit in the inode mode field identifying them as symbolic link.
+
The directory entry file inode pointer points to the target file inode.
Hard links got an inode, a directory entry, but a specific mode bit set,
@@ -126,9 +142,11 @@
Long filenames are stored in a separate addressing tree. The staring point
is the longfilename root node in the active superblock.
+
Each data block (tree leaves) holds one long filename. That filename is
limited to 510 bytes. The first two starting bytes are used as length field
for the actual filename.
+
If that structure shall fit for all allowed blocksizes, it is clear why there
is a limit of 510 bytes for the actual filename stored.
@@ -138,6 +156,7 @@
The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap
root node in the superblock and each bit in the bitmap represents one
filesystem block.
+
The first block is block 0, which starts 0x1000 after superblock start.
So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical
address at which block 0 is located.
@@ -149,11 +168,14 @@
------------------
The bitmap itself is divided into three parts.
+
First the system area, that is split into two halves.
+
Then userspace.
The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes.
+
Each superblock got it's own half of the system area. So superblock #1
always uses blocks from the lower half while superblock #2 just writes to
blocks represented by the upper half bitmap system area bits.
@@ -163,7 +185,7 @@
The rational behind that is that a write request can work on a new snapshot
(system area of the inactive - resp. lower serial numbered superblock) while
-at the same time there is still a complete stable filesystem structer in the
+at the same time there is still a complete stable filesystem structure in the
other half of the system area.
When finished with writing (a sync write is completed, the maximum sync leap
diff --git a/Documentation/filesystems/quota.txt b/Documentation/filesystems/quota.rst
similarity index 71%
rename from Documentation/filesystems/quota.txt
rename to Documentation/filesystems/quota.rst
index 32874b0..abd4303 100644
--- a/Documentation/filesystems/quota.txt
+++ b/Documentation/filesystems/quota.rst
@@ -1,4 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+===============
Quota subsystem
===============
@@ -16,7 +18,7 @@
filesystem.
For more details about quota design, see the documentation in quota-tools package
-(http://sourceforge.net/projects/linuxquota).
+(https://sourceforge.net/projects/linuxquota).
Quota netlink interface
=======================
@@ -29,16 +31,17 @@
and processed accordingly.
The interface uses generic netlink framework (see
-http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more
-details about this layer). The name of the quota generic netlink interface
-is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>.
-Since the quota netlink protocol is not namespace aware, quota netlink messages
-are sent only in initial network namespace.
+https://lwn.net/Articles/208755/ and http://www.infradead.org/~tgr/libnl/ for
+more details about this layer). The name of the quota generic netlink interface
+is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>. Since
+the quota netlink protocol is not namespace aware, quota netlink messages are
+sent only in initial network namespace.
Currently, the interface supports only one message type QUOTA_NL_C_WARNING.
This command is used to send a notification about any of the above mentioned
events. Each message has six attributes. These are (type of the argument is
in parentheses):
+
QUOTA_NL_A_QTYPE (u32)
- type of quota being exceeded (one of USRQUOTA, GRPQUOTA)
QUOTA_NL_A_EXCESS_ID (u64)
@@ -48,20 +51,34 @@
- UID of a user who caused the event
QUOTA_NL_A_WARNING (u32)
- what kind of limit is exceeded:
- QUOTA_NL_IHARDWARN - inode hardlimit
- QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer
- than given grace period
- QUOTA_NL_ISOFTWARN - inode softlimit
- QUOTA_NL_BHARDWARN - space (block) hardlimit
- QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded
- longer than given grace period.
- QUOTA_NL_BSOFTWARN - space (block) softlimit
+
+ QUOTA_NL_IHARDWARN
+ inode hardlimit
+ QUOTA_NL_ISOFTLONGWARN
+ inode softlimit is exceeded longer
+ than given grace period
+ QUOTA_NL_ISOFTWARN
+ inode softlimit
+ QUOTA_NL_BHARDWARN
+ space (block) hardlimit
+ QUOTA_NL_BSOFTLONGWARN
+ space (block) softlimit is exceeded
+ longer than given grace period.
+ QUOTA_NL_BSOFTWARN
+ space (block) softlimit
+
- four warnings are also defined for the event when user stops
exceeding some limit:
- QUOTA_NL_IHARDBELOW - inode hardlimit
- QUOTA_NL_ISOFTBELOW - inode softlimit
- QUOTA_NL_BHARDBELOW - space (block) hardlimit
- QUOTA_NL_BSOFTBELOW - space (block) softlimit
+
+ QUOTA_NL_IHARDBELOW
+ inode hardlimit
+ QUOTA_NL_ISOFTBELOW
+ inode softlimit
+ QUOTA_NL_BHARDBELOW
+ space (block) hardlimit
+ QUOTA_NL_BSOFTBELOW
+ space (block) softlimit
+
QUOTA_NL_A_DEV_MAJOR (u32)
- major number of a device with the affected filesystem
QUOTA_NL_A_DEV_MINOR (u32)
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
similarity index 89%
rename from Documentation/filesystems/ramfs-rootfs-initramfs.txt
rename to Documentation/filesystems/ramfs-rootfs-initramfs.rst
index 97d42cc..1649606 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
@@ -1,5 +1,11 @@
-ramfs, rootfs and initramfs
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+Ramfs, rootfs and initramfs
+===========================
+
October 17, 2005
+
Rob Landley <rob@landley.net>
=============================
@@ -65,7 +71,7 @@
A ramfs derivative called tmpfs was created to add size limits, and the ability
to write the data to swap space. Normal users can be allowed write access to
-tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information.
+tmpfs mounts. See Documentation/filesystems/tmpfs.rst for more information.
What is rootfs?
---------------
@@ -99,14 +105,14 @@
All this differs from the old initrd in several ways:
- The old initrd was always a separate file, while the initramfs archive is
- linked into the linux kernel image. (The directory linux-*/usr is devoted
- to generating this archive during the build.)
+ linked into the linux kernel image. (The directory ``linux-*/usr`` is
+ devoted to generating this archive during the build.)
- The old initrd file was a gzipped filesystem image (in some file format,
such as ext2, that needed a driver built into the kernel), while the new
initramfs archive is a gzipped cpio archive (like tar only simpler,
- see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). The
- kernel's cpio extraction code is not only extremely small, it's also
+ see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst).
+ The kernel's cpio extraction code is not only extremely small, it's also
__init text and data that can be discarded during the boot process.
- The program run by the old initrd (which was called /initrd, not /init) did
@@ -139,7 +145,7 @@
initramfs archive, which will automatically be incorporated into the
resulting binary. This option can point to an existing gzipped cpio
archive, a directory containing files to be archived, or a text file
-specification such as the following example:
+specification such as the following example::
dir /dev 755 0 0
nod /dev/console 644 0 0 c 5 1
@@ -164,7 +170,7 @@
The kernel does not depend on external cpio tools. If you specify a
directory instead of a configuration file, the kernel's build infrastructure
creates a configuration file from that directory (usr/Makefile calls
-usr/gen_initramfs_list.sh), and proceeds to package up that directory
+usr/gen_initramfs.sh), and proceeds to package up that directory
using the config file (by feeding it to usr/gen_init_cpio, which is created
from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
entirely self-contained, and the kernel's boot-time extractor is also
@@ -175,12 +181,12 @@
(instead of a config file or directory).
The following command line can extract a cpio image (either by the above script
-or by the kernel build) back into its component files:
+or by the kernel build) back into its component files::
cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames
The following shell script can create a prebuilt cpio archive you can
-use in place of the above config file:
+use in place of the above config file::
#!/bin/sh
@@ -202,14 +208,17 @@
exit 1
fi
-Note: The cpio man page contains some bad advice that will break your initramfs
-archive if you follow it. It says "A typical way to generate the list
-of filenames is with the find command; you should give find the -depth option
-to minimize problems with permissions on directories that are unwritable or not
-searchable." Don't do this when creating initramfs.cpio.gz images, it won't
-work. The Linux kernel cpio extractor won't create files in a directory that
-doesn't exist, so the directory entries must go before the files that go in
-those directories. The above script gets them in the right order.
+.. Note::
+
+ The cpio man page contains some bad advice that will break your initramfs
+ archive if you follow it. It says "A typical way to generate the list
+ of filenames is with the find command; you should give find the -depth
+ option to minimize problems with permissions on directories that are
+ unwritable or not searchable." Don't do this when creating
+ initramfs.cpio.gz images, it won't work. The Linux kernel cpio extractor
+ won't create files in a directory that doesn't exist, so the directory
+ entries must go before the files that go in those directories.
+ The above script gets them in the right order.
External initramfs images:
--------------------------
@@ -236,15 +245,16 @@
If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
-http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
-http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
-http://www.linuxfromscratch.org/lfs/view/stable/
-The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
+- https://www.tldp.org/HOWTO/Bootdisk-HOWTO/
+- https://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
+- http://www.linuxfromscratch.org/lfs/view/stable/
+
+The "klibc" package (https://www.kernel.org/pub/linux/libs/klibc) is
designed to be a tiny C library to statically link early userspace
code against, along with some related utilities. It is BSD licensed.
-I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
+I use uClibc (https://www.uclibc.org) and busybox (https://www.busybox.net)
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
package is planned for the busybox 1.3 release.)
@@ -255,7 +265,7 @@
A good first step is to get initramfs to run a statically linked "hello world"
program as init, and test it under an emulator like qemu (www.qemu.org) or
-User Mode Linux, like so:
+User Mode Linux, like so::
cat > hello.c << EOF
#include <stdio.h>
@@ -326,8 +336,8 @@
explained his reasoning:
- http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
- http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
+ - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html
+ - http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html
and, most importantly, designed and implemented the initramfs code.
diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.rst
similarity index 90%
rename from Documentation/filesystems/relay.txt
rename to Documentation/filesystems/relay.rst
index cd709a9..04ad083 100644
--- a/Documentation/filesystems/relay.txt
+++ b/Documentation/filesystems/relay.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
relay interface (formerly relayfs)
==================================
@@ -108,6 +111,7 @@
access to relay channel buffer data. Here are the file operations
that are available and some comments regarding their behavior:
+=========== ============================================================
open() enables user to open an _existing_ channel buffer.
mmap() results in channel buffer being mapped into the caller's
@@ -136,13 +140,16 @@
close() decrements the channel buffer's refcount. When the refcount
reaches 0, i.e. when no process or kernel client has the
buffer open, the channel buffer is freed.
+=========== ============================================================
In order for a user application to make use of relay files, the
-host filesystem must be mounted. For example,
+host filesystem must be mounted. For example::
mount -t debugfs debugfs /sys/kernel/debug
-NOTE: the host filesystem doesn't need to be mounted for kernel
+.. Note::
+
+ the host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
@@ -154,7 +161,7 @@
Here's a summary of the API the relay interface provides to in-kernel clients:
TBD(curr. line MT:/API/)
- channel management functions:
+ channel management functions::
relay_open(base_filename, parent, subbuf_size, n_subbufs,
callbacks, private_data)
@@ -162,17 +169,17 @@
relay_flush(chan)
relay_reset(chan)
- channel management typically called on instigation of userspace:
+ channel management typically called on instigation of userspace::
relay_subbufs_consumed(chan, cpu, subbufs_consumed)
- write functions:
+ write functions::
relay_write(chan, data, length)
__relay_write(chan, data, length)
relay_reserve(chan, length)
- callbacks:
+ callbacks::
subbuf_start(buf, subbuf, prev_subbuf, prev_padding)
buf_mapped(buf, filp)
@@ -180,7 +187,7 @@
create_buf_file(filename, parent, mode, buf, is_global)
remove_buf_file(dentry)
- helper functions:
+ helper functions::
relay_buf_full(buf)
subbuf_start_reserve(buf, length)
@@ -215,41 +222,41 @@
relay_close().
Here are some typical definitions for these callbacks, in this case
-using debugfs:
+using debugfs::
-/*
- * create_buf_file() callback. Creates relay file in debugfs.
- */
-static struct dentry *create_buf_file_handler(const char *filename,
- struct dentry *parent,
- umode_t mode,
- struct rchan_buf *buf,
- int *is_global)
-{
- return debugfs_create_file(filename, mode, parent, buf,
- &relay_file_operations);
-}
+ /*
+ * create_buf_file() callback. Creates relay file in debugfs.
+ */
+ static struct dentry *create_buf_file_handler(const char *filename,
+ struct dentry *parent,
+ umode_t mode,
+ struct rchan_buf *buf,
+ int *is_global)
+ {
+ return debugfs_create_file(filename, mode, parent, buf,
+ &relay_file_operations);
+ }
-/*
- * remove_buf_file() callback. Removes relay file from debugfs.
- */
-static int remove_buf_file_handler(struct dentry *dentry)
-{
- debugfs_remove(dentry);
+ /*
+ * remove_buf_file() callback. Removes relay file from debugfs.
+ */
+ static int remove_buf_file_handler(struct dentry *dentry)
+ {
+ debugfs_remove(dentry);
- return 0;
-}
+ return 0;
+ }
-/*
- * relay interface callbacks
- */
-static struct rchan_callbacks relay_callbacks =
-{
- .create_buf_file = create_buf_file_handler,
- .remove_buf_file = remove_buf_file_handler,
-};
+ /*
+ * relay interface callbacks
+ */
+ static struct rchan_callbacks relay_callbacks =
+ {
+ .create_buf_file = create_buf_file_handler,
+ .remove_buf_file = remove_buf_file_handler,
+ };
-And an example relay_open() invocation using them:
+And an example relay_open() invocation using them::
chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL);
@@ -339,23 +346,23 @@
To implement 'no-overwrite' mode, the userspace client would provide
an implementation of the subbuf_start() callback something like the
-following:
+following::
-static int subbuf_start(struct rchan_buf *buf,
- void *subbuf,
- void *prev_subbuf,
- unsigned int prev_padding)
-{
- if (prev_subbuf)
- *((unsigned *)prev_subbuf) = prev_padding;
+ static int subbuf_start(struct rchan_buf *buf,
+ void *subbuf,
+ void *prev_subbuf,
+ unsigned int prev_padding)
+ {
+ if (prev_subbuf)
+ *((unsigned *)prev_subbuf) = prev_padding;
- if (relay_buf_full(buf))
- return 0;
+ if (relay_buf_full(buf))
+ return 0;
- subbuf_start_reserve(buf, sizeof(unsigned int));
+ subbuf_start_reserve(buf, sizeof(unsigned int));
- return 1;
-}
+ return 1;
+ }
If the current buffer is full, i.e. all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
@@ -370,20 +377,20 @@
buffer switch can continue.
The implementation of the subbuf_start() callback for 'overwrite' mode
-would be very similar:
+would be very similar::
-static int subbuf_start(struct rchan_buf *buf,
- void *subbuf,
- void *prev_subbuf,
- size_t prev_padding)
-{
- if (prev_subbuf)
- *((unsigned *)prev_subbuf) = prev_padding;
+ static int subbuf_start(struct rchan_buf *buf,
+ void *subbuf,
+ void *prev_subbuf,
+ size_t prev_padding)
+ {
+ if (prev_subbuf)
+ *((unsigned *)prev_subbuf) = prev_padding;
- subbuf_start_reserve(buf, sizeof(unsigned int));
+ subbuf_start_reserve(buf, sizeof(unsigned int));
- return 1;
-}
+ return 1;
+ }
In this case, the relay_buf_full() check is meaningless and the
callback always returns 1, causing the buffer switch to occur
diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.rst
similarity index 86%
rename from Documentation/filesystems/romfs.txt
rename to Documentation/filesystems/romfs.rst
index e2b07cc..465b11e 100644
--- a/Documentation/filesystems/romfs.txt
+++ b/Documentation/filesystems/romfs.rst
@@ -1,4 +1,8 @@
-ROMFS - ROM FILE SYSTEM
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+ROMFS - ROM File System
+=======================
This is a quite dumb, read only filesystem, mainly for initial RAM
disks of installation disks. It has grown up by the need of having
@@ -51,9 +55,9 @@
bytes. This is quite rare however, since most file names are longer
than 3 bytes, and shorter than 15 bytes.
-The layout of the filesystem is the following:
+The layout of the filesystem is the following::
-offset content
+ offset content
+---+---+---+---+
0 | - | r | o | m | \
@@ -84,9 +88,9 @@
reliable, it does not require any tables, and it is very simple.
The following bytes are now part of the file system; each file header
-must begin on a 16 byte boundary.
+must begin on a 16 byte boundary::
-offset content
+ offset content
+---+---+---+---+
0 | next filehdr|X| The offset of the next file header
@@ -114,7 +118,9 @@
intended use. The mapping of the 8 possible values to file types is
the following:
+== =============== ============================================
mapping spec.info means
+== =============== ============================================
0 hard link link destination [file header]
1 directory first file's header
2 regular file unused, must be zero [MBZ]
@@ -123,6 +129,7 @@
5 char device - " -
6 socket unused, MBZ
7 fifo unused, MBZ
+== =============== ============================================
Note that hard links are specifically marked in this filesystem, but
they will behave as you can expect (i.e. share the inode number).
@@ -158,24 +165,24 @@
Pending issues:
- Permissions and owner information are pretty essential features of a
-Un*x like system, but romfs does not provide the full possibilities.
-I have never found this limiting, but others might.
+ Un*x like system, but romfs does not provide the full possibilities.
+ I have never found this limiting, but others might.
- The file system is read only, so it can be very small, but in case
-one would want to write _anything_ to a file system, he still needs
-a writable file system, thus negating the size advantages. Possible
-solutions: implement write access as a compile-time option, or a new,
-similarly small writable filesystem for RAM disks.
+ one would want to write _anything_ to a file system, he still needs
+ a writable file system, thus negating the size advantages. Possible
+ solutions: implement write access as a compile-time option, or a new,
+ similarly small writable filesystem for RAM disks.
- Since the files are only required to have alignment on a 16 byte
-boundary, it is currently possibly suboptimal to read or execute files
-from the filesystem. It might be resolved by reordering file data to
-have most of it (i.e. except the start and the end) laying at "natural"
-boundaries, thus it would be possible to directly map a big portion of
-the file contents to the mm subsystem.
+ boundary, it is currently possibly suboptimal to read or execute files
+ from the filesystem. It might be resolved by reordering file data to
+ have most of it (i.e. except the start and the end) laying at "natural"
+ boundaries, thus it would be possible to directly map a big portion of
+ the file contents to the mm subsystem.
- Compression might be an useful feature, but memory is quite a
-limiting factor in my eyes.
+ limiting factor in my eyes.
- Where it is used?
@@ -183,4 +190,5 @@
Have fun,
+
Janos Farkas <chexum@shadow.banki.hu>
diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.rst
similarity index 85%
rename from Documentation/filesystems/seq_file.txt
rename to Documentation/filesystems/seq_file.rst
index 7cf7143..a672608 100644
--- a/Documentation/filesystems/seq_file.txt
+++ b/Documentation/filesystems/seq_file.rst
@@ -1,8 +1,13 @@
-The seq_file interface
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+The seq_file Interface
+======================
Copyright 2003 Jonathan Corbet <corbet@lwn.net>
+
This file is originally from the LWN.net Driver Porting series at
- http://lwn.net/Articles/driver-porting/
+ https://lwn.net/Articles/driver-porting/
There are numerous ways for a device driver (or other kernel component) to
@@ -43,7 +48,7 @@
read, simply produces a set of increasing integer values, one per line. The
sequence will continue until the user loses patience and finds something
better to do. The file is seekable, in that one can do something like the
-following:
+following::
dd if=/proc/sequence of=out1 count=1
dd if=/proc/sequence skip=1 of=out2 count=1
@@ -52,19 +57,21 @@
result. Yes, it is a thoroughly useless module, but the point is to show
how the mechanism works without getting lost in other details. (Those
wanting to see the full source for this module can find it at
-http://lwn.net/Articles/22359/).
+https://lwn.net/Articles/22359/).
Deprecated create_proc_entry
+============================
Note that the above article uses create_proc_entry which was removed in
-kernel 3.10. Current versions require the following update
+kernel 3.10. Current versions require the following update::
-- entry = create_proc_entry("sequence", 0, NULL);
-- if (entry)
-- entry->proc_fops = &ct_file_ops;
-+ entry = proc_create("sequence", 0, NULL, &ct_file_ops);
+ - entry = create_proc_entry("sequence", 0, NULL);
+ - if (entry)
+ - entry->proc_fops = &ct_file_ops;
+ + entry = proc_create("sequence", 0, NULL, &ct_file_ops);
The iterator interface
+======================
Modules implementing a virtual file with seq_file must implement an
iterator object that allows stepping through the data of interest
@@ -99,7 +106,7 @@
the most recent pos used in the previous session.
For our simple sequence example,
-the start() function looks like:
+the start() function looks like::
static void *ct_seq_start(struct seq_file *s, loff_t *pos)
{
@@ -122,14 +129,16 @@
called SEQ_START_TOKEN; it can be used if you wish to instruct your
show() function (described below) to print a header at the top of the
output. SEQ_START_TOKEN should only be used if the offset is zero,
-however.
+however. SEQ_START_TOKEN has no special meaning to the core seq_file
+code. It is provided as a convenience for a start() funciton to
+communicate with the next() and show() functions.
The next function to implement is called, amazingly, next(); its job is to
move the iterator forward to the next position in the sequence. The
example module can simply increment the position by one; more useful
modules will do what is needed to step through some data structure. The
next() function returns a new iterator, or NULL if the sequence is
-complete. Here's the example version:
+complete. Here's the example version::
static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos)
{
@@ -138,13 +147,29 @@
return spos;
}
+The next() function should set ``*pos`` to a value that start() can use
+to find the new location in the sequence. When the iterator is being
+stored in the private data area, rather than being reinitialized on each
+start(), it might seem sufficient to simply set ``*pos`` to any non-zero
+value (zero always tells start() to restart the sequence). This is not
+sufficient due to historical problems.
+
+Historically, many next() functions have *not* updated ``*pos`` at
+end-of-file. If the value is then used by start() to initialise the
+iterator, this can result in corner cases where the last entry in the
+sequence is reported twice in the file. In order to discourage this bug
+from being resurrected, the core seq_file code now produces a warning if
+a next() function does not change the value of ``*pos``. Consequently a
+next() function *must* change the value of ``*pos``, and of course must
+set it to a non-zero value.
+
The stop() function closes a session; its job, of course, is to clean
up. If dynamic memory is allocated for the iterator, stop() is the
place to free it; if a lock was taken by start(), stop() must release
-that lock. The value that *pos was set to by the last next() call
+that lock. The value that ``*pos`` was set to by the last next() call
before stop() is remembered, and used for the first start() call of
the next session unless lseek() has been called on the file; in that
-case next start() will be asked to start at position zero.
+case next start() will be asked to start at position zero::
static void ct_seq_stop(struct seq_file *s, void *v)
{
@@ -152,7 +177,7 @@
}
Finally, the show() function should format the object currently pointed to
-by the iterator for output. The example module's show() function is:
+by the iterator for output. The example module's show() function is::
static int ct_seq_show(struct seq_file *s, void *v)
{
@@ -169,7 +194,7 @@
We will look at seq_printf() in a moment. But first, the definition of the
seq_file iterator is finished by creating a seq_operations structure with
-the four functions we have just defined:
+the four functions we have just defined::
static const struct seq_operations ct_seq_ops = {
.start = ct_seq_start,
@@ -200,6 +225,7 @@
Formatted output
+================
The seq_file code manages positioning within the output created by the
iterator and getting it into the user's buffer. But, for that to work, that
@@ -209,7 +235,7 @@
Most code will simply use seq_printf(), which works pretty much like
printk(), but which requires the seq_file pointer as an argument.
-For straight character output, the following functions may be used:
+For straight character output, the following functions may be used::
seq_putc(struct seq_file *m, char c);
seq_puts(struct seq_file *m, const char *s);
@@ -219,7 +245,7 @@
expect. seq_escape() is like seq_puts(), except that any character in s
which is in the string esc will be represented in octal form in the output.
-There are also a pair of functions for printing filenames:
+There are also a pair of functions for printing filenames::
int seq_path(struct seq_file *m, const struct path *path,
const char *esc);
@@ -232,8 +258,10 @@
root is desired, it can be used with seq_path_root(). If it turns out that
path cannot be reached from root, seq_path_root() returns SEQ_SKIP.
-A function producing complicated output may want to check
+A function producing complicated output may want to check::
+
bool seq_has_overflowed(struct seq_file *m);
+
and avoid further seq_<output> calls if true is returned.
A true return from seq_has_overflowed means that the seq_file buffer will
@@ -242,6 +270,7 @@
Making it all work
+==================
So far, we have a nice set of functions which can produce output within the
seq_file system, but we have not yet turned them into a file that a user
@@ -250,7 +279,7 @@
file. The seq_file interface provides a set of canned operations which do
most of the work. The virtual file author still must implement the open()
method, however, to hook everything up. The open function is often a single
-line, as in the example module:
+line, as in the example module::
static int ct_open(struct inode *inode, struct file *file)
{
@@ -269,7 +298,7 @@
There is also a wrapper function to seq_open() called seq_open_private(). It
kmallocs a zero filled block of memory and stores a pointer to it in the
private field of the seq_file structure, returning 0 on success. The
-block size is specified in a third parameter to the function, e.g.:
+block size is specified in a third parameter to the function, e.g.::
static int ct_open(struct inode *inode, struct file *file)
{
@@ -279,7 +308,7 @@
There is also a variant function, __seq_open_private(), which is functionally
identical except that, if successful, it returns the pointer to the allocated
-memory block, allowing further initialisation e.g.:
+memory block, allowing further initialisation e.g.::
static int ct_open(struct inode *inode, struct file *file)
{
@@ -301,7 +330,7 @@
The other operations of interest - read(), llseek(), and release() - are
all implemented by the seq_file code itself. So a virtual file's
-file_operations structure will look like:
+file_operations structure will look like::
static const struct file_operations ct_file_ops = {
.owner = THIS_MODULE,
@@ -315,7 +344,7 @@
seq_file private field to kfree() before releasing the structure.
The final step is the creation of the /proc file itself. In the example
-code, that is done in the initialization code in the usual way:
+code, that is done in the initialization code in the usual way::
static int ct_init(void)
{
@@ -331,9 +360,10 @@
seq_list
+========
If your file will be iterating through a linked list, you may find these
-routines useful:
+routines useful::
struct list_head *seq_list_start(struct list_head *head,
loff_t pos);
@@ -344,15 +374,16 @@
These helpers will interpret pos as a position within the list and iterate
accordingly. Your start() and next() functions need only invoke the
-seq_list_* helpers with a pointer to the appropriate list_head structure.
+``seq_list_*`` helpers with a pointer to the appropriate list_head structure.
The extra-simple version
+========================
For extremely simple virtual files, there is an even easier interface. A
module can define only the show() function, which should create all the
output that the virtual file will contain. The file's open() method then
-calls:
+calls::
int single_open(struct file *file,
int (*show)(struct seq_file *m, void *p),
diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.rst
similarity index 72%
rename from Documentation/filesystems/sharedsubtree.txt
rename to Documentation/filesystems/sharedsubtree.rst
index 8ccfbd5..d833953 100644
--- a/Documentation/filesystems/sharedsubtree.txt
+++ b/Documentation/filesystems/sharedsubtree.rst
@@ -1,7 +1,10 @@
-Shared Subtrees
----------------
+.. SPDX-License-Identifier: GPL-2.0
-Contents:
+===============
+Shared Subtrees
+===============
+
+.. Contents:
1) Overview
2) Features
3) Setting mount states
@@ -41,31 +44,38 @@
Here is an example:
- Let's say /mnt has a mount that is shared.
- mount --make-shared /mnt
+ Let's say /mnt has a mount that is shared::
+
+ mount --make-shared /mnt
Note: mount(8) command now supports the --make-shared flag,
so the sample 'smount' program is no longer needed and has been
removed.
- # mount --bind /mnt /tmp
+ ::
+
+ # mount --bind /mnt /tmp
+
The above command replicates the mount at /mnt to the mountpoint /tmp
and the contents of both the mounts remain identical.
- #ls /mnt
- a b c
+ ::
- #ls /tmp
- a b c
+ #ls /mnt
+ a b c
- Now let's say we mount a device at /tmp/a
- # mount /dev/sd0 /tmp/a
+ #ls /tmp
+ a b c
- #ls /tmp/a
- t1 t2 t3
+ Now let's say we mount a device at /tmp/a::
- #ls /mnt/a
- t1 t2 t3
+ # mount /dev/sd0 /tmp/a
+
+ #ls /tmp/a
+ t1 t2 t3
+
+ #ls /mnt/a
+ t1 t2 t3
Note that the mount has propagated to the mount at /mnt as well.
@@ -123,14 +133,15 @@
2d) A unbindable mount is a unbindable private mount
- let's say we have a mount at /mnt and we make it unbindable
+ let's say we have a mount at /mnt and we make it unbindable::
- # mount --make-unbindable /mnt
+ # mount --make-unbindable /mnt
- Let's try to bind mount this mount somewhere else.
- # mount --bind /mnt /tmp
- mount: wrong fs type, bad option, bad superblock on /mnt,
- or too many mounted file systems
+ Let's try to bind mount this mount somewhere else::
+
+ # mount --bind /mnt /tmp
+ mount: wrong fs type, bad option, bad superblock on /mnt,
+ or too many mounted file systems
Binding a unbindable mount is a invalid operation.
@@ -138,12 +149,12 @@
3) Setting mount states
The mount command (util-linux package) can be used to set mount
- states:
+ states::
- mount --make-shared mountpoint
- mount --make-slave mountpoint
- mount --make-private mountpoint
- mount --make-unbindable mountpoint
+ mount --make-shared mountpoint
+ mount --make-slave mountpoint
+ mount --make-private mountpoint
+ mount --make-unbindable mountpoint
4) Use cases
@@ -154,9 +165,10 @@
Solution:
- The system administrator can make the mount at /cdrom shared
- mount --bind /cdrom /cdrom
- mount --make-shared /cdrom
+ The system administrator can make the mount at /cdrom shared::
+
+ mount --bind /cdrom /cdrom
+ mount --make-shared /cdrom
Now any process that clones off a new namespace will have a
mount at /cdrom which is a replica of the same mount in the
@@ -172,14 +184,14 @@
Solution:
To begin with, the administrator can mark the entire mount tree
- as shareable.
+ as shareable::
- mount --make-rshared /
+ mount --make-rshared /
A new process can clone off a new namespace. And mark some part
- of its namespace as slave
+ of its namespace as slave::
- mount --make-rslave /myprivatetree
+ mount --make-rslave /myprivatetree
Hence forth any mounts within the /myprivatetree done by the
process will not show up in any other namespace. However mounts
@@ -206,13 +218,13 @@
versions of the file depending on the path used to access that
file.
- An example is:
+ An example is::
- mount --make-shared /
- mount --rbind / /view/v1
- mount --rbind / /view/v2
- mount --rbind / /view/v3
- mount --rbind / /view/v4
+ mount --make-shared /
+ mount --rbind / /view/v1
+ mount --rbind / /view/v2
+ mount --rbind / /view/v3
+ mount --rbind / /view/v4
and if /usr has a versioning filesystem mounted, then that
mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
@@ -224,8 +236,8 @@
filesystem is being requested and return the corresponding
inode.
-5) Detailed semantics:
--------------------
+5) Detailed semantics
+---------------------
The section below explains the detailed semantics of
bind, rbind, move, mount, umount and clone-namespace operations.
@@ -235,6 +247,7 @@
5a) Mount states
A given mount can be in one of the following states
+
1) shared
2) slave
3) shared and slave
@@ -252,7 +265,8 @@
A 'shared mount' is defined as a vfsmount that belongs to a
'peer group'.
- For example:
+ For example::
+
mount --make-shared /mnt
mount --bind /mnt /tmp
@@ -270,7 +284,7 @@
A slave mount as the name implies has a master mount from which
mount/unmount events are received. Events do not propagate from
the slave mount to the master. Only a shared mount can be made
- a slave by executing the following command
+ a slave by executing the following command::
mount --make-slave mount
@@ -290,8 +304,10 @@
peer group.
Only a slave vfsmount can be made as 'shared and slave' by
- either executing the following command
+ either executing the following command::
+
mount --make-shared mount
+
or by moving the slave vfsmount under a shared vfsmount.
(4) Private mount
@@ -307,30 +323,32 @@
State diagram:
+
The state diagram below explains the state transition of a mount,
- in response to various commands.
- ------------------------------------------------------------------------
- | |make-shared | make-slave | make-private |make-unbindab|
- --------------|------------|--------------|--------------|-------------|
- |shared |shared |*slave/private| private | unbindable |
- | | | | | |
- |-------------|------------|--------------|--------------|-------------|
- |slave |shared | **slave | private | unbindable |
- | |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |shared |shared | slave | private | unbindable |
- |and slave |and slave | | | |
- |-------------|------------|--------------|--------------|-------------|
- |private |shared | **private | private | unbindable |
- |-------------|------------|--------------|--------------|-------------|
- |unbindable |shared |**unbindable | private | unbindable |
- ------------------------------------------------------------------------
+ in response to various commands::
- * if the shared mount is the only mount in its peer group, making it
- slave, makes it private automatically. Note that there is no master to
- which it can be slaved to.
+ -----------------------------------------------------------------------
+ | |make-shared | make-slave | make-private |make-unbindab|
+ --------------|------------|--------------|--------------|-------------|
+ |shared |shared |*slave/private| private | unbindable |
+ | | | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |slave |shared | **slave | private | unbindable |
+ | |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |shared |shared | slave | private | unbindable |
+ |and slave |and slave | | | |
+ |-------------|------------|--------------|--------------|-------------|
+ |private |shared | **private | private | unbindable |
+ |-------------|------------|--------------|--------------|-------------|
+ |unbindable |shared |**unbindable | private | unbindable |
+ ------------------------------------------------------------------------
- ** slaving a non-shared mount has no effect on the mount.
+ * if the shared mount is the only mount in its peer group, making it
+ slave, makes it private automatically. Note that there is no master to
+ which it can be slaved to.
+
+ ** slaving a non-shared mount has no effect on the mount.
Apart from the commands listed below, the 'move' operation also changes
the state of a mount depending on type of the destination mount. Its
@@ -338,31 +356,32 @@
5b) Bind semantics
- Consider the following command
+ Consider the following command::
- mount --bind A/a B/b
+ mount --bind A/a B/b
where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
is the destination mount and 'b' is the dentry in the destination mount.
The outcome depends on the type of mount of 'A' and 'B'. The table
- below contains quick reference.
- ---------------------------------------------------------------------------
- | BIND MOUNT OPERATION |
- |**************************************************************************
- |source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |**************************************************************************
- | shared | shared | shared | shared & slave | invalid |
- | | | | | |
- |non-shared| shared | private | slave | invalid |
- ***************************************************************************
+ below contains quick reference::
+
+ --------------------------------------------------------------------------
+ | BIND MOUNT OPERATION |
+ |************************************************************************|
+ |source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |************************************************************************|
+ | shared | shared | shared | shared & slave | invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | invalid |
+ **************************************************************************
Details:
- 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
+ 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B'
@@ -371,7 +390,7 @@
'B'. And finally the peer-group of 'C' is merged with the peer group
of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
+ 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
are created and mounted at the dentry 'b' on all mounts where 'B'
@@ -379,7 +398,7 @@
'C', 'C1', .., 'Cn' with exactly the same configuration as the
propagation tree for 'B'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
'C3' ... are created and mounted at the dentry 'b' on all mounts where
@@ -389,19 +408,19 @@
is made the slave of mount 'Z'. In other words, mount 'C' is in the
state 'slave and shared'.
- 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
invalid operation.
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. A new mount 'C' which is clone of 'A', is created.
Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
- 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
peer-group of 'A'.
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
new mount 'C' which is a clone of 'A' is created. Its root dentry is
'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
@@ -409,7 +428,7 @@
mount/unmount on 'A' do not propagate anywhere else. Similarly
mount/unmount on 'C' do not propagate anywhere else.
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
invalid operation. A unbindable mount cannot be bind mounted.
5c) Rbind semantics
@@ -422,7 +441,9 @@
then the subtree under the unbindable mount is pruned in the new
location.
- eg: let's say we have the following mount tree.
+ eg:
+
+ let's say we have the following mount tree::
A
/ \
@@ -430,12 +451,12 @@
/ \ / \
D E F G
- Let's say all the mount except the mount C in the tree are
- of a type other than unbindable.
+ Let's say all the mount except the mount C in the tree are
+ of a type other than unbindable.
- If this tree is rbound to say Z
+ If this tree is rbound to say Z
- We will have the following tree at the new location.
+ We will have the following tree at the new location::
Z
|
@@ -457,24 +478,26 @@
the dentry in the destination mount.
The outcome depends on the type of the mount of 'A' and 'B'. The table
- below is a quick reference.
- ---------------------------------------------------------------------------
- | MOVE MOUNT OPERATION |
- |**************************************************************************
- | source(A)->| shared | private | slave | unbindable |
- | dest(B) | | | | |
- | | | | | | |
- | v | | | | |
- |**************************************************************************
- | shared | shared | shared |shared and slave| invalid |
- | | | | | |
- |non-shared| shared | private | slave | unbindable |
- ***************************************************************************
- NOTE: moving a mount residing under a shared mount is invalid.
+ below is a quick reference::
+
+ ---------------------------------------------------------------------------
+ | MOVE MOUNT OPERATION |
+ |**************************************************************************
+ | source(A)->| shared | private | slave | unbindable |
+ | dest(B) | | | | |
+ | | | | | | |
+ | v | | | | |
+ |**************************************************************************
+ | shared | shared | shared |shared and slave| invalid |
+ | | | | | |
+ |non-shared| shared | private | slave | unbindable |
+ ***************************************************************************
+
+ .. Note:: moving a mount residing under a shared mount is invalid.
Details follow:
- 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
+ 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. A new propagation tree is created in the
@@ -483,7 +506,7 @@
propagation tree is appended to the already existing propagation tree
of 'A'.
- 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
+ 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
are created and mounted at dentry 'b' on all mounts that receive
propagation from mount 'B'. The mount 'A' becomes a shared mount and a
@@ -491,7 +514,7 @@
'B'. This new propagation tree contains all the new mounts 'A1',
'A2'... 'An'.
- 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
receive propagation from mount 'B'. A new propagation tree is created
@@ -501,32 +524,32 @@
'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
becomes 'shared'.
- 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
+ 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
is invalid. Because mounting anything on the shared mount 'B' can
create new mounts that get mounted on the mounts that receive
propagation from 'B'. And since the mount 'A' is unbindable, cloning
it to mount at other mountpoints is not possible.
- 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
+ 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
- 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
+ 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
shared mount.
- 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
continues to be a slave mount of mount 'Z'.
- 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
+ 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
unbindable mount.
5e) Mount semantics
- Consider the following command
+ Consider the following command::
- mount device B/b
+ mount device B/b
'B' is the destination mount and 'b' is the dentry in the destination
mount.
@@ -537,9 +560,9 @@
5f) Unmount semantics
- Consider the following command
+ Consider the following command::
- umount A
+ umount A
where 'A' is a mount mounted on mount 'B' at dentry 'b'.
@@ -592,10 +615,12 @@
A. What is the result of the following command sequence?
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mount --bind /mnt /tmp
- mount --move /tmp /mnt/1
+ ::
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
+ mount --move /tmp /mnt/1
what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
Should they all be identical? or should /mnt and /mnt/1 be
@@ -604,23 +629,27 @@
B. What is the result of the following command sequence?
- mount --make-rshared /
- mkdir -p /v/1
- mount --rbind / /v/1
+ ::
+
+ mount --make-rshared /
+ mkdir -p /v/1
+ mount --rbind / /v/1
what should be the content of /v/1/v/1 be?
C. What is the result of the following command sequence?
- mount --bind /mnt /mnt
- mount --make-shared /mnt
- mkdir -p /mnt/1/2/3 /mnt/1/test
- mount --bind /mnt/1 /tmp
- mount --make-slave /mnt
- mount --make-shared /mnt
- mount --bind /mnt/1/2 /tmp1
- mount --make-slave /mnt
+ ::
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1/2/3 /mnt/1/test
+ mount --bind /mnt/1 /tmp
+ mount --make-slave /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt/1/2 /tmp1
+ mount --make-slave /mnt
At this point we have the first mount at /tmp and
its root dentry is 1. Let's call this mount 'A'
@@ -668,7 +697,8 @@
step 1:
let's say the root tree has just two directories with
- one vfsmount.
+ one vfsmount::
+
root
/ \
tmp usr
@@ -676,14 +706,17 @@
And we want to replicate the tree at multiple
mountpoints under /root/tmp
- step2:
- mount --make-shared /root
+ step 2:
+ ::
- mkdir -p /tmp/m1
- mount --rbind /root /tmp/m1
+ mount --make-shared /root
- the new tree now looks like this:
+ mkdir -p /tmp/m1
+
+ mount --rbind /root /tmp/m1
+
+ the new tree now looks like this::
root
/ \
@@ -697,11 +730,13 @@
it has two vfsmounts
- step3:
+ step 3:
+ ::
+
mkdir -p /tmp/m2
mount --rbind /root /tmp/m2
- the new tree now looks like this:
+ the new tree now looks like this::
root
/ \
@@ -724,6 +759,7 @@
it has 6 vfsmounts
step 4:
+ ::
mkdir -p /tmp/m3
mount --rbind /root /tmp/m3
@@ -740,7 +776,8 @@
step 1:
let's say the root tree has just two directories with
- one vfsmount.
+ one vfsmount::
+
root
/ \
tmp usr
@@ -748,17 +785,20 @@
How do we set up the same tree at multiple locations under
/root/tmp
- step2:
- mount --bind /root/tmp /root/tmp
+ step 2:
+ ::
- mount --make-rshared /root
- mount --make-unbindable /root/tmp
- mkdir -p /tmp/m1
+ mount --bind /root/tmp /root/tmp
- mount --rbind /root /tmp/m1
+ mount --make-rshared /root
+ mount --make-unbindable /root/tmp
- the new tree now looks like this:
+ mkdir -p /tmp/m1
+
+ mount --rbind /root /tmp/m1
+
+ the new tree now looks like this::
root
/ \
@@ -768,11 +808,13 @@
/ \
tmp usr
- step3:
+ step 3:
+ ::
+
mkdir -p /tmp/m2
mount --rbind /root /tmp/m2
- the new tree now looks like this:
+ the new tree now looks like this::
root
/ \
@@ -782,12 +824,13 @@
/ \ / \
tmp usr tmp usr
- step4:
+ step 4:
+ ::
mkdir -p /tmp/m3
mount --rbind /root /tmp/m3
- the new tree now looks like this:
+ the new tree now looks like this::
root
/ \
@@ -801,25 +844,31 @@
8A) Datastructure
- 4 new fields are introduced to struct vfsmount
- ->mnt_share
- ->mnt_slave_list
- ->mnt_slave
- ->mnt_master
+ 4 new fields are introduced to struct vfsmount:
- ->mnt_share links together all the mount to/from which this vfsmount
+ * ->mnt_share
+ * ->mnt_slave_list
+ * ->mnt_slave
+ * ->mnt_master
+
+ ->mnt_share
+ links together all the mount to/from which this vfsmount
send/receives propagation events.
- ->mnt_slave_list links all the mounts to which this vfsmount propagates
+ ->mnt_slave_list
+ links all the mounts to which this vfsmount propagates
to.
- ->mnt_slave links together all the slaves that its master vfsmount
+ ->mnt_slave
+ links together all the slaves that its master vfsmount
propagates to.
- ->mnt_master points to the master vfsmount from which this vfsmount
+ ->mnt_master
+ points to the master vfsmount from which this vfsmount
receives propagation.
- ->mnt_flags takes two more flags to indicate the propagation status of
+ ->mnt_flags
+ takes two more flags to indicate the propagation status of
the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
replicated.
@@ -842,7 +891,7 @@
A example propagation tree looks as shown in the figure below.
[ NOTE: Though it looks like a forest, if we consider all the shared
- mounts as a conceptual entity called 'pnode', it becomes a tree]
+ mounts as a conceptual entity called 'pnode', it becomes a tree]::
A <--> B <--> C <---> D
@@ -864,14 +913,19 @@
A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
E's ->mnt_share links with ->mnt_share of K
- 'E', 'K', 'F', 'G' have their ->mnt_master point to struct
- vfsmount of 'A'
+
+ 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
+
'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
+
K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
+
J and K's ->mnt_master points to struct vfsmount of C
+
and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
+
'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
@@ -903,6 +957,7 @@
Prepare phase:
for each mount in the source tree:
+
a) Create the necessary number of mount trees to
be attached to each of the mounts that receive
propagation from the destination mount.
@@ -929,11 +984,12 @@
Abort phase
delete all the newly created trees.
- NOTE: all the propagation related functionality resides in the file
- pnode.c
+ .. Note::
+ all the propagation related functionality resides in the file pnode.c
------------------------------------------------------------------------
version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
+
version 0.2 (Incorporated comments from Al Viro)
diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt
deleted file mode 100644
index eb9e3aa..0000000
--- a/Documentation/filesystems/spufs.txt
+++ /dev/null
@@ -1,521 +0,0 @@
-SPUFS(2) Linux Programmer's Manual SPUFS(2)
-
-
-
-NAME
- spufs - the SPU file system
-
-
-DESCRIPTION
- The SPU file system is used on PowerPC machines that implement the Cell
- Broadband Engine Architecture in order to access Synergistic Processor
- Units (SPUs).
-
- The file system provides a name space similar to posix shared memory or
- message queues. Users that have write permissions on the file system
- can use spu_create(2) to establish SPU contexts in the spufs root.
-
- Every SPU context is represented by a directory containing a predefined
- set of files. These files can be used for manipulating the state of the
- logical SPU. Users can change permissions on those files, but not actu-
- ally add or remove files.
-
-
-MOUNT OPTIONS
- uid=<uid>
- set the user owning the mount point, the default is 0 (root).
-
- gid=<gid>
- set the group owning the mount point, the default is 0 (root).
-
-
-FILES
- The files in spufs mostly follow the standard behavior for regular sys-
- tem calls like read(2) or write(2), but often support only a subset of
- the operations supported on regular file systems. This list details the
- supported operations and the deviations from the behaviour in the
- respective man pages.
-
- All files that support the read(2) operation also support readv(2) and
- all files that support the write(2) operation also support writev(2).
- All files support the access(2) and stat(2) family of operations, but
- only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
- contain reliable information.
-
- All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
- tions, but will not be able to grant permissions that contradict the
- possible operations, e.g. read access on the wbox file.
-
- The current set of files is:
-
-
- /mem
- the contents of the local storage memory of the SPU. This can be
- accessed like a regular shared memory file and contains both code and
- data in the address space of the SPU. The possible operations on an
- open mem file are:
-
- read(2), pread(2), write(2), pwrite(2), lseek(2)
- These operate as documented, with the exception that seek(2),
- write(2) and pwrite(2) are not supported beyond the end of the
- file. The file size is the size of the local storage of the SPU,
- which normally is 256 kilobytes.
-
- mmap(2)
- Mapping mem into the process address space gives access to the
- SPU local storage within the process address space. Only
- MAP_SHARED mappings are allowed.
-
-
- /mbox
- The first SPU to CPU communication mailbox. This file is read-only and
- can be read in units of 32 bits. The file can only be used in non-
- blocking mode and it even poll() will not block on it. The possible
- operations on an open mbox file are:
-
- read(2)
- If a count smaller than four is requested, read returns -1 and
- sets errno to EINVAL. If there is no data available in the mail
- box, the return value is set to -1 and errno becomes EAGAIN.
- When data has been read successfully, four bytes are placed in
- the data buffer and the value four is returned.
-
-
- /ibox
- The second SPU to CPU communication mailbox. This file is similar to
- the first mailbox file, but can be read in blocking I/O mode, and the
- poll family of system calls can be used to wait for it. The possible
- operations on an open ibox file are:
-
- read(2)
- If a count smaller than four is requested, read returns -1 and
- sets errno to EINVAL. If there is no data available in the mail
- box and the file descriptor has been opened with O_NONBLOCK, the
- return value is set to -1 and errno becomes EAGAIN.
-
- If there is no data available in the mail box and the file
- descriptor has been opened without O_NONBLOCK, the call will
- block until the SPU writes to its interrupt mailbox channel.
- When data has been read successfully, four bytes are placed in
- the data buffer and the value four is returned.
-
- poll(2)
- Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
- data is available for reading.
-
-
- /wbox
- The CPU to SPU communation mailbox. It is write-only and can be written
- in units of 32 bits. If the mailbox is full, write() will block and
- poll can be used to wait for it becoming empty again. The possible
- operations on an open wbox file are: write(2) If a count smaller than
- four is requested, write returns -1 and sets errno to EINVAL. If there
- is no space available in the mail box and the file descriptor has been
- opened with O_NONBLOCK, the return value is set to -1 and errno becomes
- EAGAIN.
-
- If there is no space available in the mail box and the file descriptor
- has been opened without O_NONBLOCK, the call will block until the SPU
- reads from its PPE mailbox channel. When data has been read success-
- fully, four bytes are placed in the data buffer and the value four is
- returned.
-
- poll(2)
- Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
- space is available for writing.
-
-
- /mbox_stat
- /ibox_stat
- /wbox_stat
- Read-only files that contain the length of the current queue, i.e. how
- many words can be read from mbox or ibox or how many words can be
- written to wbox without blocking. The files can be read only in 4-byte
- units and return a big-endian binary integer number. The possible
- operations on an open *box_stat file are:
-
- read(2)
- If a count smaller than four is requested, read returns -1 and
- sets errno to EINVAL. Otherwise, a four byte value is placed in
- the data buffer, containing the number of elements that can be
- read from (for mbox_stat and ibox_stat) or written to (for
- wbox_stat) the respective mail box without blocking or resulting
- in EAGAIN.
-
-
- /npc
- /decr
- /decr_status
- /spu_tag_mask
- /event_mask
- /srr0
- Internal registers of the SPU. The representation is an ASCII string
- with the numeric value of the next instruction to be executed. These
- can be used in read/write mode for debugging, but normal operation of
- programs should not rely on them because access to any of them except
- npc requires an SPU context save and is therefore very inefficient.
-
- The contents of these files are:
-
- npc Next Program Counter
-
- decr SPU Decrementer
-
- decr_status Decrementer Status
-
- spu_tag_mask MFC tag mask for SPU DMA
-
- event_mask Event mask for SPU interrupts
-
- srr0 Interrupt Return address register
-
-
- The possible operations on an open npc, decr, decr_status,
- spu_tag_mask, event_mask or srr0 file are:
-
- read(2)
- When the count supplied to the read call is shorter than the
- required length for the pointer value plus a newline character,
- subsequent reads from the same file descriptor will result in
- completing the string, regardless of changes to the register by
- a running SPU task. When a complete string has been read, all
- subsequent read operations will return zero bytes and a new file
- descriptor needs to be opened to read the value again.
-
- write(2)
- A write operation on the file results in setting the register to
- the value given in the string. The string is parsed from the
- beginning to the first non-numeric character or the end of the
- buffer. Subsequent writes to the same file descriptor overwrite
- the previous setting.
-
-
- /fpcr
- This file gives access to the Floating Point Status and Control Regis-
- ter as a four byte long file. The operations on the fpcr file are:
-
- read(2)
- If a count smaller than four is requested, read returns -1 and
- sets errno to EINVAL. Otherwise, a four byte value is placed in
- the data buffer, containing the current value of the fpcr regis-
- ter.
-
- write(2)
- If a count smaller than four is requested, write returns -1 and
- sets errno to EINVAL. Otherwise, a four byte value is copied
- from the data buffer, updating the value of the fpcr register.
-
-
- /signal1
- /signal2
- The two signal notification channels of an SPU. These are read-write
- files that operate on a 32 bit word. Writing to one of these files
- triggers an interrupt on the SPU. The value written to the signal
- files can be read from the SPU through a channel read or from host user
- space through the file. After the value has been read by the SPU, it
- is reset to zero. The possible operations on an open signal1 or sig-
- nal2 file are:
-
- read(2)
- If a count smaller than four is requested, read returns -1 and
- sets errno to EINVAL. Otherwise, a four byte value is placed in
- the data buffer, containing the current value of the specified
- signal notification register.
-
- write(2)
- If a count smaller than four is requested, write returns -1 and
- sets errno to EINVAL. Otherwise, a four byte value is copied
- from the data buffer, updating the value of the specified signal
- notification register. The signal notification register will
- either be replaced with the input data or will be updated to the
- bitwise OR or the old value and the input data, depending on the
- contents of the signal1_type, or signal2_type respectively,
- file.
-
-
- /signal1_type
- /signal2_type
- These two files change the behavior of the signal1 and signal2 notifi-
- cation files. The contain a numerical ASCII string which is read as
- either "1" or "0". In mode 0 (overwrite), the hardware replaces the
- contents of the signal channel with the data that is written to it. in
- mode 1 (logical OR), the hardware accumulates the bits that are subse-
- quently written to it. The possible operations on an open signal1_type
- or signal2_type file are:
-
- read(2)
- When the count supplied to the read call is shorter than the
- required length for the digit plus a newline character, subse-
- quent reads from the same file descriptor will result in com-
- pleting the string. When a complete string has been read, all
- subsequent read operations will return zero bytes and a new file
- descriptor needs to be opened to read the value again.
-
- write(2)
- A write operation on the file results in setting the register to
- the value given in the string. The string is parsed from the
- beginning to the first non-numeric character or the end of the
- buffer. Subsequent writes to the same file descriptor overwrite
- the previous setting.
-
-
-EXAMPLES
- /etc/fstab entry
- none /spu spufs gid=spu 0 0
-
-
-AUTHORS
- Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
- Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
-
-SEE ALSO
- capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
-
-
-
-Linux 2005-09-28 SPUFS(2)
-
-------------------------------------------------------------------------------
-
-SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
-
-
-
-NAME
- spu_run - execute an spu context
-
-
-SYNOPSIS
- #include <sys/spu.h>
-
- int spu_run(int fd, unsigned int *npc, unsigned int *event);
-
-DESCRIPTION
- The spu_run system call is used on PowerPC machines that implement the
- Cell Broadband Engine Architecture in order to access Synergistic Pro-
- cessor Units (SPUs). It uses the fd that was returned from spu_cre-
- ate(2) to address a specific SPU context. When the context gets sched-
- uled to a physical SPU, it starts execution at the instruction pointer
- passed in npc.
-
- Execution of SPU code happens synchronously, meaning that spu_run does
- not return while the SPU is still running. If there is a need to exe-
- cute SPU code in parallel with other code on either the main CPU or
- other SPUs, you need to create a new thread of execution first, e.g.
- using the pthread_create(3) call.
-
- When spu_run returns, the current value of the SPU instruction pointer
- is written back to npc, so you can call spu_run again without updating
- the pointers.
-
- event can be a NULL pointer or point to an extended status code that
- gets filled when spu_run returns. It can be one of the following con-
- stants:
-
- SPE_EVENT_DMA_ALIGNMENT
- A DMA alignment error
-
- SPE_EVENT_SPE_DATA_SEGMENT
- A DMA segmentation error
-
- SPE_EVENT_SPE_DATA_STORAGE
- A DMA storage error
-
- If NULL is passed as the event argument, these errors will result in a
- signal delivered to the calling process.
-
-RETURN VALUE
- spu_run returns the value of the spu_status register or -1 to indicate
- an error and set errno to one of the error codes listed below. The
- spu_status register value contains a bit mask of status codes and
- optionally a 14 bit code returned from the stop-and-signal instruction
- on the SPU. The bit masks for the status codes are:
-
- 0x02 SPU was stopped by stop-and-signal.
-
- 0x04 SPU was stopped by halt.
-
- 0x08 SPU is waiting for a channel.
-
- 0x10 SPU is in single-step mode.
-
- 0x20 SPU has tried to execute an invalid instruction.
-
- 0x40 SPU has tried to access an invalid channel.
-
- 0x3fff0000
- The bits masked with this value contain the code returned from
- stop-and-signal.
-
- There are always one or more of the lower eight bits set or an error
- code is returned from spu_run.
-
-ERRORS
- EAGAIN or EWOULDBLOCK
- fd is in non-blocking mode and spu_run would block.
-
- EBADF fd is not a valid file descriptor.
-
- EFAULT npc is not a valid pointer or status is neither NULL nor a valid
- pointer.
-
- EINTR A signal occurred while spu_run was in progress. The npc value
- has been updated to the new program counter value if necessary.
-
- EINVAL fd is not a file descriptor returned from spu_create(2).
-
- ENOMEM Insufficient memory was available to handle a page fault result-
- ing from an MFC direct memory access.
-
- ENOSYS the functionality is not provided by the current system, because
- either the hardware does not provide SPUs or the spufs module is
- not loaded.
-
-
-NOTES
- spu_run is meant to be used from libraries that implement a more
- abstract interface to SPUs, not to be used from regular applications.
- See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
- ommended libraries.
-
-
-CONFORMING TO
- This call is Linux specific and only implemented by the ppc64 architec-
- ture. Programs using this system call are not portable.
-
-
-BUGS
- The code does not yet fully implement all features lined out here.
-
-
-AUTHOR
- Arnd Bergmann <arndb@de.ibm.com>
-
-SEE ALSO
- capabilities(7), close(2), spu_create(2), spufs(7)
-
-
-
-Linux 2005-09-28 SPU_RUN(2)
-
-------------------------------------------------------------------------------
-
-SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
-
-
-
-NAME
- spu_create - create a new spu context
-
-
-SYNOPSIS
- #include <sys/types.h>
- #include <sys/spu.h>
-
- int spu_create(const char *pathname, int flags, mode_t mode);
-
-DESCRIPTION
- The spu_create system call is used on PowerPC machines that implement
- the Cell Broadband Engine Architecture in order to access Synergistic
- Processor Units (SPUs). It creates a new logical context for an SPU in
- pathname and returns a handle to associated with it. pathname must
- point to a non-existing directory in the mount point of the SPU file
- system (spufs). When spu_create is successful, a directory gets cre-
- ated on pathname and it is populated with files.
-
- The returned file handle can only be passed to spu_run(2) or closed,
- other operations are not defined on it. When it is closed, all associ-
- ated directory entries in spufs are removed. When the last file handle
- pointing either inside of the context directory or to this file
- descriptor is closed, the logical SPU context is destroyed.
-
- The parameter flags can be zero or any bitwise or'd combination of the
- following constants:
-
- SPU_RAWIO
- Allow mapping of some of the hardware registers of the SPU into
- user space. This flag requires the CAP_SYS_RAWIO capability, see
- capabilities(7).
-
- The mode parameter specifies the permissions used for creating the new
- directory in spufs. mode is modified with the user's umask(2) value
- and then used for both the directory and the files contained in it. The
- file permissions mask out some more bits of mode because they typically
- support only read or write access. See stat(2) for a full list of the
- possible mode values.
-
-
-RETURN VALUE
- spu_create returns a new file descriptor. It may return -1 to indicate
- an error condition and set errno to one of the error codes listed
- below.
-
-
-ERRORS
- EACCES
- The current user does not have write access on the spufs mount
- point.
-
- EEXIST An SPU context already exists at the given path name.
-
- EFAULT pathname is not a valid string pointer in the current address
- space.
-
- EINVAL pathname is not a directory in the spufs mount point.
-
- ELOOP Too many symlinks were found while resolving pathname.
-
- EMFILE The process has reached its maximum open file limit.
-
- ENAMETOOLONG
- pathname was too long.
-
- ENFILE The system has reached the global open file limit.
-
- ENOENT Part of pathname could not be resolved.
-
- ENOMEM The kernel could not allocate all resources required.
-
- ENOSPC There are not enough SPU resources available to create a new
- context or the user specific limit for the number of SPU con-
- texts has been reached.
-
- ENOSYS the functionality is not provided by the current system, because
- either the hardware does not provide SPUs or the spufs module is
- not loaded.
-
- ENOTDIR
- A part of pathname is not a directory.
-
-
-
-NOTES
- spu_create is meant to be used from libraries that implement a more
- abstract interface to SPUs, not to be used from regular applications.
- See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
- ommended libraries.
-
-
-FILES
- pathname must point to a location beneath the mount point of spufs. By
- convention, it gets mounted in /spu.
-
-
-CONFORMING TO
- This call is Linux specific and only implemented by the ppc64 architec-
- ture. Programs using this system call are not portable.
-
-
-BUGS
- The code does not yet fully implement all features lined out here.
-
-
-AUTHOR
- Arnd Bergmann <arndb@de.ibm.com>
-
-SEE ALSO
- capabilities(7), close(2), spu_run(2), spufs(7)
-
-
-
-Linux 2005-09-28 SPU_CREATE(2)
diff --git a/Documentation/filesystems/spufs/index.rst b/Documentation/filesystems/spufs/index.rst
new file mode 100644
index 0000000..5ed4a84
--- /dev/null
+++ b/Documentation/filesystems/spufs/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+SPU Filesystem
+==============
+
+
+.. toctree::
+ :maxdepth: 1
+
+ spufs
+ spu_create
+ spu_run
diff --git a/Documentation/filesystems/spufs/spu_create.rst b/Documentation/filesystems/spufs/spu_create.rst
new file mode 100644
index 0000000..83108c0
--- /dev/null
+++ b/Documentation/filesystems/spufs/spu_create.rst
@@ -0,0 +1,131 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+spu_create
+==========
+
+Name
+====
+ spu_create - create a new spu context
+
+
+Synopsis
+========
+
+ ::
+
+ #include <sys/types.h>
+ #include <sys/spu.h>
+
+ int spu_create(const char *pathname, int flags, mode_t mode);
+
+Description
+===========
+ The spu_create system call is used on PowerPC machines that implement
+ the Cell Broadband Engine Architecture in order to access Synergistic
+ Processor Units (SPUs). It creates a new logical context for an SPU in
+ pathname and returns a handle to associated with it. pathname must
+ point to a non-existing directory in the mount point of the SPU file
+ system (spufs). When spu_create is successful, a directory gets cre-
+ ated on pathname and it is populated with files.
+
+ The returned file handle can only be passed to spu_run(2) or closed,
+ other operations are not defined on it. When it is closed, all associ-
+ ated directory entries in spufs are removed. When the last file handle
+ pointing either inside of the context directory or to this file
+ descriptor is closed, the logical SPU context is destroyed.
+
+ The parameter flags can be zero or any bitwise or'd combination of the
+ following constants:
+
+ SPU_RAWIO
+ Allow mapping of some of the hardware registers of the SPU into
+ user space. This flag requires the CAP_SYS_RAWIO capability, see
+ capabilities(7).
+
+ The mode parameter specifies the permissions used for creating the new
+ directory in spufs. mode is modified with the user's umask(2) value
+ and then used for both the directory and the files contained in it. The
+ file permissions mask out some more bits of mode because they typically
+ support only read or write access. See stat(2) for a full list of the
+ possible mode values.
+
+
+Return Value
+============
+ spu_create returns a new file descriptor. It may return -1 to indicate
+ an error condition and set errno to one of the error codes listed
+ below.
+
+
+Errors
+======
+ EACCES
+ The current user does not have write access on the spufs mount
+ point.
+
+ EEXIST An SPU context already exists at the given path name.
+
+ EFAULT pathname is not a valid string pointer in the current address
+ space.
+
+ EINVAL pathname is not a directory in the spufs mount point.
+
+ ELOOP Too many symlinks were found while resolving pathname.
+
+ EMFILE The process has reached its maximum open file limit.
+
+ ENAMETOOLONG
+ pathname was too long.
+
+ ENFILE The system has reached the global open file limit.
+
+ ENOENT Part of pathname could not be resolved.
+
+ ENOMEM The kernel could not allocate all resources required.
+
+ ENOSPC There are not enough SPU resources available to create a new
+ context or the user specific limit for the number of SPU con-
+ texts has been reached.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+ ENOTDIR
+ A part of pathname is not a directory.
+
+
+
+Notes
+=====
+ spu_create is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+Files
+=====
+ pathname must point to a location beneath the mount point of spufs. By
+ convention, it gets mounted in /spu.
+
+
+Conforming to
+=============
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+Bugs
+====
+ The code does not yet fully implement all features lined out here.
+
+
+Author
+======
+ Arnd Bergmann <arndb@de.ibm.com>
+
+See Also
+========
+ capabilities(7), close(2), spu_run(2), spufs(7)
diff --git a/Documentation/filesystems/spufs/spu_run.rst b/Documentation/filesystems/spufs/spu_run.rst
new file mode 100644
index 0000000..7fdb1c3
--- /dev/null
+++ b/Documentation/filesystems/spufs/spu_run.rst
@@ -0,0 +1,138 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======
+spu_run
+=======
+
+
+Name
+====
+ spu_run - execute an spu context
+
+
+Synopsis
+========
+
+ ::
+
+ #include <sys/spu.h>
+
+ int spu_run(int fd, unsigned int *npc, unsigned int *event);
+
+Description
+===========
+ The spu_run system call is used on PowerPC machines that implement the
+ Cell Broadband Engine Architecture in order to access Synergistic Pro-
+ cessor Units (SPUs). It uses the fd that was returned from spu_cre-
+ ate(2) to address a specific SPU context. When the context gets sched-
+ uled to a physical SPU, it starts execution at the instruction pointer
+ passed in npc.
+
+ Execution of SPU code happens synchronously, meaning that spu_run does
+ not return while the SPU is still running. If there is a need to exe-
+ cute SPU code in parallel with other code on either the main CPU or
+ other SPUs, you need to create a new thread of execution first, e.g.
+ using the pthread_create(3) call.
+
+ When spu_run returns, the current value of the SPU instruction pointer
+ is written back to npc, so you can call spu_run again without updating
+ the pointers.
+
+ event can be a NULL pointer or point to an extended status code that
+ gets filled when spu_run returns. It can be one of the following con-
+ stants:
+
+ SPE_EVENT_DMA_ALIGNMENT
+ A DMA alignment error
+
+ SPE_EVENT_SPE_DATA_SEGMENT
+ A DMA segmentation error
+
+ SPE_EVENT_SPE_DATA_STORAGE
+ A DMA storage error
+
+ If NULL is passed as the event argument, these errors will result in a
+ signal delivered to the calling process.
+
+Return Value
+============
+ spu_run returns the value of the spu_status register or -1 to indicate
+ an error and set errno to one of the error codes listed below. The
+ spu_status register value contains a bit mask of status codes and
+ optionally a 14 bit code returned from the stop-and-signal instruction
+ on the SPU. The bit masks for the status codes are:
+
+ 0x02
+ SPU was stopped by stop-and-signal.
+
+ 0x04
+ SPU was stopped by halt.
+
+ 0x08
+ SPU is waiting for a channel.
+
+ 0x10
+ SPU is in single-step mode.
+
+ 0x20
+ SPU has tried to execute an invalid instruction.
+
+ 0x40
+ SPU has tried to access an invalid channel.
+
+ 0x3fff0000
+ The bits masked with this value contain the code returned from
+ stop-and-signal.
+
+ There are always one or more of the lower eight bits set or an error
+ code is returned from spu_run.
+
+Errors
+======
+ EAGAIN or EWOULDBLOCK
+ fd is in non-blocking mode and spu_run would block.
+
+ EBADF fd is not a valid file descriptor.
+
+ EFAULT npc is not a valid pointer or status is neither NULL nor a valid
+ pointer.
+
+ EINTR A signal occurred while spu_run was in progress. The npc value
+ has been updated to the new program counter value if necessary.
+
+ EINVAL fd is not a file descriptor returned from spu_create(2).
+
+ ENOMEM Insufficient memory was available to handle a page fault result-
+ ing from an MFC direct memory access.
+
+ ENOSYS the functionality is not provided by the current system, because
+ either the hardware does not provide SPUs or the spufs module is
+ not loaded.
+
+
+Notes
+=====
+ spu_run is meant to be used from libraries that implement a more
+ abstract interface to SPUs, not to be used from regular applications.
+ See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
+ ommended libraries.
+
+
+Conforming to
+=============
+ This call is Linux specific and only implemented by the ppc64 architec-
+ ture. Programs using this system call are not portable.
+
+
+Bugs
+====
+ The code does not yet fully implement all features lined out here.
+
+
+Author
+======
+ Arnd Bergmann <arndb@de.ibm.com>
+
+See Also
+========
+ capabilities(7), close(2), spu_create(2), spufs(7)
diff --git a/Documentation/filesystems/spufs/spufs.rst b/Documentation/filesystems/spufs/spufs.rst
new file mode 100644
index 0000000..8a42859
--- /dev/null
+++ b/Documentation/filesystems/spufs/spufs.rst
@@ -0,0 +1,273 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+spufs
+=====
+
+Name
+====
+
+ spufs - the SPU file system
+
+
+Description
+===========
+
+ The SPU file system is used on PowerPC machines that implement the Cell
+ Broadband Engine Architecture in order to access Synergistic Processor
+ Units (SPUs).
+
+ The file system provides a name space similar to posix shared memory or
+ message queues. Users that have write permissions on the file system
+ can use spu_create(2) to establish SPU contexts in the spufs root.
+
+ Every SPU context is represented by a directory containing a predefined
+ set of files. These files can be used for manipulating the state of the
+ logical SPU. Users can change permissions on those files, but not actu-
+ ally add or remove files.
+
+
+Mount Options
+=============
+
+ uid=<uid>
+ set the user owning the mount point, the default is 0 (root).
+
+ gid=<gid>
+ set the group owning the mount point, the default is 0 (root).
+
+
+Files
+=====
+
+ The files in spufs mostly follow the standard behavior for regular sys-
+ tem calls like read(2) or write(2), but often support only a subset of
+ the operations supported on regular file systems. This list details the
+ supported operations and the deviations from the behaviour in the
+ respective man pages.
+
+ All files that support the read(2) operation also support readv(2) and
+ all files that support the write(2) operation also support writev(2).
+ All files support the access(2) and stat(2) family of operations, but
+ only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
+ contain reliable information.
+
+ All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
+ tions, but will not be able to grant permissions that contradict the
+ possible operations, e.g. read access on the wbox file.
+
+ The current set of files is:
+
+
+ /mem
+ the contents of the local storage memory of the SPU. This can be
+ accessed like a regular shared memory file and contains both code and
+ data in the address space of the SPU. The possible operations on an
+ open mem file are:
+
+ read(2), pread(2), write(2), pwrite(2), lseek(2)
+ These operate as documented, with the exception that seek(2),
+ write(2) and pwrite(2) are not supported beyond the end of the
+ file. The file size is the size of the local storage of the SPU,
+ which normally is 256 kilobytes.
+
+ mmap(2)
+ Mapping mem into the process address space gives access to the
+ SPU local storage within the process address space. Only
+ MAP_SHARED mappings are allowed.
+
+
+ /mbox
+ The first SPU to CPU communication mailbox. This file is read-only and
+ can be read in units of 32 bits. The file can only be used in non-
+ blocking mode and it even poll() will not block on it. The possible
+ operations on an open mbox file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. If there is no data available in the mail
+ box, the return value is set to -1 and errno becomes EAGAIN.
+ When data has been read successfully, four bytes are placed in
+ the data buffer and the value four is returned.
+
+
+ /ibox
+ The second SPU to CPU communication mailbox. This file is similar to
+ the first mailbox file, but can be read in blocking I/O mode, and the
+ poll family of system calls can be used to wait for it. The possible
+ operations on an open ibox file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. If there is no data available in the mail
+ box and the file descriptor has been opened with O_NONBLOCK, the
+ return value is set to -1 and errno becomes EAGAIN.
+
+ If there is no data available in the mail box and the file
+ descriptor has been opened without O_NONBLOCK, the call will
+ block until the SPU writes to its interrupt mailbox channel.
+ When data has been read successfully, four bytes are placed in
+ the data buffer and the value four is returned.
+
+ poll(2)
+ Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
+ data is available for reading.
+
+
+ /wbox
+ The CPU to SPU communation mailbox. It is write-only and can be written
+ in units of 32 bits. If the mailbox is full, write() will block and
+ poll can be used to wait for it becoming empty again. The possible
+ operations on an open wbox file are: write(2) If a count smaller than
+ four is requested, write returns -1 and sets errno to EINVAL. If there
+ is no space available in the mail box and the file descriptor has been
+ opened with O_NONBLOCK, the return value is set to -1 and errno becomes
+ EAGAIN.
+
+ If there is no space available in the mail box and the file descriptor
+ has been opened without O_NONBLOCK, the call will block until the SPU
+ reads from its PPE mailbox channel. When data has been read success-
+ fully, four bytes are placed in the data buffer and the value four is
+ returned.
+
+ poll(2)
+ Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
+ space is available for writing.
+
+
+ /mbox_stat, /ibox_stat, /wbox_stat
+ Read-only files that contain the length of the current queue, i.e. how
+ many words can be read from mbox or ibox or how many words can be
+ written to wbox without blocking. The files can be read only in 4-byte
+ units and return a big-endian binary integer number. The possible
+ operations on an open ``*box_stat`` file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the number of elements that can be
+ read from (for mbox_stat and ibox_stat) or written to (for
+ wbox_stat) the respective mail box without blocking or resulting
+ in EAGAIN.
+
+
+ /npc, /decr, /decr_status, /spu_tag_mask, /event_mask, /srr0
+ Internal registers of the SPU. The representation is an ASCII string
+ with the numeric value of the next instruction to be executed. These
+ can be used in read/write mode for debugging, but normal operation of
+ programs should not rely on them because access to any of them except
+ npc requires an SPU context save and is therefore very inefficient.
+
+ The contents of these files are:
+
+ =================== ===================================
+ npc Next Program Counter
+ decr SPU Decrementer
+ decr_status Decrementer Status
+ spu_tag_mask MFC tag mask for SPU DMA
+ event_mask Event mask for SPU interrupts
+ srr0 Interrupt Return address register
+ =================== ===================================
+
+
+ The possible operations on an open npc, decr, decr_status,
+ spu_tag_mask, event_mask or srr0 file are:
+
+ read(2)
+ When the count supplied to the read call is shorter than the
+ required length for the pointer value plus a newline character,
+ subsequent reads from the same file descriptor will result in
+ completing the string, regardless of changes to the register by
+ a running SPU task. When a complete string has been read, all
+ subsequent read operations will return zero bytes and a new file
+ descriptor needs to be opened to read the value again.
+
+ write(2)
+ A write operation on the file results in setting the register to
+ the value given in the string. The string is parsed from the
+ beginning to the first non-numeric character or the end of the
+ buffer. Subsequent writes to the same file descriptor overwrite
+ the previous setting.
+
+
+ /fpcr
+ This file gives access to the Floating Point Status and Control Regis-
+ ter as a four byte long file. The operations on the fpcr file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the current value of the fpcr regis-
+ ter.
+
+ write(2)
+ If a count smaller than four is requested, write returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is copied
+ from the data buffer, updating the value of the fpcr register.
+
+
+ /signal1, /signal2
+ The two signal notification channels of an SPU. These are read-write
+ files that operate on a 32 bit word. Writing to one of these files
+ triggers an interrupt on the SPU. The value written to the signal
+ files can be read from the SPU through a channel read or from host user
+ space through the file. After the value has been read by the SPU, it
+ is reset to zero. The possible operations on an open signal1 or sig-
+ nal2 file are:
+
+ read(2)
+ If a count smaller than four is requested, read returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is placed in
+ the data buffer, containing the current value of the specified
+ signal notification register.
+
+ write(2)
+ If a count smaller than four is requested, write returns -1 and
+ sets errno to EINVAL. Otherwise, a four byte value is copied
+ from the data buffer, updating the value of the specified signal
+ notification register. The signal notification register will
+ either be replaced with the input data or will be updated to the
+ bitwise OR or the old value and the input data, depending on the
+ contents of the signal1_type, or signal2_type respectively,
+ file.
+
+
+ /signal1_type, /signal2_type
+ These two files change the behavior of the signal1 and signal2 notifi-
+ cation files. The contain a numerical ASCII string which is read as
+ either "1" or "0". In mode 0 (overwrite), the hardware replaces the
+ contents of the signal channel with the data that is written to it. in
+ mode 1 (logical OR), the hardware accumulates the bits that are subse-
+ quently written to it. The possible operations on an open signal1_type
+ or signal2_type file are:
+
+ read(2)
+ When the count supplied to the read call is shorter than the
+ required length for the digit plus a newline character, subse-
+ quent reads from the same file descriptor will result in com-
+ pleting the string. When a complete string has been read, all
+ subsequent read operations will return zero bytes and a new file
+ descriptor needs to be opened to read the value again.
+
+ write(2)
+ A write operation on the file results in setting the register to
+ the value given in the string. The string is parsed from the
+ beginning to the first non-numeric character or the end of the
+ buffer. Subsequent writes to the same file descriptor overwrite
+ the previous setting.
+
+
+Examples
+========
+ /etc/fstab entry
+ none /spu spufs gid=spu 0 0
+
+
+Authors
+=======
+ Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
+ Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
+
+See Also
+========
+ capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.rst
similarity index 90%
rename from Documentation/filesystems/squashfs.txt
rename to Documentation/filesystems/squashfs.rst
index e5274f8..df42106 100644
--- a/Documentation/filesystems/squashfs.txt
+++ b/Documentation/filesystems/squashfs.rst
@@ -1,7 +1,11 @@
-SQUASHFS 4.0 FILESYSTEM
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+Squashfs 4.0 Filesystem
=======================
Squashfs is a compressed read-only filesystem for Linux.
+
It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
directories. Inodes in the system are very small and all blocks are packed to
minimise data overhead. Block sizes greater than 4K are supported up to a
@@ -15,31 +19,33 @@
Mailing list: squashfs-devel@lists.sourceforge.net
Web site: www.squashfs.org
-1. FILESYSTEM FEATURES
+1. Filesystem Features
----------------------
Squashfs filesystem features versus Cramfs:
+============================== ========= ==========
Squashfs Cramfs
-
-Max filesystem size: 2^64 256 MiB
-Max file size: ~ 2 TiB 16 MiB
-Max files: unlimited unlimited
-Max directories: unlimited unlimited
-Max entries per directory: unlimited unlimited
-Max block size: 1 MiB 4 KiB
-Metadata compression: yes no
-Directory indexes: yes no
-Sparse file support: yes no
-Tail-end packing (fragments): yes no
-Exportable (NFS etc.): yes no
-Hard link support: yes no
-"." and ".." in readdir: yes no
-Real inode numbers: yes no
-32-bit uids/gids: yes no
-File creation time: yes no
-Xattr support: yes no
-ACL support: no no
+============================== ========= ==========
+Max filesystem size 2^64 256 MiB
+Max file size ~ 2 TiB 16 MiB
+Max files unlimited unlimited
+Max directories unlimited unlimited
+Max entries per directory unlimited unlimited
+Max block size 1 MiB 4 KiB
+Metadata compression yes no
+Directory indexes yes no
+Sparse file support yes no
+Tail-end packing (fragments) yes no
+Exportable (NFS etc.) yes no
+Hard link support yes no
+"." and ".." in readdir yes no
+Real inode numbers yes no
+32-bit uids/gids yes no
+File creation time yes no
+Xattr support yes no
+ACL support no no
+============================== ========= ==========
Squashfs compresses data, inodes and directories. In addition, inode and
directory data are highly compacted, and packed on byte boundaries. Each
@@ -47,7 +53,7 @@
file type, i.e. regular file, directory, symbolic link, and block/char device
inodes have different sizes).
-2. USING SQUASHFS
+2. Using Squashfs
-----------------
As squashfs is a read-only filesystem, the mksquashfs program must be used to
@@ -58,11 +64,11 @@
The squashfs-tools development tree is now located on kernel.org
git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
-3. SQUASHFS FILESYSTEM DESIGN
+3. Squashfs Filesystem Design
-----------------------------
A squashfs filesystem consists of a maximum of nine parts, packed together on a
-byte alignment:
+byte alignment::
---------------
| superblock |
@@ -229,15 +235,15 @@
is stored. This xattr id is mapped into the location of the xattr
list using a second xattr id lookup table.
-4. TODOS AND OUTSTANDING ISSUES
+4. TODOs and Outstanding Issues
-------------------------------
-4.1 Todo list
+4.1 TODO list
-------------
Implement ACL support.
-4.2 Squashfs internal cache
+4.2 Squashfs Internal Cache
---------------------------
Blocks in Squashfs are compressed. To avoid repeatedly decompressing
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
deleted file mode 100644
index 06f1d64..0000000
--- a/Documentation/filesystems/sysfs-pci.txt
+++ /dev/null
@@ -1,131 +0,0 @@
-Accessing PCI device resources through sysfs
---------------------------------------------
-
-sysfs, usually mounted at /sys, provides access to PCI resources on platforms
-that support it. For example, a given bus might look like this:
-
- /sys/devices/pci0000:17
- |-- 0000:17:00.0
- | |-- class
- | |-- config
- | |-- device
- | |-- enable
- | |-- irq
- | |-- local_cpus
- | |-- remove
- | |-- resource
- | |-- resource0
- | |-- resource1
- | |-- resource2
- | |-- revision
- | |-- rom
- | |-- subsystem_device
- | |-- subsystem_vendor
- | `-- vendor
- `-- ...
-
-The topmost element describes the PCI domain and bus number. In this case,
-the domain number is 0000 and the bus number is 17 (both values are in hex).
-This bus contains a single function device in slot 0. The domain and bus
-numbers are reproduced for convenience. Under the device directory are several
-files, each with their own function.
-
- file function
- ---- --------
- class PCI class (ascii, ro)
- config PCI config space (binary, rw)
- device PCI device (ascii, ro)
- enable Whether the device is enabled (ascii, rw)
- irq IRQ number (ascii, ro)
- local_cpus nearby CPU mask (cpumask, ro)
- remove remove device from kernel's list (ascii, wo)
- resource PCI resource host addresses (ascii, ro)
- resource0..N PCI resource N, if present (binary, mmap, rw[1])
- resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
- revision PCI revision (ascii, ro)
- rom PCI ROM resource, if present (binary, ro)
- subsystem_device PCI subsystem device (ascii, ro)
- subsystem_vendor PCI subsystem vendor (ascii, ro)
- vendor PCI vendor (ascii, ro)
-
- ro - read only file
- rw - file is readable and writable
- wo - write only file
- mmap - file is mmapable
- ascii - file contains ascii text
- binary - file contains binary data
- cpumask - file contains a cpumask type
-
-[1] rw for RESOURCE_IO (I/O port) regions only
-
-The read only files are informational, writes to them will be ignored, with
-the exception of the 'rom' file. Writable files can be used to perform
-actions on the device (e.g. changing config space, detaching a device).
-mmapable files are available via an mmap of the file at offset 0 and can be
-used to do actual device programming from userspace. Note that some platforms
-don't support mmapping of certain resources, so be sure to check the return
-value from any attempted mmap. The most notable of these are I/O port
-resources, which also provide read/write access.
-
-The 'enable' file provides a counter that indicates how many times the device
-has been enabled. If the 'enable' file currently returns '4', and a '1' is
-echoed into it, it will then return '5'. Echoing a '0' into it will decrease
-the count. Even when it returns to 0, though, some of the initialisation
-may not be reversed.
-
-The 'rom' file is special in that it provides read-only access to the device's
-ROM file, if available. It's disabled by default, however, so applications
-should write the string "1" to the file to enable it before attempting a read
-call, and disable it following the access by writing "0" to the file. Note
-that the device must be enabled for a rom read to return data successfully.
-In the event a driver is not bound to the device, it can be enabled using the
-'enable' file, documented above.
-
-The 'remove' file is used to remove the PCI device, by writing a non-zero
-integer to the file. This does not involve any kind of hot-plug functionality,
-e.g. powering off the device. The device is removed from the kernel's list of
-PCI devices, the sysfs directory for it is removed, and the device will be
-removed from any drivers attached to it. Removal of PCI root buses is
-disallowed.
-
-Accessing legacy resources through sysfs
-----------------------------------------
-
-Legacy I/O port and ISA memory resources are also provided in sysfs if the
-underlying platform supports them. They're located in the PCI class hierarchy,
-e.g.
-
- /sys/class/pci_bus/0000:17/
- |-- bridge -> ../../../devices/pci0000:17
- |-- cpuaffinity
- |-- legacy_io
- `-- legacy_mem
-
-The legacy_io file is a read/write file that can be used by applications to
-do legacy port I/O. The application should open the file, seek to the desired
-port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
-file should be mmapped with an offset corresponding to the memory offset
-desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
-simply dereference the returned pointer (after checking for errors of course)
-to access legacy memory space.
-
-Supporting PCI access on new platforms
---------------------------------------
-
-In order to support PCI resource mapping as described above, Linux platform
-code should ideally define ARCH_GENERIC_PCI_MMAP_RESOURCE and use the generic
-implementation of that functionality. To support the historical interface of
-mmap() through files in /proc/bus/pci, platforms may also set HAVE_PCI_MMAP.
-
-Alternatively, platforms which set HAVE_PCI_MMAP may provide their own
-implementation of pci_mmap_page_range() instead of defining
-ARCH_GENERIC_PCI_MMAP_RESOURCE.
-
-Platforms which support write-combining maps of PCI resources must define
-arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
-write-combining is permitted. Platforms which support maps of I/O resources
-define arch_can_pci_mmap_io() similarly.
-
-Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
-wishing to support legacy functionality should define it and provide
-pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt
deleted file mode 100644
index c7c8e64..0000000
--- a/Documentation/filesystems/sysfs-tagging.txt
+++ /dev/null
@@ -1,42 +0,0 @@
-Sysfs tagging
--------------
-
-(Taken almost verbatim from Eric Biederman's netns tagging patch
-commit msg)
-
-The problem. Network devices show up in sysfs and with the network
-namespace active multiple devices with the same name can show up in
-the same directory, ouch!
-
-To avoid that problem and allow existing applications in network
-namespaces to see the same interface that is currently presented in
-sysfs, sysfs now has tagging directory support.
-
-By using the network namespace pointers as tags to separate out the
-the sysfs directory entries we ensure that we don't have conflicts
-in the directories and applications only see a limited set of
-the network devices.
-
-Each sysfs directory entry may be tagged with a namespace via the
-void *ns member of its kernfs_node. If a directory entry is tagged,
-then kernfs_node->flags will have a flag between KOBJ_NS_TYPE_NONE
-and KOBJ_NS_TYPES, and ns will point to the namespace to which it
-belongs.
-
-Each sysfs superblock's kernfs_super_info contains an array void
-*ns[KOBJ_NS_TYPES]. When a task in a tagging namespace
-kobj_nstype first mounts sysfs, a new superblock is created. It
-will be differentiated from other sysfs mounts by having its
-s_fs_info->ns[kobj_nstype] set to the new namespace. Note that
-through bind mounting and mounts propagation, a task can easily view
-the contents of other namespaces' sysfs mounts. Therefore, when a
-namespace exits, it will call kobj_ns_exit() to invalidate any
-kernfs_node->ns pointers pointing to it.
-
-Users of this interface:
-- define a type in the kobj_ns_type enumeration.
-- call kobj_ns_type_register() with its kobj_ns_type_operations which has
- - current_ns() which returns current's namespace
- - netlink_ns() which returns a socket's namespace
- - initial_ns() which returns the initial namesapce
-- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst
new file mode 100644
index 0000000..004d490
--- /dev/null
+++ b/Documentation/filesystems/sysfs.rst
@@ -0,0 +1,415 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+sysfs - _The_ filesystem for exporting kernel objects
+=====================================================
+
+Patrick Mochel <mochel@osdl.org>
+
+Mike Murphy <mamurph@cs.clemson.edu>
+
+:Revised: 16 August 2011
+:Original: 10 January 2003
+
+
+What it is:
+~~~~~~~~~~~
+
+sysfs is a ram-based filesystem initially based on ramfs. It provides
+a means to export kernel data structures, their attributes, and the
+linkages between them to userspace.
+
+sysfs is tied inherently to the kobject infrastructure. Please read
+Documentation/core-api/kobject.rst for more information concerning the kobject
+interface.
+
+
+Using sysfs
+~~~~~~~~~~~
+
+sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
+it by doing::
+
+ mount -t sysfs sysfs /sys
+
+
+Directory Creation
+~~~~~~~~~~~~~~~~~~
+
+For every kobject that is registered with the system, a directory is
+created for it in sysfs. That directory is created as a subdirectory
+of the kobject's parent, expressing internal object hierarchies to
+userspace. Top-level directories in sysfs represent the common
+ancestors of object hierarchies; i.e. the subsystems the objects
+belong to.
+
+Sysfs internally stores a pointer to the kobject that implements a
+directory in the kernfs_node object associated with the directory. In
+the past this kobject pointer has been used by sysfs to do reference
+counting directly on the kobject whenever the file is opened or closed.
+With the current sysfs implementation the kobject reference count is
+only modified directly by the function sysfs_schedule_callback().
+
+
+Attributes
+~~~~~~~~~~
+
+Attributes can be exported for kobjects in the form of regular files in
+the filesystem. Sysfs forwards file I/O operations to methods defined
+for the attributes, providing a means to read and write kernel
+attributes.
+
+Attributes should be ASCII text files, preferably with only one value
+per file. It is noted that it may not be efficient to contain only one
+value per file, so it is socially acceptable to express an array of
+values of the same type.
+
+Mixing types, expressing multiple lines of data, and doing fancy
+formatting of data is heavily frowned upon. Doing these things may get
+you publicly humiliated and your code rewritten without notice.
+
+
+An attribute definition is simply::
+
+ struct attribute {
+ char * name;
+ struct module *owner;
+ umode_t mode;
+ };
+
+
+ int sysfs_create_file(struct kobject * kobj, const struct attribute * attr);
+ void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr);
+
+
+A bare attribute contains no means to read or write the value of the
+attribute. Subsystems are encouraged to define their own attribute
+structure and wrapper functions for adding and removing attributes for
+a specific object type.
+
+For example, the driver model defines struct device_attribute like::
+
+ struct device_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device *dev, struct device_attribute *attr,
+ char *buf);
+ ssize_t (*store)(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count);
+ };
+
+ int device_create_file(struct device *, const struct device_attribute *);
+ void device_remove_file(struct device *, const struct device_attribute *);
+
+It also defines this helper for defining device attributes::
+
+ #define DEVICE_ATTR(_name, _mode, _show, _store) \
+ struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
+
+For example, declaring::
+
+ static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
+
+is equivalent to doing::
+
+ static struct device_attribute dev_attr_foo = {
+ .attr = {
+ .name = "foo",
+ .mode = S_IWUSR | S_IRUGO,
+ },
+ .show = show_foo,
+ .store = store_foo,
+ };
+
+Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally
+considered a bad idea." so trying to set a sysfs file writable for
+everyone will fail reverting to RO mode for "Others".
+
+For the common cases sysfs.h provides convenience macros to make
+defining attributes easier as well as making code more concise and
+readable. The above case could be shortened to:
+
+static struct device_attribute dev_attr_foo = __ATTR_RW(foo);
+
+the list of helpers available to define your wrapper function is:
+
+__ATTR_RO(name):
+ assumes default name_show and mode 0444
+__ATTR_WO(name):
+ assumes a name_store only and is restricted to mode
+ 0200 that is root write access only.
+__ATTR_RO_MODE(name, mode):
+ fore more restrictive RO access currently
+ only use case is the EFI System Resource Table
+ (see drivers/firmware/efi/esrt.c)
+__ATTR_RW(name):
+ assumes default name_show, name_store and setting
+ mode to 0644.
+__ATTR_NULL:
+ which sets the name to NULL and is used as end of list
+ indicator (see: kernel/workqueue.c)
+
+Subsystem-Specific Callbacks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a subsystem defines a new attribute type, it must implement a
+set of sysfs operations for forwarding read and write calls to the
+show and store methods of the attribute owners::
+
+ struct sysfs_ops {
+ ssize_t (*show)(struct kobject *, struct attribute *, char *);
+ ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
+ };
+
+[ Subsystems should have already defined a struct kobj_type as a
+descriptor for this type, which is where the sysfs_ops pointer is
+stored. See the kobject documentation for more information. ]
+
+When a file is read or written, sysfs calls the appropriate method
+for the type. The method then translates the generic struct kobject
+and struct attribute pointers to the appropriate pointer types, and
+calls the associated methods.
+
+
+To illustrate::
+
+ #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
+
+ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+ {
+ struct device_attribute *dev_attr = to_dev_attr(attr);
+ struct device *dev = kobj_to_dev(kobj);
+ ssize_t ret = -EIO;
+
+ if (dev_attr->show)
+ ret = dev_attr->show(dev, dev_attr, buf);
+ if (ret >= (ssize_t)PAGE_SIZE) {
+ printk("dev_attr_show: %pS returned bad count\n",
+ dev_attr->show);
+ }
+ return ret;
+ }
+
+
+
+Reading/Writing Attribute Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To read or write attributes, show() or store() methods must be
+specified when declaring the attribute. The method types should be as
+simple as those defined for device attributes::
+
+ ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf);
+ ssize_t (*store)(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count);
+
+IOW, they should take only an object, an attribute, and a buffer as parameters.
+
+
+sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
+method. Sysfs will call the method exactly once for each read or
+write. This forces the following behavior on the method
+implementations:
+
+- On read(2), the show() method should fill the entire buffer.
+ Recall that an attribute should only be exporting one value, or an
+ array of similar values, so this shouldn't be that expensive.
+
+ This allows userspace to do partial reads and forward seeks
+ arbitrarily over the entire file at will. If userspace seeks back to
+ zero or does a pread(2) with an offset of '0' the show() method will
+ be called again, rearmed, to fill the buffer.
+
+- On write(2), sysfs expects the entire buffer to be passed during the
+ first write. Sysfs then passes the entire buffer to the store() method.
+ A terminating null is added after the data on stores. This makes
+ functions like sysfs_streq() safe to use.
+
+ When writing sysfs files, userspace processes should first read the
+ entire file, modify the values it wishes to change, then write the
+ entire buffer back.
+
+ Attribute method implementations should operate on an identical
+ buffer when reading and writing values.
+
+Other notes:
+
+- Writing causes the show() method to be rearmed regardless of current
+ file position.
+
+- The buffer will always be PAGE_SIZE bytes in length. On i386, this
+ is 4096.
+
+- show() methods should return the number of bytes printed into the
+ buffer.
+
+- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
+ the value to be returned to user space.
+
+- store() should return the number of bytes used from the buffer. If the
+ entire buffer has been used, just return the count argument.
+
+- show() or store() can always return errors. If a bad value comes
+ through, be sure to return an error.
+
+- The object passed to the methods will be pinned in memory via sysfs
+ referencing counting its embedded object. However, the physical
+ entity (e.g. device) the object represents may not be present. Be
+ sure to have a way to check this, if necessary.
+
+
+A very simple (and naive) implementation of a device attribute is::
+
+ static ssize_t show_name(struct device *dev, struct device_attribute *attr,
+ char *buf)
+ {
+ return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
+ }
+
+ static ssize_t store_name(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+ {
+ snprintf(dev->name, sizeof(dev->name), "%.*s",
+ (int)min(count, sizeof(dev->name) - 1), buf);
+ return count;
+ }
+
+ static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
+
+
+(Note that the real implementation doesn't allow userspace to set the
+name for a device.)
+
+
+Top Level Directory Layout
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sysfs directory arrangement exposes the relationship of kernel
+data structures.
+
+The top level sysfs directory looks like::
+
+ block/
+ bus/
+ class/
+ dev/
+ devices/
+ firmware/
+ net/
+ fs/
+
+devices/ contains a filesystem representation of the device tree. It maps
+directly to the internal kernel device tree, which is a hierarchy of
+struct device.
+
+bus/ contains flat directory layout of the various bus types in the
+kernel. Each bus's directory contains two subdirectories::
+
+ devices/
+ drivers/
+
+devices/ contains symlinks for each device discovered in the system
+that point to the device's directory under root/.
+
+drivers/ contains a directory for each device driver that is loaded
+for devices on that particular bus (this assumes that drivers do not
+span multiple bus types).
+
+fs/ contains a directory for some filesystems. Currently each
+filesystem wanting to export attributes must create its own hierarchy
+below fs/ (see ./fuse.txt for an example).
+
+dev/ contains two directories char/ and block/. Inside these two
+directories there are symlinks named <major>:<minor>. These symlinks
+point to the sysfs directory for the given device. /sys/dev provides a
+quick way to lookup the sysfs interface for a device from the result of
+a stat(2) operation.
+
+More information can driver-model specific features can be found in
+Documentation/driver-api/driver-model/.
+
+
+TODO: Finish this section.
+
+
+Current Interfaces
+~~~~~~~~~~~~~~~~~~
+
+The following interface layers currently exist in sysfs:
+
+
+devices (include/linux/device.h)
+--------------------------------
+Structure::
+
+ struct device_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device *dev, struct device_attribute *attr,
+ char *buf);
+ ssize_t (*store)(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count);
+ };
+
+Declaring::
+
+ DEVICE_ATTR(_name, _mode, _show, _store);
+
+Creation/Removal::
+
+ int device_create_file(struct device *dev, const struct device_attribute * attr);
+ void device_remove_file(struct device *dev, const struct device_attribute * attr);
+
+
+bus drivers (include/linux/device.h)
+------------------------------------
+Structure::
+
+ struct bus_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct bus_type *, char * buf);
+ ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
+ };
+
+Declaring::
+
+ static BUS_ATTR_RW(name);
+ static BUS_ATTR_RO(name);
+ static BUS_ATTR_WO(name);
+
+Creation/Removal::
+
+ int bus_create_file(struct bus_type *, struct bus_attribute *);
+ void bus_remove_file(struct bus_type *, struct bus_attribute *);
+
+
+device drivers (include/linux/device.h)
+---------------------------------------
+
+Structure::
+
+ struct driver_attribute {
+ struct attribute attr;
+ ssize_t (*show)(struct device_driver *, char * buf);
+ ssize_t (*store)(struct device_driver *, const char * buf,
+ size_t count);
+ };
+
+Declaring::
+
+ DRIVER_ATTR_RO(_name)
+ DRIVER_ATTR_RW(_name)
+
+Creation/Removal::
+
+ int driver_create_file(struct device_driver *, const struct driver_attribute *);
+ void driver_remove_file(struct device_driver *, const struct driver_attribute *);
+
+
+Documentation
+~~~~~~~~~~~~~
+
+The sysfs directory structure and the attributes in each directory define an
+ABI between the kernel and user space. As for any ABI, it is important that
+this ABI is stable and properly documented. All new sysfs attributes must be
+documented in Documentation/ABI. See also Documentation/ABI/README for more
+information.
diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt
deleted file mode 100644
index 33ec0a0..0000000
--- a/Documentation/filesystems/sysfs.txt
+++ /dev/null
@@ -1,406 +0,0 @@
-
-sysfs - _The_ filesystem for exporting kernel objects.
-
-Patrick Mochel <mochel@osdl.org>
-Mike Murphy <mamurph@cs.clemson.edu>
-
-Revised: 16 August 2011
-Original: 10 January 2003
-
-
-What it is:
-~~~~~~~~~~~
-
-sysfs is a ram-based filesystem initially based on ramfs. It provides
-a means to export kernel data structures, their attributes, and the
-linkages between them to userspace.
-
-sysfs is tied inherently to the kobject infrastructure. Please read
-Documentation/kobject.txt for more information concerning the kobject
-interface.
-
-
-Using sysfs
-~~~~~~~~~~~
-
-sysfs is always compiled in if CONFIG_SYSFS is defined. You can access
-it by doing:
-
- mount -t sysfs sysfs /sys
-
-
-Directory Creation
-~~~~~~~~~~~~~~~~~~
-
-For every kobject that is registered with the system, a directory is
-created for it in sysfs. That directory is created as a subdirectory
-of the kobject's parent, expressing internal object hierarchies to
-userspace. Top-level directories in sysfs represent the common
-ancestors of object hierarchies; i.e. the subsystems the objects
-belong to.
-
-Sysfs internally stores a pointer to the kobject that implements a
-directory in the kernfs_node object associated with the directory. In
-the past this kobject pointer has been used by sysfs to do reference
-counting directly on the kobject whenever the file is opened or closed.
-With the current sysfs implementation the kobject reference count is
-only modified directly by the function sysfs_schedule_callback().
-
-
-Attributes
-~~~~~~~~~~
-
-Attributes can be exported for kobjects in the form of regular files in
-the filesystem. Sysfs forwards file I/O operations to methods defined
-for the attributes, providing a means to read and write kernel
-attributes.
-
-Attributes should be ASCII text files, preferably with only one value
-per file. It is noted that it may not be efficient to contain only one
-value per file, so it is socially acceptable to express an array of
-values of the same type.
-
-Mixing types, expressing multiple lines of data, and doing fancy
-formatting of data is heavily frowned upon. Doing these things may get
-you publicly humiliated and your code rewritten without notice.
-
-
-An attribute definition is simply:
-
-struct attribute {
- char * name;
- struct module *owner;
- umode_t mode;
-};
-
-
-int sysfs_create_file(struct kobject * kobj, const struct attribute * attr);
-void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr);
-
-
-A bare attribute contains no means to read or write the value of the
-attribute. Subsystems are encouraged to define their own attribute
-structure and wrapper functions for adding and removing attributes for
-a specific object type.
-
-For example, the driver model defines struct device_attribute like:
-
-struct device_attribute {
- struct attribute attr;
- ssize_t (*show)(struct device *dev, struct device_attribute *attr,
- char *buf);
- ssize_t (*store)(struct device *dev, struct device_attribute *attr,
- const char *buf, size_t count);
-};
-
-int device_create_file(struct device *, const struct device_attribute *);
-void device_remove_file(struct device *, const struct device_attribute *);
-
-It also defines this helper for defining device attributes:
-
-#define DEVICE_ATTR(_name, _mode, _show, _store) \
-struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store)
-
-For example, declaring
-
-static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);
-
-is equivalent to doing:
-
-static struct device_attribute dev_attr_foo = {
- .attr = {
- .name = "foo",
- .mode = S_IWUSR | S_IRUGO,
- },
- .show = show_foo,
- .store = store_foo,
-};
-
-Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally
-considered a bad idea." so trying to set a sysfs file writable for
-everyone will fail reverting to RO mode for "Others".
-
-For the common cases sysfs.h provides convenience macros to make
-defining attributes easier as well as making code more concise and
-readable. The above case could be shortened to:
-
-static struct device_attribute dev_attr_foo = __ATTR_RW(foo);
-
-the list of helpers available to define your wrapper function is:
-__ATTR_RO(name): assumes default name_show and mode 0444
-__ATTR_WO(name): assumes a name_store only and is restricted to mode
- 0200 that is root write access only.
-__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently
- only use case is the EFI System Resource Table
- (see drivers/firmware/efi/esrt.c)
-__ATTR_RW(name): assumes default name_show, name_store and setting
- mode to 0644.
-__ATTR_NULL: which sets the name to NULL and is used as end of list
- indicator (see: kernel/workqueue.c)
-
-Subsystem-Specific Callbacks
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When a subsystem defines a new attribute type, it must implement a
-set of sysfs operations for forwarding read and write calls to the
-show and store methods of the attribute owners.
-
-struct sysfs_ops {
- ssize_t (*show)(struct kobject *, struct attribute *, char *);
- ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t);
-};
-
-[ Subsystems should have already defined a struct kobj_type as a
-descriptor for this type, which is where the sysfs_ops pointer is
-stored. See the kobject documentation for more information. ]
-
-When a file is read or written, sysfs calls the appropriate method
-for the type. The method then translates the generic struct kobject
-and struct attribute pointers to the appropriate pointer types, and
-calls the associated methods.
-
-
-To illustrate:
-
-#define to_dev(obj) container_of(obj, struct device, kobj)
-#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
-
-static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct device_attribute *dev_attr = to_dev_attr(attr);
- struct device *dev = to_dev(kobj);
- ssize_t ret = -EIO;
-
- if (dev_attr->show)
- ret = dev_attr->show(dev, dev_attr, buf);
- if (ret >= (ssize_t)PAGE_SIZE) {
- printk("dev_attr_show: %pS returned bad count\n",
- dev_attr->show);
- }
- return ret;
-}
-
-
-
-Reading/Writing Attribute Data
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To read or write attributes, show() or store() methods must be
-specified when declaring the attribute. The method types should be as
-simple as those defined for device attributes:
-
-ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf);
-ssize_t (*store)(struct device *dev, struct device_attribute *attr,
- const char *buf, size_t count);
-
-IOW, they should take only an object, an attribute, and a buffer as parameters.
-
-
-sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
-method. Sysfs will call the method exactly once for each read or
-write. This forces the following behavior on the method
-implementations:
-
-- On read(2), the show() method should fill the entire buffer.
- Recall that an attribute should only be exporting one value, or an
- array of similar values, so this shouldn't be that expensive.
-
- This allows userspace to do partial reads and forward seeks
- arbitrarily over the entire file at will. If userspace seeks back to
- zero or does a pread(2) with an offset of '0' the show() method will
- be called again, rearmed, to fill the buffer.
-
-- On write(2), sysfs expects the entire buffer to be passed during the
- first write. Sysfs then passes the entire buffer to the store() method.
- A terminating null is added after the data on stores. This makes
- functions like sysfs_streq() safe to use.
-
- When writing sysfs files, userspace processes should first read the
- entire file, modify the values it wishes to change, then write the
- entire buffer back.
-
- Attribute method implementations should operate on an identical
- buffer when reading and writing values.
-
-Other notes:
-
-- Writing causes the show() method to be rearmed regardless of current
- file position.
-
-- The buffer will always be PAGE_SIZE bytes in length. On i386, this
- is 4096.
-
-- show() methods should return the number of bytes printed into the
- buffer.
-
-- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
- the value to be returned to user space.
-
-- store() should return the number of bytes used from the buffer. If the
- entire buffer has been used, just return the count argument.
-
-- show() or store() can always return errors. If a bad value comes
- through, be sure to return an error.
-
-- The object passed to the methods will be pinned in memory via sysfs
- referencing counting its embedded object. However, the physical
- entity (e.g. device) the object represents may not be present. Be
- sure to have a way to check this, if necessary.
-
-
-A very simple (and naive) implementation of a device attribute is:
-
-static ssize_t show_name(struct device *dev, struct device_attribute *attr,
- char *buf)
-{
- return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
-}
-
-static ssize_t store_name(struct device *dev, struct device_attribute *attr,
- const char *buf, size_t count)
-{
- snprintf(dev->name, sizeof(dev->name), "%.*s",
- (int)min(count, sizeof(dev->name) - 1), buf);
- return count;
-}
-
-static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);
-
-
-(Note that the real implementation doesn't allow userspace to set the
-name for a device.)
-
-
-Top Level Directory Layout
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The sysfs directory arrangement exposes the relationship of kernel
-data structures.
-
-The top level sysfs directory looks like:
-
-block/
-bus/
-class/
-dev/
-devices/
-firmware/
-net/
-fs/
-
-devices/ contains a filesystem representation of the device tree. It maps
-directly to the internal kernel device tree, which is a hierarchy of
-struct device.
-
-bus/ contains flat directory layout of the various bus types in the
-kernel. Each bus's directory contains two subdirectories:
-
- devices/
- drivers/
-
-devices/ contains symlinks for each device discovered in the system
-that point to the device's directory under root/.
-
-drivers/ contains a directory for each device driver that is loaded
-for devices on that particular bus (this assumes that drivers do not
-span multiple bus types).
-
-fs/ contains a directory for some filesystems. Currently each
-filesystem wanting to export attributes must create its own hierarchy
-below fs/ (see ./fuse.txt for an example).
-
-dev/ contains two directories char/ and block/. Inside these two
-directories there are symlinks named <major>:<minor>. These symlinks
-point to the sysfs directory for the given device. /sys/dev provides a
-quick way to lookup the sysfs interface for a device from the result of
-a stat(2) operation.
-
-More information can driver-model specific features can be found in
-Documentation/driver-api/driver-model/.
-
-
-TODO: Finish this section.
-
-
-Current Interfaces
-~~~~~~~~~~~~~~~~~~
-
-The following interface layers currently exist in sysfs:
-
-
-- devices (include/linux/device.h)
-----------------------------------
-Structure:
-
-struct device_attribute {
- struct attribute attr;
- ssize_t (*show)(struct device *dev, struct device_attribute *attr,
- char *buf);
- ssize_t (*store)(struct device *dev, struct device_attribute *attr,
- const char *buf, size_t count);
-};
-
-Declaring:
-
-DEVICE_ATTR(_name, _mode, _show, _store);
-
-Creation/Removal:
-
-int device_create_file(struct device *dev, const struct device_attribute * attr);
-void device_remove_file(struct device *dev, const struct device_attribute * attr);
-
-
-- bus drivers (include/linux/device.h)
---------------------------------------
-Structure:
-
-struct bus_attribute {
- struct attribute attr;
- ssize_t (*show)(struct bus_type *, char * buf);
- ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
-};
-
-Declaring:
-
-static BUS_ATTR_RW(name);
-static BUS_ATTR_RO(name);
-static BUS_ATTR_WO(name);
-
-Creation/Removal:
-
-int bus_create_file(struct bus_type *, struct bus_attribute *);
-void bus_remove_file(struct bus_type *, struct bus_attribute *);
-
-
-- device drivers (include/linux/device.h)
------------------------------------------
-
-Structure:
-
-struct driver_attribute {
- struct attribute attr;
- ssize_t (*show)(struct device_driver *, char * buf);
- ssize_t (*store)(struct device_driver *, const char * buf,
- size_t count);
-};
-
-Declaring:
-
-DRIVER_ATTR_RO(_name)
-DRIVER_ATTR_RW(_name)
-
-Creation/Removal:
-
-int driver_create_file(struct device_driver *, const struct driver_attribute *);
-void driver_remove_file(struct device_driver *, const struct driver_attribute *);
-
-
-Documentation
-~~~~~~~~~~~~~
-
-The sysfs directory structure and the attributes in each directory define an
-ABI between the kernel and user space. As for any ABI, it is important that
-this ABI is stable and properly documented. All new sysfs attributes must be
-documented in Documentation/ABI. See also Documentation/ABI/README for more
-information.
diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.rst
similarity index 73%
rename from Documentation/filesystems/sysv-fs.txt
rename to Documentation/filesystems/sysv-fs.rst
index 253b50d..89e4091 100644
--- a/Documentation/filesystems/sysv-fs.txt
+++ b/Documentation/filesystems/sysv-fs.rst
@@ -1,25 +1,40 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+SystemV Filesystem
+==================
+
It implements all of
- Xenix FS,
- SystemV/386 FS,
- Coherent FS.
To install:
+
* Answer the 'System V and Coherent filesystem support' question with 'y'
when configuring the kernel.
-* To mount a disk or a partition, use
+* To mount a disk or a partition, use::
+
mount [-r] -t sysv device mountpoint
- The file system type names
+
+ The file system type names::
+
-t sysv
-t xenix
-t coherent
+
may be used interchangeably, but the last two will eventually disappear.
Bugs in the present implementation:
+
- Coherent FS:
+
- The "free list interleave" n:m is currently ignored.
- Only file systems with no filesystem name and no pack name are recognized.
- (See Coherent "man mkfs" for a description of these features.)
+ (See Coherent "man mkfs" for a description of these features.)
+
- SystemV Release 2 FS:
+
The superblock is only searched in the blocks 9, 15, 18, which
corresponds to the beginning of track 1 on floppy disks. No support
for this FS on hard disk yet.
@@ -28,12 +43,14 @@
These filesystems are rather similar. Here is a comparison with Minix FS:
* Linux fdisk reports on partitions
+
- Minix FS 0x81 Linux/Minix
- Xenix FS ??
- SystemV FS ??
- Coherent FS 0x08 AIX bootable
* Size of a block or zone (data allocation unit on disk)
+
- Minix FS 1024
- Xenix FS 1024 (also 512 ??)
- SystemV FS 1024 (also 512 and 2048)
@@ -45,37 +62,51 @@
all the block numbers (including the super block) are offset by one track.
* Byte ordering of "short" (16 bit entities) on disk:
+
- Minix FS little endian 0 1
- Xenix FS little endian 0 1
- SystemV FS little endian 0 1
- Coherent FS little endian 0 1
+
Of course, this affects only the file system, not the data of files on it!
* Byte ordering of "long" (32 bit entities) on disk:
+
- Minix FS little endian 0 1 2 3
- Xenix FS little endian 0 1 2 3
- SystemV FS little endian 0 1 2 3
- Coherent FS PDP-11 2 3 0 1
+
Of course, this affects only the file system, not the data of files on it!
* Inode on disk: "short", 0 means non-existent, the root dir ino is:
- - Minix FS 1
- - Xenix FS, SystemV FS, Coherent FS 2
+
+ ================================= ==
+ Minix FS 1
+ Xenix FS, SystemV FS, Coherent FS 2
+ ================================= ==
* Maximum number of hard links to a file:
- - Minix FS 250
- - Xenix FS ??
- - SystemV FS ??
- - Coherent FS >=10000
+
+ =========== =========
+ Minix FS 250
+ Xenix FS ??
+ SystemV FS ??
+ Coherent FS >=10000
+ =========== =========
* Free inode management:
- - Minix FS a bitmap
+
+ - Minix FS
+ a bitmap
- Xenix FS, SystemV FS, Coherent FS
There is a cache of a certain number of free inodes in the super-block.
When it is exhausted, new free inodes are found using a linear search.
* Free block management:
- - Minix FS a bitmap
+
+ - Minix FS
+ a bitmap
- Xenix FS, SystemV FS, Coherent FS
Free blocks are organized in a "free list". Maybe a misleading term,
since it is not true that every free block contains a pointer to
@@ -86,13 +117,18 @@
0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
* Super-block location:
- - Minix FS block 1 = bytes 1024..2047
- - Xenix FS block 1 = bytes 1024..2047
- - SystemV FS bytes 512..1023
- - Coherent FS block 1 = bytes 512..1023
+
+ =========== ==========================
+ Minix FS block 1 = bytes 1024..2047
+ Xenix FS block 1 = bytes 1024..2047
+ SystemV FS bytes 512..1023
+ Coherent FS block 1 = bytes 512..1023
+ =========== ==========================
* Super-block layout:
- - Minix FS
+
+ - Minix FS::
+
unsigned short s_ninodes;
unsigned short s_nzones;
unsigned short s_imap_blocks;
@@ -101,7 +137,9 @@
unsigned short s_log_zone_size;
unsigned long s_max_size;
unsigned short s_magic;
- - Xenix FS, SystemV FS, Coherent FS
+
+ - Xenix FS, SystemV FS, Coherent FS::
+
unsigned short s_firstdatazone;
unsigned long s_nzones;
unsigned short s_fzone_count;
@@ -120,23 +158,33 @@
unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
char s_fname[6];
char s_fpack[6];
+
then they differ considerably:
- Xenix FS
+
+ Xenix FS::
+
char s_clean;
char s_fill[371];
long s_magic;
long s_type;
- SystemV FS
+
+ SystemV FS::
+
long s_fill[12 or 14];
long s_state;
long s_magic;
long s_type;
- Coherent FS
+
+ Coherent FS::
+
unsigned long s_unique;
+
Note that Coherent FS has no magic.
* Inode layout:
- - Minix FS
+
+ - Minix FS::
+
unsigned short i_mode;
unsigned short i_uid;
unsigned long i_size;
@@ -144,7 +192,9 @@
unsigned char i_gid;
unsigned char i_nlinks;
unsigned short i_zone[7+1+1];
- - Xenix FS, SystemV FS, Coherent FS
+
+ - Xenix FS, SystemV FS, Coherent FS::
+
unsigned short i_mode;
unsigned short i_nlink;
unsigned short i_uid;
@@ -155,38 +205,55 @@
unsigned long i_mtime;
unsigned long i_ctime;
-* Regular file data blocks are organized as
- - Minix FS
- 7 direct blocks
- 1 indirect block (pointers to blocks)
- 1 double-indirect block (pointer to pointers to blocks)
- - Xenix FS, SystemV FS, Coherent FS
- 10 direct blocks
- 1 indirect block (pointers to blocks)
- 1 double-indirect block (pointer to pointers to blocks)
- 1 triple-indirect block (pointer to pointers to pointers to blocks)
-* Inode size, inodes per block
- - Minix FS 32 32
- - Xenix FS 64 16
- - SystemV FS 64 16
- - Coherent FS 64 8
+* Regular file data blocks are organized as
+
+ - Minix FS:
+
+ - 7 direct blocks
+ - 1 indirect block (pointers to blocks)
+ - 1 double-indirect block (pointer to pointers to blocks)
+
+ - Xenix FS, SystemV FS, Coherent FS:
+
+ - 10 direct blocks
+ - 1 indirect block (pointers to blocks)
+ - 1 double-indirect block (pointer to pointers to blocks)
+ - 1 triple-indirect block (pointer to pointers to pointers to blocks)
+
+
+ =========== ========== ================
+ Inode size inodes per block
+ =========== ========== ================
+ Minix FS 32 32
+ Xenix FS 64 16
+ SystemV FS 64 16
+ Coherent FS 64 8
+ =========== ========== ================
* Directory entry on disk
- - Minix FS
+
+ - Minix FS::
+
unsigned short inode;
char name[14/30];
- - Xenix FS, SystemV FS, Coherent FS
+
+ - Xenix FS, SystemV FS, Coherent FS::
+
unsigned short inode;
char name[14];
-* Dir entry size, dir entries per block
- - Minix FS 16/32 64/32
- - Xenix FS 16 64
- - SystemV FS 16 64
- - Coherent FS 16 32
+ =========== ============== =====================
+ Dir entry size dir entries per block
+ =========== ============== =====================
+ Minix FS 16/32 64/32
+ Xenix FS 16 64
+ SystemV FS 16 64
+ Coherent FS 16 32
+ =========== ============== =====================
* How to implement symbolic links such that the host fsck doesn't scream:
+
- Minix FS normal
- Xenix FS kludge: as regular files with chmod 1000
- SystemV FS ??
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.rst
similarity index 77%
rename from Documentation/filesystems/tmpfs.txt
rename to Documentation/filesystems/tmpfs.rst
index 5ecbc03..c44f8b1 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+Tmpfs
+=====
+
Tmpfs is a file system which keeps all files in virtual memory.
@@ -14,7 +20,7 @@
you gain swapping and limit checking. Another similar thing is the RAM
disk (/dev/ram*), which simulates a fixed size hard disk in physical
RAM, where you have to create an ordinary filesystem on top. Ramdisks
-cannot swap and you do not have the possibility to resize them.
+cannot swap and you do not have the possibility to resize them.
Since tmpfs lives completely in the page cache and on swap, all tmpfs
pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
@@ -26,7 +32,7 @@
1) There is always a kernel internal mount which you will not see at
all. This is used for shared anonymous mappings and SYSV shared
- memory.
+ memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
set, the user visible part of tmpfs is not build. But the internal
@@ -34,7 +40,7 @@
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
POSIX shared memory (shm_open, shm_unlink). Adding the following
- line to /etc/fstab should take care of this:
+ line to /etc/fstab should take care of this::
tmpfs /dev/shm tmpfs defaults 0 0
@@ -56,15 +62,17 @@
tmpfs has three mount options for sizing:
-size: The limit of allocated bytes for this tmpfs instance. The
+========= ============================================================
+size The limit of allocated bytes for this tmpfs instance. The
default is half of your physical RAM without swap. If you
oversize your tmpfs instances the machine will deadlock
since the OOM handler will not be able to free that memory.
-nr_blocks: The same as size, but in blocks of PAGE_SIZE.
-nr_inodes: The maximum number of inodes for this instance. The default
+nr_blocks The same as size, but in blocks of PAGE_SIZE.
+nr_inodes The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a
machine with highmem) the number of lowmem RAM pages,
whichever is the lower.
+========= ============================================================
These parameters accept a suffix k, m or g for kilo, mega and giga and
can be changed on remount. The size parameter also accepts a suffix %
@@ -82,6 +90,7 @@
all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...'
+======================== ==============================================
mpol=default use the process allocation policy
(see set_mempolicy(2))
mpol=prefer:Node prefers to allocate memory from the given Node
@@ -89,6 +98,7 @@
mpol=interleave prefers to allocate from each node in turn
mpol=interleave:NodeList allocates from each node of NodeList in turn
mpol=local prefers to allocate memory from the local node
+======================== ==============================================
NodeList format is a comma-separated list of decimal numbers and ranges,
a range being two hyphen-separated decimal numbers, the smallest and
@@ -98,9 +108,9 @@
use at file creation time. When a task allocates a file in the file
system, the mount option memory policy will be applied with a NodeList,
if any, modified by the calling task's cpuset constraints
-[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, listed
-below. If the resulting NodeLists is the empty set, the effective memory
-policy for the file will revert to "default" policy.
+[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags,
+listed below. If the resulting NodeLists is the empty set, the effective
+memory policy for the file will revert to "default" policy.
NUMA memory allocation policies have optional flags that can be used in
conjunction with their modes. These optional flags can be specified
@@ -109,6 +119,8 @@
all available memory allocation policy mode flags and their effect on
memory policy.
+::
+
=static is equivalent to MPOL_F_STATIC_NODES
=relative is equivalent to MPOL_F_RELATIVE_NODES
@@ -128,22 +140,42 @@
To specify the initial root directory you can use the following mount
options:
-mode: The permissions as an octal number
-uid: The user id
-gid: The group id
+==== ==================================
+mode The permissions as an octal number
+uid The user id
+gid The group id
+==== ==================================
These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
+numbers:
+
+======= ========================
+inode64 Use 64-bit inode numbers
+inode32 Use 32-bit inode numbers
+======= ========================
+
+On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time.
+On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default. inode64 avoids the
+possibility of multiple files with the same inode number on a single device;
+but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached -
+if a long-lived tmpfs is accessed by 32-bit applications so ancient that
+opening a file larger than 2GiB fails with EINVAL.
+
+
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
-Author:
+:Author:
Christoph Rohland <cr@sap.com>, 1.12.01
-Updated:
+:Updated:
Hugh Dickins, 4 June 2007
-Updated:
+:Updated:
KOSAKI Motohiro, 16 Mar 2010
+:Updated:
+ Chris Down, 13 July 2020
diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst
index 6a9584f..5210aed 100644
--- a/Documentation/filesystems/ubifs-authentication.rst
+++ b/Documentation/filesystems/ubifs-authentication.rst
@@ -1,9 +1,13 @@
-:orphan:
+.. SPDX-License-Identifier: GPL-2.0
.. UBIFS Authentication
.. sigma star gmbh
.. 2018
+============================
+UBIFS Authentication Support
+============================
+
Introduction
============
@@ -92,11 +96,11 @@
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types
-of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file
-contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes.
-Almost all types of nodes share a common header (`ubifs_ch`) containing basic
+of nodes. Eg. data nodes (``struct ubifs_data_node``) which store chunks of file
+contents or inode nodes (``struct ubifs_ino_node``) which represent VFS inodes.
+Almost all types of nodes share a common header (``ubifs_ch``) containing basic
information like node type, node length, a sequence number, etc. (see
-`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT
+``fs/ubifs/ubifs-media.h`` in kernel source). Exceptions are entries of the LPT
and some less important node types like padding nodes which are used to pad
unusable content at the end of LEBs.
@@ -431,9 +435,9 @@
References
==========
-[CRYPTSETUP2] http://www.saout.de/pipermail/dm-crypt/2017-November/005745.html
+[CRYPTSETUP2] https://www.saout.de/pipermail/dm-crypt/2017-November/005745.html
-[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
+[DMC-CBC-ATTACK] https://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.rst
diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.rst
similarity index 91%
rename from Documentation/filesystems/ubifs.txt
rename to Documentation/filesystems/ubifs.rst
index acc8044..e6ee997 100644
--- a/Documentation/filesystems/ubifs.txt
+++ b/Documentation/filesystems/ubifs.rst
@@ -1,5 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+UBI File System
+===============
+
Introduction
-=============
+============
UBIFS file-system stands for UBI File System. UBI stands for "Unsorted
Block Images". UBIFS is a flash file system, which means it is designed
@@ -79,6 +85,7 @@
(*) == default.
+==================== =======================================================
bulk_read read more in one go to take advantage of flash
media that read faster sequentially
no_bulk_read (*) do not bulk-read
@@ -98,6 +105,7 @@
auth_hash_name= The hash algorithm used for authentication. Used for
both hashing and for creating HMACs. Typical values
include "sha256" or "sha512"
+==================== =======================================================
Quick usage instructions
@@ -107,12 +115,14 @@
where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is
UBI volume name.
-Mount volume 0 on UBI device 0 to /mnt/ubifs:
-$ mount -t ubifs ubi0_0 /mnt/ubifs
+Mount volume 0 on UBI device 0 to /mnt/ubifs::
+
+ $ mount -t ubifs ubi0_0 /mnt/ubifs
Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume
-name):
-$ mount -t ubifs ubi0:rootfs /mnt/ubifs
+name)::
+
+ $ mount -t ubifs ubi0:rootfs /mnt/ubifs
The following is an example of the kernel boot arguments to attach mtd0
to UBI and mount volume "rootfs":
@@ -122,5 +132,6 @@
==========
UBIFS documentation and FAQ/HOWTO at the MTD web site:
-http://www.linux-mtd.infradead.org/doc/ubifs.html
-http://www.linux-mtd.infradead.org/faq/ubifs.html
+
+- http://www.linux-mtd.infradead.org/doc/ubifs.html
+- http://www.linux-mtd.infradead.org/faq/ubifs.html
diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.rst
similarity index 83%
rename from Documentation/filesystems/udf.txt
rename to Documentation/filesystems/udf.rst
index e2f2faf..f9489dd 100644
--- a/Documentation/filesystems/udf.txt
+++ b/Documentation/filesystems/udf.rst
@@ -1,6 +1,8 @@
-*
-* Documentation/filesystems/udf.txt
-*
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+UDF file system
+===============
If you encounter problems with reading UDF discs using this driver,
please report them according to MAINTAINERS file.
@@ -18,8 +20,10 @@
by drive firmware.
-------------------------------------------------------------------------------
+
The following mount options are supported:
+ =========== ======================================
gid= Set the default group.
umask= Set the default umask.
mode= Set the default file permissions.
@@ -34,6 +38,7 @@
longad Use long ad's (default)
nostrict Unset strict conformance
iocharset= Set the NLS character set
+ =========== ======================================
The uid= and gid= options need a bit more explaining. They will accept a
decimal numeric value and all inodes on that mount will then appear as
@@ -47,13 +52,17 @@
The remaining are for debugging and disaster recovery:
- novrs Skip volume sequence recognition
+ ===== ================================
+ novrs Skip volume sequence recognition
+ ===== ================================
The following expect a offset from 0.
+ ========== =================================================
session= Set the CDROM session (default= last session)
anchor= Override standard anchor location. (default= 256)
lastblock= Set the last block of the filesystem/
+ ========== =================================================
-------------------------------------------------------------------------------
@@ -62,5 +71,5 @@
https://github.com/pali/udftools
Documentation on UDF and ECMA 167 is available FREE from:
- http://www.osta.org/
- http://www.ecma-international.org/
+ - http://www.osta.org/
+ - https://www.ecma-international.org/
diff --git a/Documentation/filesystems/vfat.rst b/Documentation/filesystems/vfat.rst
new file mode 100644
index 0000000..e85d74e
--- /dev/null
+++ b/Documentation/filesystems/vfat.rst
@@ -0,0 +1,387 @@
+====
+VFAT
+====
+
+USING VFAT
+==========
+
+To use the vfat filesystem, use the filesystem type 'vfat'. i.e.::
+
+ mount -t vfat /dev/fd0 /mnt
+
+
+No special partition formatter is required,
+'mkdosfs' will work fine if you want to format from within Linux.
+
+VFAT MOUNT OPTIONS
+==================
+
+**uid=###**
+ Set the owner of all files on this filesystem.
+ The default is the uid of current process.
+
+**gid=###**
+ Set the group of all files on this filesystem.
+ The default is the gid of current process.
+
+**umask=###**
+ The permission mask (for files and directories, see *umask(1)*).
+ The default is the umask of current process.
+
+**dmask=###**
+ The permission mask for the directory.
+ The default is the umask of current process.
+
+**fmask=###**
+ The permission mask for files.
+ The default is the umask of current process.
+
+**allow_utime=###**
+ This option controls the permission check of mtime/atime.
+
+ **-20**: If current process is in group of file's group ID,
+ you can change timestamp.
+
+ **-2**: Other users can change timestamp.
+
+ The default is set from dmask option. If the directory is
+ writable, utime(2) is also allowed. i.e. ~dmask & 022.
+
+ Normally utime(2) checks current process is owner of
+ the file, or it has CAP_FOWNER capability. But FAT
+ filesystem doesn't have uid/gid on disk, so normal
+ check is too unflexible. With this option you can
+ relax it.
+
+**codepage=###**
+ Sets the codepage number for converting to shortname
+ characters on FAT filesystem.
+ By default, FAT_DEFAULT_CODEPAGE setting is used.
+
+**iocharset=<name>**
+ Character set to use for converting between the
+ encoding is used for user visible filename and 16 bit
+ Unicode characters. Long filenames are stored on disk
+ in Unicode format, but Unix for the most part doesn't
+ know how to deal with Unicode.
+ By default, FAT_DEFAULT_IOCHARSET setting is used.
+
+ There is also an option of doing UTF-8 translations
+ with the utf8 option.
+
+.. note:: ``iocharset=utf8`` is not recommended. If unsure, you should consider
+ the utf8 option instead.
+
+**utf8=<bool>**
+ UTF-8 is the filesystem safe version of Unicode that
+ is used by the console. It can be enabled or disabled
+ for the filesystem with this option.
+ If 'uni_xlate' gets set, UTF-8 gets disabled.
+ By default, FAT_DEFAULT_UTF8 setting is used.
+
+**uni_xlate=<bool>**
+ Translate unhandled Unicode characters to special
+ escaped sequences. This would let you backup and
+ restore filenames that are created with any Unicode
+ characters. Until Linux supports Unicode for real,
+ this gives you an alternative. Without this option,
+ a '?' is used when no translation is possible. The
+ escape character is ':' because it is otherwise
+ illegal on the vfat filesystem. The escape sequence
+ that gets used is ':' and the four digits of hexadecimal
+ unicode.
+
+**nonumtail=<bool>**
+ When creating 8.3 aliases, normally the alias will
+ end in '~1' or tilde followed by some number. If this
+ option is set, then if the filename is
+ "longfilename.txt" and "longfile.txt" does not
+ currently exist in the directory, longfile.txt will
+ be the short alias instead of longfi~1.txt.
+
+**usefree**
+ Use the "free clusters" value stored on FSINFO. It will
+ be used to determine number of free clusters without
+ scanning disk. But it's not used by default, because
+ recent Windows don't update it correctly in some
+ case. If you are sure the "free clusters" on FSINFO is
+ correct, by this option you can avoid scanning disk.
+
+**quiet**
+ Stops printing certain warning messages.
+
+**check=s|r|n**
+ Case sensitivity checking setting.
+
+ **s**: strict, case sensitive
+
+ **r**: relaxed, case insensitive
+
+ **n**: normal, default setting, currently case insensitive
+
+**nocase**
+ This was deprecated for vfat. Use ``shortname=win95`` instead.
+
+**shortname=lower|win95|winnt|mixed**
+ Shortname display/create setting.
+
+ **lower**: convert to lowercase for display,
+ emulate the Windows 95 rule for create.
+
+ **win95**: emulate the Windows 95 rule for display/create.
+
+ **winnt**: emulate the Windows NT rule for display/create.
+
+ **mixed**: emulate the Windows NT rule for display,
+ emulate the Windows 95 rule for create.
+
+ Default setting is `mixed`.
+
+**tz=UTC**
+ Interpret timestamps as UTC rather than local time.
+ This option disables the conversion of timestamps
+ between local time (as used by Windows on FAT) and UTC
+ (which Linux uses internally). This is particularly
+ useful when mounting devices (like digital cameras)
+ that are set to UTC in order to avoid the pitfalls of
+ local time.
+
+**time_offset=minutes**
+ Set offset for conversion of timestamps from local time
+ used by FAT to UTC. I.e. <minutes> minutes will be subtracted
+ from each timestamp to convert it to UTC used internally by
+ Linux. This is useful when time zone set in ``sys_tz`` is
+ not the time zone used by the filesystem. Note that this
+ option still does not provide correct time stamps in all
+ cases in presence of DST - time stamps in a different DST
+ setting will be off by one hour.
+
+**showexec**
+ If set, the execute permission bits of the file will be
+ allowed only if the extension part of the name is .EXE,
+ .COM, or .BAT. Not set by default.
+
+**debug**
+ Can be set, but unused by the current implementation.
+
+**sys_immutable**
+ If set, ATTR_SYS attribute on FAT is handled as
+ IMMUTABLE flag on Linux. Not set by default.
+
+**flush**
+ If set, the filesystem will try to flush to disk more
+ early than normal. Not set by default.
+
+**rodir**
+ FAT has the ATTR_RO (read-only) attribute. On Windows,
+ the ATTR_RO of the directory will just be ignored,
+ and is used only by applications as a flag (e.g. it's set
+ for the customized folder).
+
+ If you want to use ATTR_RO as read-only flag even for
+ the directory, set this option.
+
+**errors=panic|continue|remount-ro**
+ specify FAT behavior on critical errors: panic, continue
+ without doing anything or remount the partition in
+ read-only mode (default behavior).
+
+**discard**
+ If set, issues discard/TRIM commands to the block
+ device when blocks are freed. This is useful for SSD devices
+ and sparse/thinly-provisoned LUNs.
+
+**nfs=stale_rw|nostale_ro**
+ Enable this only if you want to export the FAT filesystem
+ over NFS.
+
+ **stale_rw**: This option maintains an index (cache) of directory
+ *inodes* by *i_logstart* which is used by the nfs-related code to
+ improve look-ups. Full file operations (read/write) over NFS is
+ supported but with cache eviction at NFS server, this could
+ result in ESTALE issues.
+
+ **nostale_ro**: This option bases the *inode* number and filehandle
+ on the on-disk location of a file in the MS-DOS directory entry.
+ This ensures that ESTALE will not be returned after a file is
+ evicted from the inode cache. However, it means that operations
+ such as rename, create and unlink could cause filehandles that
+ previously pointed at one file to point at a different file,
+ potentially causing data corruption. For this reason, this
+ option also mounts the filesystem readonly.
+
+ To maintain backward compatibility, ``'-o nfs'`` is also accepted,
+ defaulting to "stale_rw".
+
+**dos1xfloppy <bool>: 0,1,yes,no,true,false**
+ If set, use a fallback default BIOS Parameter Block
+ configuration, determined by backing device size. These static
+ parameters match defaults assumed by DOS 1.x for 160 kiB,
+ 180 kiB, 320 kiB, and 360 kiB floppies and floppy images.
+
+
+
+LIMITATION
+==========
+
+The fallocated region of file is discarded at umount/evict time
+when using fallocate with FALLOC_FL_KEEP_SIZE.
+So, User should assume that fallocated region can be discarded at
+last close if there is memory pressure resulting in eviction of
+the inode from the memory. As a result, for any dependency on
+the fallocated region, user should make sure to recheck fallocate
+after reopening the file.
+
+TODO
+====
+Need to get rid of the raw scanning stuff. Instead, always use
+a get next directory entry approach. The only thing left that uses
+raw scanning is the directory renaming code.
+
+
+POSSIBLE PROBLEMS
+=================
+
+- vfat_valid_longname does not properly checked reserved names.
+- When a volume name is the same as a directory name in the root
+ directory of the filesystem, the directory name sometimes shows
+ up as an empty file.
+- autoconv option does not work correctly.
+
+
+TEST SUITE
+==========
+If you plan to make any modifications to the vfat filesystem, please
+get the test suite that comes with the vfat distribution at
+
+`<http://web.archive.org/web/*/http://bmrc.berkeley.edu/people/chaffee/vfat.html>`_
+
+This tests quite a few parts of the vfat filesystem and additional
+tests for new features or untested features would be appreciated.
+
+NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
+=============================================
+This documentation was provided by Galen C. Hunt gchunt@cs.rochester.edu and
+lightly annotated by Gordon Chaffee.
+
+This document presents a very rough, technical overview of my
+knowledge of the extended FAT file system used in Windows NT 3.5 and
+Windows 95. I don't guarantee that any of the following is correct,
+but it appears to be so.
+
+The extended FAT file system is almost identical to the FAT
+file system used in DOS versions up to and including *6.223410239847*
+:-). The significant change has been the addition of long file names.
+These names support up to 255 characters including spaces and lower
+case characters as opposed to the traditional 8.3 short names.
+
+Here is the description of the traditional FAT entry in the current
+Windows 95 filesystem::
+
+ struct directory { // Short 8.3 names
+ unsigned char name[8]; // file name
+ unsigned char ext[3]; // file extension
+ unsigned char attr; // attribute byte
+ unsigned char lcase; // Case for base and extension
+ unsigned char ctime_ms; // Creation time, milliseconds
+ unsigned char ctime[2]; // Creation time
+ unsigned char cdate[2]; // Creation date
+ unsigned char adate[2]; // Last access date
+ unsigned char reserved[2]; // reserved values (ignored)
+ unsigned char time[2]; // time stamp
+ unsigned char date[2]; // date stamp
+ unsigned char start[2]; // starting cluster number
+ unsigned char size[4]; // size of the file
+ };
+
+
+The lcase field specifies if the base and/or the extension of an 8.3
+name should be capitalized. This field does not seem to be used by
+Windows 95 but it is used by Windows NT. The case of filenames is not
+completely compatible from Windows NT to Windows 95. It is not completely
+compatible in the reverse direction, however. Filenames that fit in
+the 8.3 namespace and are written on Windows NT to be lowercase will
+show up as uppercase on Windows 95.
+
+.. note:: Note that the ``start`` and ``size`` values are actually little
+ endian integer values. The descriptions of the fields in this
+ structure are public knowledge and can be found elsewhere.
+
+With the extended FAT system, Microsoft has inserted extra
+directory entries for any files with extended names. (Any name which
+legally fits within the old 8.3 encoding scheme does not have extra
+entries.) I call these extra entries slots. Basically, a slot is a
+specially formatted directory entry which holds up to 13 characters of
+a file's extended name. Think of slots as additional labeling for the
+directory entry of the file to which they correspond. Microsoft
+prefers to refer to the 8.3 entry for a file as its alias and the
+extended slot directory entries as the file name.
+
+The C structure for a slot directory entry follows::
+
+ struct slot { // Up to 13 characters of a long name
+ unsigned char id; // sequence number for slot
+ unsigned char name0_4[10]; // first 5 characters in name
+ unsigned char attr; // attribute byte
+ unsigned char reserved; // always 0
+ unsigned char alias_checksum; // checksum for 8.3 alias
+ unsigned char name5_10[12]; // 6 more characters in name
+ unsigned char start[2]; // starting cluster number
+ unsigned char name11_12[4]; // last 2 characters in name
+ };
+
+
+If the layout of the slots looks a little odd, it's only
+because of Microsoft's efforts to maintain compatibility with old
+software. The slots must be disguised to prevent old software from
+panicking. To this end, a number of measures are taken:
+
+ 1) The attribute byte for a slot directory entry is always set
+ to 0x0f. This corresponds to an old directory entry with
+ attributes of "hidden", "system", "read-only", and "volume
+ label". Most old software will ignore any directory
+ entries with the "volume label" bit set. Real volume label
+ entries don't have the other three bits set.
+
+ 2) The starting cluster is always set to 0, an impossible
+ value for a DOS file.
+
+Because the extended FAT system is backward compatible, it is
+possible for old software to modify directory entries. Measures must
+be taken to ensure the validity of slots. An extended FAT system can
+verify that a slot does in fact belong to an 8.3 directory entry by
+the following:
+
+ 1) Positioning. Slots for a file always immediately proceed
+ their corresponding 8.3 directory entry. In addition, each
+ slot has an id which marks its order in the extended file
+ name. Here is a very abbreviated view of an 8.3 directory
+ entry and its corresponding long name slots for the file
+ "My Big File.Extension which is long"::
+
+ <proceeding files...>
+ <slot #3, id = 0x43, characters = "h is long">
+ <slot #2, id = 0x02, characters = "xtension whic">
+ <slot #1, id = 0x01, characters = "My Big File.E">
+ <directory entry, name = "MYBIGFIL.EXT">
+
+
+ .. note:: Note that the slots are stored from last to first. Slots
+ are numbered from 1 to N. The Nth slot is ``or'ed`` with
+ 0x40 to mark it as the last one.
+
+ 2) Checksum. Each slot has an alias_checksum value. The
+ checksum is calculated from the 8.3 name using the
+ following algorithm::
+
+ for (sum = i = 0; i < 11; i++) {
+ sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
+ }
+
+
+ 3) If there is free space in the final slot, a Unicode ``NULL (0x0000)``
+ is stored after the final character. After that, all unused
+ characters in the final slot are set to Unicode 0xFFFF.
+
+Finally, note that the extended name is stored in Unicode. Each Unicode
+character takes either two or four bytes, UTF-16LE encoded.
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt
deleted file mode 100644
index 9103129..0000000
--- a/Documentation/filesystems/vfat.txt
+++ /dev/null
@@ -1,347 +0,0 @@
-USING VFAT
-----------------------------------------------------------------------
-To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
- mount -t vfat /dev/fd0 /mnt
-
-No special partition formatter is required. mkdosfs will work fine
-if you want to format from within Linux.
-
-VFAT MOUNT OPTIONS
-----------------------------------------------------------------------
-uid=### -- Set the owner of all files on this filesystem.
- The default is the uid of current process.
-
-gid=### -- Set the group of all files on this filesystem.
- The default is the gid of current process.
-
-umask=### -- The permission mask (for files and directories, see umask(1)).
- The default is the umask of current process.
-
-dmask=### -- The permission mask for the directory.
- The default is the umask of current process.
-
-fmask=### -- The permission mask for files.
- The default is the umask of current process.
-
-allow_utime=### -- This option controls the permission check of mtime/atime.
-
- 20 - If current process is in group of file's group ID,
- you can change timestamp.
- 2 - Other users can change timestamp.
-
- The default is set from `dmask' option. (If the directory is
- writable, utime(2) is also allowed. I.e. ~dmask & 022)
-
- Normally utime(2) checks current process is owner of
- the file, or it has CAP_FOWNER capability. But FAT
- filesystem doesn't have uid/gid on disk, so normal
- check is too unflexible. With this option you can
- relax it.
-
-codepage=### -- Sets the codepage number for converting to shortname
- characters on FAT filesystem.
- By default, FAT_DEFAULT_CODEPAGE setting is used.
-
-iocharset=<name> -- Character set to use for converting between the
- encoding is used for user visible filename and 16 bit
- Unicode characters. Long filenames are stored on disk
- in Unicode format, but Unix for the most part doesn't
- know how to deal with Unicode.
- By default, FAT_DEFAULT_IOCHARSET setting is used.
-
- There is also an option of doing UTF-8 translations
- with the utf8 option.
-
- NOTE: "iocharset=utf8" is not recommended. If unsure,
- you should consider the following option instead.
-
-utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that
- is used by the console. It can be enabled or disabled
- for the filesystem with this option.
- If 'uni_xlate' gets set, UTF-8 gets disabled.
- By default, FAT_DEFAULT_UTF8 setting is used.
-
-uni_xlate=<bool> -- Translate unhandled Unicode characters to special
- escaped sequences. This would let you backup and
- restore filenames that are created with any Unicode
- characters. Until Linux supports Unicode for real,
- this gives you an alternative. Without this option,
- a '?' is used when no translation is possible. The
- escape character is ':' because it is otherwise
- illegal on the vfat filesystem. The escape sequence
- that gets used is ':' and the four digits of hexadecimal
- unicode.
-
-nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
- end in '~1' or tilde followed by some number. If this
- option is set, then if the filename is
- "longfilename.txt" and "longfile.txt" does not
- currently exist in the directory, 'longfile.txt' will
- be the short alias instead of 'longfi~1.txt'.
-
-usefree -- Use the "free clusters" value stored on FSINFO. It'll
- be used to determine number of free clusters without
- scanning disk. But it's not used by default, because
- recent Windows don't update it correctly in some
- case. If you are sure the "free clusters" on FSINFO is
- correct, by this option you can avoid scanning disk.
-
-quiet -- Stops printing certain warning messages.
-
-check=s|r|n -- Case sensitivity checking setting.
- s: strict, case sensitive
- r: relaxed, case insensitive
- n: normal, default setting, currently case insensitive
-
-nocase -- This was deprecated for vfat. Use shortname=win95 instead.
-
-shortname=lower|win95|winnt|mixed
- -- Shortname display/create setting.
- lower: convert to lowercase for display,
- emulate the Windows 95 rule for create.
- win95: emulate the Windows 95 rule for display/create.
- winnt: emulate the Windows NT rule for display/create.
- mixed: emulate the Windows NT rule for display,
- emulate the Windows 95 rule for create.
- Default setting is `mixed'.
-
-tz=UTC -- Interpret timestamps as UTC rather than local time.
- This option disables the conversion of timestamps
- between local time (as used by Windows on FAT) and UTC
- (which Linux uses internally). This is particularly
- useful when mounting devices (like digital cameras)
- that are set to UTC in order to avoid the pitfalls of
- local time.
-time_offset=minutes
- -- Set offset for conversion of timestamps from local time
- used by FAT to UTC. I.e. <minutes> minutes will be subtracted
- from each timestamp to convert it to UTC used internally by
- Linux. This is useful when time zone set in sys_tz is
- not the time zone used by the filesystem. Note that this
- option still does not provide correct time stamps in all
- cases in presence of DST - time stamps in a different DST
- setting will be off by one hour.
-
-showexec -- If set, the execute permission bits of the file will be
- allowed only if the extension part of the name is .EXE,
- .COM, or .BAT. Not set by default.
-
-debug -- Can be set, but unused by the current implementation.
-
-sys_immutable -- If set, ATTR_SYS attribute on FAT is handled as
- IMMUTABLE flag on Linux. Not set by default.
-
-flush -- If set, the filesystem will try to flush to disk more
- early than normal. Not set by default.
-
-rodir -- FAT has the ATTR_RO (read-only) attribute. On Windows,
- the ATTR_RO of the directory will just be ignored,
- and is used only by applications as a flag (e.g. it's set
- for the customized folder).
-
- If you want to use ATTR_RO as read-only flag even for
- the directory, set this option.
-
-errors=panic|continue|remount-ro
- -- specify FAT behavior on critical errors: panic, continue
- without doing anything or remount the partition in
- read-only mode (default behavior).
-
-discard -- If set, issues discard/TRIM commands to the block
- device when blocks are freed. This is useful for SSD devices
- and sparse/thinly-provisoned LUNs.
-
-nfs=stale_rw|nostale_ro
- Enable this only if you want to export the FAT filesystem
- over NFS.
-
- stale_rw: This option maintains an index (cache) of directory
- inodes by i_logstart which is used by the nfs-related code to
- improve look-ups. Full file operations (read/write) over NFS is
- supported but with cache eviction at NFS server, this could
- result in ESTALE issues.
-
- nostale_ro: This option bases the inode number and filehandle
- on the on-disk location of a file in the MS-DOS directory entry.
- This ensures that ESTALE will not be returned after a file is
- evicted from the inode cache. However, it means that operations
- such as rename, create and unlink could cause filehandles that
- previously pointed at one file to point at a different file,
- potentially causing data corruption. For this reason, this
- option also mounts the filesystem readonly.
-
- To maintain backward compatibility, '-o nfs' is also accepted,
- defaulting to stale_rw
-
-dos1xfloppy -- If set, use a fallback default BIOS Parameter Block
- configuration, determined by backing device size. These static
- parameters match defaults assumed by DOS 1.x for 160 kiB,
- 180 kiB, 320 kiB, and 360 kiB floppies and floppy images.
-
-
-<bool>: 0,1,yes,no,true,false
-
-LIMITATION
----------------------------------------------------------------------
-* The fallocated region of file is discarded at umount/evict time
- when using fallocate with FALLOC_FL_KEEP_SIZE.
- So, User should assume that fallocated region can be discarded at
- last close if there is memory pressure resulting in eviction of
- the inode from the memory. As a result, for any dependency on
- the fallocated region, user should make sure to recheck fallocate
- after reopening the file.
-
-TODO
-----------------------------------------------------------------------
-* Need to get rid of the raw scanning stuff. Instead, always use
- a get next directory entry approach. The only thing left that uses
- raw scanning is the directory renaming code.
-
-
-POSSIBLE PROBLEMS
-----------------------------------------------------------------------
-* vfat_valid_longname does not properly checked reserved names.
-* When a volume name is the same as a directory name in the root
- directory of the filesystem, the directory name sometimes shows
- up as an empty file.
-* autoconv option does not work correctly.
-
-BUG REPORTS
-----------------------------------------------------------------------
-If you have trouble with the VFAT filesystem, mail bug reports to
-chaffee@bmrc.cs.berkeley.edu. Please specify the filename
-and the operation that gave you trouble.
-
-TEST SUITE
-----------------------------------------------------------------------
-If you plan to make any modifications to the vfat filesystem, please
-get the test suite that comes with the vfat distribution at
-
- http://web.archive.org/web/*/http://bmrc.berkeley.edu/
- people/chaffee/vfat.html
-
-This tests quite a few parts of the vfat filesystem and additional
-tests for new features or untested features would be appreciated.
-
-NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
-----------------------------------------------------------------------
-(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
- and lightly annotated by Gordon Chaffee).
-
-This document presents a very rough, technical overview of my
-knowledge of the extended FAT file system used in Windows NT 3.5 and
-Windows 95. I don't guarantee that any of the following is correct,
-but it appears to be so.
-
-The extended FAT file system is almost identical to the FAT
-file system used in DOS versions up to and including 6.223410239847
-:-). The significant change has been the addition of long file names.
-These names support up to 255 characters including spaces and lower
-case characters as opposed to the traditional 8.3 short names.
-
-Here is the description of the traditional FAT entry in the current
-Windows 95 filesystem:
-
- struct directory { // Short 8.3 names
- unsigned char name[8]; // file name
- unsigned char ext[3]; // file extension
- unsigned char attr; // attribute byte
- unsigned char lcase; // Case for base and extension
- unsigned char ctime_ms; // Creation time, milliseconds
- unsigned char ctime[2]; // Creation time
- unsigned char cdate[2]; // Creation date
- unsigned char adate[2]; // Last access date
- unsigned char reserved[2]; // reserved values (ignored)
- unsigned char time[2]; // time stamp
- unsigned char date[2]; // date stamp
- unsigned char start[2]; // starting cluster number
- unsigned char size[4]; // size of the file
- };
-
-The lcase field specifies if the base and/or the extension of an 8.3
-name should be capitalized. This field does not seem to be used by
-Windows 95 but it is used by Windows NT. The case of filenames is not
-completely compatible from Windows NT to Windows 95. It is not completely
-compatible in the reverse direction, however. Filenames that fit in
-the 8.3 namespace and are written on Windows NT to be lowercase will
-show up as uppercase on Windows 95.
-
-Note that the "start" and "size" values are actually little
-endian integer values. The descriptions of the fields in this
-structure are public knowledge and can be found elsewhere.
-
-With the extended FAT system, Microsoft has inserted extra
-directory entries for any files with extended names. (Any name which
-legally fits within the old 8.3 encoding scheme does not have extra
-entries.) I call these extra entries slots. Basically, a slot is a
-specially formatted directory entry which holds up to 13 characters of
-a file's extended name. Think of slots as additional labeling for the
-directory entry of the file to which they correspond. Microsoft
-prefers to refer to the 8.3 entry for a file as its alias and the
-extended slot directory entries as the file name.
-
-The C structure for a slot directory entry follows:
-
- struct slot { // Up to 13 characters of a long name
- unsigned char id; // sequence number for slot
- unsigned char name0_4[10]; // first 5 characters in name
- unsigned char attr; // attribute byte
- unsigned char reserved; // always 0
- unsigned char alias_checksum; // checksum for 8.3 alias
- unsigned char name5_10[12]; // 6 more characters in name
- unsigned char start[2]; // starting cluster number
- unsigned char name11_12[4]; // last 2 characters in name
- };
-
-If the layout of the slots looks a little odd, it's only
-because of Microsoft's efforts to maintain compatibility with old
-software. The slots must be disguised to prevent old software from
-panicking. To this end, a number of measures are taken:
-
- 1) The attribute byte for a slot directory entry is always set
- to 0x0f. This corresponds to an old directory entry with
- attributes of "hidden", "system", "read-only", and "volume
- label". Most old software will ignore any directory
- entries with the "volume label" bit set. Real volume label
- entries don't have the other three bits set.
-
- 2) The starting cluster is always set to 0, an impossible
- value for a DOS file.
-
-Because the extended FAT system is backward compatible, it is
-possible for old software to modify directory entries. Measures must
-be taken to ensure the validity of slots. An extended FAT system can
-verify that a slot does in fact belong to an 8.3 directory entry by
-the following:
-
- 1) Positioning. Slots for a file always immediately proceed
- their corresponding 8.3 directory entry. In addition, each
- slot has an id which marks its order in the extended file
- name. Here is a very abbreviated view of an 8.3 directory
- entry and its corresponding long name slots for the file
- "My Big File.Extension which is long":
-
- <proceeding files...>
- <slot #3, id = 0x43, characters = "h is long">
- <slot #2, id = 0x02, characters = "xtension whic">
- <slot #1, id = 0x01, characters = "My Big File.E">
- <directory entry, name = "MYBIGFIL.EXT">
-
- Note that the slots are stored from last to first. Slots
- are numbered from 1 to N. The Nth slot is or'ed with 0x40
- to mark it as the last one.
-
- 2) Checksum. Each slot has an "alias_checksum" value. The
- checksum is calculated from the 8.3 name using the
- following algorithm:
-
- for (sum = i = 0; i < 11; i++) {
- sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
- }
-
- 3) If there is free space in the final slot, a Unicode NULL (0x0000)
- is stored after the final character. After that, all unused
- characters in the final slot are set to Unicode 0xFFFF.
-
-Finally, note that the extended name is stored in Unicode. Each Unicode
-character takes either two or four bytes, UTF-16LE encoded.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09d..ca52c82 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -392,7 +392,7 @@
``set``
Called by the VFS to set the value of a particular extended
attribute. When the new value is NULL, called to remove a
- particular extended attribute. This method is called by the the
+ particular extended attribute. This method is called by the
setxattr(2) and removexattr(2) system calls.
When none of the xattr handlers of a filesystem match the specified
@@ -652,7 +652,7 @@
PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct the
-operations. This gives the the writepage and writepages operations some
+operations. This gives the writepage and writepages operations some
information about the nature of and reason for the writeback request,
and the constraints under which it is being done. It is also used to
return information back to the caller about the result of a writepage or
@@ -706,6 +706,7 @@
int (*readpage)(struct file *, struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
+ void (*readahead)(struct readahead_control *);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
@@ -765,9 +766,9 @@
``writepages``
called by the VM to write out pages associated with the
- address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
+ address_space object. If wbc->sync_mode is WB_SYNC_ALL, then
the writeback_control will specify a range of pages that must be
- written out. If it is WBC_SYNC_NONE, then a nr_to_write is
+ written out. If it is WB_SYNC_NONE, then a nr_to_write is
given and that many pages should be written if possible. If no
->writepages is given, then mpage_writepages is used instead.
This will choose pages from the address space that are tagged as
@@ -781,12 +782,26 @@
If defined, it should set the PageDirty flag, and the
PAGECACHE_TAG_DIRTY tag in the radix tree.
+``readahead``
+ Called by the VM to read pages associated with the address_space
+ object. The pages are consecutive in the page cache and are
+ locked. The implementation should decrement the page refcount
+ after starting I/O on each page. Usually the page will be
+ unlocked by the I/O completion handler. If the filesystem decides
+ to stop attempting I/O before reaching the end of the readahead
+ window, it can simply return. The caller will decrement the page
+ refcount and unlock the remaining pages for you. Set PageUptodate
+ if the I/O completes successfully. Setting PageError on any page
+ will be ignored; simply unlock the page if an I/O error occurs.
+
``readpages``
called by the VM to read pages associated with the address_space
object. This is essentially just a vector version of readpage.
Instead of just one page, several pages are requested.
readpages is only used for read-ahead, so read errors are
ignored. If anything goes wrong, feel free to give up.
+ This interface is deprecated and will be removed by the end of
+ 2020; implement readahead instead.
``write_begin``
Called by the generic buffered write code to ask the filesystem
@@ -1101,7 +1116,7 @@
before any bytes were remapped. The remap_flags parameter
accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the
implementation must only remap if the requested file ranges have
- identical contents. If REMAP_CAN_SHORTEN is set, the caller is
+ identical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller is
ok with the implementation shortening the request length to
satisfy alignment or EOF requirements (or any other reason).
@@ -1416,13 +1431,13 @@
version.)
Creating Linux virtual filesystems. 2002
- <http://lwn.net/Articles/13325/>
+ <https://lwn.net/Articles/13325/>
The Linux Virtual File-system Layer by Neil Brown. 1999
<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
A tour of the Linux VFS by Michael K. Johnson. 1996
- <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
+ <https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
A small trail through the Linux kernel by Andries Brouwer. 2001
- <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
+ <https://www.win.tue.nl/~aeb/linux/vfs/trail.html>
diff --git a/Documentation/filesystems/virtiofs.rst b/Documentation/filesystems/virtiofs.rst
index 4f338e3..fd4d248 100644
--- a/Documentation/filesystems/virtiofs.rst
+++ b/Documentation/filesystems/virtiofs.rst
@@ -1,5 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
+.. _virtiofs_index:
+
===================================================
virtiofs: virtio-fs host<->guest shared file system
===================================================
@@ -37,6 +39,20 @@
Please see https://virtio-fs.gitlab.io/ for details on how to configure QEMU
and the virtiofsd daemon.
+Mount options
+-------------
+
+virtiofs supports general VFS mount options, for example, remount,
+ro, rw, context, etc. It also supports FUSE mount options.
+
+atime behavior
+^^^^^^^^^^^^^^
+
+The atime-related mount options, for example, noatime, strictatime,
+are ignored. The atime behavior for virtiofs is the same as the
+underlying filesystem of the directory that has been exported
+on the host.
+
Internals
=========
Since the virtio-fs device uses the FUSE protocol for file system requests, the
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.rst
similarity index 96%
rename from Documentation/filesystems/xfs-delayed-logging-design.txt
rename to Documentation/filesystems/xfs-delayed-logging-design.rst
index 9a6dd28..464405d 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.txt
+++ b/Documentation/filesystems/xfs-delayed-logging-design.rst
@@ -1,8 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
XFS Delayed Logging Design
---------------------------
+==========================
Introduction to Re-logging in XFS
----------------------------------
+=================================
XFS logging is a combination of logical and physical logging. Some objects,
such as inodes and dquots, are logged in logical format where the details
@@ -25,7 +28,7 @@
That is, if we have a sequence of changes A through to F, and the object was
written to disk after change D, we would see in the log the following series
of transactions, their contents and the log sequence number (LSN) of the
-transaction:
+transaction::
Transaction Contents LSN
A A X
@@ -85,7 +88,7 @@
bound.
Delayed Logging: Concepts
--------------------------
+=========================
The key thing to note about the asynchronous logging combined with the
relogging technique XFS uses is that we can be relogging changed objects
@@ -154,9 +157,10 @@
6. No performance regressions for synchronous transaction workloads.
Delayed Logging: Design
------------------------
+=======================
Storing Changes
+---------------
The problem with accumulating changes at a logical level (i.e. just using the
existing log item dirty region tracking) is that when it comes to writing the
@@ -194,30 +198,30 @@
formatting method and the delayed logging formatting can be seen in the
diagram below.
-Current format log vector:
+Current format log vector::
-Object +---------------------------------------------+
-Vector 1 +----+
-Vector 2 +----+
-Vector 3 +----------+
+ Object +---------------------------------------------+
+ Vector 1 +----+
+ Vector 2 +----+
+ Vector 3 +----------+
-After formatting:
+After formatting::
-Log Buffer +-V1-+-V2-+----V3----+
+ Log Buffer +-V1-+-V2-+----V3----+
-Delayed logging vector:
+Delayed logging vector::
-Object +---------------------------------------------+
-Vector 1 +----+
-Vector 2 +----+
-Vector 3 +----------+
+ Object +---------------------------------------------+
+ Vector 1 +----+
+ Vector 2 +----+
+ Vector 3 +----------+
-After formatting:
+After formatting::
-Memory Buffer +-V1-+-V2-+----V3----+
-Vector 1 +----+
-Vector 2 +----+
-Vector 3 +----------+
+ Memory Buffer +-V1-+-V2-+----V3----+
+ Vector 1 +----+
+ Vector 2 +----+
+ Vector 3 +----------+
The memory buffer and associated vector need to be passed as a single object,
but still need to be associated with the parent object so if the object is
@@ -242,6 +246,7 @@
Tracking Changes
+----------------
Now that we can record transactional changes in memory in a form that allows
them to be used without limitations, we need to be able to track and accumulate
@@ -278,6 +283,7 @@
Delayed Logging: Checkpoints
+----------------------------
When we have a log synchronisation event, commonly known as a "log force",
all the items in the CIL must be written into the log via the log buffers.
@@ -341,7 +347,7 @@
detached from the log items. That is, when the CIL is flushed the memory
buffer and log vector attached to each log item needs to be attached to the
checkpoint context so that the log item can be released. In diagrammatic form,
-the CIL would look like this before the flush:
+the CIL would look like this before the flush::
CIL Head
|
@@ -362,7 +368,7 @@
-> vector array
And after the flush the CIL head is empty, and the checkpoint context log
-vector list would look like:
+vector list would look like::
Checkpoint Context
|
@@ -411,6 +417,7 @@
is in the dev tree....
Delayed Logging: Checkpoint Sequencing
+--------------------------------------
One of the key aspects of the XFS transaction subsystem is that it tags
committed transactions with the log sequence number of the transaction commit.
@@ -474,6 +481,7 @@
behaves the same regardless of whether delayed logging is being used or not.
Delayed Logging: Checkpoint Log Space Accounting
+------------------------------------------------
The big issue for a checkpoint transaction is the log space reservation for the
transaction. We don't know how big a checkpoint transaction is going to be
@@ -491,7 +499,7 @@
of log vectors in the transaction).
An example of the differences would be logging directory changes versus logging
-inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
+inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
there are lots of transactions that only contain an inode core and an inode log
format structure. That is, two vectors totaling roughly 150 bytes. If we modify
10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
@@ -565,6 +573,7 @@
Delayed Logging: Log Item Pinning
+---------------------------------
Currently log items are pinned during transaction commit while the items are
still locked. This happens just after the items are formatted, though it could
@@ -605,6 +614,7 @@
lock to guarantee that we pin the items correctly.
Delayed Logging: Concurrent Scalability
+---------------------------------------
A fundamental requirement for the CIL is that accesses through transaction
commits must scale to many concurrent commits. The current transaction commit
@@ -683,8 +693,9 @@
Lifecycle Changes
+-----------------
-The existing log item life cycle is as follows:
+The existing log item life cycle is as follows::
1. Transaction allocate
2. Transaction reserve
@@ -729,7 +740,7 @@
and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
are entered and completed is the object considered clean.
-With delayed logging, there are new steps inserted into the life cycle:
+With delayed logging, there are new steps inserted into the life cycle::
1. Transaction allocate
2. Transaction reserve
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.rst
similarity index 81%
rename from Documentation/filesystems/xfs-self-describing-metadata.txt
rename to Documentation/filesystems/xfs-self-describing-metadata.rst
index 8db0121..b79dbf3 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.txt
+++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
@@ -1,8 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
XFS Self Describing Metadata
-----------------------------
+============================
Introduction
-------------
+============
The largest scalability problem facing XFS is not one of algorithmic
scalability, but of verification of the filesystem structure. Scalabilty of the
@@ -34,7 +37,7 @@
Self Describing Metadata
-------------------------
+========================
One of the problems with the current metadata format is that apart from the
magic number in the metadata block, we have no other way of identifying what it
@@ -142,7 +145,7 @@
detected.
Runtime Validation
-------------------
+==================
Validation of self-describing metadata takes place at runtime in two places:
@@ -183,18 +186,18 @@
error for the higher layers to catch.
Structures
-----------
+==========
-A typical on-disk structure needs to contain the following information:
+A typical on-disk structure needs to contain the following information::
-struct xfs_ondisk_hdr {
- __be32 magic; /* magic number */
- __be32 crc; /* CRC, not logged */
- uuid_t uuid; /* filesystem identifier */
- __be64 owner; /* parent object */
- __be64 blkno; /* location on disk */
- __be64 lsn; /* last modification in log, not logged */
-};
+ struct xfs_ondisk_hdr {
+ __be32 magic; /* magic number */
+ __be32 crc; /* CRC, not logged */
+ uuid_t uuid; /* filesystem identifier */
+ __be64 owner; /* parent object */
+ __be64 blkno; /* location on disk */
+ __be64 lsn; /* last modification in log, not logged */
+ };
Depending on the metadata, this information may be part of a header structure
separate to the metadata contents, or may be distributed through an existing
@@ -214,24 +217,24 @@
well. hence the additional metadata headers change the overall format
of the metadata.
-A typical buffer read verifier is structured as follows:
+A typical buffer read verifier is structured as follows::
-#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
+ #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc)
-static void
-xfs_foo_read_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
+ static void
+ xfs_foo_read_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
- if ((xfs_sb_version_hascrc(&mp->m_sb) &&
- !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
- XFS_FOO_CRC_OFF)) ||
- !xfs_foo_verify(bp)) {
- XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
- xfs_buf_ioerror(bp, EFSCORRUPTED);
- }
-}
+ if ((xfs_sb_version_hascrc(&mp->m_sb) &&
+ !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
+ XFS_FOO_CRC_OFF)) ||
+ !xfs_foo_verify(bp)) {
+ XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+ xfs_buf_ioerror(bp, EFSCORRUPTED);
+ }
+ }
The code ensures that the CRC is only checked if the filesystem has CRCs enabled
by checking the superblock of the feature bit, and then if the CRC verifies OK
@@ -239,83 +242,83 @@
The verifier function will take a couple of different forms, depending on
whether the magic number can be used to determine the format of the block. In
-the case it can't, the code is structured as follows:
+the case it can't, the code is structured as follows::
-static bool
-xfs_foo_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
+ static bool
+ xfs_foo_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
+ struct xfs_ondisk_hdr *hdr = bp->b_addr;
- if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
- return false;
+ if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+ return false;
- if (!xfs_sb_version_hascrc(&mp->m_sb)) {
- if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
- return false;
- if (bp->b_bn != be64_to_cpu(hdr->blkno))
- return false;
- if (hdr->owner == 0)
- return false;
- }
+ if (!xfs_sb_version_hascrc(&mp->m_sb)) {
+ if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+ return false;
+ if (bp->b_bn != be64_to_cpu(hdr->blkno))
+ return false;
+ if (hdr->owner == 0)
+ return false;
+ }
- /* object specific verification checks here */
+ /* object specific verification checks here */
- return true;
-}
+ return true;
+ }
If there are different magic numbers for the different formats, the verifier
-will look like:
+will look like::
-static bool
-xfs_foo_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
+ static bool
+ xfs_foo_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
+ struct xfs_ondisk_hdr *hdr = bp->b_addr;
- if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
- if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
- return false;
- if (bp->b_bn != be64_to_cpu(hdr->blkno))
- return false;
- if (hdr->owner == 0)
- return false;
- } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
- return false;
+ if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
+ if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
+ return false;
+ if (bp->b_bn != be64_to_cpu(hdr->blkno))
+ return false;
+ if (hdr->owner == 0)
+ return false;
+ } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
+ return false;
- /* object specific verification checks here */
+ /* object specific verification checks here */
- return true;
-}
+ return true;
+ }
Write verifiers are very similar to the read verifiers, they just do things in
-the opposite order to the read verifiers. A typical write verifier:
+the opposite order to the read verifiers. A typical write verifier::
-static void
-xfs_foo_write_verify(
- struct xfs_buf *bp)
-{
- struct xfs_mount *mp = bp->b_mount;
- struct xfs_buf_log_item *bip = bp->b_fspriv;
+ static void
+ xfs_foo_write_verify(
+ struct xfs_buf *bp)
+ {
+ struct xfs_mount *mp = bp->b_mount;
+ struct xfs_buf_log_item *bip = bp->b_fspriv;
- if (!xfs_foo_verify(bp)) {
- XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
- xfs_buf_ioerror(bp, EFSCORRUPTED);
- return;
- }
+ if (!xfs_foo_verify(bp)) {
+ XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
+ xfs_buf_ioerror(bp, EFSCORRUPTED);
+ return;
+ }
- if (!xfs_sb_version_hascrc(&mp->m_sb))
- return;
+ if (!xfs_sb_version_hascrc(&mp->m_sb))
+ return;
- if (bip) {
- struct xfs_ondisk_hdr *hdr = bp->b_addr;
- hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
- }
- xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
-}
+ if (bip) {
+ struct xfs_ondisk_hdr *hdr = bp->b_addr;
+ hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
+ }
+ xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
+ }
This will verify the internal structure of the metadata before we go any
further, detecting corruptions that have occurred as the metadata has been
@@ -324,7 +327,7 @@
metadata. Once this is done, we can issue the IO.
Inodes and Dquots
------------------
+=================
Inodes and dquots are special snowflakes. They have per-object CRC and
self-identifiers, but they are packed so that there are multiple objects per
@@ -337,14 +340,13 @@
The structure of the verifiers and the identifiers checks is very similar to the
buffer code described above. The only difference is where they are called. For
-example, inode read verification is done in xfs_iread() when the inode is first
-read out of the buffer and the struct xfs_inode is instantiated. The inode is
-already extensively verified during writeback in xfs_iflush_int, so the only
-addition here is to add the LSN and CRC to the inode as it is copied back into
-the buffer.
+example, inode read verification is done in xfs_inode_from_disk() when the inode
+is first read out of the buffer and the struct xfs_inode is instantiated. The
+inode is already extensively verified during writeback in xfs_iflush_int, so the
+only addition here is to add the LSN and CRC to the inode as it is copied back
+into the buffer.
XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
the unlinked list modifications check or update CRCs, neither during unlink nor
log recovery. So, it's gone unnoticed until now. This won't matter immediately -
repair will probably complain about it - but it needs to be fixed.
-
diff --git a/Documentation/filesystems/zonefs.rst b/Documentation/filesystems/zonefs.rst
new file mode 100644
index 0000000..6b213fe
--- /dev/null
+++ b/Documentation/filesystems/zonefs.rst
@@ -0,0 +1,437 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================
+ZoneFS - Zone filesystem for Zoned block devices
+================================================
+
+Introduction
+============
+
+zonefs is a very simple file system exposing each zone of a zoned block device
+as a file. Unlike a regular POSIX-compliant file system with native zoned block
+device support (e.g. f2fs), zonefs does not hide the sequential write
+constraint of zoned block devices to the user. Files representing sequential
+write zones of the device must be written sequentially starting from the end
+of the file (append only writes).
+
+As such, zonefs is in essence closer to a raw block device access interface
+than to a full-featured POSIX file system. The goal of zonefs is to simplify
+the implementation of zoned block device support in applications by replacing
+raw block device file accesses with a richer file API, avoiding relying on
+direct block device file ioctls which may be more obscure to developers. One
+example of this approach is the implementation of LSM (log-structured merge)
+tree structures (such as used in RocksDB and LevelDB) on zoned block devices
+by allowing SSTables to be stored in a zone file similarly to a regular file
+system rather than as a range of sectors of the entire disk. The introduction
+of the higher level construct "one file is one zone" can help reducing the
+amount of changes needed in the application as well as introducing support for
+different application programming languages.
+
+Zoned block devices
+-------------------
+
+Zoned storage devices belong to a class of storage devices with an address
+space that is divided into zones. A zone is a group of consecutive LBAs and all
+zones are contiguous (there are no LBA gaps). Zones may have different types.
+
+* Conventional zones: there are no access constraints to LBAs belonging to
+ conventional zones. Any read or write access can be executed, similarly to a
+ regular block device.
+* Sequential zones: these zones accept random reads but must be written
+ sequentially. Each sequential zone has a write pointer maintained by the
+ device that keeps track of the mandatory start LBA position of the next write
+ to the device. As a result of this write constraint, LBAs in a sequential zone
+ cannot be overwritten. Sequential zones must first be erased using a special
+ command (zone reset) before rewriting.
+
+Zoned storage devices can be implemented using various recording and media
+technologies. The most common form of zoned storage today uses the SCSI Zoned
+Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled
+Magnetic Recording (SMR) HDDs.
+
+Solid State Disks (SSD) storage devices can also implement a zoned interface
+to, for instance, reduce internal write amplification due to garbage collection.
+The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard
+committee aiming at adding a zoned storage interface to the NVMe protocol.
+
+Zonefs Overview
+===============
+
+Zonefs exposes the zones of a zoned block device as files. The files
+representing zones are grouped by zone type, which are themselves represented
+by sub-directories. This file structure is built entirely using zone information
+provided by the device and so does not require any complex on-disk metadata
+structure.
+
+On-disk metadata
+----------------
+
+zonefs on-disk metadata is reduced to an immutable super block which
+persistently stores a magic number and optional feature flags and values. On
+mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration
+and populates the mount point with a static file tree solely based on this
+information. File sizes come from the device zone type and write pointer
+position managed by the device itself.
+
+The super block is always written on disk at sector 0. The first zone of the
+device storing the super block is never exposed as a zone file by zonefs. If
+the zone containing the super block is a sequential zone, the mkzonefs format
+tool always "finishes" the zone, that is, it transitions the zone to a full
+state to make it read-only, preventing any data write.
+
+Zone type sub-directories
+-------------------------
+
+Files representing zones of the same type are grouped together under the same
+sub-directory automatically created on mount.
+
+For conventional zones, the sub-directory "cnv" is used. This directory is
+however created if and only if the device has usable conventional zones. If
+the device only has a single conventional zone at sector 0, the zone will not
+be exposed as a file as it will be used to store the zonefs super block. For
+such devices, the "cnv" sub-directory will not be created.
+
+For sequential write zones, the sub-directory "seq" is used.
+
+These two directories are the only directories that exist in zonefs. Users
+cannot create other directories and cannot rename nor delete the "cnv" and
+"seq" sub-directories.
+
+The size of the directories indicated by the st_size field of struct stat,
+obtained with the stat() or fstat() system calls, indicates the number of files
+existing under the directory.
+
+Zone files
+----------
+
+Zone files are named using the number of the zone they represent within the set
+of zones of a particular type. That is, both the "cnv" and "seq" directories
+contain files named "0", "1", "2", ... The file numbers also represent
+increasing zone start sector on the device.
+
+All read and write operations to zone files are not allowed beyond the file
+maximum size, that is, beyond the zone capacity. Any access exceeding the zone
+capacity is failed with the -EFBIG error.
+
+Creating, deleting, renaming or modifying any attribute of files and
+sub-directories is not allowed.
+
+The number of blocks of a file as reported by stat() and fstat() indicates the
+capacity of the zone file, or in other words, the maximum file size.
+
+Conventional zone files
+-----------------------
+
+The size of conventional zone files is fixed to the size of the zone they
+represent. Conventional zone files cannot be truncated.
+
+These files can be randomly read and written using any type of I/O operation:
+buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O
+constraint for these files beyond the file size limit mentioned above.
+
+Sequential zone files
+---------------------
+
+The size of sequential zone files grouped in the "seq" sub-directory represents
+the file's zone write pointer position relative to the zone start sector.
+
+Sequential zone files can only be written sequentially, starting from the file
+end, that is, write operations can only be append writes. Zonefs makes no
+attempt at accepting random writes and will fail any write request that has a
+start offset not corresponding to the end of the file, or to the end of the last
+write issued and still in-flight (for asynchronous I/O operations).
+
+Since dirty page writeback by the page cache does not guarantee a sequential
+write pattern, zonefs prevents buffered writes and writeable shared mappings
+on sequential files. Only direct I/O writes are accepted for these files.
+zonefs relies on the sequential delivery of write I/O requests to the device
+implemented by the block layer elevator. An elevator implementing the sequential
+write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature)
+must be used. This type of elevator (e.g. mq-deadline) is set by default
+for zoned block devices on device initialization.
+
+There are no restrictions on the type of I/O used for read operations in
+sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are
+all accepted.
+
+Truncating sequential zone files is allowed only down to 0, in which case, the
+zone is reset to rewind the file zone write pointer position to the start of
+the zone, or up to the zone capacity, in which case the file's zone is
+transitioned to the FULL state (finish zone operation).
+
+Format options
+--------------
+
+Several optional features of zonefs can be enabled at format time.
+
+* Conventional zone aggregation: ranges of contiguous conventional zones can be
+ aggregated into a single larger file instead of the default one file per zone.
+* File ownership: The owner UID and GID of zone files is by default 0 (root)
+ but can be changed to any valid UID/GID.
+* File access permissions: the default 640 access permissions can be changed.
+
+IO error handling
+-----------------
+
+Zoned block devices may fail I/O requests for reasons similar to regular block
+devices, e.g. due to bad sectors. However, in addition to such known I/O
+failure pattern, the standards governing zoned block devices behavior define
+additional conditions that result in I/O errors.
+
+* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY):
+ While the data already written in the zone is still readable, the zone can
+ no longer be written. No user action on the zone (zone management command or
+ read/write access) can change the zone condition back to a normal read/write
+ state. While the reasons for the device to transition a zone to read-only
+ state are not defined by the standards, a typical cause for such transition
+ would be a defective write head on an HDD (all zones under this head are
+ changed to read-only).
+
+* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE):
+ An offline zone cannot be read nor written. No user action can transition an
+ offline zone back to an operational good state. Similarly to zone read-only
+ transitions, the reasons for a drive to transition a zone to the offline
+ condition are undefined. A typical cause would be a defective read-write head
+ on an HDD causing all zones on the platter under the broken head to be
+ inaccessible.
+
+* Unaligned write errors: These errors result from the host issuing write
+ requests with a start sector that does not correspond to a zone write pointer
+ position when the write request is executed by the device. Even though zonefs
+ enforces sequential file write for sequential zones, unaligned write errors
+ may still happen in the case of a partial failure of a very large direct I/O
+ operation split into multiple BIOs/requests or asynchronous I/O operations.
+ If one of the write request within the set of sequential write requests
+ issued to the device fails, all write requests queued after it will
+ become unaligned and fail.
+
+* Delayed write errors: similarly to regular block devices, if the device side
+ write cache is enabled, write errors may occur in ranges of previously
+ completed writes when the device write cache is flushed, e.g. on fsync().
+ Similarly to the previous immediate unaligned write error case, delayed write
+ errors can propagate through a stream of cached sequential data for a zone
+ causing all data to be dropped after the sector that caused the error.
+
+All I/O errors detected by zonefs are notified to the user with an error code
+return for the system call that triggered or detected the error. The recovery
+actions taken by zonefs in response to I/O errors depend on the I/O type (read
+vs write) and on the reason for the error (bad sector, unaligned writes or zone
+condition change).
+
+* For read I/O errors, zonefs does not execute any particular recovery action,
+ but only if the file zone is still in a good condition and there is no
+ inconsistency between the file inode size and its zone write pointer position.
+ If a problem is detected, I/O error recovery is executed (see below table).
+
+* For write I/O errors, zonefs I/O error recovery is always executed.
+
+* A zone condition change to read-only or offline also always triggers zonefs
+ I/O error recovery.
+
+Zonefs minimal I/O error recovery may change a file size and file access
+permissions.
+
+* File size changes:
+ Immediate or delayed write errors in a sequential zone file may cause the file
+ inode size to be inconsistent with the amount of data successfully written in
+ the file zone. For instance, the partial failure of a multi-BIO large write
+ operation will cause the zone write pointer to advance partially, even though
+ the entire write operation will be reported as failed to the user. In such
+ case, the file inode size must be advanced to reflect the zone write pointer
+ change and eventually allow the user to restart writing at the end of the
+ file.
+ A file size may also be reduced to reflect a delayed write error detected on
+ fsync(): in this case, the amount of data effectively written in the zone may
+ be less than originally indicated by the file inode size. After such I/O
+ error, zonefs always fixes the file inode size to reflect the amount of data
+ persistently stored in the file zone.
+
+* Access permission changes:
+ A zone condition change to read-only is indicated with a change in the file
+ access permissions to render the file read-only. This disables changes to the
+ file attributes and data modification. For offline zones, all permissions
+ (read and write) to the file are disabled.
+
+Further action taken by zonefs I/O error recovery can be controlled by the user
+with the "errors=xxx" mount option. The table below summarizes the result of
+zonefs I/O error processing depending on the mount option and on the zone
+conditions::
+
+ +--------------+-----------+-----------------------------------------+
+ | | | Post error state |
+ | "errors=xxx" | device | access permissions |
+ | mount | zone | file file device zone |
+ | option | condition | size read write read write |
+ +--------------+-----------+-----------------------------------------+
+ | | good | fixed yes no yes yes |
+ | remount-ro | read-only | as is yes no yes no |
+ | (default) | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+ | | good | fixed yes no yes yes |
+ | zone-ro | read-only | as is yes no yes no |
+ | | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+ | | good | 0 no no yes yes |
+ | zone-offline | read-only | 0 no no yes no |
+ | | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+ | | good | fixed yes yes yes yes |
+ | repair | read-only | as is yes no yes no |
+ | | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+
+Further notes:
+
+* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
+ error processing if no errors mount option is specified.
+* With the "errors=remount-ro" mount option, the change of the file access
+ permissions to read-only applies to all files. The file system is remounted
+ read-only.
+* Access permission and file size changes due to the device transitioning zones
+ to the offline condition are permanent. Remounting or reformatting the device
+ with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good
+ state.
+* File access permission changes to read-only due to the device transitioning
+ zones to the read-only condition are permanent. Remounting or reformatting
+ the device will not re-enable file write access.
+* File access permission changes implied by the remount-ro, zone-ro and
+ zone-offline mount options are temporary for zones in a good condition.
+ Unmounting and remounting the file system will restore the previous default
+ (format time values) access rights to the files affected.
+* The repair mount option triggers only the minimal set of I/O error recovery
+ actions, that is, file size fixes for zones in a good condition. Zones
+ indicated as being read-only or offline by the device still imply changes to
+ the zone file access permissions as noted in the table above.
+
+Mount options
+-------------
+
+zonefs define the "errors=<behavior>" mount option to allow the user to specify
+zonefs behavior in response to I/O errors, inode size inconsistencies or zone
+condition changes. The defined behaviors are as follow:
+
+* remount-ro (default)
+* zone-ro
+* zone-offline
+* repair
+
+The run-time I/O error actions defined for each behavior are detailed in the
+previous section. Mount time I/O errors will cause the mount operation to fail.
+The handling of read-only zones also differs between mount-time and run-time.
+If a read-only zone is found at mount time, the zone is always treated in the
+same manner as offline zones, that is, all accesses are disabled and the zone
+file size set to 0. This is necessary as the write pointer of read-only zones
+is defined as invalib by the ZBC and ZAC standards, making it impossible to
+discover the amount of data that has been written to the zone. In the case of a
+read-only zone discovered at run-time, as indicated in the previous section.
+The size of the zone file is left unchanged from its last updated value.
+
+A zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on
+the number of zones that can be active, that is, zones that are in the
+implicit open, explicit open or closed conditions. This potential limitation
+translates into a risk for applications to see write IO errors due to this
+limit being exceeded if the zone of a file is not already active when a write
+request is issued by the user.
+
+To avoid these potential errors, the "explicit-open" mount option forces zones
+to be made active using an open zone command when a file is opened for writing
+for the first time. If the zone open command succeeds, the application is then
+guaranteed that write requests can be processed. Conversely, the
+"explicit-open" mount option will result in a zone close command being issued
+to the device on the last close() of a zone file if the zone is not full nor
+empty.
+
+Zonefs User Space Tools
+=======================
+
+The mkzonefs tool is used to format zoned block devices for use with zonefs.
+This tool is available on Github at:
+
+https://github.com/damien-lemoal/zonefs-tools
+
+zonefs-tools also includes a test suite which can be run against any zoned
+block device, including null_blk block device created with zoned mode.
+
+Examples
+--------
+
+The following formats a 15TB host-managed SMR HDD with 256 MB zones
+with the conventional zones aggregation feature enabled::
+
+ # mkzonefs -o aggr_cnv /dev/sdX
+ # mount -t zonefs /dev/sdX /mnt
+ # ls -l /mnt/
+ total 0
+ dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
+ dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
+
+The size of the zone files sub-directories indicate the number of files
+existing for each type of zones. In this example, there is only one
+conventional zone file (all conventional zones are aggregated under a single
+file)::
+
+ # ls -l /mnt/cnv
+ total 137101312
+ -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
+
+This aggregated conventional zone file can be used as a regular file::
+
+ # mkfs.ext4 /mnt/cnv/0
+ # mount -o loop /mnt/cnv/0 /data
+
+The "seq" sub-directory grouping files for sequential write zones has in this
+example 55356 zones::
+
+ # ls -lv /mnt/seq
+ total 14511243264
+ -rw-r----- 1 root root 0 Nov 25 13:23 0
+ -rw-r----- 1 root root 0 Nov 25 13:23 1
+ -rw-r----- 1 root root 0 Nov 25 13:23 2
+ ...
+ -rw-r----- 1 root root 0 Nov 25 13:23 55354
+ -rw-r----- 1 root root 0 Nov 25 13:23 55355
+
+For sequential write zone files, the file size changes as data is appended at
+the end of the file, similarly to any regular file system::
+
+ # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
+ 1+0 records in
+ 1+0 records out
+ 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
+
+ # ls -l /mnt/seq/0
+ -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
+
+The written file can be truncated to the zone size, preventing any further
+write operation::
+
+ # truncate -s 268435456 /mnt/seq/0
+ # ls -l /mnt/seq/0
+ -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
+
+Truncation to 0 size allows freeing the file zone storage space and restart
+append-writes to the file::
+
+ # truncate -s 0 /mnt/seq/0
+ # ls -l /mnt/seq/0
+ -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
+
+Since files are statically mapped to zones on the disk, the number of blocks
+of a file as reported by stat() and fstat() indicates the capacity of the file
+zone::
+
+ # stat /mnt/seq/0
+ File: /mnt/seq/0
+ Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
+ Device: 870h/2160d Inode: 50431 Links: 1
+ Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
+ Access: 2019-11-25 13:23:57.048971997 +0900
+ Modify: 2019-11-25 13:52:25.553805765 +0900
+ Change: 2019-11-25 13:52:25.553805765 +0900
+ Birth: -
+
+The number of blocks of the file ("Blocks") in units of 512B blocks gives the
+maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
+capacity in this example. Of note is that the "IO block" field always
+indicates the minimum I/O size for writes and corresponds to the device
+physical sector size.