Update Linux to v5.4.2
Change-Id: Idf6911045d9d382da2cfe01b1edff026404ac8fd
diff --git a/Documentation/admin-guide/LSM/LoadPin.rst b/Documentation/admin-guide/LSM/LoadPin.rst
index 3207076..716ad9b 100644
--- a/Documentation/admin-guide/LSM/LoadPin.rst
+++ b/Documentation/admin-guide/LSM/LoadPin.rst
@@ -19,3 +19,13 @@
created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having
a mutable filesystem means pinning is mutable too, but having the
sysctl allows for easy testing on systems with a mutable filesystem.)
+
+It's also possible to exclude specific file types from LoadPin using kernel
+command line option "``loadpin.exclude``". By default, all files are
+included, but they can be excluded using kernel command line option such
+as "``loadpin.exclude=kernel-module,kexec-image``". This allows to use
+different mechanisms such as ``CONFIG_MODULE_SIG`` and
+``CONFIG_KEXEC_VERIFY_SIG`` to verify kernel module and kernel image while
+still use LoadPin to protect the integrity of other files kernel loads. The
+full list of valid file types can be found in ``kernel_read_file_str``
+defined in ``include/linux/fs.h``.
diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst
index f722c9b..520a1c2 100644
--- a/Documentation/admin-guide/LSM/SELinux.rst
+++ b/Documentation/admin-guide/LSM/SELinux.rst
@@ -6,7 +6,7 @@
to use the distro-provided policies, or install the
latest reference policy release from
- http://oss.tresys.com/projects/refpolicy
+ https://github.com/SELinuxProject/refpolicy
However, if you want to install a dummy policy for
testing, you can do using ``mdp`` provided under
diff --git a/Documentation/admin-guide/LSM/SafeSetID.rst b/Documentation/admin-guide/LSM/SafeSetID.rst
new file mode 100644
index 0000000..212434e
--- /dev/null
+++ b/Documentation/admin-guide/LSM/SafeSetID.rst
@@ -0,0 +1,107 @@
+=========
+SafeSetID
+=========
+SafeSetID is an LSM module that gates the setid family of syscalls to restrict
+UID/GID transitions from a given UID/GID to only those approved by a
+system-wide whitelist. These restrictions also prohibit the given UIDs/GIDs
+from obtaining auxiliary privileges associated with CAP_SET{U/G}ID, such as
+allowing a user to set up user namespace UID mappings.
+
+
+Background
+==========
+In absence of file capabilities, processes spawned on a Linux system that need
+to switch to a different user must be spawned with CAP_SETUID privileges.
+CAP_SETUID is granted to programs running as root or those running as a non-root
+user that have been explicitly given the CAP_SETUID runtime capability. It is
+often preferable to use Linux runtime capabilities rather than file
+capabilities, since using file capabilities to run a program with elevated
+privileges opens up possible security holes since any user with access to the
+file can exec() that program to gain the elevated privileges.
+
+While it is possible to implement a tree of processes by giving full
+CAP_SET{U/G}ID capabilities, this is often at odds with the goals of running a
+tree of processes under non-root user(s) in the first place. Specifically,
+since CAP_SETUID allows changing to any user on the system, including the root
+user, it is an overpowered capability for what is needed in this scenario,
+especially since programs often only call setuid() to drop privileges to a
+lesser-privileged user -- not elevate privileges. Unfortunately, there is no
+generally feasible way in Linux to restrict the potential UIDs that a user can
+switch to through setuid() beyond allowing a switch to any user on the system.
+This SafeSetID LSM seeks to provide a solution for restricting setid
+capabilities in such a way.
+
+The main use case for this LSM is to allow a non-root program to transition to
+other untrusted uids without full blown CAP_SETUID capabilities. The non-root
+program would still need CAP_SETUID to do any kind of transition, but the
+additional restrictions imposed by this LSM would mean it is a "safer" version
+of CAP_SETUID since the non-root program cannot take advantage of CAP_SETUID to
+do any unapproved actions (e.g. setuid to uid 0 or create/enter new user
+namespace). The higher level goal is to allow for uid-based sandboxing of system
+services without having to give out CAP_SETUID all over the place just so that
+non-root programs can drop to even-lesser-privileged uids. This is especially
+relevant when one non-root daemon on the system should be allowed to spawn other
+processes as different uids, but its undesirable to give the daemon a
+basically-root-equivalent CAP_SETUID.
+
+
+Other Approaches Considered
+===========================
+
+Solve this problem in userspace
+-------------------------------
+For candidate applications that would like to have restricted setid capabilities
+as implemented in this LSM, an alternative option would be to simply take away
+setid capabilities from the application completely and refactor the process
+spawning semantics in the application (e.g. by using a privileged helper program
+to do process spawning and UID/GID transitions). Unfortunately, there are a
+number of semantics around process spawning that would be affected by this, such
+as fork() calls where the program doesn???t immediately call exec() after the
+fork(), parent processes specifying custom environment variables or command line
+args for spawned child processes, or inheritance of file handles across a
+fork()/exec(). Because of this, as solution that uses a privileged helper in
+userspace would likely be less appealing to incorporate into existing projects
+that rely on certain process-spawning semantics in Linux.
+
+Use user namespaces
+-------------------
+Another possible approach would be to run a given process tree in its own user
+namespace and give programs in the tree setid capabilities. In this way,
+programs in the tree could change to any desired UID/GID in the context of their
+own user namespace, and only approved UIDs/GIDs could be mapped back to the
+initial system user namespace, affectively preventing privilege escalation.
+Unfortunately, it is not generally feasible to use user namespaces in isolation,
+without pairing them with other namespace types, which is not always an option.
+Linux checks for capabilities based off of the user namespace that ???owns??? some
+entity. For example, Linux has the notion that network namespaces are owned by
+the user namespace in which they were created. A consequence of this is that
+capability checks for access to a given network namespace are done by checking
+whether a task has the given capability in the context of the user namespace
+that owns the network namespace -- not necessarily the user namespace under
+which the given task runs. Therefore spawning a process in a new user namespace
+effectively prevents it from accessing the network namespace owned by the
+initial namespace. This is a deal-breaker for any application that expects to
+retain the CAP_NET_ADMIN capability for the purpose of adjusting network
+configurations. Using user namespaces in isolation causes problems regarding
+other system interactions, including use of pid namespaces and device creation.
+
+Use an existing LSM
+-------------------
+None of the other in-tree LSMs have the capability to gate setid transitions, or
+even employ the security_task_fix_setuid hook at all. SELinux says of that hook:
+"Since setuid only affects the current process, and since the SELinux controls
+are not based on the Linux identity attributes, SELinux does not need to control
+this operation."
+
+
+Directions for use
+==================
+This LSM hooks the setid syscalls to make sure transitions are allowed if an
+applicable restriction policy is in place. Policies are configured through
+securityfs by writing to the safesetid/add_whitelist_policy and
+safesetid/flush_whitelist_policies files at the location where securityfs is
+mounted. The format for adding a policy is '<UID>:<UID>', using literal
+numbers, such as '123:456'. To flush the policies, any write to the file is
+sufficient. Again, configuring a policy for a UID will prevent that UID from
+obtaining auxiliary setid privileges, such as allowing a user to set up user
+namespace UID mappings.
diff --git a/Documentation/admin-guide/LSM/Smack.rst b/Documentation/admin-guide/LSM/Smack.rst
index 6a5826a..6d44f4f 100644
--- a/Documentation/admin-guide/LSM/Smack.rst
+++ b/Documentation/admin-guide/LSM/Smack.rst
@@ -818,6 +818,10 @@
specifies a label to which all labels set on the
filesystem must have read access. Not yet enforced.
+ smackfstransmute=label:
+ behaves exactly like smackfsroot except that it also
+ sets the transmute flag on the root of the mount
+
These mount options apply to all file system types.
Smack auditing
diff --git a/Documentation/admin-guide/LSM/Yama.rst b/Documentation/admin-guide/LSM/Yama.rst
index 13468ea..d0a060d 100644
--- a/Documentation/admin-guide/LSM/Yama.rst
+++ b/Documentation/admin-guide/LSM/Yama.rst
@@ -64,8 +64,8 @@
Using ``PTRACE_TRACEME`` is unchanged.
2 - admin-only attach:
- only processes with ``CAP_SYS_PTRACE`` may use ptrace
- with ``PTRACE_ATTACH``, or through children calling ``PTRACE_TRACEME``.
+ only processes with ``CAP_SYS_PTRACE`` may use ptrace, either with
+ ``PTRACE_ATTACH`` or through children calling ``PTRACE_TRACEME``.
3 - no attach:
no processes may use ptrace with ``PTRACE_ATTACH`` nor via
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst
index c980dfe..a6ba95f 100644
--- a/Documentation/admin-guide/LSM/index.rst
+++ b/Documentation/admin-guide/LSM/index.rst
@@ -17,9 +17,8 @@
specific changes to system operation when these tweaks are not available
in the core functionality of Linux itself.
-Without a specific LSM built into the kernel, the default LSM will be the
-Linux capabilities system. Most LSMs choose to extend the capabilities
-system, building their checks on top of the defined capability hooks.
+The Linux capabilities modules will always be included. This may be
+followed by any number of "minor" modules and at most one "major" module.
For more details on capabilities, see ``capabilities(7)`` in the Linux
man-pages project.
@@ -30,6 +29,14 @@
be first, followed by any "minor" modules (e.g. Yama) and then
the one "major" module (e.g. SELinux) if there is one configured.
+Process attributes associated with "major" security modules should
+be accessed and maintained using the special files in ``/proc/.../attr``.
+A security module may maintain a module specific subdirectory there,
+named after the module. ``/proc/.../attr/smack`` is provided by the Smack
+security module and contains all its special files. The files directly
+in ``/proc/.../attr`` remain as legacy interfaces for modules that provide
+subdirectories.
+
.. toctree::
:maxdepth: 1
@@ -39,3 +46,4 @@
Smack
tomoyo
Yama
+ SafeSetID
diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst
index 15ea785..cc6151f 100644
--- a/Documentation/admin-guide/README.rst
+++ b/Documentation/admin-guide/README.rst
@@ -1,9 +1,9 @@
.. _readme:
-Linux kernel release 4.x <http://kernel.org/>
+Linux kernel release 5.x <http://kernel.org/>
=============================================
-These are the release notes for Linux version 4. Read them carefully,
+These are the release notes for Linux version 5. Read them carefully,
as they tell you what this is all about, explain how to install the
kernel, and what to do if something goes wrong.
@@ -51,8 +51,7 @@
- There are various README files in the Documentation/ subdirectory:
these typically contain kernel-specific installation notes for some
- drivers for example. See Documentation/00-INDEX for a list of what
- is contained in each file. Please read the
+ drivers for example. Please read the
:ref:`Documentation/process/changes.rst <changes>` file, as it
contains information about the problems, which may result by upgrading
your kernel.
@@ -64,7 +63,7 @@
directory where you have permissions (e.g. your home directory) and
unpack it::
- xz -cd linux-4.X.tar.xz | tar xvf -
+ xz -cd linux-5.x.tar.xz | tar xvf -
Replace "X" with the version number of the latest kernel.
@@ -73,26 +72,26 @@
files. They should match the library, and not get messed up by
whatever the kernel-du-jour happens to be.
- - You can also upgrade between 4.x releases by patching. Patches are
+ - You can also upgrade between 5.x releases by patching. Patches are
distributed in the xz format. To install by patching, get all the
newer patch files, enter the top level directory of the kernel source
- (linux-4.X) and execute::
+ (linux-5.x) and execute::
- xz -cd ../patch-4.x.xz | patch -p1
+ xz -cd ../patch-5.x.xz | patch -p1
- Replace "x" for all versions bigger than the version "X" of your current
+ Replace "x" for all versions bigger than the version "x" of your current
source tree, **in_order**, and you should be ok. You may want to remove
the backup files (some-file-name~ or some-file-name.orig), and make sure
that there are no failed patches (some-file-name# or some-file-name.rej).
If there are, either you or I have made a mistake.
- Unlike patches for the 4.x kernels, patches for the 4.x.y kernels
+ Unlike patches for the 5.x kernels, patches for the 5.x.y kernels
(also known as the -stable kernels) are not incremental but instead apply
- directly to the base 4.x kernel. For example, if your base kernel is 4.0
- and you want to apply the 4.0.3 patch, you must not first apply the 4.0.1
- and 4.0.2 patches. Similarly, if you are running kernel version 4.0.2 and
- want to jump to 4.0.3, you must first reverse the 4.0.2 patch (that is,
- patch -R) **before** applying the 4.0.3 patch. You can read more on this in
+ directly to the base 5.x kernel. For example, if your base kernel is 5.0
+ and you want to apply the 5.0.3 patch, you must not first apply the 5.0.1
+ and 5.0.2 patches. Similarly, if you are running kernel version 5.0.2 and
+ want to jump to 5.0.3, you must first reverse the 5.0.2 patch (that is,
+ patch -R) **before** applying the 5.0.3 patch. You can read more on this in
:ref:`Documentation/process/applying-patches.rst <applying_patches>`.
Alternatively, the script patch-kernel can be used to automate this
@@ -115,7 +114,7 @@
Software requirements
---------------------
- Compiling and running the 4.x kernels requires up-to-date
+ Compiling and running the 5.x kernels requires up-to-date
versions of various software packages. Consult
:ref:`Documentation/process/changes.rst <changes>` for the minimum version numbers
required and how to get updates for these packages. Beware that using
@@ -133,12 +132,12 @@
place for the output files (including .config).
Example::
- kernel source code: /usr/src/linux-4.X
+ kernel source code: /usr/src/linux-5.x
build directory: /home/name/build/kernel
To configure and build the kernel, use::
- cd /usr/src/linux-4.X
+ cd /usr/src/linux-5.x
make O=/home/name/build/kernel menuconfig
make O=/home/name/build/kernel
sudo make O=/home/name/build/kernel modules_install install
@@ -228,7 +227,7 @@
"make tinyconfig" Configure the tiniest possible kernel.
You can find more information on using the Linux kernel config tools
- in Documentation/kbuild/kconfig.txt.
+ in Documentation/kbuild/kconfig.rst.
- NOTES on ``make config``:
@@ -252,7 +251,7 @@
Compiling the kernel
--------------------
- - Make sure you have at least gcc 3.2 available.
+ - Make sure you have at least gcc 4.6 available.
For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
Please note that you can still run a.out user programs with this kernel.
diff --git a/Documentation/admin-guide/acpi/cppc_sysfs.rst b/Documentation/admin-guide/acpi/cppc_sysfs.rst
new file mode 100644
index 0000000..a4b99af
--- /dev/null
+++ b/Documentation/admin-guide/acpi/cppc_sysfs.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Collaborative Processor Performance Control (CPPC)
+==================================================
+
+CPPC
+====
+
+CPPC defined in the ACPI spec describes a mechanism for the OS to manage the
+performance of a logical processor on a contigious and abstract performance
+scale. CPPC exposes a set of registers to describe abstract performance scale,
+to request performance levels and to measure per-cpu delivered performance.
+
+For more details on CPPC please refer to the ACPI specification at:
+
+http://uefi.org/specifications
+
+Some of the CPPC registers are exposed via sysfs under::
+
+ /sys/devices/system/cpu/cpuX/acpi_cppc/
+
+for each cpu X::
+
+ $ ls -lR /sys/devices/system/cpu/cpu0/acpi_cppc/
+ /sys/devices/system/cpu/cpu0/acpi_cppc/:
+ total 0
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 feedback_ctrs
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 highest_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_freq
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_nonlinear_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 lowest_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_freq
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 nominal_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 reference_perf
+ -r--r--r-- 1 root root 65536 Mar 5 19:38 wraparound_time
+
+* highest_perf : Highest performance of this processor (abstract scale).
+* nominal_perf : Highest sustained performance of this processor
+ (abstract scale).
+* lowest_nonlinear_perf : Lowest performance of this processor with nonlinear
+ power savings (abstract scale).
+* lowest_perf : Lowest performance of this processor (abstract scale).
+
+* lowest_freq : CPU frequency corresponding to lowest_perf (in MHz).
+* nominal_freq : CPU frequency corresponding to nominal_perf (in MHz).
+ The above frequencies should only be used to report processor performance in
+ freqency instead of abstract scale. These values should not be used for any
+ functional decisions.
+
+* feedback_ctrs : Includes both Reference and delivered performance counter.
+ Reference counter ticks up proportional to processor's reference performance.
+ Delivered counter ticks up proportional to processor's delivered performance.
+* wraparound_time: Minimum time for the feedback counters to wraparound
+ (seconds).
+* reference_perf : Performance level at which reference performance counter
+ accumulates (abstract scale).
+
+
+Computing Average Delivered Performance
+=======================================
+
+Below describes the steps to compute the average performance delivered by
+taking two different snapshots of feedback counters at time T1 and T2.
+
+ T1: Read feedback_ctrs as fbc_t1
+ Wait or run some workload
+
+ T2: Read feedback_ctrs as fbc_t2
+
+::
+
+ delivered_counter_delta = fbc_t2[del] - fbc_t1[del]
+ reference_counter_delta = fbc_t2[ref] - fbc_t1[ref]
+
+ delivered_perf = (refernce_perf x delivered_counter_delta) / reference_counter_delta
diff --git a/Documentation/admin-guide/acpi/dsdt-override.rst b/Documentation/admin-guide/acpi/dsdt-override.rst
new file mode 100644
index 0000000..50bd7f1
--- /dev/null
+++ b/Documentation/admin-guide/acpi/dsdt-override.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Overriding DSDT
+===============
+
+Linux supports a method of overriding the BIOS DSDT:
+
+CONFIG_ACPI_CUSTOM_DSDT - builds the image into the kernel.
+
+When to use this method is described in detail on the
+Linux/ACPI home page:
+https://01.org/linux-acpi/documentation/overriding-dsdt
diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst
new file mode 100644
index 0000000..4d13eee
--- /dev/null
+++ b/Documentation/admin-guide/acpi/index.rst
@@ -0,0 +1,14 @@
+============
+ACPI Support
+============
+
+Here we document in detail how to interact with various mechanisms in
+the Linux ACPI support.
+
+.. toctree::
+ :maxdepth: 1
+
+ initrd_table_override
+ dsdt-override
+ ssdt-overlays
+ cppc_sysfs
diff --git a/Documentation/admin-guide/acpi/initrd_table_override.rst b/Documentation/admin-guide/acpi/initrd_table_override.rst
new file mode 100644
index 0000000..cbd7682
--- /dev/null
+++ b/Documentation/admin-guide/acpi/initrd_table_override.rst
@@ -0,0 +1,115 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Upgrading ACPI tables via initrd
+================================
+
+What is this about
+==================
+
+If the ACPI_TABLE_UPGRADE compile option is true, it is possible to
+upgrade the ACPI execution environment that is defined by the ACPI tables
+via upgrading the ACPI tables provided by the BIOS with an instrumented,
+modified, more recent version one, or installing brand new ACPI tables.
+
+When building initrd with kernel in a single image, option
+ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD should also be true for this
+feature to work.
+
+For a full list of ACPI tables that can be upgraded/installed, take a look
+at the char `*table_sigs[MAX_ACPI_SIGNATURE];` definition in
+drivers/acpi/tables.c.
+
+All ACPI tables iasl (Intel's ACPI compiler and disassembler) knows should
+be overridable, except:
+
+ - ACPI_SIG_RSDP (has a signature of 6 bytes)
+ - ACPI_SIG_FACS (does not have an ordinary ACPI table header)
+
+Both could get implemented as well.
+
+
+What is this for
+================
+
+Complain to your platform/BIOS vendor if you find a bug which is so severe
+that a workaround is not accepted in the Linux kernel. And this facility
+allows you to upgrade the buggy tables before your platform/BIOS vendor
+releases an upgraded BIOS binary.
+
+This facility can be used by platform/BIOS vendors to provide a Linux
+compatible environment without modifying the underlying platform firmware.
+
+This facility also provides a powerful feature to easily debug and test
+ACPI BIOS table compatibility with the Linux kernel by modifying old
+platform provided ACPI tables or inserting new ACPI tables.
+
+It can and should be enabled in any kernel because there is no functional
+change with not instrumented initrds.
+
+
+How does it work
+================
+::
+
+ # Extract the machine's ACPI tables:
+ cd /tmp
+ acpidump >acpidump
+ acpixtract -a acpidump
+ # Disassemble, modify and recompile them:
+ iasl -d *.dat
+ # For example add this statement into a _PRT (PCI Routing Table) function
+ # of the DSDT:
+ Store("HELLO WORLD", debug)
+ # And increase the OEM Revision. For example, before modification:
+ DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000000)
+ # After modification:
+ DefinitionBlock ("DSDT.aml", "DSDT", 2, "INTEL ", "TEMPLATE", 0x00000001)
+ iasl -sa dsdt.dsl
+ # Add the raw ACPI tables to an uncompressed cpio archive.
+ # They must be put into a /kernel/firmware/acpi directory inside the cpio
+ # archive. Note that if the table put here matches a platform table
+ # (similar Table Signature, and similar OEMID, and similar OEM Table ID)
+ # with a more recent OEM Revision, the platform table will be upgraded by
+ # this table. If the table put here doesn't match a platform table
+ # (dissimilar Table Signature, or dissimilar OEMID, or dissimilar OEM Table
+ # ID), this table will be appended.
+ mkdir -p kernel/firmware/acpi
+ cp dsdt.aml kernel/firmware/acpi
+ # A maximum of "NR_ACPI_INITRD_TABLES (64)" tables are currently allowed
+ # (see osl.c):
+ iasl -sa facp.dsl
+ iasl -sa ssdt1.dsl
+ cp facp.aml kernel/firmware/acpi
+ cp ssdt1.aml kernel/firmware/acpi
+ # The uncompressed cpio archive must be the first. Other, typically
+ # compressed cpio archives, must be concatenated on top of the uncompressed
+ # one. Following command creates the uncompressed cpio archive and
+ # concatenates the original initrd on top:
+ find kernel | cpio -H newc --create > /boot/instrumented_initrd
+ cat /boot/initrd >>/boot/instrumented_initrd
+ # reboot with increased acpi debug level, e.g. boot params:
+ acpi.debug_level=0x2 acpi.debug_layer=0xFFFFFFFF
+ # and check your syslog:
+ [ 1.268089] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
+ [ 1.272091] [ACPI Debug] String [0x0B] "HELLO WORLD"
+
+iasl is able to disassemble and recompile quite a lot different,
+also static ACPI tables.
+
+
+Where to retrieve userspace tools
+=================================
+
+iasl and acpixtract are part of Intel's ACPICA project:
+http://acpica.org/
+
+and should be packaged by distributions (for example in the acpica package
+on SUSE).
+
+acpidump can be found in Len Browns pmtools:
+ftp://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools/acpidump
+
+This tool is also part of the acpica package on SUSE.
+Alternatively, used ACPI tables can be retrieved via sysfs in latest kernels:
+/sys/firmware/acpi/tables
diff --git a/Documentation/admin-guide/acpi/ssdt-overlays.rst b/Documentation/admin-guide/acpi/ssdt-overlays.rst
new file mode 100644
index 0000000..da37455
--- /dev/null
+++ b/Documentation/admin-guide/acpi/ssdt-overlays.rst
@@ -0,0 +1,180 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+SSDT Overlays
+=============
+
+In order to support ACPI open-ended hardware configurations (e.g. development
+boards) we need a way to augment the ACPI configuration provided by the firmware
+image. A common example is connecting sensors on I2C / SPI buses on development
+boards.
+
+Although this can be accomplished by creating a kernel platform driver or
+recompiling the firmware image with updated ACPI tables, neither is practical:
+the former proliferates board specific kernel code while the latter requires
+access to firmware tools which are often not publicly available.
+
+Because ACPI supports external references in AML code a more practical
+way to augment firmware ACPI configuration is by dynamically loading
+user defined SSDT tables that contain the board specific information.
+
+For example, to enumerate a Bosch BMA222E accelerometer on the I2C bus of the
+Minnowboard MAX development board exposed via the LSE connector [1], the
+following ASL code can be used::
+
+ DefinitionBlock ("minnowmax.aml", "SSDT", 1, "Vendor", "Accel", 0x00000003)
+ {
+ External (\_SB.I2C6, DeviceObj)
+
+ Scope (\_SB.I2C6)
+ {
+ Device (STAC)
+ {
+ Name (_ADR, Zero)
+ Name (_HID, "BMA222E")
+
+ Method (_CRS, 0, Serialized)
+ {
+ Name (RBUF, ResourceTemplate ()
+ {
+ I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
+ AddressingMode7Bit, "\\_SB.I2C6", 0x00,
+ ResourceConsumer, ,)
+ GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
+ "\\_SB.GPO2", 0x00, ResourceConsumer, , )
+ { // Pin list
+ 0
+ }
+ })
+ Return (RBUF)
+ }
+ }
+ }
+ }
+
+which can then be compiled to AML binary format::
+
+ $ iasl minnowmax.asl
+
+ Intel ACPI Component Architecture
+ ASL Optimizing Compiler version 20140214-64 [Mar 29 2014]
+ Copyright (c) 2000 - 2014 Intel Corporation
+
+ ASL Input: minnomax.asl - 30 lines, 614 bytes, 7 keywords
+ AML Output: minnowmax.aml - 165 bytes, 6 named objects, 1 executable opcodes
+
+[1] http://wiki.minnowboard.org/MinnowBoard_MAX#Low_Speed_Expansion_Connector_.28Top.29
+
+The resulting AML code can then be loaded by the kernel using one of the methods
+below.
+
+Loading ACPI SSDTs from initrd
+==============================
+
+This option allows loading of user defined SSDTs from initrd and it is useful
+when the system does not support EFI or when there is not enough EFI storage.
+
+It works in a similar way with initrd based ACPI tables override/upgrade: SSDT
+aml code must be placed in the first, uncompressed, initrd under the
+"kernel/firmware/acpi" path. Multiple files can be used and this will translate
+in loading multiple tables. Only SSDT and OEM tables are allowed. See
+initrd_table_override.txt for more details.
+
+Here is an example::
+
+ # Add the raw ACPI tables to an uncompressed cpio archive.
+ # They must be put into a /kernel/firmware/acpi directory inside the
+ # cpio archive.
+ # The uncompressed cpio archive must be the first.
+ # Other, typically compressed cpio archives, must be
+ # concatenated on top of the uncompressed one.
+ mkdir -p kernel/firmware/acpi
+ cp ssdt.aml kernel/firmware/acpi
+
+ # Create the uncompressed cpio archive and concatenate the original initrd
+ # on top:
+ find kernel | cpio -H newc --create > /boot/instrumented_initrd
+ cat /boot/initrd >>/boot/instrumented_initrd
+
+Loading ACPI SSDTs from EFI variables
+=====================================
+
+This is the preferred method, when EFI is supported on the platform, because it
+allows a persistent, OS independent way of storing the user defined SSDTs. There
+is also work underway to implement EFI support for loading user defined SSDTs
+and using this method will make it easier to convert to the EFI loading
+mechanism when that will arrive.
+
+In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line
+parameter can be used. The argument for the option is the variable name to
+use. If there are multiple variables with the same name but with different
+vendor GUIDs, all of them will be loaded.
+
+In order to store the AML code in an EFI variable the efivarfs filesystem can be
+used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all
+recent distribution.
+
+Creating a new file in /sys/firmware/efi/efivars will automatically create a new
+EFI variable. Updating a file in /sys/firmware/efi/efivars will update the EFI
+variable. Please note that the file name needs to be specially formatted as
+"Name-GUID" and that the first 4 bytes in the file (little-endian format)
+represent the attributes of the EFI variable (see EFI_VARIABLE_MASK in
+include/linux/efi.h). Writing to the file must also be done with one write
+operation.
+
+For example, you can use the following bash script to create/update an EFI
+variable with the content from a given file::
+
+ #!/bin/sh -e
+
+ while ! [ -z "$1" ]; do
+ case "$1" in
+ "-f") filename="$2"; shift;;
+ "-g") guid="$2"; shift;;
+ *) name="$1";;
+ esac
+ shift
+ done
+
+ usage()
+ {
+ echo "Syntax: ${0##*/} -f filename [ -g guid ] name"
+ exit 1
+ }
+
+ [ -n "$name" -a -f "$filename" ] || usage
+
+ EFIVARFS="/sys/firmware/efi/efivars"
+
+ [ -d "$EFIVARFS" ] || exit 2
+
+ if stat -tf $EFIVARFS | grep -q -v de5e81e4; then
+ mount -t efivarfs none $EFIVARFS
+ fi
+
+ # try to pick up an existing GUID
+ [ -n "$guid" ] || guid=$(find "$EFIVARFS" -name "$name-*" | head -n1 | cut -f2- -d-)
+
+ # use a randomly generated GUID
+ [ -n "$guid" ] || guid="$(cat /proc/sys/kernel/random/uuid)"
+
+ # efivarfs expects all of the data in one write
+ tmp=$(mktemp)
+ /bin/echo -ne "\007\000\000\000" | cat - $filename > $tmp
+ dd if=$tmp of="$EFIVARFS/$name-$guid" bs=$(stat -c %s $tmp)
+ rm $tmp
+
+Loading ACPI SSDTs from configfs
+================================
+
+This option allows loading of user defined SSDTs from userspace via the configfs
+interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be
+mounted. In the following examples, we assume that configfs has been mounted in
+/config.
+
+New tables can be loading by creating new directories in /config/acpi/table/ and
+writing the SSDT aml code in the aml attribute::
+
+ cd /config/acpi/table
+ mkdir my_ssdt
+ cat ~/ssdt.aml > my_ssdt/aml
diff --git a/Documentation/admin-guide/aoe/aoe.rst b/Documentation/admin-guide/aoe/aoe.rst
new file mode 100644
index 0000000..a05e751
--- /dev/null
+++ b/Documentation/admin-guide/aoe/aoe.rst
@@ -0,0 +1,150 @@
+Introduction
+============
+
+ATA over Ethernet is a network protocol that provides simple access to
+block storage on the LAN.
+
+ http://support.coraid.com/documents/AoEr11.txt
+
+The EtherDrive (R) HOWTO for 2.6 and 3.x kernels is found at ...
+
+ http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO.html
+
+It has many tips and hints! Please see, especially, recommended
+tunings for virtual memory:
+
+ http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.19
+
+The aoetools are userland programs that are designed to work with this
+driver. The aoetools are on sourceforge.
+
+ http://aoetools.sourceforge.net/
+
+The scripts in this Documentation/admin-guide/aoe directory are intended to
+document the use of the driver and are not necessary if you install
+the aoetools.
+
+
+Creating Device Nodes
+=====================
+
+ Users of udev should find the block device nodes created
+ automatically, but to create all the necessary device nodes, use the
+ udev configuration rules provided in udev.txt (in this directory).
+
+ There is a udev-install.sh script that shows how to install these
+ rules on your system.
+
+ There is also an autoload script that shows how to edit
+ /etc/modprobe.d/aoe.conf to ensure that the aoe module is loaded when
+ necessary. Preloading the aoe module is preferable to autoloading,
+ however, because AoE discovery takes a few seconds. It can be
+ confusing when an AoE device is not present the first time the a
+ command is run but appears a second later.
+
+Using Device Nodes
+==================
+
+ "cat /dev/etherd/err" blocks, waiting for error diagnostic output,
+ like any retransmitted packets.
+
+ "echo eth2 eth4 > /dev/etherd/interfaces" tells the aoe driver to
+ limit ATA over Ethernet traffic to eth2 and eth4. AoE traffic from
+ untrusted networks should be ignored as a matter of security. See
+ also the aoe_iflist driver option described below.
+
+ "echo > /dev/etherd/discover" tells the driver to find out what AoE
+ devices are available.
+
+ In the future these character devices may disappear and be replaced
+ by sysfs counterparts. Using the commands in aoetools insulates
+ users from these implementation details.
+
+ The block devices are named like this::
+
+ e{shelf}.{slot}
+ e{shelf}.{slot}p{part}
+
+ ... so that "e0.2" is the third blade from the left (slot 2) in the
+ first shelf (shelf address zero). That's the whole disk. The first
+ partition on that disk would be "e0.2p1".
+
+Using sysfs
+===========
+
+ Each aoe block device in /sys/block has the extra attributes of
+ state, mac, and netif. The state attribute is "up" when the device
+ is ready for I/O and "down" if detected but unusable. The
+ "down,closewait" state shows that the device is still open and
+ cannot come up again until it has been closed.
+
+ The mac attribute is the ethernet address of the remote AoE device.
+ The netif attribute is the network interface on the localhost
+ through which we are communicating with the remote AoE device.
+
+ There is a script in this directory that formats this information in
+ a convenient way. Users with aoetools should use the aoe-stat
+ command::
+
+ root@makki root# sh Documentation/admin-guide/aoe/status.sh
+ e10.0 eth3 up
+ e10.1 eth3 up
+ e10.2 eth3 up
+ e10.3 eth3 up
+ e10.4 eth3 up
+ e10.5 eth3 up
+ e10.6 eth3 up
+ e10.7 eth3 up
+ e10.8 eth3 up
+ e10.9 eth3 up
+ e4.0 eth1 up
+ e4.1 eth1 up
+ e4.2 eth1 up
+ e4.3 eth1 up
+ e4.4 eth1 up
+ e4.5 eth1 up
+ e4.6 eth1 up
+ e4.7 eth1 up
+ e4.8 eth1 up
+ e4.9 eth1 up
+
+ Use /sys/module/aoe/parameters/aoe_iflist (or better, the driver
+ option discussed below) instead of /dev/etherd/interfaces to limit
+ AoE traffic to the network interfaces in the given
+ whitespace-separated list. Unlike the old character device, the
+ sysfs entry can be read from as well as written to.
+
+ It's helpful to trigger discovery after setting the list of allowed
+ interfaces. The aoetools package provides an aoe-discover script
+ for this purpose. You can also directly use the
+ /dev/etherd/discover special file described above.
+
+Driver Options
+==============
+
+ There is a boot option for the built-in aoe driver and a
+ corresponding module parameter, aoe_iflist. Without this option,
+ all network interfaces may be used for ATA over Ethernet. Here is a
+ usage example for the module parameter::
+
+ modprobe aoe_iflist="eth1 eth3"
+
+ The aoe_deadsecs module parameter determines the maximum number of
+ seconds that the driver will wait for an AoE device to provide a
+ response to an AoE command. After aoe_deadsecs seconds have
+ elapsed, the AoE device will be marked as "down". A value of zero
+ is supported for testing purposes and makes the aoe driver keep
+ trying AoE commands forever.
+
+ The aoe_maxout module parameter has a default of 128. This is the
+ maximum number of unresponded packets that will be sent to an AoE
+ target at one time.
+
+ The aoe_dyndevs module parameter defaults to 1, meaning that the
+ driver will assign a block device minor number to a discovered AoE
+ target based on the order of its discovery. With dynamic minor
+ device numbers in use, a greater range of AoE shelf and slot
+ addresses can be supported. Users with udev will never have to
+ think about minor numbers. Using aoe_dyndevs=0 allows device nodes
+ to be pre-created using a static minor-number scheme with the
+ aoe-mkshelf script in the aoetools.
diff --git a/Documentation/admin-guide/aoe/autoload.sh b/Documentation/admin-guide/aoe/autoload.sh
new file mode 100644
index 0000000..815dff4
--- /dev/null
+++ b/Documentation/admin-guide/aoe/autoload.sh
@@ -0,0 +1,17 @@
+#!/bin/sh
+# set aoe to autoload by installing the
+# aliases in /etc/modprobe.d/
+
+f=/etc/modprobe.d/aoe.conf
+
+if test ! -r $f || test ! -w $f; then
+ echo "cannot configure $f for module autoloading" 1>&2
+ exit 1
+fi
+
+grep major-152 $f >/dev/null
+if [ $? = 1 ]; then
+ echo alias block-major-152 aoe >> $f
+ echo alias char-major-152 aoe >> $f
+fi
+
diff --git a/Documentation/admin-guide/aoe/examples.rst b/Documentation/admin-guide/aoe/examples.rst
new file mode 100644
index 0000000..91f3198
--- /dev/null
+++ b/Documentation/admin-guide/aoe/examples.rst
@@ -0,0 +1,23 @@
+Example of udev rules
+---------------------
+
+ .. include:: udev.txt
+ :literal:
+
+Example of udev install rules script
+------------------------------------
+
+ .. literalinclude:: udev-install.sh
+ :language: shell
+
+Example script to get status
+----------------------------
+
+ .. literalinclude:: status.sh
+ :language: shell
+
+Example of AoE autoload script
+------------------------------
+
+ .. literalinclude:: autoload.sh
+ :language: shell
diff --git a/Documentation/admin-guide/aoe/index.rst b/Documentation/admin-guide/aoe/index.rst
new file mode 100644
index 0000000..d71c5df
--- /dev/null
+++ b/Documentation/admin-guide/aoe/index.rst
@@ -0,0 +1,17 @@
+=======================
+ATA over Ethernet (AoE)
+=======================
+
+.. toctree::
+ :maxdepth: 1
+
+ aoe
+ todo
+ examples
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/aoe/status.sh b/Documentation/admin-guide/aoe/status.sh
new file mode 100644
index 0000000..eeec7ba
--- /dev/null
+++ b/Documentation/admin-guide/aoe/status.sh
@@ -0,0 +1,30 @@
+#! /bin/sh
+# collate and present sysfs information about AoE storage
+#
+# A more complete version of this script is aoe-stat, in the
+# aoetools.
+
+set -e
+format="%8s\t%8s\t%8s\n"
+me=`basename $0`
+sysd=${sysfs_dir:-/sys}
+
+# printf "$format" device mac netif state
+
+# Suse 9.1 Pro doesn't put /sys in /etc/mtab
+#test -z "`mount | grep sysfs`" && {
+test ! -d "$sysd/block" && {
+ echo "$me Error: sysfs is not mounted" 1>&2
+ exit 1
+}
+
+for d in `ls -d $sysd/block/etherd* 2>/dev/null | grep -v p` end; do
+ # maybe ls comes up empty, so we use "end"
+ test $d = end && continue
+
+ dev=`echo "$d" | sed 's/.*!//'`
+ printf "$format" \
+ "$dev" \
+ "`cat \"$d/netif\"`" \
+ "`cat \"$d/state\"`"
+done | sort
diff --git a/Documentation/admin-guide/aoe/todo.rst b/Documentation/admin-guide/aoe/todo.rst
new file mode 100644
index 0000000..dea8db5
--- /dev/null
+++ b/Documentation/admin-guide/aoe/todo.rst
@@ -0,0 +1,17 @@
+TODO
+====
+
+There is a potential for deadlock when allocating a struct sk_buff for
+data that needs to be written out to aoe storage. If the data is
+being written from a dirty page in order to free that page, and if
+there are no other pages available, then deadlock may occur when a
+free page is needed for the sk_buff allocation. This situation has
+not been observed, but it would be nice to eliminate any potential for
+deadlock under memory pressure.
+
+Because ATA over Ethernet is not fragmented by the kernel's IP code,
+the destructor member of the struct sk_buff is available to the aoe
+driver. By using a mempool for allocating all but the first few
+sk_buffs, and by registering a destructor, we should be able to
+efficiently allocate sk_buffs without introducing any potential for
+deadlock.
diff --git a/Documentation/admin-guide/aoe/udev-install.sh b/Documentation/admin-guide/aoe/udev-install.sh
new file mode 100644
index 0000000..15e86f5
--- /dev/null
+++ b/Documentation/admin-guide/aoe/udev-install.sh
@@ -0,0 +1,33 @@
+# install the aoe-specific udev rules from udev.txt into
+# the system's udev configuration
+#
+
+me="`basename $0`"
+
+# find udev.conf, often /etc/udev/udev.conf
+# (or environment can specify where to find udev.conf)
+#
+if test -z "$conf"; then
+ if test -r /etc/udev/udev.conf; then
+ conf=/etc/udev/udev.conf
+ else
+ conf="`find /etc -type f -name udev.conf 2> /dev/null`"
+ if test -z "$conf" || test ! -r "$conf"; then
+ echo "$me Error: no udev.conf found" 1>&2
+ exit 1
+ fi
+ fi
+fi
+
+# find the directory where udev rules are stored, often
+# /etc/udev/rules.d
+#
+rules_d="`sed -n '/^udev_rules=/{ s!udev_rules=!!; s!\"!!g; p; }' $conf`"
+if test -z "$rules_d" ; then
+ rules_d=/etc/udev/rules.d
+fi
+if test ! -d "$rules_d"; then
+ echo "$me Error: cannot find udev rules directory" 1>&2
+ exit 1
+fi
+sh -xc "cp `dirname $0`/udev.txt $rules_d/60-aoe.rules"
diff --git a/Documentation/admin-guide/aoe/udev.txt b/Documentation/admin-guide/aoe/udev.txt
new file mode 100644
index 0000000..5fb7564
--- /dev/null
+++ b/Documentation/admin-guide/aoe/udev.txt
@@ -0,0 +1,26 @@
+# These rules tell udev what device nodes to create for aoe support.
+# They may be installed along the following lines. Check the section
+# 8 udev manpage to see whether your udev supports SUBSYSTEM, and
+# whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
+#
+# ecashin@makki ~$ su
+# Password:
+# bash# find /etc -type f -name udev.conf
+# /etc/udev/udev.conf
+# bash# grep udev_rules= /etc/udev/udev.conf
+# udev_rules="/etc/udev/rules.d/"
+# bash# ls /etc/udev/rules.d/
+# 10-wacom.rules 50-udev.rules
+# bash# cp /path/to/linux/Documentation/admin-guide/aoe/udev.txt \
+# /etc/udev/rules.d/60-aoe.rules
+#
+
+# aoe char devices
+SUBSYSTEM=="aoe", KERNEL=="discover", NAME="etherd/%k", GROUP="disk", MODE="0220"
+SUBSYSTEM=="aoe", KERNEL=="err", NAME="etherd/%k", GROUP="disk", MODE="0440"
+SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
+SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
+SUBSYSTEM=="aoe", KERNEL=="flush", NAME="etherd/%k", GROUP="disk", MODE="0220"
+
+# aoe block devices
+KERNEL=="etherd*", GROUP="disk"
diff --git a/Documentation/admin-guide/auxdisplay/cfag12864b.rst b/Documentation/admin-guide/auxdisplay/cfag12864b.rst
new file mode 100644
index 0000000..18c2865
--- /dev/null
+++ b/Documentation/admin-guide/auxdisplay/cfag12864b.rst
@@ -0,0 +1,98 @@
+===================================
+cfag12864b LCD Driver Documentation
+===================================
+
+:License: GPLv2
+:Author & Maintainer: Miguel Ojeda Sandonis
+:Date: 2006-10-27
+
+
+
+.. INDEX
+
+ 1. DRIVER INFORMATION
+ 2. DEVICE INFORMATION
+ 3. WIRING
+ 4. USERSPACE PROGRAMMING
+
+1. Driver Information
+---------------------
+
+This driver supports a cfag12864b LCD.
+
+
+2. Device Information
+---------------------
+
+:Manufacturer: Crystalfontz
+:Device Name: Crystalfontz 12864b LCD Series
+:Device Code: cfag12864b
+:Webpage: http://www.crystalfontz.com
+:Device Webpage: http://www.crystalfontz.com/products/12864b/
+:Type: LCD (Liquid Crystal Display)
+:Width: 128
+:Height: 64
+:Colors: 2 (B/N)
+:Controller: ks0108
+:Controllers: 2
+:Pages: 8 each controller
+:Addresses: 64 each page
+:Data size: 1 byte each address
+:Memory size: 2 * 8 * 64 * 1 = 1024 bytes = 1 Kbyte
+
+
+3. Wiring
+---------
+
+The cfag12864b LCD Series don't have official wiring.
+
+The common wiring is done to the parallel port as shown::
+
+ Parallel Port cfag12864b
+
+ Name Pin# Pin# Name
+
+ Strobe ( 1)------------------------------(17) Enable
+ Data 0 ( 2)------------------------------( 4) Data 0
+ Data 1 ( 3)------------------------------( 5) Data 1
+ Data 2 ( 4)------------------------------( 6) Data 2
+ Data 3 ( 5)------------------------------( 7) Data 3
+ Data 4 ( 6)------------------------------( 8) Data 4
+ Data 5 ( 7)------------------------------( 9) Data 5
+ Data 6 ( 8)------------------------------(10) Data 6
+ Data 7 ( 9)------------------------------(11) Data 7
+ (10) [+5v]---( 1) Vdd
+ (11) [GND]---( 2) Ground
+ (12) [+5v]---(14) Reset
+ (13) [GND]---(15) Read / Write
+ Line (14)------------------------------(13) Controller Select 1
+ (15)
+ Init (16)------------------------------(12) Controller Select 2
+ Select (17)------------------------------(16) Data / Instruction
+ Ground (18)---[GND] [+5v]---(19) LED +
+ Ground (19)---[GND]
+ Ground (20)---[GND] E A Values:
+ Ground (21)---[GND] [GND]---[P1]---(18) Vee - R = Resistor = 22 ohm
+ Ground (22)---[GND] | - P1 = Preset = 10 Kohm
+ Ground (23)---[GND] ---- S ------( 3) V0 - P2 = Preset = 1 Kohm
+ Ground (24)---[GND] | |
+ Ground (25)---[GND] [GND]---[P2]---[R]---(20) LED -
+
+
+4. Userspace Programming
+------------------------
+
+The cfag12864bfb describes a framebuffer device (/dev/fbX).
+
+It has a size of 1024 bytes = 1 Kbyte.
+Each bit represents one pixel. If the bit is high, the pixel will
+turn on. If the pixel is low, the pixel will turn off.
+
+You can use the framebuffer as a file: fopen, fwrite, fclose...
+Although the LCD won't get updated until the next refresh time arrives.
+
+Also, you can mmap the framebuffer: open & mmap, munmap & close...
+which is the best option for most uses.
+
+Check samples/auxdisplay/cfag12864b-example.c
+for a real working userspace complete program with usage examples.
diff --git a/Documentation/admin-guide/auxdisplay/index.rst b/Documentation/admin-guide/auxdisplay/index.rst
new file mode 100644
index 0000000..e466f05
--- /dev/null
+++ b/Documentation/admin-guide/auxdisplay/index.rst
@@ -0,0 +1,16 @@
+=========================
+Auxiliary Display Support
+=========================
+
+.. toctree::
+ :maxdepth: 1
+
+ ks0108.rst
+ cfag12864b.rst
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/auxdisplay/ks0108.rst b/Documentation/admin-guide/auxdisplay/ks0108.rst
new file mode 100644
index 0000000..c0b7faf
--- /dev/null
+++ b/Documentation/admin-guide/auxdisplay/ks0108.rst
@@ -0,0 +1,50 @@
+==========================================
+ks0108 LCD Controller Driver Documentation
+==========================================
+
+:License: GPLv2
+:Author & Maintainer: Miguel Ojeda Sandonis
+:Date: 2006-10-27
+
+
+
+.. INDEX
+
+ 1. DRIVER INFORMATION
+ 2. DEVICE INFORMATION
+ 3. WIRING
+
+
+1. Driver Information
+---------------------
+
+This driver supports the ks0108 LCD controller.
+
+
+2. Device Information
+---------------------
+
+:Manufacturer: Samsung
+:Device Name: KS0108 LCD Controller
+:Device Code: ks0108
+:Webpage: -
+:Device Webpage: -
+:Type: LCD Controller (Liquid Crystal Display Controller)
+:Width: 64
+:Height: 64
+:Colors: 2 (B/N)
+:Pages: 8
+:Addresses: 64 each page
+:Data size: 1 byte each address
+:Memory size: 8 * 64 * 1 = 512 bytes
+
+
+3. Wiring
+---------
+
+The driver supports data parallel port wiring.
+
+If you aren't building LCD related hardware, you should check
+your LCD specific wiring information in the same folder.
+
+For example, check Documentation/admin-guide/auxdisplay/cfag12864b.rst
diff --git a/Documentation/admin-guide/binderfs.rst b/Documentation/admin-guide/binderfs.rst
new file mode 100644
index 0000000..c009671
--- /dev/null
+++ b/Documentation/admin-guide/binderfs.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+The Android binderfs Filesystem
+===============================
+
+Android binderfs is a filesystem for the Android binder IPC mechanism. It
+allows to dynamically add and remove binder devices at runtime. Binder devices
+located in a new binderfs instance are independent of binder devices located in
+other binderfs instances. Mounting a new binderfs instance makes it possible
+to get a set of private binder devices.
+
+Mounting binderfs
+-----------------
+
+Android binderfs can be mounted with::
+
+ mkdir /dev/binderfs
+ mount -t binder binder /dev/binderfs
+
+at which point a new instance of binderfs will show up at ``/dev/binderfs``.
+In a fresh instance of binderfs no binder devices will be present. There will
+only be a ``binder-control`` device which serves as the request handler for
+binderfs. Mounting another binderfs instance at a different location will
+create a new and separate instance from all other binderfs mounts. This is
+identical to the behavior of e.g. ``devpts`` and ``tmpfs``. The Android
+binderfs filesystem can be mounted in user namespaces.
+
+Options
+-------
+max
+ binderfs instances can be mounted with a limit on the number of binder
+ devices that can be allocated. The ``max=<count>`` mount option serves as
+ a per-instance limit. If ``max=<count>`` is set then only ``<count>`` number
+ of binder devices can be allocated in this binderfs instance.
+
+Allocating binder Devices
+-------------------------
+
+.. _ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html
+
+To allocate a new binder device in a binderfs instance a request needs to be
+sent through the ``binder-control`` device node. A request is sent in the form
+of an `ioctl() <ioctl_>`_.
+
+What a program needs to do is to open the ``binder-control`` device node and
+send a ``BINDER_CTL_ADD`` request to the kernel. Users of binderfs need to
+tell the kernel which name the new binder device should get. By default a name
+can only contain up to ``BINDERFS_MAX_NAME`` chars including the terminating
+zero byte.
+
+Once the request is made via an `ioctl() <ioctl_>`_ passing a ``struct
+binder_device`` with the name to the kernel it will allocate a new binder
+device and return the major and minor number of the new device in the struct
+(This is necessary because binderfs allocates a major device number
+dynamically.). After the `ioctl() <ioctl_>`_ returns there will be a new
+binder device located under /dev/binderfs with the chosen name.
+
+Deleting binder Devices
+-----------------------
+
+.. _unlink: http://man7.org/linux/man-pages/man2/unlink.2.html
+.. _rm: http://man7.org/linux/man-pages/man1/rm.1.html
+
+Binderfs binder devices can be deleted via `unlink() <unlink_>`_. This means
+that the `rm() <rm_>`_ tool can be used to delete them. Note that the
+``binder-control`` device cannot be deleted since this would make the binderfs
+instance unuseable. The ``binder-control`` device will be deleted when the
+binderfs instance is unmounted and all references to it have been dropped.
diff --git a/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg b/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
new file mode 100644
index 0000000..f87cfa0
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/DRBD-8.3-data-packets.svg
@@ -0,0 +1,588 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+<svg
+ xmlns:svg="http://www.w3.org/2000/svg"
+ xmlns="http://www.w3.org/2000/svg"
+ version="1.0"
+ width="210mm"
+ height="297mm"
+ viewBox="0 0 21000 29700"
+ id="svg2"
+ style="fill-rule:evenodd">
+ <defs
+ id="defs4" />
+ <g
+ id="Default"
+ style="visibility:visible">
+ <desc
+ id="desc180">Master slide</desc>
+ </g>
+ <path
+ d="M 11999,8601 L 11899,8301 L 12099,8301 L 11999,8601 z"
+ id="path193"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 11999,7801 L 11999,8361"
+ id="path197"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 7999,10401 L 7899,10101 L 8099,10101 L 7999,10401 z"
+ id="path209"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 7999,9601 L 7999,10161"
+ id="path213"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 11999,7801 L 11685,7840 L 11724,7644 L 11999,7801 z"
+ id="path225"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 7999,7001 L 11764,7754"
+ id="path229"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <g
+ transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-1244.4792,1416.5139)"
+ id="g245"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text247">
+ <tspan
+ x="9139 9368 9579 9808 9986 10075 10252 10481 10659 10837 10909"
+ y="9284"
+ id="tspan249">RSDataReply</tspan>
+ </text>
+ </g>
+ <path
+ d="M 7999,9601 L 8281,9458 L 8311,9655 L 7999,9601 z"
+ id="path259"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 11999,9001 L 8236,9565"
+ id="path263"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <g
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,1620.9382,-1639.4947)"
+ id="g279"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text281">
+ <tspan
+ x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
+ y="7023"
+ id="tspan283">CsumRSRequest</tspan>
+ </text>
+ </g>
+ <text
+ id="text297"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
+ y="5707"
+ id="tspan299">w_make_resync_request()</tspan>
+ </text>
+ <text
+ id="text313"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
+ y="7806"
+ id="tspan315">receive_DataRequest()</tspan>
+ </text>
+ <text
+ id="text329"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
+ y="8606"
+ id="tspan331">drbd_endio_read_sec()</tspan>
+ </text>
+ <text
+ id="text345"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
+ y="9007"
+ id="tspan347">w_e_end_csum_rs_req()</tspan>
+ </text>
+ <text
+ id="text361"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
+ y="9507"
+ id="tspan363">receive_RSDataReply()</tspan>
+ </text>
+ <text
+ id="text377"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
+ y="10407"
+ id="tspan379">drbd_endio_write_sec()</tspan>
+ </text>
+ <text
+ id="text393"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
+ y="10907"
+ id="tspan395">e_end_resync_block()</tspan>
+ </text>
+ <path
+ d="M 11999,11601 L 11685,11640 L 11724,11444 L 11999,11601 z"
+ id="path405"
+ style="fill:#000080;visibility:visible" />
+ <path
+ d="M 7999,10801 L 11764,11554"
+ id="path409"
+ style="fill:none;stroke:#000080;visibility:visible" />
+ <g
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,2434.7562,-1674.649)"
+ id="g425"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text427">
+ <tspan
+ x="9320 9621 9726 9798 9887 10065 10277 10438"
+ y="10943"
+ id="tspan429">WriteAck</tspan>
+ </text>
+ </g>
+ <text
+ id="text443"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
+ y="11559"
+ id="tspan445">got_BlockAck()</tspan>
+ </text>
+ <text
+ id="text459"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14302 14540 14658 14777 14870 15107 15225 15437 15649 15886"
+ y="4877"
+ id="tspan461">Checksum based Resync, case not in sync</tspan>
+ </text>
+ <text
+ id="text475"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="6961 7266 7571 7854 8159 8299 8536 8654 8891 9010 9247 9484 9603 9840 9958 10077 10170 10407"
+ y="2806"
+ id="tspan477">DRBD-8.3 data flow</tspan>
+ </text>
+ <text
+ id="text491"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
+ y="7005"
+ id="tspan493">w_e_send_csum()</tspan>
+ </text>
+ <path
+ d="M 11999,17601 L 11899,17301 L 12099,17301 L 11999,17601 z"
+ id="path503"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 11999,16801 L 11999,17361"
+ id="path507"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 11999,16801 L 11685,16840 L 11724,16644 L 11999,16801 z"
+ id="path519"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 7999,16001 L 11764,16754"
+ id="path523"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <g
+ transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-2539.5806,1529.3491)"
+ id="g539"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text541">
+ <tspan
+ x="9269 9498 9709 9798 9959 10048 10226 10437 10598 10776"
+ y="18265"
+ id="tspan543">RSIsInSync</tspan>
+ </text>
+ </g>
+ <path
+ d="M 7999,18601 L 8281,18458 L 8311,18655 L 7999,18601 z"
+ id="path553"
+ style="fill:#000080;visibility:visible" />
+ <path
+ d="M 11999,18001 L 8236,18565"
+ id="path557"
+ style="fill:none;stroke:#000080;visibility:visible" />
+ <g
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,3461.4027,-1449.3012)"
+ id="g573"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text575">
+ <tspan
+ x="8743 8972 9132 9310 9573 9801 10013 10242 10419 10597 10775 10953 11114"
+ y="16023"
+ id="tspan577">CsumRSRequest</tspan>
+ </text>
+ </g>
+ <text
+ id="text591"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
+ y="16806"
+ id="tspan593">receive_DataRequest()</tspan>
+ </text>
+ <text
+ id="text607"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
+ y="17606"
+ id="tspan609">drbd_endio_read_sec()</tspan>
+ </text>
+ <text
+ id="text623"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13825 13986 14164 14426 14604 14710 14871 15049 15154 15332 15510 15616"
+ y="18007"
+ id="tspan625">w_e_end_csum_rs_req()</tspan>
+ </text>
+ <text
+ id="text639"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="5735 5913 6091 6180 6357 6446 6607 6696 6874 7085 7246 7424 7585 7691"
+ y="18507"
+ id="tspan641">got_IsInSync()</tspan>
+ </text>
+ <text
+ id="text655"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="7999 8304 8541 8778 8990 9201 9413 9650 10001 10120 10357 10594 10806 11043 11280 11398 11703 11940 12152 12364 12601 12812 12931 13049 13261 13498 13710 13947 14065 14159 14396 14514 14726 14937 15175"
+ y="13877"
+ id="tspan657">Checksum based Resync, case in sync</tspan>
+ </text>
+ <path
+ d="M 12000,24601 L 11900,24301 L 12100,24301 L 12000,24601 z"
+ id="path667"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 12000,23801 L 12000,24361"
+ id="path671"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 8000,26401 L 7900,26101 L 8100,26101 L 8000,26401 z"
+ id="path683"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 8000,25601 L 8000,26161"
+ id="path687"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 12000,23801 L 11686,23840 L 11725,23644 L 12000,23801 z"
+ id="path699"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 8000,23001 L 11765,23754"
+ id="path703"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <g
+ transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,-3543.8452,1630.5143)"
+ id="g719"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text721">
+ <tspan
+ x="9464 9710 9921 10150 10328 10505 10577"
+ y="25236"
+ id="tspan723">OVReply</tspan>
+ </text>
+ </g>
+ <path
+ d="M 8000,25601 L 8282,25458 L 8312,25655 L 8000,25601 z"
+ id="path733"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 12000,25001 L 8237,25565"
+ id="path737"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <g
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,4918.2801,-1381.2128)"
+ id="g753"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text755">
+ <tspan
+ x="9142 9388 9599 9828 10006 10183 10361 10539 10700"
+ y="23106"
+ id="tspan757">OVRequest</tspan>
+ </text>
+ </g>
+ <text
+ id="text771"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13656 13868 14097 14274 14452 14630 14808 14969 15058 15163"
+ y="23806"
+ id="tspan773">receive_OVRequest()</tspan>
+ </text>
+ <text
+ id="text787"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
+ y="24606"
+ id="tspan789">drbd_endio_read_sec()</tspan>
+ </text>
+ <text
+ id="text803"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14004 14182 14288 14465 14643 14749"
+ y="25007"
+ id="tspan805">w_e_end_ov_req()</tspan>
+ </text>
+ <text
+ id="text819"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="5101 5207 5385 5546 5723 5795 5956 6134 6312 6557 6769 6998 7175 7353 7425 7586 7692"
+ y="25507"
+ id="tspan821">receive_OVReply()</tspan>
+ </text>
+ <text
+ id="text835"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
+ y="26407"
+ id="tspan837">drbd_endio_read_sec()</tspan>
+ </text>
+ <text
+ id="text851"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4902 5131 5308 5486 5664 5842 6020 6197 6375 6553 6714 6892 6998 7175 7353 7425 7586 7692"
+ y="26907"
+ id="tspan853">w_e_end_ov_reply()</tspan>
+ </text>
+ <path
+ d="M 12000,27601 L 11686,27640 L 11725,27444 L 12000,27601 z"
+ id="path863"
+ style="fill:#000080;visibility:visible" />
+ <path
+ d="M 8000,26801 L 11765,27554"
+ id="path867"
+ style="fill:none;stroke:#000080;visibility:visible" />
+ <g
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,5704.1907,-1328.312)"
+ id="g883"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <text
+ id="text885">
+ <tspan
+ x="9279 9525 9736 9965 10143 10303 10481 10553"
+ y="26935"
+ id="tspan887">OVResult</tspan>
+ </text>
+ </g>
+ <text
+ id="text901"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12200 12378 12556 12645 12822 13068 13280 13508 13686 13847 14025 14097 14185 14291"
+ y="27559"
+ id="tspan903">got_OVResult()</tspan>
+ </text>
+ <text
+ id="text917"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="8000 8330 8567 8660 8754 8991 9228 9346 9558 9795 9935 10028 10146"
+ y="21877"
+ id="tspan919">Online verify</tspan>
+ </text>
+ <text
+ id="text933"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4641 4870 5047 5310 5488 5649 5826 6004 6182 6343 6521 6626 6804 6982 7160 7338 7499 7587 7693"
+ y="23005"
+ id="tspan935">w_make_ov_request()</tspan>
+ </text>
+ <path
+ d="M 8000,6500 L 7900,6200 L 8100,6200 L 8000,6500 z"
+ id="path945"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 8000,5700 L 8000,6260"
+ id="path949"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 3900,5500 L 3700,5500 L 3700,11000 L 3900,11000"
+ id="path961"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <path
+ d="M 3900,14500 L 3700,14500 L 3700,18600 L 3900,18600"
+ id="path973"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <path
+ d="M 3900,22800 L 3700,22800 L 3700,26900 L 3900,26900"
+ id="path985"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <text
+ id="text1001"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
+ y="6506"
+ id="tspan1003">drbd_endio_read_sec()</tspan>
+ </text>
+ <text
+ id="text1017"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
+ y="14708"
+ id="tspan1019">w_make_resync_request()</tspan>
+ </text>
+ <text
+ id="text1033"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="5190 5419 5596 5774 5952 6113 6291 6468 6646 6824 6985 7146 7324 7586 7692"
+ y="16006"
+ id="tspan1035">w_e_send_csum()</tspan>
+ </text>
+ <path
+ d="M 8000,15501 L 7900,15201 L 8100,15201 L 8000,15501 z"
+ id="path1045"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 8000,14701 L 8000,15261"
+ id="path1049"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <text
+ id="text1065"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4492 4670 4776 4953 5131 5309 5487 5665 5842 5914 6092 6270 6376 6554 6731 6909 7087 7248 7426 7587 7692"
+ y="15507"
+ id="tspan1067">drbd_endio_read_sec()</tspan>
+ </text>
+ <path
+ d="M 16100,9000 L 16300,9000 L 16300,7500 L 16100,7500"
+ id="path1077"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <path
+ d="M 16100,18000 L 16300,18000 L 16300,16500 L 16100,16500"
+ id="path1089"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <path
+ d="M 16100,25000 L 16300,25000 L 16300,23500 L 16100,23500"
+ id="path1101"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <text
+ id="text1117"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
+ y="5402"
+ id="tspan1119">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text1133"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="2027 2133 2294 2472 2649 2827 3005 3077 3255 3432 3504 3682 3788"
+ y="14402"
+ id="tspan1135">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text1149"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="2026 2132 2293 2471 2648 2826 3004 3076 3254 3431 3503 3681 3787"
+ y="22602"
+ id="tspan1151">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text1165"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="1426 1532 1693 1871 2031 2209 2472 2649 2721 2899 2988 3166 3344 3416 3593 3699"
+ y="11302"
+ id="tspan1167">rs_complete_io()</tspan>
+ </text>
+ <text
+ id="text1181"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
+ y="18931"
+ id="tspan1183">rs_complete_io()</tspan>
+ </text>
+ <text
+ id="text1197"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="1526 1632 1793 1971 2131 2309 2572 2749 2821 2999 3088 3266 3444 3516 3693 3799"
+ y="27231"
+ id="tspan1199">rs_complete_io()</tspan>
+ </text>
+ <text
+ id="text1213"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16126 16232 16393 16571 16748 16926 17104 17176 17354 17531 17603 17781 17887"
+ y="7402"
+ id="tspan1215">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text1229"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
+ y="16331"
+ id="tspan1231">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text1245"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16127 16233 16394 16572 16749 16927 17105 17177 17355 17532 17604 17782 17888"
+ y="23302"
+ id="tspan1247">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text1261"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
+ y="9302"
+ id="tspan1263">rs_complete_io()</tspan>
+ </text>
+ <text
+ id="text1277"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
+ y="18331"
+ id="tspan1279">rs_complete_io()</tspan>
+ </text>
+ <text
+ id="text1293"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16126 16232 16393 16571 16731 16909 17172 17349 17421 17599 17688 17866 18044 18116 18293 18399"
+ y="25302"
+ id="tspan1295">rs_complete_io()</tspan>
+ </text>
+</svg>
diff --git a/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg b/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
new file mode 100644
index 0000000..48a1e21
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/DRBD-data-packets.svg
@@ -0,0 +1,459 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+<svg
+ xmlns:svg="http://www.w3.org/2000/svg"
+ xmlns="http://www.w3.org/2000/svg"
+ version="1.0"
+ width="210mm"
+ height="297mm"
+ viewBox="0 0 21000 29700"
+ id="svg2"
+ style="fill-rule:evenodd">
+ <defs
+ id="defs4" />
+ <g
+ id="Default"
+ style="visibility:visible">
+ <desc
+ id="desc176">Master slide</desc>
+ </g>
+ <path
+ d="M 11999,19601 L 11899,19301 L 12099,19301 L 11999,19601 z"
+ id="path189"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 11999,18801 L 11999,19361"
+ id="path193"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 7999,21401 L 7899,21101 L 8099,21101 L 7999,21401 z"
+ id="path205"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 7999,20601 L 7999,21161"
+ id="path209"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 11999,18801 L 11685,18840 L 11724,18644 L 11999,18801 z"
+ id="path221"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 7999,18001 L 11764,18754"
+ id="path225"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <text
+ x="-3023.845"
+ y="1106.8124"
+ transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
+ id="text243"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="6115.1553 6344.1553 6555.1553 6784.1553 6962.1553 7051.1553 7228.1553 7457.1553 7635.1553 7813.1553 7885.1553"
+ y="21390.812"
+ id="tspan245">RSDataReply</tspan>
+ </text>
+ <path
+ d="M 7999,20601 L 8281,20458 L 8311,20655 L 7999,20601 z"
+ id="path255"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 11999,20001 L 8236,20565"
+ id="path259"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <text
+ x="3502.5356"
+ y="-2184.6621"
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+ id="text277"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12321.536 12550.536 12761.536 12990.536 13168.536 13257.536 13434.536 13663.536 13841.536 14019.536 14196.536 14374.536 14535.536"
+ y="15854.338"
+ id="tspan279">RSDataRequest</tspan>
+ </text>
+ <text
+ id="text293"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4034 4263 4440 4703 4881 5042 5219 5397 5503 5681 5842 6003 6180 6341 6519 6625 6803 6980 7158 7336 7497 7586 7692"
+ y="17807"
+ id="tspan295">w_make_resync_request()</tspan>
+ </text>
+ <text
+ id="text309"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12305 12483 12644 12821 12893 13054 13232 13410 13638 13816 13905 14083 14311 14489 14667 14845 15023 15184 15272 15378"
+ y="18806"
+ id="tspan311">receive_DataRequest()</tspan>
+ </text>
+ <text
+ id="text325"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12377 12483 12660 12838 13016 13194 13372 13549 13621 13799 13977 14083 14261 14438 14616 14794 14955 15133 15294 15399"
+ y="19606"
+ id="tspan327">drbd_endio_read_sec()</tspan>
+ </text>
+ <text
+ id="text341"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12191 12420 12597 12775 12953 13131 13309 13486 13664 13770 13931 14109 14287 14375 14553 14731 14837 15015 15192 15298"
+ y="20007"
+ id="tspan343">w_e_end_rsdata_req()</tspan>
+ </text>
+ <text
+ id="text357"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4444 4550 4728 4889 5066 5138 5299 5477 5655 5883 6095 6324 6501 6590 6768 6997 7175 7352 7424 7585 7691"
+ y="20507"
+ id="tspan359">receive_RSDataReply()</tspan>
+ </text>
+ <text
+ id="text373"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4457 4635 4741 4918 5096 5274 5452 5630 5807 5879 6057 6235 6464 6569 6641 6730 6908 7086 7247 7425 7585 7691"
+ y="21407"
+ id="tspan375">drbd_endio_write_sec()</tspan>
+ </text>
+ <text
+ id="text389"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4647 4825 5003 5180 5358 5536 5714 5820 5997 6158 6319 6497 6658 6836 7013 7085 7263 7424 7585 7691"
+ y="21907"
+ id="tspan391">e_end_resync_block()</tspan>
+ </text>
+ <path
+ d="M 11999,22601 L 11685,22640 L 11724,22444 L 11999,22601 z"
+ id="path401"
+ style="fill:#000080;visibility:visible" />
+ <path
+ d="M 7999,21801 L 11764,22554"
+ id="path405"
+ style="fill:none;stroke:#000080;visibility:visible" />
+ <text
+ x="4290.3008"
+ y="-2369.6162"
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+ id="text423"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="13610.301 13911.301 14016.301 14088.301 14177.301 14355.301 14567.301 14728.301"
+ y="19573.385"
+ id="tspan425">WriteAck</tspan>
+ </text>
+ <text
+ id="text439"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12199 12377 12555 12644 12821 13033 13105 13283 13444 13604 13816 13977 14138 14244"
+ y="22559"
+ id="tspan441">got_BlockAck()</tspan>
+ </text>
+ <text
+ id="text455"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="7999 8304 8541 8753 8964 9201 9413 9531 9769 9862 10099 10310 10522 10734 10852 10971 11208 11348 11585 11822"
+ y="16877"
+ id="tspan457">Resync blocks, 4-32K</tspan>
+ </text>
+ <path
+ d="M 12000,7601 L 11900,7301 L 12100,7301 L 12000,7601 z"
+ id="path467"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 12000,6801 L 12000,7361"
+ id="path471"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 12000,6801 L 11686,6840 L 11725,6644 L 12000,6801 z"
+ id="path483"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 8000,6001 L 11765,6754"
+ id="path487"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <text
+ x="-1288.1796"
+ y="1279.7666"
+ transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
+ id="text505"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="8174.8208 8475.8203 8580.8203 8652.8203 8741.8203 8919.8203 9131.8203 9292.8203"
+ y="9516.7666"
+ id="tspan507">WriteAck</tspan>
+ </text>
+ <path
+ d="M 8000,8601 L 8282,8458 L 8312,8655 L 8000,8601 z"
+ id="path517"
+ style="fill:#000080;visibility:visible" />
+ <path
+ d="M 12000,8001 L 8237,8565"
+ id="path521"
+ style="fill:none;stroke:#000080;visibility:visible" />
+ <text
+ x="1065.6655"
+ y="-2097.7664"
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+ id="text539"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="10682.666 10911.666 11088.666 11177.666"
+ y="4107.2339"
+ id="tspan541">Data</tspan>
+ </text>
+ <text
+ id="text555"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
+ y="5505"
+ id="tspan557">drbd_make_request()</tspan>
+ </text>
+ <text
+ id="text571"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14190"
+ y="6806"
+ id="tspan573">receive_Data()</tspan>
+ </text>
+ <text
+ id="text587"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14207 14312 14384 14473 14651 14829 14990 15168 15328 15434"
+ y="7606"
+ id="tspan589">drbd_endio_write_sec()</tspan>
+ </text>
+ <text
+ id="text603"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12192 12370 12548 12725 12903 13081 13259 13437 13509 13686 13847 14008 14114"
+ y="8007"
+ id="tspan605">e_end_block()</tspan>
+ </text>
+ <text
+ id="text619"
+ style="font-size:318px;font-weight:400;fill:#000080;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="5647 5825 6003 6092 6269 6481 6553 6731 6892 7052 7264 7425 7586 7692"
+ y="8606"
+ id="tspan621">got_BlockAck()</tspan>
+ </text>
+ <text
+ id="text635"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="8000 8305 8542 8779 9016 9109 9346 9486 9604 9956 10049 10189 10328 10565 10705 10942 11179 11298 11603 11742 11835 11954 12191 12310 12428 12665 12902 13139 13279 13516 13753"
+ y="4877"
+ id="tspan637">Regular mirrored write, 512-32K</tspan>
+ </text>
+ <text
+ id="text651"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="5381 5610 5787 5948 6126 6304 6482 6659 6837 7015 7087 7265 7426 7587 7692"
+ y="6003"
+ id="tspan653">w_send_dblock()</tspan>
+ </text>
+ <path
+ d="M 8000,6800 L 7900,6500 L 8100,6500 L 8000,6800 z"
+ id="path663"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 8000,6000 L 8000,6560"
+ id="path667"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <text
+ id="text683"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4602 4780 4886 5063 5241 5419 5597 5775 5952 6024 6202 6380 6609 6714 6786 6875 7053 7231 7409 7515 7587 7692"
+ y="6905"
+ id="tspan685">drbd_endio_write_pri()</tspan>
+ </text>
+ <path
+ d="M 12000,13602 L 11900,13302 L 12100,13302 L 12000,13602 z"
+ id="path695"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 12000,12802 L 12000,13362"
+ id="path699"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <path
+ d="M 12000,12802 L 11686,12841 L 11725,12645 L 12000,12802 z"
+ id="path711"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 8000,12002 L 11765,12755"
+ id="path715"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <text
+ x="-2155.5266"
+ y="1201.5964"
+ transform="matrix(0.9895258,-0.1443562,0.1443562,0.9895258,0,0)"
+ id="text733"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="7202.4736 7431.4736 7608.4736 7697.4736 7875.4736 8104.4736 8282.4736 8459.4736 8531.4736"
+ y="15454.597"
+ id="tspan735">DataReply</tspan>
+ </text>
+ <path
+ d="M 8000,14602 L 8282,14459 L 8312,14656 L 8000,14602 z"
+ id="path745"
+ style="fill:#008000;visibility:visible" />
+ <path
+ d="M 12000,14002 L 8237,14566"
+ id="path749"
+ style="fill:none;stroke:#008000;visibility:visible" />
+ <text
+ x="2280.3804"
+ y="-2103.2141"
+ transform="matrix(0.9788674,0.2044961,-0.2044961,0.9788674,0,0)"
+ id="text767"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="11316.381 11545.381 11722.381 11811.381 11989.381 12218.381 12396.381 12573.381 12751.381 12929.381 13090.381"
+ y="9981.7861"
+ id="tspan769">DataRequest</tspan>
+ </text>
+ <text
+ id="text783"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="4746 4924 5030 5207 5385 5563 5826 6003 6164 6342 6520 6626 6803 6981 7159 7337 7498 7587 7692"
+ y="11506"
+ id="tspan785">drbd_make_request()</tspan>
+ </text>
+ <text
+ id="text799"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12200 12306 12484 12645 12822 12894 13055 13233 13411 13639 13817 13906 14084 14312 14490 14668 14846 15024 15185 15273 15379"
+ y="12807"
+ id="tspan801">receive_DataRequest()</tspan>
+ </text>
+ <text
+ id="text815"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12200 12378 12484 12661 12839 13017 13195 13373 13550 13622 13800 13978 14084 14262 14439 14617 14795 14956 15134 15295 15400"
+ y="13607"
+ id="tspan817">drbd_endio_read_sec()</tspan>
+ </text>
+ <text
+ id="text831"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="12192 12421 12598 12776 12954 13132 13310 13487 13665 13843 14021 14110 14288 14465 14571 14749 14927 15033"
+ y="14008"
+ id="tspan833">w_e_end_data_req()</tspan>
+ </text>
+ <g
+ id="g835"
+ style="visibility:visible">
+ <desc
+ id="desc837">Drawing</desc>
+ <text
+ id="text847"
+ style="font-size:318px;font-weight:400;fill:#008000;font-family:Helvetica embedded">
+ <tspan
+ x="4885 4991 5169 5330 5507 5579 5740 5918 6096 6324 6502 6591 6769 6997 7175 7353 7425 7586 7692"
+ y="14607"
+ id="tspan849">receive_DataReply()</tspan>
+ </text>
+ </g>
+ <text
+ id="text863"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="8000 8305 8398 8610 8821 8914 9151 9363 9575 9693 9833 10070 10307 10544 10663 10781 11018 11255 11493 11632 11869 12106"
+ y="10878"
+ id="tspan865">Diskless read, 512-32K</tspan>
+ </text>
+ <text
+ id="text879"
+ style="font-size:318px;font-weight:400;fill:#008000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="5029 5258 5435 5596 5774 5952 6130 6307 6413 6591 6769 6947 7125 7230 7408 7586 7692"
+ y="12004"
+ id="tspan881">w_send_read_req()</tspan>
+ </text>
+ <text
+ id="text895"
+ style="font-size:423px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="6961 7266 7571 7854 8159 8278 8515 8633 8870 9107 9226 9463 9581 9700 9793 10030"
+ y="2806"
+ id="tspan897">DRBD 8 data flow</tspan>
+ </text>
+ <path
+ d="M 3900,5300 L 3700,5300 L 3700,7000 L 3900,7000"
+ id="path907"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <path
+ d="M 3900,17600 L 3700,17600 L 3700,22000 L 3900,22000"
+ id="path919"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <path
+ d="M 16100,20000 L 16300,20000 L 16300,18500 L 16100,18500"
+ id="path931"
+ style="fill:none;stroke:#000000;visibility:visible" />
+ <text
+ id="text947"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="2126 2304 2376 2554 2731 2909 3087 3159 3337 3515 3587 3764 3870"
+ y="5202"
+ id="tspan949">al_begin_io()</tspan>
+ </text>
+ <text
+ id="text963"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="1632 1810 1882 2060 2220 2398 2661 2839 2910 3088 3177 3355 3533 3605 3783 3888"
+ y="7331"
+ id="tspan965">al_complete_io()</tspan>
+ </text>
+ <text
+ id="text979"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="2126 2232 2393 2571 2748 2926 3104 3176 3354 3531 3603 3781 3887"
+ y="17431"
+ id="tspan981">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text995"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="1626 1732 1893 2071 2231 2409 2672 2849 2921 3099 3188 3366 3544 3616 3793 3899"
+ y="22331"
+ id="tspan997">rs_complete_io()</tspan>
+ </text>
+ <text
+ id="text1011"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16027 16133 16294 16472 16649 16827 17005 17077 17255 17432 17504 17682 17788"
+ y="18402"
+ id="tspan1013">rs_begin_io()</tspan>
+ </text>
+ <text
+ id="text1027"
+ style="font-size:318px;font-weight:400;fill:#000000;visibility:visible;font-family:Helvetica embedded">
+ <tspan
+ x="16115 16221 16382 16560 16720 16898 17161 17338 17410 17588 17677 17855 18033 18105 18282 18388"
+ y="20331"
+ id="tspan1029">rs_complete_io()</tspan>
+ </text>
+</svg>
diff --git a/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot b/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
new file mode 100644
index 0000000..025e8cf
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/conn-states-8.dot
@@ -0,0 +1,18 @@
+digraph conn_states {
+ StandAllone -> WFConnection [ label = "ioctl_set_net()" ]
+ WFConnection -> Unconnected [ label = "unable to bind()" ]
+ WFConnection -> WFReportParams [ label = "in connect() after accept" ]
+ WFReportParams -> StandAllone [ label = "checks in receive_param()" ]
+ WFReportParams -> Connected [ label = "in receive_param()" ]
+ WFReportParams -> WFBitMapS [ label = "sync_handshake()" ]
+ WFReportParams -> WFBitMapT [ label = "sync_handshake()" ]
+ WFBitMapS -> SyncSource [ label = "receive_bitmap()" ]
+ WFBitMapT -> SyncTarget [ label = "receive_bitmap()" ]
+ SyncSource -> Connected
+ SyncTarget -> Connected
+ SyncSource -> PausedSyncS
+ SyncTarget -> PausedSyncT
+ PausedSyncS -> SyncSource
+ PausedSyncT -> SyncTarget
+ Connected -> WFConnection [ label = "* on network error" ]
+}
diff --git a/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst b/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
new file mode 100644
index 0000000..66036b9
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/data-structure-v9.rst
@@ -0,0 +1,42 @@
+================================
+kernel data structure for DRBD-9
+================================
+
+This describes the in kernel data structure for DRBD-9. Starting with
+Linux v3.14 we are reorganizing DRBD to use this data structure.
+
+Basic Data Structure
+====================
+
+A node has a number of DRBD resources. Each such resource has a number of
+devices (aka volumes) and connections to other nodes ("peer nodes"). Each DRBD
+device is represented by a block device locally.
+
+The DRBD objects are interconnected to form a matrix as depicted below; a
+drbd_peer_device object sits at each intersection between a drbd_device and a
+drbd_connection::
+
+ /--------------+---------------+.....+---------------\
+ | resource | device | | device |
+ +--------------+---------------+.....+---------------+
+ | connection | peer_device | | peer_device |
+ +--------------+---------------+.....+---------------+
+ : : : : :
+ : : : : :
+ +--------------+---------------+.....+---------------+
+ | connection | peer_device | | peer_device |
+ \--------------+---------------+.....+---------------/
+
+In this table, horizontally, devices can be accessed from resources by their
+volume number. Likewise, peer_devices can be accessed from connections by
+their volume number. Objects in the vertical direction are connected by double
+linked lists. There are back pointers from peer_devices to their connections a
+devices, and from connections and devices to their resource.
+
+All resources are in the drbd_resources double-linked list. In addition, all
+devices can be accessed by their minor device number via the drbd_devices idr.
+
+The drbd_resource, drbd_connection, and drbd_device objects are reference
+counted. The peer_device objects only serve to establish the links between
+devices and connections; their lifetime is determined by the lifetime of the
+device and connection which they reference.
diff --git a/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot b/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
new file mode 100644
index 0000000..d06cfb4
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/disk-states-8.dot
@@ -0,0 +1,16 @@
+digraph disk_states {
+ Diskless -> Inconsistent [ label = "ioctl_set_disk()" ]
+ Diskless -> Consistent [ label = "ioctl_set_disk()" ]
+ Diskless -> Outdated [ label = "ioctl_set_disk()" ]
+ Consistent -> Outdated [ label = "receive_param()" ]
+ Consistent -> UpToDate [ label = "receive_param()" ]
+ Consistent -> Inconsistent [ label = "start resync" ]
+ Outdated -> Inconsistent [ label = "start resync" ]
+ UpToDate -> Inconsistent [ label = "ioctl_replicate" ]
+ Inconsistent -> UpToDate [ label = "resync completed" ]
+ Consistent -> Failed [ label = "io completion error" ]
+ Outdated -> Failed [ label = "io completion error" ]
+ UpToDate -> Failed [ label = "io completion error" ]
+ Inconsistent -> Failed [ label = "io completion error" ]
+ Failed -> Diskless [ label = "sending notify to peer" ]
+}
diff --git a/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot b/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
new file mode 100644
index 0000000..6d9cf0a
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/drbd-connection-state-overview.dot
@@ -0,0 +1,85 @@
+// vim: set sw=2 sts=2 :
+digraph {
+ rankdir=BT
+ bgcolor=white
+
+ node [shape=plaintext]
+ node [fontcolor=black]
+
+ StandAlone [ style=filled,fillcolor=gray,label=StandAlone ]
+
+ node [fontcolor=lightgray]
+
+ Unconnected [ label=Unconnected ]
+
+ CommTrouble [ shape=record,
+ label="{communication loss|{Timeout|BrokenPipe|NetworkFailure}}" ]
+
+ node [fontcolor=gray]
+
+ subgraph cluster_try_connect {
+ label="try to connect, handshake"
+ rank=max
+ WFConnection [ label=WFConnection ]
+ WFReportParams [ label=WFReportParams ]
+ }
+
+ TearDown [ label=TearDown ]
+
+ Connected [ label=Connected,style=filled,fillcolor=green,fontcolor=black ]
+
+ node [fontcolor=lightblue]
+
+ StartingSyncS [ label=StartingSyncS ]
+ StartingSyncT [ label=StartingSyncT ]
+
+ subgraph cluster_bitmap_exchange {
+ node [fontcolor=red]
+ fontcolor=red
+ label="new application (WRITE?) requests blocked\lwhile bitmap is exchanged"
+
+ WFBitMapT [ label=WFBitMapT ]
+ WFSyncUUID [ label=WFSyncUUID ]
+ WFBitMapS [ label=WFBitMapS ]
+ }
+
+ node [fontcolor=blue]
+
+ cluster_resync [ shape=record,label="{<any>resynchronisation process running\l'concurrent' application requests allowed|{{<T>PausedSyncT\nSyncTarget}|{<S>PausedSyncS\nSyncSource}}}" ]
+
+ node [shape=box,fontcolor=black]
+
+ // drbdadm [label="drbdadm connect"]
+ // handshake [label="drbd_connect()\ndrbd_do_handshake\ndrbd_sync_handshake() etc."]
+ // comm_error [label="communication trouble"]
+
+ //
+ // edges
+ // --------------------------------------
+
+ StandAlone -> Unconnected [ label="drbdadm connect" ]
+ Unconnected -> StandAlone [ label="drbdadm disconnect\lor serious communication trouble" ]
+ Unconnected -> WFConnection [ label="receiver thread is started" ]
+ WFConnection -> WFReportParams [ headlabel="accept()\land/or \lconnect()\l" ]
+
+ WFReportParams -> StandAlone [ label="during handshake\lpeers do not agree\labout something essential" ]
+ WFReportParams -> Connected [ label="data identical\lno sync needed",color=green,fontcolor=green ]
+
+ WFReportParams -> WFBitMapS
+ WFReportParams -> WFBitMapT
+ WFBitMapT -> WFSyncUUID [minlen=0.1,constraint=false]
+
+ WFBitMapS -> cluster_resync:S
+ WFSyncUUID -> cluster_resync:T
+
+ edge [color=green]
+ cluster_resync:any -> Connected [ label="resnyc done",fontcolor=green ]
+
+ edge [color=red]
+ WFReportParams -> CommTrouble
+ Connected -> CommTrouble
+ cluster_resync:any -> CommTrouble
+ edge [color=black]
+ CommTrouble -> Unconnected [label="receiver thread is stopped" ]
+
+}
diff --git a/Documentation/admin-guide/blockdev/drbd/figures.rst b/Documentation/admin-guide/blockdev/drbd/figures.rst
new file mode 100644
index 0000000..bd9a490
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/figures.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. The here included files are intended to help understand the implementation
+
+Data flows that Relate some functions, and write packets
+========================================================
+
+.. kernel-figure:: DRBD-8.3-data-packets.svg
+ :alt: DRBD-8.3-data-packets.svg
+ :align: center
+
+.. kernel-figure:: DRBD-data-packets.svg
+ :alt: DRBD-data-packets.svg
+ :align: center
+
+
+Sub graphs of DRBD's state transitions
+======================================
+
+.. kernel-figure:: conn-states-8.dot
+ :alt: conn-states-8.dot
+ :align: center
+
+.. kernel-figure:: disk-states-8.dot
+ :alt: disk-states-8.dot
+ :align: center
+
+.. kernel-figure:: node-states-8.dot
+ :alt: node-states-8.dot
+ :align: center
diff --git a/Documentation/admin-guide/blockdev/drbd/index.rst b/Documentation/admin-guide/blockdev/drbd/index.rst
new file mode 100644
index 0000000..68ecd5c
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/index.rst
@@ -0,0 +1,19 @@
+==========================================
+Distributed Replicated Block Device - DRBD
+==========================================
+
+Description
+===========
+
+ DRBD is a shared-nothing, synchronously replicated block device. It
+ is designed to serve as a building block for high availability
+ clusters and in this context, is a "drop-in" replacement for shared
+ storage. Simplistically, you could see it as a network RAID 1.
+
+ Please visit http://www.drbd.org to find out more.
+
+.. toctree::
+ :maxdepth: 1
+
+ data-structure-v9
+ figures
diff --git a/Documentation/admin-guide/blockdev/drbd/node-states-8.dot b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
new file mode 100644
index 0000000..bfa54e1
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
@@ -0,0 +1,13 @@
+digraph node_states {
+ Secondary -> Primary [ label = "ioctl_set_state()" ]
+ Primary -> Secondary [ label = "ioctl_set_state()" ]
+}
+
+digraph peer_states {
+ Secondary -> Primary [ label = "recv state packet" ]
+ Primary -> Secondary [ label = "recv state packet" ]
+ Primary -> Unknown [ label = "connection lost" ]
+ Secondary -> Unknown [ label = "connection lost" ]
+ Unknown -> Primary [ label = "connected" ]
+ Unknown -> Secondary [ label = "connected" ]
+}
diff --git a/Documentation/admin-guide/blockdev/floppy.rst b/Documentation/admin-guide/blockdev/floppy.rst
new file mode 100644
index 0000000..4a8f31c
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/floppy.rst
@@ -0,0 +1,255 @@
+=============
+Floppy Driver
+=============
+
+FAQ list:
+=========
+
+A FAQ list may be found in the fdutils package (see below), and also
+at <http://fdutils.linux.lu/faq.html>.
+
+
+LILO configuration options (Thinkpad users, read this)
+======================================================
+
+The floppy driver is configured using the 'floppy=' option in
+lilo. This option can be typed at the boot prompt, or entered in the
+lilo configuration file.
+
+Example: If your kernel is called linux-2.6.9, type the following line
+at the lilo boot prompt (if you have a thinkpad)::
+
+ linux-2.6.9 floppy=thinkpad
+
+You may also enter the following line in /etc/lilo.conf, in the description
+of linux-2.6.9::
+
+ append = "floppy=thinkpad"
+
+Several floppy related options may be given, example::
+
+ linux-2.6.9 floppy=daring floppy=two_fdc
+ append = "floppy=daring floppy=two_fdc"
+
+If you give options both in the lilo config file and on the boot
+prompt, the option strings of both places are concatenated, the boot
+prompt options coming last. That's why there are also options to
+restore the default behavior.
+
+
+Module configuration options
+============================
+
+If you use the floppy driver as a module, use the following syntax::
+
+ modprobe floppy floppy="<options>"
+
+Example::
+
+ modprobe floppy floppy="omnibook messages"
+
+If you need certain options enabled every time you load the floppy driver,
+you can put::
+
+ options floppy floppy="omnibook messages"
+
+in a configuration file in /etc/modprobe.d/.
+
+
+The floppy driver related options are:
+
+ floppy=asus_pci
+ Sets the bit mask to allow only units 0 and 1. (default)
+
+ floppy=daring
+ Tells the floppy driver that you have a well behaved floppy controller.
+ This allows more efficient and smoother operation, but may fail on
+ certain controllers. This may speed up certain operations.
+
+ floppy=0,daring
+ Tells the floppy driver that your floppy controller should be used
+ with caution.
+
+ floppy=one_fdc
+ Tells the floppy driver that you have only one floppy controller.
+ (default)
+
+ floppy=two_fdc / floppy=<address>,two_fdc
+ Tells the floppy driver that you have two floppy controllers.
+ The second floppy controller is assumed to be at <address>.
+ This option is not needed if the second controller is at address
+ 0x370, and if you use the 'cmos' option.
+
+ floppy=thinkpad
+ Tells the floppy driver that you have a Thinkpad. Thinkpads use an
+ inverted convention for the disk change line.
+
+ floppy=0,thinkpad
+ Tells the floppy driver that you don't have a Thinkpad.
+
+ floppy=omnibook / floppy=nodma
+ Tells the floppy driver not to use Dma for data transfers.
+ This is needed on HP Omnibooks, which don't have a workable
+ DMA channel for the floppy driver. This option is also useful
+ if you frequently get "Unable to allocate DMA memory" messages.
+ Indeed, dma memory needs to be continuous in physical memory,
+ and is thus harder to find, whereas non-dma buffers may be
+ allocated in virtual memory. However, I advise against this if
+ you have an FDC without a FIFO (8272A or 82072). 82072A and
+ later are OK. You also need at least a 486 to use nodma.
+ If you use nodma mode, I suggest you also set the FIFO
+ threshold to 10 or lower, in order to limit the number of data
+ transfer interrupts.
+
+ If you have a FIFO-able FDC, the floppy driver automatically
+ falls back on non DMA mode if no DMA-able memory can be found.
+ If you want to avoid this, explicitly ask for 'yesdma'.
+
+ floppy=yesdma
+ Tells the floppy driver that a workable DMA channel is available.
+ (default)
+
+ floppy=nofifo
+ Disables the FIFO entirely. This is needed if you get "Bus
+ master arbitration error" messages from your Ethernet card (or
+ from other devices) while accessing the floppy.
+
+ floppy=usefifo
+ Enables the FIFO. (default)
+
+ floppy=<threshold>,fifo_depth
+ Sets the FIFO threshold. This is mostly relevant in DMA
+ mode. If this is higher, the floppy driver tolerates more
+ interrupt latency, but it triggers more interrupts (i.e. it
+ imposes more load on the rest of the system). If this is
+ lower, the interrupt latency should be lower too (faster
+ processor). The benefit of a lower threshold is less
+ interrupts.
+
+ To tune the fifo threshold, switch on over/underrun messages
+ using 'floppycontrol --messages'. Then access a floppy
+ disk. If you get a huge amount of "Over/Underrun - retrying"
+ messages, then the fifo threshold is too low. Try with a
+ higher value, until you only get an occasional Over/Underrun.
+ It is a good idea to compile the floppy driver as a module
+ when doing this tuning. Indeed, it allows to try different
+ fifo values without rebooting the machine for each test. Note
+ that you need to do 'floppycontrol --messages' every time you
+ re-insert the module.
+
+ Usually, tuning the fifo threshold should not be needed, as
+ the default (0xa) is reasonable.
+
+ floppy=<drive>,<type>,cmos
+ Sets the CMOS type of <drive> to <type>. This is mandatory if
+ you have more than two floppy drives (only two can be
+ described in the physical CMOS), or if your BIOS uses
+ non-standard CMOS types. The CMOS types are:
+
+ == ==================================
+ 0 Use the value of the physical CMOS
+ 1 5 1/4 DD
+ 2 5 1/4 HD
+ 3 3 1/2 DD
+ 4 3 1/2 HD
+ 5 3 1/2 ED
+ 6 3 1/2 ED
+ 16 unknown or not installed
+ == ==================================
+
+ (Note: there are two valid types for ED drives. This is because 5 was
+ initially chosen to represent floppy *tapes*, and 6 for ED drives.
+ AMI ignored this, and used 5 for ED drives. That's why the floppy
+ driver handles both.)
+
+ floppy=unexpected_interrupts
+ Print a warning message when an unexpected interrupt is received.
+ (default)
+
+ floppy=no_unexpected_interrupts / floppy=L40SX
+ Don't print a message when an unexpected interrupt is received. This
+ is needed on IBM L40SX laptops in certain video modes. (There seems
+ to be an interaction between video and floppy. The unexpected
+ interrupts affect only performance, and can be safely ignored.)
+
+ floppy=broken_dcl
+ Don't use the disk change line, but assume that the disk was
+ changed whenever the device node is reopened. Needed on some
+ boxes where the disk change line is broken or unsupported.
+ This should be regarded as a stopgap measure, indeed it makes
+ floppy operation less efficient due to unneeded cache
+ flushings, and slightly more unreliable. Please verify your
+ cable, connection and jumper settings if you have any DCL
+ problems. However, some older drives, and also some laptops
+ are known not to have a DCL.
+
+ floppy=debug
+ Print debugging messages.
+
+ floppy=messages
+ Print informational messages for some operations (disk change
+ notifications, warnings about over and underruns, and about
+ autodetection).
+
+ floppy=silent_dcl_clear
+ Uses a less noisy way to clear the disk change line (which
+ doesn't involve seeks). Implied by 'daring' option.
+
+ floppy=<nr>,irq
+ Sets the floppy IRQ to <nr> instead of 6.
+
+ floppy=<nr>,dma
+ Sets the floppy DMA channel to <nr> instead of 2.
+
+ floppy=slow
+ Use PS/2 stepping rate::
+
+ PS/2 floppies have much slower step rates than regular floppies.
+ It's been recommended that take about 1/4 of the default speed
+ in some more extreme cases.
+
+
+Supporting utilities and additional documentation:
+==================================================
+
+Additional parameters of the floppy driver can be configured at
+runtime. Utilities which do this can be found in the fdutils package.
+This package also contains a new version of mtools which allows to
+access high capacity disks (up to 1992K on a high density 3 1/2 disk!).
+It also contains additional documentation about the floppy driver.
+
+The latest version can be found at fdutils homepage:
+
+ http://fdutils.linux.lu
+
+The fdutils releases can be found at:
+
+ http://fdutils.linux.lu/download.html
+
+ http://www.tux.org/pub/knaff/fdutils/
+
+ ftp://metalab.unc.edu/pub/Linux/utils/disk-management/
+
+Reporting problems about the floppy driver
+==========================================
+
+If you have a question or a bug report about the floppy driver, mail
+me at Alain.Knaff@poboxes.com . If you post to Usenet, preferably use
+comp.os.linux.hardware. As the volume in these groups is rather high,
+be sure to include the word "floppy" (or "FLOPPY") in the subject
+line. If the reported problem happens when mounting floppy disks, be
+sure to mention also the type of the filesystem in the subject line.
+
+Be sure to read the FAQ before mailing/posting any bug reports!
+
+Alain
+
+Changelog
+=========
+
+10-30-2004 :
+ Cleanup, updating, add reference to module configuration.
+ James Nelson <james4765@gmail.com>
+
+6-3-2000 :
+ Original Document
diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst
new file mode 100644
index 0000000..b903cf1
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/index.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+The Linux RapidIO Subsystem
+===========================
+
+.. toctree::
+ :maxdepth: 1
+
+ floppy
+ nbd
+ paride
+ ramdisk
+ zram
+
+ drbd/index
diff --git a/Documentation/admin-guide/blockdev/nbd.rst b/Documentation/admin-guide/blockdev/nbd.rst
new file mode 100644
index 0000000..d78dfe5
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/nbd.rst
@@ -0,0 +1,31 @@
+==================================
+Network Block Device (TCP version)
+==================================
+
+1) Overview
+-----------
+
+What is it: With this compiled in the kernel (or as a module), Linux
+can use a remote server as one of its block devices. So every time
+the client computer wants to read, e.g., /dev/nb0, it sends a
+request over TCP to the server, which will reply with the data read.
+This can be used for stations with low disk space (or even diskless)
+to borrow disk space from another computer.
+Unlike NFS, it is possible to put any filesystem on it, etc.
+
+For more information, or to download the nbd-client and nbd-server
+tools, go to http://nbd.sf.net/.
+
+The nbd kernel module need only be installed on the client
+system, as the nbd-server is completely in userspace. In fact,
+the nbd-server has been successfully ported to other operating
+systems, including Windows.
+
+A) NBD parameters
+-----------------
+
+max_part
+ Number of partitions per device (default: 0).
+
+nbds_max
+ Number of block devices that should be initialized (default: 16).
diff --git a/Documentation/admin-guide/blockdev/paride.rst b/Documentation/admin-guide/blockdev/paride.rst
new file mode 100644
index 0000000..87b4278
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/paride.rst
@@ -0,0 +1,439 @@
+===================================
+Linux and parallel port IDE devices
+===================================
+
+PARIDE v1.03 (c) 1997-8 Grant Guenther <grant@torque.net>
+
+1. Introduction
+===============
+
+Owing to the simplicity and near universality of the parallel port interface
+to personal computers, many external devices such as portable hard-disk,
+CD-ROM, LS-120 and tape drives use the parallel port to connect to their
+host computer. While some devices (notably scanners) use ad-hoc methods
+to pass commands and data through the parallel port interface, most
+external devices are actually identical to an internal model, but with
+a parallel-port adapter chip added in. Some of the original parallel port
+adapters were little more than mechanisms for multiplexing a SCSI bus.
+(The Iomega PPA-3 adapter used in the ZIP drives is an example of this
+approach). Most current designs, however, take a different approach.
+The adapter chip reproduces a small ISA or IDE bus in the external device
+and the communication protocol provides operations for reading and writing
+device registers, as well as data block transfer functions. Sometimes,
+the device being addressed via the parallel cable is a standard SCSI
+controller like an NCR 5380. The "ditto" family of external tape
+drives use the ISA replicator to interface a floppy disk controller,
+which is then connected to a floppy-tape mechanism. The vast majority
+of external parallel port devices, however, are now based on standard
+IDE type devices, which require no intermediate controller. If one
+were to open up a parallel port CD-ROM drive, for instance, one would
+find a standard ATAPI CD-ROM drive, a power supply, and a single adapter
+that interconnected a standard PC parallel port cable and a standard
+IDE cable. It is usually possible to exchange the CD-ROM device with
+any other device using the IDE interface.
+
+The document describes the support in Linux for parallel port IDE
+devices. It does not cover parallel port SCSI devices, "ditto" tape
+drives or scanners. Many different devices are supported by the
+parallel port IDE subsystem, including:
+
+ - MicroSolutions backpack CD-ROM
+ - MicroSolutions backpack PD/CD
+ - MicroSolutions backpack hard-drives
+ - MicroSolutions backpack 8000t tape drive
+ - SyQuest EZ-135, EZ-230 & SparQ drives
+ - Avatar Shark
+ - Imation Superdisk LS-120
+ - Maxell Superdisk LS-120
+ - FreeCom Power CD
+ - Hewlett-Packard 5GB and 8GB tape drives
+ - Hewlett-Packard 7100 and 7200 CD-RW drives
+
+as well as most of the clone and no-name products on the market.
+
+To support such a wide range of devices, PARIDE, the parallel port IDE
+subsystem, is actually structured in three parts. There is a base
+paride module which provides a registry and some common methods for
+accessing the parallel ports. The second component is a set of
+high-level drivers for each of the different types of supported devices:
+
+ === =============
+ pd IDE disk
+ pcd ATAPI CD-ROM
+ pf ATAPI disk
+ pt ATAPI tape
+ pg ATAPI generic
+ === =============
+
+(Currently, the pg driver is only used with CD-R drives).
+
+The high-level drivers function according to the relevant standards.
+The third component of PARIDE is a set of low-level protocol drivers
+for each of the parallel port IDE adapter chips. Thanks to the interest
+and encouragement of Linux users from many parts of the world,
+support is available for almost all known adapter protocols:
+
+ ==== ====================================== ====
+ aten ATEN EH-100 (HK)
+ bpck Microsolutions backpack (US)
+ comm DataStor (old-type) "commuter" adapter (TW)
+ dstr DataStor EP-2000 (TW)
+ epat Shuttle EPAT (UK)
+ epia Shuttle EPIA (UK)
+ fit2 FIT TD-2000 (US)
+ fit3 FIT TD-3000 (US)
+ friq Freecom IQ cable (DE)
+ frpw Freecom Power (DE)
+ kbic KingByte KBIC-951A and KBIC-971A (TW)
+ ktti KT Technology PHd adapter (SG)
+ on20 OnSpec 90c20 (US)
+ on26 OnSpec 90c26 (US)
+ ==== ====================================== ====
+
+
+2. Using the PARIDE subsystem
+=============================
+
+While configuring the Linux kernel, you may choose either to build
+the PARIDE drivers into your kernel, or to build them as modules.
+
+In either case, you will need to select "Parallel port IDE device support"
+as well as at least one of the high-level drivers and at least one
+of the parallel port communication protocols. If you do not know
+what kind of parallel port adapter is used in your drive, you could
+begin by checking the file names and any text files on your DOS
+installation floppy. Alternatively, you can look at the markings on
+the adapter chip itself. That's usually sufficient to identify the
+correct device.
+
+You can actually select all the protocol modules, and allow the PARIDE
+subsystem to try them all for you.
+
+For the "brand-name" products listed above, here are the protocol
+and high-level drivers that you would use:
+
+ ================ ============ ====== ========
+ Manufacturer Model Driver Protocol
+ ================ ============ ====== ========
+ MicroSolutions CD-ROM pcd bpck
+ MicroSolutions PD drive pf bpck
+ MicroSolutions hard-drive pd bpck
+ MicroSolutions 8000t tape pt bpck
+ SyQuest EZ, SparQ pd epat
+ Imation Superdisk pf epat
+ Maxell Superdisk pf friq
+ Avatar Shark pd epat
+ FreeCom CD-ROM pcd frpw
+ Hewlett-Packard 5GB Tape pt epat
+ Hewlett-Packard 7200e (CD) pcd epat
+ Hewlett-Packard 7200e (CD-R) pg epat
+ ================ ============ ====== ========
+
+2.1 Configuring built-in drivers
+---------------------------------
+
+We recommend that you get to know how the drivers work and how to
+configure them as loadable modules, before attempting to compile a
+kernel with the drivers built-in.
+
+If you built all of your PARIDE support directly into your kernel,
+and you have just a single parallel port IDE device, your kernel should
+locate it automatically for you. If you have more than one device,
+you may need to give some command line options to your bootloader
+(eg: LILO), how to do that is beyond the scope of this document.
+
+The high-level drivers accept a number of command line parameters, all
+of which are documented in the source files in linux/drivers/block/paride.
+By default, each driver will automatically try all parallel ports it
+can find, and all protocol types that have been installed, until it finds
+a parallel port IDE adapter. Once it finds one, the probe stops. So,
+if you have more than one device, you will need to tell the drivers
+how to identify them. This requires specifying the port address, the
+protocol identification number and, for some devices, the drive's
+chain ID. While your system is booting, a number of messages are
+displayed on the console. Like all such messages, they can be
+reviewed with the 'dmesg' command. Among those messages will be
+some lines like::
+
+ paride: bpck registered as protocol 0
+ paride: epat registered as protocol 1
+
+The numbers will always be the same until you build a new kernel with
+different protocol selections. You should note these numbers as you
+will need them to identify the devices.
+
+If you happen to be using a MicroSolutions backpack device, you will
+also need to know the unit ID number for each drive. This is usually
+the last two digits of the drive's serial number (but read MicroSolutions'
+documentation about this).
+
+As an example, let's assume that you have a MicroSolutions PD/CD drive
+with unit ID number 36 connected to the parallel port at 0x378, a SyQuest
+EZ-135 connected to the chained port on the PD/CD drive and also an
+Imation Superdisk connected to port 0x278. You could give the following
+options on your boot command::
+
+ pd.drive0=0x378,1 pf.drive0=0x278,1 pf.drive1=0x378,0,36
+
+In the last option, pf.drive1 configures device /dev/pf1, the 0x378
+is the parallel port base address, the 0 is the protocol registration
+number and 36 is the chain ID.
+
+Please note: while PARIDE will work both with and without the
+PARPORT parallel port sharing system that is included by the
+"Parallel port support" option, PARPORT must be included and enabled
+if you want to use chains of devices on the same parallel port.
+
+2.2 Loading and configuring PARIDE as modules
+----------------------------------------------
+
+It is much faster and simpler to get to understand the PARIDE drivers
+if you use them as loadable kernel modules.
+
+Note 1:
+ using these drivers with the "kerneld" automatic module loading
+ system is not recommended for beginners, and is not documented here.
+
+Note 2:
+ if you build PARPORT support as a loadable module, PARIDE must
+ also be built as loadable modules, and PARPORT must be loaded before
+ the PARIDE modules.
+
+To use PARIDE, you must begin by::
+
+ insmod paride
+
+this loads a base module which provides a registry for the protocols,
+among other tasks.
+
+Then, load as many of the protocol modules as you think you might need.
+As you load each module, it will register the protocols that it supports,
+and print a log message to your kernel log file and your console. For
+example::
+
+ # insmod epat
+ paride: epat registered as protocol 0
+ # insmod kbic
+ paride: k951 registered as protocol 1
+ paride: k971 registered as protocol 2
+
+Finally, you can load high-level drivers for each kind of device that
+you have connected. By default, each driver will autoprobe for a single
+device, but you can support up to four similar devices by giving their
+individual co-ordinates when you load the driver.
+
+For example, if you had two no-name CD-ROM drives both using the
+KingByte KBIC-951A adapter, one on port 0x378 and the other on 0x3bc
+you could give the following command::
+
+ # insmod pcd drive0=0x378,1 drive1=0x3bc,1
+
+For most adapters, giving a port address and protocol number is sufficient,
+but check the source files in linux/drivers/block/paride for more
+information. (Hopefully someone will write some man pages one day !).
+
+As another example, here's what happens when PARPORT is installed, and
+a SyQuest EZ-135 is attached to port 0x378::
+
+ # insmod paride
+ paride: version 1.0 installed
+ # insmod epat
+ paride: epat registered as protocol 0
+ # insmod pd
+ pd: pd version 1.0, major 45, cluster 64, nice 0
+ pda: Sharing parport1 at 0x378
+ pda: epat 1.0, Shuttle EPAT chip c3 at 0x378, mode 5 (EPP-32), delay 1
+ pda: SyQuest EZ135A, 262144 blocks [128M], (512/16/32), removable media
+ pda: pda1
+
+Note that the last line is the output from the generic partition table
+scanner - in this case it reports that it has found a disk with one partition.
+
+2.3 Using a PARIDE device
+--------------------------
+
+Once the drivers have been loaded, you can access PARIDE devices in the
+same way as their traditional counterparts. You will probably need to
+create the device "special files". Here is a simple script that you can
+cut to a file and execute::
+
+ #!/bin/bash
+ #
+ # mkd -- a script to create the device special files for the PARIDE subsystem
+ #
+ function mkdev {
+ mknod $1 $2 $3 $4 ; chmod 0660 $1 ; chown root:disk $1
+ }
+ #
+ function pd {
+ D=$( printf \\$( printf "x%03x" $[ $1 + 97 ] ) )
+ mkdev pd$D b 45 $[ $1 * 16 ]
+ for P in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+ do mkdev pd$D$P b 45 $[ $1 * 16 + $P ]
+ done
+ }
+ #
+ cd /dev
+ #
+ for u in 0 1 2 3 ; do pd $u ; done
+ for u in 0 1 2 3 ; do mkdev pcd$u b 46 $u ; done
+ for u in 0 1 2 3 ; do mkdev pf$u b 47 $u ; done
+ for u in 0 1 2 3 ; do mkdev pt$u c 96 $u ; done
+ for u in 0 1 2 3 ; do mkdev npt$u c 96 $[ $u + 128 ] ; done
+ for u in 0 1 2 3 ; do mkdev pg$u c 97 $u ; done
+ #
+ # end of mkd
+
+With the device files and drivers in place, you can access PARIDE devices
+like any other Linux device. For example, to mount a CD-ROM in pcd0, use::
+
+ mount /dev/pcd0 /cdrom
+
+If you have a fresh Avatar Shark cartridge, and the drive is pda, you
+might do something like::
+
+ fdisk /dev/pda -- make a new partition table with
+ partition 1 of type 83
+
+ mke2fs /dev/pda1 -- to build the file system
+
+ mkdir /shark -- make a place to mount the disk
+
+ mount /dev/pda1 /shark
+
+Devices like the Imation superdisk work in the same way, except that
+they do not have a partition table. For example to make a 120MB
+floppy that you could share with a DOS system::
+
+ mkdosfs /dev/pf0
+ mount /dev/pf0 /mnt
+
+
+2.4 The pf driver
+------------------
+
+The pf driver is intended for use with parallel port ATAPI disk
+devices. The most common devices in this category are PD drives
+and LS-120 drives. Traditionally, media for these devices are not
+partitioned. Consequently, the pf driver does not support partitioned
+media. This may be changed in a future version of the driver.
+
+2.5 Using the pt driver
+------------------------
+
+The pt driver for parallel port ATAPI tape drives is a minimal driver.
+It does not yet support many of the standard tape ioctl operations.
+For best performance, a block size of 32KB should be used. You will
+probably want to set the parallel port delay to 0, if you can.
+
+2.6 Using the pg driver
+------------------------
+
+The pg driver can be used in conjunction with the cdrecord program
+to create CD-ROMs. Please get cdrecord version 1.6.1 or later
+from ftp://ftp.fokus.gmd.de/pub/unix/cdrecord/ . To record CD-R media
+your parallel port should ideally be set to EPP mode, and the "port delay"
+should be set to 0. With those settings it is possible to record at 2x
+speed without any buffer underruns. If you cannot get the driver to work
+in EPP mode, try to use "bidirectional" or "PS/2" mode and 1x speeds only.
+
+
+3. Troubleshooting
+==================
+
+3.1 Use EPP mode if you can
+----------------------------
+
+The most common problems that people report with the PARIDE drivers
+concern the parallel port CMOS settings. At this time, none of the
+PARIDE protocol modules support ECP mode, or any ECP combination modes.
+If you are able to do so, please set your parallel port into EPP mode
+using your CMOS setup procedure.
+
+3.2 Check the port delay
+-------------------------
+
+Some parallel ports cannot reliably transfer data at full speed. To
+offset the errors, the PARIDE protocol modules introduce a "port
+delay" between each access to the i/o ports. Each protocol sets
+a default value for this delay. In most cases, the user can override
+the default and set it to 0 - resulting in somewhat higher transfer
+rates. In some rare cases (especially with older 486 systems) the
+default delays are not long enough. if you experience corrupt data
+transfers, or unexpected failures, you may wish to increase the
+port delay. The delay can be programmed using the "driveN" parameters
+to each of the high-level drivers. Please see the notes above, or
+read the comments at the beginning of the driver source files in
+linux/drivers/block/paride.
+
+3.3 Some drives need a printer reset
+-------------------------------------
+
+There appear to be a number of "noname" external drives on the market
+that do not always power up correctly. We have noticed this with some
+drives based on OnSpec and older Freecom adapters. In these rare cases,
+the adapter can often be reinitialised by issuing a "printer reset" on
+the parallel port. As the reset operation is potentially disruptive in
+multiple device environments, the PARIDE drivers will not do it
+automatically. You can however, force a printer reset by doing::
+
+ insmod lp reset=1
+ rmmod lp
+
+If you have one of these marginal cases, you should probably build
+your paride drivers as modules, and arrange to do the printer reset
+before loading the PARIDE drivers.
+
+3.4 Use the verbose option and dmesg if you need help
+------------------------------------------------------
+
+While a lot of testing has gone into these drivers to make them work
+as smoothly as possible, problems will arise. If you do have problems,
+please check all the obvious things first: does the drive work in
+DOS with the manufacturer's drivers ? If that doesn't yield any useful
+clues, then please make sure that only one drive is hooked to your system,
+and that either (a) PARPORT is enabled or (b) no other device driver
+is using your parallel port (check in /proc/ioports). Then, load the
+appropriate drivers (you can load several protocol modules if you want)
+as in::
+
+ # insmod paride
+ # insmod epat
+ # insmod bpck
+ # insmod kbic
+ ...
+ # insmod pd verbose=1
+
+(using the correct driver for the type of device you have, of course).
+The verbose=1 parameter will cause the drivers to log a trace of their
+activity as they attempt to locate your drive.
+
+Use 'dmesg' to capture a log of all the PARIDE messages (any messages
+beginning with paride:, a protocol module's name or a driver's name) and
+include that with your bug report. You can submit a bug report in one
+of two ways. Either send it directly to the author of the PARIDE suite,
+by e-mail to grant@torque.net, or join the linux-parport mailing list
+and post your report there.
+
+3.5 For more information or help
+---------------------------------
+
+You can join the linux-parport mailing list by sending a mail message
+to:
+
+ linux-parport-request@torque.net
+
+with the single word::
+
+ subscribe
+
+in the body of the mail message (not in the subject line). Please be
+sure that your mail program is correctly set up when you do this, as
+the list manager is a robot that will subscribe you using the reply
+address in your mail headers. REMOVE any anti-spam gimmicks you may
+have in your mail headers, when sending mail to the list server.
+
+You might also find some useful information on the linux-parport
+web pages (although they are not always up to date) at
+
+ http://web.archive.org/web/%2E/http://www.torque.net/parport/
diff --git a/Documentation/admin-guide/blockdev/ramdisk.rst b/Documentation/admin-guide/blockdev/ramdisk.rst
new file mode 100644
index 0000000..b7c2268
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/ramdisk.rst
@@ -0,0 +1,177 @@
+==========================================
+Using the RAM disk block device with Linux
+==========================================
+
+.. Contents:
+
+ 1) Overview
+ 2) Kernel Command Line Parameters
+ 3) Using "rdev -r"
+ 4) An Example of Creating a Compressed RAM Disk
+
+
+1) Overview
+-----------
+
+The RAM disk driver is a way to use main system memory as a block device. It
+is required for initrd, an initial filesystem used if you need to load modules
+in order to access the root filesystem (see Documentation/admin-guide/initrd.rst). It can
+also be used for a temporary filesystem for crypto work, since the contents
+are erased on reboot.
+
+The RAM disk dynamically grows as more space is required. It does this by using
+RAM from the buffer cache. The driver marks the buffers it is using as dirty
+so that the VM subsystem does not try to reclaim them later.
+
+The RAM disk supports up to 16 RAM disks by default, and can be reconfigured
+to support an unlimited number of RAM disks (at your own risk). Just change
+the configuration symbol BLK_DEV_RAM_COUNT in the Block drivers config menu
+and (re)build the kernel.
+
+To use RAM disk support with your system, run './MAKEDEV ram' from the /dev
+directory. RAM disks are all major number 1, and start with minor number 0
+for /dev/ram0, etc. If used, modern kernels use /dev/ram0 for an initrd.
+
+The new RAM disk also has the ability to load compressed RAM disk images,
+allowing one to squeeze more programs onto an average installation or
+rescue floppy disk.
+
+
+2) Parameters
+---------------------------------
+
+2a) Kernel Command Line Parameters
+
+ ramdisk_size=N
+ Size of the ramdisk.
+
+This parameter tells the RAM disk driver to set up RAM disks of N k size. The
+default is 4096 (4 MB).
+
+2b) Module parameters
+
+ rd_nr
+ /dev/ramX devices created.
+
+ max_part
+ Maximum partition number.
+
+ rd_size
+ See ramdisk_size.
+
+3) Using "rdev -r"
+------------------
+
+The usage of the word (two bytes) that "rdev -r" sets in the kernel image is
+as follows. The low 11 bits (0 -> 10) specify an offset (in 1 k blocks) of up
+to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit
+14 indicates that a RAM disk is to be loaded, and bit 15 indicates whether a
+prompt/wait sequence is to be given before trying to read the RAM disk. Since
+the RAM disk dynamically grows as data is being written into it, a size field
+is not required. Bits 11 to 13 are not currently used and may as well be zero.
+These numbers are no magical secrets, as seen below::
+
+ ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF
+ ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000
+ ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000
+
+Consider a typical two floppy disk setup, where you will have the
+kernel on disk one, and have already put a RAM disk image onto disk #2.
+
+Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk
+starts at an offset of 0 kB from the beginning of the floppy.
+The command line equivalent is: "ramdisk_start=0"
+
+You want bit 14 as one, indicating that a RAM disk is to be loaded.
+The command line equivalent is: "load_ramdisk=1"
+
+You want bit 15 as one, indicating that you want a prompt/keypress
+sequence so that you have a chance to switch floppy disks.
+The command line equivalent is: "prompt_ramdisk=1"
+
+Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word.
+So to create disk one of the set, you would do::
+
+ /usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0
+ /usr/src/linux# rdev /dev/fd0 /dev/fd0
+ /usr/src/linux# rdev -r /dev/fd0 49152
+
+If you make a boot disk that has LILO, then for the above, you would use::
+
+ append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1"
+
+Since the default start = 0 and the default prompt = 1, you could use::
+
+ append = "load_ramdisk=1"
+
+
+4) An Example of Creating a Compressed RAM Disk
+-----------------------------------------------
+
+To create a RAM disk image, you will need a spare block device to
+construct it on. This can be the RAM disk device itself, or an
+unused disk partition (such as an unmounted swap partition). For this
+example, we will use the RAM disk device, "/dev/ram0".
+
+Note: This technique should not be done on a machine with less than 8 MB
+of RAM. If using a spare disk partition instead of /dev/ram0, then this
+restriction does not apply.
+
+a) Decide on the RAM disk size that you want. Say 2 MB for this example.
+ Create it by writing to the RAM disk device. (This step is not currently
+ required, but may be in the future.) It is wise to zero out the
+ area (esp. for disks) so that maximal compression is achieved for
+ the unused blocks of the image that you are about to create::
+
+ dd if=/dev/zero of=/dev/ram0 bs=1k count=2048
+
+b) Make a filesystem on it. Say ext2fs for this example::
+
+ mke2fs -vm0 /dev/ram0 2048
+
+c) Mount it, copy the files you want to it (eg: /etc/* /dev/* ...)
+ and unmount it again.
+
+d) Compress the contents of the RAM disk. The level of compression
+ will be approximately 50% of the space used by the files. Unused
+ space on the RAM disk will compress to almost nothing::
+
+ dd if=/dev/ram0 bs=1k count=2048 | gzip -v9 > /tmp/ram_image.gz
+
+e) Put the kernel onto the floppy::
+
+ dd if=zImage of=/dev/fd0 bs=1k
+
+f) Put the RAM disk image onto the floppy, after the kernel. Use an offset
+ that is slightly larger than the kernel, so that you can put another
+ (possibly larger) kernel onto the same floppy later without overlapping
+ the RAM disk image. An offset of 400 kB for kernels about 350 kB in
+ size would be reasonable. Make sure offset+size of ram_image.gz is
+ not larger than the total space on your floppy (usually 1440 kB)::
+
+ dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400
+
+g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc.
+ For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would
+ have 2^15 + 2^14 + 400 = 49552::
+
+ rdev /dev/fd0 /dev/fd0
+ rdev -r /dev/fd0 49552
+
+That is it. You now have your boot/root compressed RAM disk floppy. Some
+users may wish to combine steps (d) and (f) by using a pipe.
+
+
+ Paul Gortmaker 12/95
+
+Changelog:
+----------
+
+10-22-04 :
+ Updated to reflect changes in command line options, remove
+ obsolete references, general cleanup.
+ James Nelson (james4765@gmail.com)
+
+
+12-95 :
+ Original Document
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
new file mode 100644
index 0000000..6eccf13
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -0,0 +1,422 @@
+========================================
+zram: Compressed RAM based block devices
+========================================
+
+Introduction
+============
+
+The zram module creates RAM based block devices named /dev/zram<id>
+(<id> = 0, 1, ...). Pages written to these disks are compressed and stored
+in memory itself. These disks allow very fast I/O and compression provides
+good amounts of memory savings. Some of the usecases include /tmp storage,
+use as swap disks, various caches under /var and maybe many more :)
+
+Statistics for individual zram devices are exported through sysfs nodes at
+/sys/block/zram<id>/
+
+Usage
+=====
+
+There are several ways to configure and manage zram device(-s):
+
+a) using zram and zram_control sysfs attributes
+b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org).
+
+In this document we will describe only 'manual' zram configuration steps,
+IOW, zram and zram_control sysfs attributes.
+
+In order to get a better idea about zramctl please consult util-linux
+documentation, zramctl man-page or `zramctl --help`. Please be informed
+that zram maintainers do not develop/maintain util-linux or zramctl, should
+you have any questions please contact util-linux@vger.kernel.org
+
+Following shows a typical sequence of steps for using zram.
+
+WARNING
+=======
+
+For the sake of simplicity we skip error checking parts in most of the
+examples below. However, it is your sole responsibility to handle errors.
+
+zram sysfs attributes always return negative values in case of errors.
+The list of possible return codes:
+
+======== =============================================================
+-EBUSY an attempt to modify an attribute that cannot be changed once
+ the device has been initialised. Please reset device first;
+-ENOMEM zram was not able to allocate enough memory to fulfil your
+ needs;
+-EINVAL invalid input has been provided.
+======== =============================================================
+
+If you use 'echo', the returned value that is changed by 'echo' utility,
+and, in general case, something like::
+
+ echo 3 > /sys/block/zram0/max_comp_streams
+ if [ $? -ne 0 ];
+ handle_error
+ fi
+
+should suffice.
+
+1) Load Module
+==============
+
+::
+
+ modprobe zram num_devices=4
+ This creates 4 devices: /dev/zram{0,1,2,3}
+
+num_devices parameter is optional and tells zram how many devices should be
+pre-created. Default: 1.
+
+2) Set max number of compression streams
+========================================
+
+Regardless the value passed to this attribute, ZRAM will always
+allocate multiple compression streams - one per online CPUs - thus
+allowing several concurrent compression operations. The number of
+allocated compression streams goes down when some of the CPUs
+become offline. There is no single-compression-stream mode anymore,
+unless you are running a UP system or has only 1 CPU online.
+
+To find out how many streams are currently available::
+
+ cat /sys/block/zram0/max_comp_streams
+
+3) Select compression algorithm
+===============================
+
+Using comp_algorithm device attribute one can see available and
+currently selected (shown in square brackets) compression algorithms,
+change selected compression algorithm (once the device is initialised
+there is no way to change compression algorithm).
+
+Examples::
+
+ #show supported compression algorithms
+ cat /sys/block/zram0/comp_algorithm
+ lzo [lz4]
+
+ #select lzo compression algorithm
+ echo lzo > /sys/block/zram0/comp_algorithm
+
+For the time being, the `comp_algorithm` content does not necessarily
+show every compression algorithm supported by the kernel. We keep this
+list primarily to simplify device configuration and one can configure
+a new device with a compression algorithm that is not listed in
+`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
+and, if some of the algorithms were built as modules, it's impossible
+to list all of them using, for instance, /proc/crypto or any other
+method. This, however, has an advantage of permitting the usage of
+custom crypto compression modules (implementing S/W or H/W compression).
+
+4) Set Disksize
+===============
+
+Set disk size by writing the value to sysfs node 'disksize'.
+The value can be either in bytes or you can use mem suffixes.
+Examples::
+
+ # Initialize /dev/zram0 with 50MB disksize
+ echo $((50*1024*1024)) > /sys/block/zram0/disksize
+
+ # Using mem suffixes
+ echo 256K > /sys/block/zram0/disksize
+ echo 512M > /sys/block/zram0/disksize
+ echo 1G > /sys/block/zram0/disksize
+
+Note:
+There is little point creating a zram of greater than twice the size of memory
+since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
+size of the disk when not in use so a huge zram is wasteful.
+
+5) Set memory limit: Optional
+=============================
+
+Set memory limit by writing the value to sysfs node 'mem_limit'.
+The value can be either in bytes or you can use mem suffixes.
+In addition, you could change the value in runtime.
+Examples::
+
+ # limit /dev/zram0 with 50MB memory
+ echo $((50*1024*1024)) > /sys/block/zram0/mem_limit
+
+ # Using mem suffixes
+ echo 256K > /sys/block/zram0/mem_limit
+ echo 512M > /sys/block/zram0/mem_limit
+ echo 1G > /sys/block/zram0/mem_limit
+
+ # To disable memory limit
+ echo 0 > /sys/block/zram0/mem_limit
+
+6) Activate
+===========
+
+::
+
+ mkswap /dev/zram0
+ swapon /dev/zram0
+
+ mkfs.ext4 /dev/zram1
+ mount /dev/zram1 /tmp
+
+7) Add/remove zram devices
+==========================
+
+zram provides a control interface, which enables dynamic (on-demand) device
+addition and removal.
+
+In order to add a new /dev/zramX device, perform read operation on hot_add
+attribute. This will return either new device's device id (meaning that you
+can use /dev/zram<id>) or error code.
+
+Example::
+
+ cat /sys/class/zram-control/hot_add
+ 1
+
+To remove the existing /dev/zramX device (where X is a device id)
+execute::
+
+ echo X > /sys/class/zram-control/hot_remove
+
+8) Stats
+========
+
+Per-device statistics are exported as various nodes under /sys/block/zram<id>/
+
+A brief description of exported device attributes. For more details please
+read Documentation/ABI/testing/sysfs-block-zram.
+
+====================== ====== ===============================================
+Name access description
+====================== ====== ===============================================
+disksize RW show and set the device's disk size
+initstate RO shows the initialization state of the device
+reset WO trigger device reset
+mem_used_max WO reset the `mem_used_max` counter (see later)
+mem_limit WO specifies the maximum amount of memory ZRAM can
+ use to store the compressed data
+writeback_limit WO specifies the maximum amount of write IO zram
+ can write out to backing device as 4KB unit
+writeback_limit_enable RW show and set writeback_limit feature
+max_comp_streams RW the number of possible concurrent compress
+ operations
+comp_algorithm RW show and change the compression algorithm
+compact WO trigger memory compaction
+debug_stat RO this file is used for zram debugging purposes
+backing_dev RW set up backend storage for zram to write out
+idle WO mark allocated slot as idle
+====================== ====== ===============================================
+
+
+User space is advised to use the following files to read the device statistics.
+
+File /sys/block/zram<id>/stat
+
+Represents block layer statistics. Read Documentation/block/stat.rst for
+details.
+
+File /sys/block/zram<id>/io_stat
+
+The stat file represents device's I/O statistics not accounted by block
+layer and, thus, not available in zram<id>/stat file. It consists of a
+single line of text and contains the following stats separated by
+whitespace:
+
+ ============= =============================================================
+ failed_reads The number of failed reads
+ failed_writes The number of failed writes
+ invalid_io The number of non-page-size-aligned I/O requests
+ notify_free Depending on device usage scenario it may account
+
+ a) the number of pages freed because of swap slot free
+ notifications
+ b) the number of pages freed because of
+ REQ_OP_DISCARD requests sent by bio. The former ones are
+ sent to a swap block device when a swap slot is freed,
+ which implies that this disk is being used as a swap disk.
+
+ The latter ones are sent by filesystem mounted with
+ discard option, whenever some data blocks are getting
+ discarded.
+ ============= =============================================================
+
+File /sys/block/zram<id>/mm_stat
+
+The stat file represents device's mm statistics. It consists of a single
+line of text and contains the following stats separated by whitespace:
+
+ ================ =============================================================
+ orig_data_size uncompressed size of data stored in this disk.
+ This excludes same-element-filled pages (same_pages) since
+ no memory is allocated for them.
+ Unit: bytes
+ compr_data_size compressed size of data stored in this disk
+ mem_used_total the amount of memory allocated for this disk. This
+ includes allocator fragmentation and metadata overhead,
+ allocated for this disk. So, allocator space efficiency
+ can be calculated using compr_data_size and this statistic.
+ Unit: bytes
+ mem_limit the maximum amount of memory ZRAM can use to store
+ the compressed data
+ mem_used_max the maximum amount of memory zram have consumed to
+ store the data
+ same_pages the number of same element filled pages written to this disk.
+ No memory is allocated for such pages.
+ pages_compacted the number of pages freed during compaction
+ huge_pages the number of incompressible pages
+ ================ =============================================================
+
+File /sys/block/zram<id>/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+
+ ============== =============================================================
+ bd_count size of data written in backing device.
+ Unit: 4K bytes
+ bd_reads the number of reads from backing device
+ Unit: 4K bytes
+ bd_writes the number of writes to backing device
+ Unit: 4K bytes
+ ============== =============================================================
+
+9) Deactivate
+=============
+
+::
+
+ swapoff /dev/zram0
+ umount /dev/zram1
+
+10) Reset
+=========
+
+ Write any positive value to 'reset' sysfs node::
+
+ echo 1 > /sys/block/zram0/reset
+ echo 1 > /sys/block/zram1/reset
+
+ This frees all the memory allocated for the given device and
+ resets the disksize to zero. You must set the disksize again
+ before reusing the device.
+
+Optional Feature
+================
+
+writeback
+---------
+
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
+to backing storage rather than keeping it in memory.
+To use the feature, admin should set up backing device via::
+
+ echo /dev/sda5 > /sys/block/zramX/backing_dev
+
+before disksize setting. It supports only partition at this moment.
+If admin want to use incompressible page writeback, they could do via::
+
+ echo huge > /sys/block/zramX/write
+
+To use idle page writeback, first, user need to declare zram pages
+as idle::
+
+ echo all > /sys/block/zramX/idle
+
+From now on, any pages on zram are idle pages. The idle mark
+will be removed until someone request access of the block.
+IOW, unless there is access request, those pages are still idle pages.
+
+Admin can request writeback of those idle pages at right timing via::
+
+ echo idle > /sys/block/zramX/writeback
+
+With the command, zram writeback idle pages from memory to the storage.
+
+If there are lots of write IO with flash device, potentially, it has
+flash wearout problem so that admin needs to design write limitation
+to guarantee storage health for entire product life.
+
+To overcome the concern, zram supports "writeback_limit" feature.
+The "writeback_limit_enable"'s default value is 0 so that it doesn't limit
+any writeback. IOW, if admin want to apply writeback budget, he should
+enable writeback_limit_enable via::
+
+ $ echo 1 > /sys/block/zramX/writeback_limit_enable
+
+Once writeback_limit_enable is set, zram doesn't allow any writeback
+until admin set the budget via /sys/block/zramX/writeback_limit.
+
+(If admin doesn't enable writeback_limit_enable, writeback_limit's value
+assigned via /sys/block/zramX/writeback_limit is meaninless.)
+
+If admin want to limit writeback as per-day 400M, he could do it
+like below::
+
+ $ MB_SHIFT=20
+ $ 4K_SHIFT=12
+ $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
+ /sys/block/zram0/writeback_limit.
+ $ echo 1 > /sys/block/zram0/writeback_limit_enable
+
+If admin want to allow further write again once the bugdet is exausted,
+he could do it like below::
+
+ $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
+ /sys/block/zram0/writeback_limit
+
+If admin want to see remaining writeback budget since he set::
+
+ $ cat /sys/block/zramX/writeback_limit
+
+If admin want to disable writeback limit, he could do::
+
+ $ echo 0 > /sys/block/zramX/writeback_limit_enable
+
+The writeback_limit count will reset whenever you reset zram(e.g.,
+system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
+writeback happened until you reset the zram to allocate extra writeback
+budget in next setting is user's job.
+
+If admin want to measure writeback count in a certain period, he could
+know it via /sys/block/zram0/bd_stat's 3rd column.
+
+memory tracking
+===============
+
+With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
+zram block. It could be useful to catch cold or incompressible
+pages of the process with*pagemap.
+
+If you enable the feature, you could see block state via
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows::
+
+ 300 75.033841 .wh.
+ 301 63.806904 s...
+ 302 63.806919 ..hi
+
+First column
+ zram's block index.
+Second column
+ access time since the system was booted
+Third column
+ state of the block:
+
+ s:
+ same page
+ w:
+ written page to backing store
+ h:
+ huge page
+ i:
+ idle page
+
+First line of above example says 300th block is accessed at 75.033841sec
+and the block's state is huge so it is written back to the backing
+storage. It's a debugging feature so anyone shouldn't rely on it to work
+properly.
+
+Nitin Gupta
+ngupta@vflare.org
diff --git a/Documentation/admin-guide/btmrvl.rst b/Documentation/admin-guide/btmrvl.rst
new file mode 100644
index 0000000..ec57740
--- /dev/null
+++ b/Documentation/admin-guide/btmrvl.rst
@@ -0,0 +1,124 @@
+=============
+btmrvl driver
+=============
+
+All commands are used via debugfs interface.
+
+Set/get driver configurations
+=============================
+
+Path: /debug/btmrvl/config/
+
+gpiogap=[n], hscfgcmd
+ These commands are used to configure the host sleep parameters::
+ bit 8:0 -- Gap
+ bit 16:8 -- GPIO
+
+ where GPIO is the pin number of GPIO used to wake up the host.
+ It could be any valid GPIO pin# (e.g. 0-7) or 0xff (SDIO interface
+ wakeup will be used instead).
+
+ where Gap is the gap in milli seconds between wakeup signal and
+ wakeup event, or 0xff for special host sleep setting.
+
+ Usage::
+
+ # Use SDIO interface to wake up the host and set GAP to 0x80:
+ echo 0xff80 > /debug/btmrvl/config/gpiogap
+ echo 1 > /debug/btmrvl/config/hscfgcmd
+
+ # Use GPIO pin #3 to wake up the host and set GAP to 0xff:
+ echo 0x03ff > /debug/btmrvl/config/gpiogap
+ echo 1 > /debug/btmrvl/config/hscfgcmd
+
+psmode=[n], pscmd
+ These commands are used to enable/disable auto sleep mode
+
+ where the option is::
+
+ 1 -- Enable auto sleep mode
+ 0 -- Disable auto sleep mode
+
+ Usage::
+
+ # Enable auto sleep mode
+ echo 1 > /debug/btmrvl/config/psmode
+ echo 1 > /debug/btmrvl/config/pscmd
+
+ # Disable auto sleep mode
+ echo 0 > /debug/btmrvl/config/psmode
+ echo 1 > /debug/btmrvl/config/pscmd
+
+
+hsmode=[n], hscmd
+ These commands are used to enable host sleep or wake up firmware
+
+ where the option is::
+
+ 1 -- Enable host sleep
+ 0 -- Wake up firmware
+
+ Usage::
+
+ # Enable host sleep
+ echo 1 > /debug/btmrvl/config/hsmode
+ echo 1 > /debug/btmrvl/config/hscmd
+
+ # Wake up firmware
+ echo 0 > /debug/btmrvl/config/hsmode
+ echo 1 > /debug/btmrvl/config/hscmd
+
+
+Get driver status
+=================
+
+Path: /debug/btmrvl/status/
+
+Usage::
+
+ cat /debug/btmrvl/status/<args>
+
+where the args are:
+
+curpsmode
+ This command displays current auto sleep status.
+
+psstate
+ This command display the power save state.
+
+hsstate
+ This command display the host sleep state.
+
+txdnldrdy
+ This command displays the value of Tx download ready flag.
+
+Issuing a raw hci command
+=========================
+
+Use hcitool to issue raw hci command, refer to hcitool manual
+
+Usage::
+
+ Hcitool cmd <ogf> <ocf> [Parameters]
+
+Interface Control Command::
+
+ hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface
+ hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface
+ hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface
+ hcitool cmd 0x3f 0x5b 0xf5 0x00 0x00 --Disable All interface
+ hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface
+ hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface
+
+SD8688 firmware
+===============
+
+Images:
+
+- /lib/firmware/sd8688_helper.bin
+- /lib/firmware/sd8688.bin
+
+
+The images can be downloaded from:
+
+git.infradead.org/users/dwmw2/linux-firmware.git/libertas/
diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst
index f278b28..44b8a4e 100644
--- a/Documentation/admin-guide/bug-hunting.rst
+++ b/Documentation/admin-guide/bug-hunting.rst
@@ -90,9 +90,9 @@
run a null modem to a second machine and capture the output there
using your favourite communication program. Minicom works well.
-(3) Use Kdump (see Documentation/kdump/kdump.txt),
+(3) Use Kdump (see Documentation/admin-guide/kdump/kdump.rst),
extract the kernel ring buffer from old memory with using dmesg
- gdbmacro in Documentation/kdump/gdbmacros.txt.
+ gdbmacro in Documentation/admin-guide/kdump/gdbmacros.txt.
Finding the bug's location
--------------------------
diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
new file mode 100644
index 0000000..36d43ae
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
@@ -0,0 +1,296 @@
+===================
+Block IO Controller
+===================
+
+Overview
+========
+cgroup subsys "blkio" implements the block io controller. There seems to be
+a need of various kinds of IO control policies (like proportional BW, max BW)
+both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
+Plan is to use the same cgroup based management interface for blkio controller
+and based on user options switch IO policies in the background.
+
+One IO control policy is throttling policy which can be used to
+specify upper IO rate limits on devices. This policy is implemented in
+generic block layer and can be used on leaf nodes as well as higher
+level logical devices like device mapper.
+
+HOWTO
+=====
+Throttling/Upper Limit policy
+-----------------------------
+- Enable Block IO controller::
+
+ CONFIG_BLK_CGROUP=y
+
+- Enable throttling in block layer::
+
+ CONFIG_BLK_DEV_THROTTLING=y
+
+- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
+
+ mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
+
+- Specify a bandwidth rate on particular device for root group. The format
+ for policy is "<major>:<minor> <bytes_per_second>"::
+
+ echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
+
+ Above will put a limit of 1MB/second on reads happening for root group
+ on device having major/minor number 8:16.
+
+- Run dd to read a file and see if rate is throttled to 1MB/s or not::
+
+ # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
+ 1024+0 records in
+ 1024+0 records out
+ 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
+
+ Limits for writes can be put using blkio.throttle.write_bps_device file.
+
+Hierarchical Cgroups
+====================
+
+Throttling implements hierarchy support; however,
+throttling's hierarchy support is enabled iff "sane_behavior" is
+enabled from cgroup side, which currently is a development option and
+not publicly available.
+
+If somebody created a hierarchy like as follows::
+
+ root
+ / \
+ test1 test2
+ |
+ test3
+
+Throttling with "sane_behavior" will handle the
+hierarchy correctly. For throttling, all limits apply
+to the whole subtree while all statistics are local to the IOs
+directly generated by tasks in that cgroup.
+
+Throttling without "sane_behavior" enabled from cgroup side will
+practically treat all groups at same level as if it looks like the
+following::
+
+ pivot
+ / / \ \
+ root test1 test2 test3
+
+Various user visible config options
+===================================
+CONFIG_BLK_CGROUP
+ - Block IO controller.
+
+CONFIG_BFQ_CGROUP_DEBUG
+ - Debug help. Right now some additional stats file show up in cgroup
+ if this option is enabled.
+
+CONFIG_BLK_DEV_THROTTLING
+ - Enable block device throttling support in block layer.
+
+Details of cgroup files
+=======================
+Proportional weight policy files
+--------------------------------
+- blkio.weight
+ - Specifies per cgroup weight. This is default weight of the group
+ on all the devices until and unless overridden by per device rule.
+ (See blkio.weight_device).
+ Currently allowed range of weights is from 10 to 1000.
+
+- blkio.weight_device
+ - One can specify per cgroup per device rules using this interface.
+ These rules override the default value of group weight as specified
+ by blkio.weight.
+
+ Following is the format::
+
+ # echo dev_maj:dev_minor weight > blkio.weight_device
+
+ Configure weight=300 on /dev/sdb (8:16) in this cgroup::
+
+ # echo 8:16 300 > blkio.weight_device
+ # cat blkio.weight_device
+ dev weight
+ 8:16 300
+
+ Configure weight=500 on /dev/sda (8:0) in this cgroup::
+
+ # echo 8:0 500 > blkio.weight_device
+ # cat blkio.weight_device
+ dev weight
+ 8:0 500
+ 8:16 300
+
+ Remove specific weight for /dev/sda in this cgroup::
+
+ # echo 8:0 0 > blkio.weight_device
+ # cat blkio.weight_device
+ dev weight
+ 8:16 300
+
+- blkio.time
+ - disk time allocated to cgroup per device in milliseconds. First
+ two fields specify the major and minor number of the device and
+ third field specifies the disk time allocated to group in
+ milliseconds.
+
+- blkio.sectors
+ - number of sectors transferred to/from disk by the group. First
+ two fields specify the major and minor number of the device and
+ third field specifies the number of sectors transferred by the
+ group to/from the device.
+
+- blkio.io_service_bytes
+ - Number of bytes transferred to/from the disk by the group. These
+ are further divided by the type of operation - read or write, sync
+ or async. First two fields specify the major and minor number of the
+ device, third field specifies the operation type and the fourth field
+ specifies the number of bytes.
+
+- blkio.io_serviced
+ - Number of IOs (bio) issued to the disk by the group. These
+ are further divided by the type of operation - read or write, sync
+ or async. First two fields specify the major and minor number of the
+ device, third field specifies the operation type and the fourth field
+ specifies the number of IOs.
+
+- blkio.io_service_time
+ - Total amount of time between request dispatch and request completion
+ for the IOs done by this cgroup. This is in nanoseconds to make it
+ meaningful for flash devices too. For devices with queue depth of 1,
+ this time represents the actual service time. When queue_depth > 1,
+ that is no longer true as requests may be served out of order. This
+ may cause the service time for a given IO to include the service time
+ of multiple IOs when served out of order which may result in total
+ io_service_time > actual time elapsed. This time is further divided by
+ the type of operation - read or write, sync or async. First two fields
+ specify the major and minor number of the device, third field
+ specifies the operation type and the fourth field specifies the
+ io_service_time in ns.
+
+- blkio.io_wait_time
+ - Total amount of time the IOs for this cgroup spent waiting in the
+ scheduler queues for service. This can be greater than the total time
+ elapsed since it is cumulative io_wait_time for all IOs. It is not a
+ measure of total time the cgroup spent waiting but rather a measure of
+ the wait_time for its individual IOs. For devices with queue_depth > 1
+ this metric does not include the time spent waiting for service once
+ the IO is dispatched to the device but till it actually gets serviced
+ (there might be a time lag here due to re-ordering of requests by the
+ device). This is in nanoseconds to make it meaningful for flash
+ devices too. This time is further divided by the type of operation -
+ read or write, sync or async. First two fields specify the major and
+ minor number of the device, third field specifies the operation type
+ and the fourth field specifies the io_wait_time in ns.
+
+- blkio.io_merged
+ - Total number of bios/requests merged into requests belonging to this
+ cgroup. This is further divided by the type of operation - read or
+ write, sync or async.
+
+- blkio.io_queued
+ - Total number of requests queued up at any given instant for this
+ cgroup. This is further divided by the type of operation - read or
+ write, sync or async.
+
+- blkio.avg_queue_size
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ The average queue size for this cgroup over the entire time of this
+ cgroup's existence. Queue size samples are taken each time one of the
+ queues of this cgroup gets a timeslice.
+
+- blkio.group_wait_time
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ This is the amount of time the cgroup had to wait since it became busy
+ (i.e., went from 0 to 1 request queued) to get a timeslice for one of
+ its queues. This is different from the io_wait_time which is the
+ cumulative total of the amount of time spent by each IO in that cgroup
+ waiting in the scheduler queue. This is in nanoseconds. If this is
+ read when the cgroup is in a waiting (for timeslice) state, the stat
+ will only report the group_wait_time accumulated till the last time it
+ got a timeslice and will not include the current delta.
+
+- blkio.empty_time
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ This is the amount of time a cgroup spends without any pending
+ requests when not being served, i.e., it does not include any time
+ spent idling for one of the queues of the cgroup. This is in
+ nanoseconds. If this is read when the cgroup is in an empty state,
+ the stat will only report the empty_time accumulated till the last
+ time it had a pending request and will not include the current delta.
+
+- blkio.idle_time
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
+ This is the amount of time spent by the IO scheduler idling for a
+ given cgroup in anticipation of a better request than the existing ones
+ from other queues/cgroups. This is in nanoseconds. If this is read
+ when the cgroup is in an idling state, the stat will only report the
+ idle_time accumulated till the last idle period and will not include
+ the current delta.
+
+- blkio.dequeue
+ - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
+ gives the statistics about how many a times a group was dequeued
+ from service tree of the device. First two fields specify the major
+ and minor number of the device and third field specifies the number
+ of times a group was dequeued from a particular device.
+
+- blkio.*_recursive
+ - Recursive version of various stats. These files show the
+ same information as their non-recursive counterparts but
+ include stats from all the descendant cgroups.
+
+Throttling/Upper limit policy files
+-----------------------------------
+- blkio.throttle.read_bps_device
+ - Specifies upper limit on READ rate from the device. IO rate is
+ specified in bytes per second. Rules are per device. Following is
+ the format::
+
+ echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
+
+- blkio.throttle.write_bps_device
+ - Specifies upper limit on WRITE rate to the device. IO rate is
+ specified in bytes per second. Rules are per device. Following is
+ the format::
+
+ echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
+
+- blkio.throttle.read_iops_device
+ - Specifies upper limit on READ rate from the device. IO rate is
+ specified in IO per second. Rules are per device. Following is
+ the format::
+
+ echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
+
+- blkio.throttle.write_iops_device
+ - Specifies upper limit on WRITE rate to the device. IO rate is
+ specified in io per second. Rules are per device. Following is
+ the format::
+
+ echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
+
+Note: If both BW and IOPS rules are specified for a device, then IO is
+ subjected to both the constraints.
+
+- blkio.throttle.io_serviced
+ - Number of IOs (bio) issued to the disk by the group. These
+ are further divided by the type of operation - read or write, sync
+ or async. First two fields specify the major and minor number of the
+ device, third field specifies the operation type and the fourth field
+ specifies the number of IOs.
+
+- blkio.throttle.io_service_bytes
+ - Number of bytes transferred to/from the disk by the group. These
+ are further divided by the type of operation - read or write, sync
+ or async. First two fields specify the major and minor number of the
+ device, third field specifies the operation type and the fourth field
+ specifies the number of bytes.
+
+Common files among various policies
+-----------------------------------
+- blkio.reset_stats
+ - Writing an int to this file will result in resetting all the stats
+ for that cgroup.
diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst
new file mode 100644
index 0000000..b068801
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst
@@ -0,0 +1,695 @@
+==============
+Control Groups
+==============
+
+Written by Paul Menage <menage@google.com> based on
+Documentation/admin-guide/cgroup-v1/cpusets.rst
+
+Original copyright statements from cpusets.txt:
+
+Portions Copyright (C) 2004 BULL SA.
+
+Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
+
+Modified by Paul Jackson <pj@sgi.com>
+
+Modified by Christoph Lameter <cl@linux.com>
+
+.. CONTENTS:
+
+ 1. Control Groups
+ 1.1 What are cgroups ?
+ 1.2 Why are cgroups needed ?
+ 1.3 How are cgroups implemented ?
+ 1.4 What does notify_on_release do ?
+ 1.5 What does clone_children do ?
+ 1.6 How do I use cgroups ?
+ 2. Usage Examples and Syntax
+ 2.1 Basic Usage
+ 2.2 Attaching processes
+ 2.3 Mounting hierarchies by name
+ 3. Kernel API
+ 3.1 Overview
+ 3.2 Synchronization
+ 3.3 Subsystem API
+ 4. Extended attributes usage
+ 5. Questions
+
+1. Control Groups
+=================
+
+1.1 What are cgroups ?
+----------------------
+
+Control Groups provide a mechanism for aggregating/partitioning sets of
+tasks, and all their future children, into hierarchical groups with
+specialized behaviour.
+
+Definitions:
+
+A *cgroup* associates a set of tasks with a set of parameters for one
+or more subsystems.
+
+A *subsystem* is a module that makes use of the task grouping
+facilities provided by cgroups to treat groups of tasks in
+particular ways. A subsystem is typically a "resource controller" that
+schedules a resource or applies per-cgroup limits, but it may be
+anything that wants to act on a group of processes, e.g. a
+virtualization subsystem.
+
+A *hierarchy* is a set of cgroups arranged in a tree, such that
+every task in the system is in exactly one of the cgroups in the
+hierarchy, and a set of subsystems; each subsystem has system-specific
+state attached to each cgroup in the hierarchy. Each hierarchy has
+an instance of the cgroup virtual filesystem associated with it.
+
+At any one time there may be multiple active hierarchies of task
+cgroups. Each hierarchy is a partition of all tasks in the system.
+
+User-level code may create and destroy cgroups by name in an
+instance of the cgroup virtual file system, specify and query to
+which cgroup a task is assigned, and list the task PIDs assigned to
+a cgroup. Those creations and assignments only affect the hierarchy
+associated with that instance of the cgroup file system.
+
+On their own, the only use for cgroups is for simple job
+tracking. The intention is that other subsystems hook into the generic
+cgroup support to provide new attributes for cgroups, such as
+accounting/limiting the resources which processes in a cgroup can
+access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rst) allow
+you to associate a set of CPUs and a set of memory nodes with the
+tasks in each cgroup.
+
+1.2 Why are cgroups needed ?
+----------------------------
+
+There are multiple efforts to provide process aggregations in the
+Linux kernel, mainly for resource-tracking purposes. Such efforts
+include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server
+namespaces. These all require the basic notion of a
+grouping/partitioning of processes, with newly forked processes ending
+up in the same group (cgroup) as their parent process.
+
+The kernel cgroup patch provides the minimum essential kernel
+mechanisms required to efficiently implement such groups. It has
+minimal impact on the system fast paths, and provides hooks for
+specific subsystems such as cpusets to provide additional behaviour as
+desired.
+
+Multiple hierarchy support is provided to allow for situations where
+the division of tasks into cgroups is distinctly different for
+different subsystems - having parallel hierarchies allows each
+hierarchy to be a natural division of tasks, without having to handle
+complex combinations of tasks that would be present if several
+unrelated subsystems needed to be forced into the same tree of
+cgroups.
+
+At one extreme, each resource controller or subsystem could be in a
+separate hierarchy; at the other extreme, all subsystems
+would be attached to the same hierarchy.
+
+As an example of a scenario (originally proposed by vatsa@in.ibm.com)
+that can benefit from multiple hierarchies, consider a large
+university server with various users - students, professors, system
+tasks etc. The resource planning for this server could be along the
+following lines::
+
+ CPU : "Top cpuset"
+ / \
+ CPUSet1 CPUSet2
+ | |
+ (Professors) (Students)
+
+ In addition (system tasks) are attached to topcpuset (so
+ that they can run anywhere) with a limit of 20%
+
+ Memory : Professors (50%), Students (30%), system (20%)
+
+ Disk : Professors (50%), Students (30%), system (20%)
+
+ Network : WWW browsing (20%), Network File System (60%), others (20%)
+ / \
+ Professors (15%) students (5%)
+
+Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes
+into the NFS network class.
+
+At the same time Firefox/Lynx will share an appropriate CPU/Memory class
+depending on who launched it (prof/student).
+
+With the ability to classify tasks differently for different resources
+(by putting those resource subsystems in different hierarchies),
+the admin can easily set up a script which receives exec notifications
+and depending on who is launching the browser he can::
+
+ # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
+
+With only a single hierarchy, he now would potentially have to create
+a separate cgroup for every browser launched and associate it with
+appropriate network and other resource class. This may lead to
+proliferation of such cgroups.
+
+Also let's say that the administrator would like to give enhanced network
+access temporarily to a student's browser (since it is night and the user
+wants to do online gaming :)) OR give one of the student's simulation
+apps enhanced CPU power.
+
+With ability to write PIDs directly to resource classes, it's just a
+matter of::
+
+ # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
+ (after some time)
+ # echo pid > /sys/fs/cgroup/network/<orig_class>/tasks
+
+Without this ability, the administrator would have to split the cgroup into
+multiple separate ones and then associate the new cgroups with the
+new resource classes.
+
+
+
+1.3 How are cgroups implemented ?
+---------------------------------
+
+Control Groups extends the kernel as follows:
+
+ - Each task in the system has a reference-counted pointer to a
+ css_set.
+
+ - A css_set contains a set of reference-counted pointers to
+ cgroup_subsys_state objects, one for each cgroup subsystem
+ registered in the system. There is no direct link from a task to
+ the cgroup of which it's a member in each hierarchy, but this
+ can be determined by following pointers through the
+ cgroup_subsys_state objects. This is because accessing the
+ subsystem state is something that's expected to happen frequently
+ and in performance-critical code, whereas operations that require a
+ task's actual cgroup assignments (in particular, moving between
+ cgroups) are less common. A linked list runs through the cg_list
+ field of each task_struct using the css_set, anchored at
+ css_set->tasks.
+
+ - A cgroup hierarchy filesystem can be mounted for browsing and
+ manipulation from user space.
+
+ - You can list all the tasks (by PID) attached to any cgroup.
+
+The implementation of cgroups requires a few, simple hooks
+into the rest of the kernel, none in performance-critical paths:
+
+ - in init/main.c, to initialize the root cgroups and initial
+ css_set at system boot.
+
+ - in fork and exit, to attach and detach a task from its css_set.
+
+In addition, a new file system of type "cgroup" may be mounted, to
+enable browsing and modifying the cgroups presently known to the
+kernel. When mounting a cgroup hierarchy, you may specify a
+comma-separated list of subsystems to mount as the filesystem mount
+options. By default, mounting the cgroup filesystem attempts to
+mount a hierarchy containing all registered subsystems.
+
+If an active hierarchy with exactly the same set of subsystems already
+exists, it will be reused for the new mount. If no existing hierarchy
+matches, and any of the requested subsystems are in use in an existing
+hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy
+is activated, associated with the requested subsystems.
+
+It's not currently possible to bind a new subsystem to an active
+cgroup hierarchy, or to unbind a subsystem from an active cgroup
+hierarchy. This may be possible in future, but is fraught with nasty
+error-recovery issues.
+
+When a cgroup filesystem is unmounted, if there are any
+child cgroups created below the top-level cgroup, that hierarchy
+will remain active even though unmounted; if there are no
+child cgroups then the hierarchy will be deactivated.
+
+No new system calls are added for cgroups - all support for
+querying and modifying cgroups is via this cgroup file system.
+
+Each task under /proc has an added file named 'cgroup' displaying,
+for each active hierarchy, the subsystem names and the cgroup name
+as the path relative to the root of the cgroup file system.
+
+Each cgroup is represented by a directory in the cgroup file system
+containing the following files describing that cgroup:
+
+ - tasks: list of tasks (by PID) attached to that cgroup. This list
+ is not guaranteed to be sorted. Writing a thread ID into this file
+ moves the thread into this cgroup.
+ - cgroup.procs: list of thread group IDs in the cgroup. This list is
+ not guaranteed to be sorted or free of duplicate TGIDs, and userspace
+ should sort/uniquify the list if this property is required.
+ Writing a thread group ID into this file moves all threads in that
+ group into this cgroup.
+ - notify_on_release flag: run the release agent on exit?
+ - release_agent: the path to use for release notifications (this file
+ exists in the top cgroup only)
+
+Other subsystems such as cpusets may add additional files in each
+cgroup dir.
+
+New cgroups are created using the mkdir system call or shell
+command. The properties of a cgroup, such as its flags, are
+modified by writing to the appropriate file in that cgroups
+directory, as listed above.
+
+The named hierarchical structure of nested cgroups allows partitioning
+a large system into nested, dynamically changeable, "soft-partitions".
+
+The attachment of each task, automatically inherited at fork by any
+children of that task, to a cgroup allows organizing the work load
+on a system into related sets of tasks. A task may be re-attached to
+any other cgroup, if allowed by the permissions on the necessary
+cgroup file system directories.
+
+When a task is moved from one cgroup to another, it gets a new
+css_set pointer - if there's an already existing css_set with the
+desired collection of cgroups then that group is reused, otherwise a new
+css_set is allocated. The appropriate existing css_set is located by
+looking into a hash table.
+
+To allow access from a cgroup to the css_sets (and hence tasks)
+that comprise it, a set of cg_cgroup_link objects form a lattice;
+each cg_cgroup_link is linked into a list of cg_cgroup_links for
+a single cgroup on its cgrp_link_list field, and a list of
+cg_cgroup_links for a single css_set on its cg_link_list.
+
+Thus the set of tasks in a cgroup can be listed by iterating over
+each css_set that references the cgroup, and sub-iterating over
+each css_set's task set.
+
+The use of a Linux virtual file system (vfs) to represent the
+cgroup hierarchy provides for a familiar permission and name space
+for cgroups, with a minimum of additional kernel code.
+
+1.4 What does notify_on_release do ?
+------------------------------------
+
+If the notify_on_release flag is enabled (1) in a cgroup, then
+whenever the last task in the cgroup leaves (exits or attaches to
+some other cgroup) and the last child cgroup of that cgroup
+is removed, then the kernel runs the command specified by the contents
+of the "release_agent" file in that hierarchy's root directory,
+supplying the pathname (relative to the mount point of the cgroup
+file system) of the abandoned cgroup. This enables automatic
+removal of abandoned cgroups. The default value of
+notify_on_release in the root cgroup at system boot is disabled
+(0). The default value of other cgroups at creation is the current
+value of their parents' notify_on_release settings. The default value of
+a cgroup hierarchy's release_agent path is empty.
+
+1.5 What does clone_children do ?
+---------------------------------
+
+This flag only affects the cpuset controller. If the clone_children
+flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its
+configuration from the parent during initialization.
+
+1.6 How do I use cgroups ?
+--------------------------
+
+To start a new job that is to be contained within a cgroup, using
+the "cpuset" cgroup subsystem, the steps are something like::
+
+ 1) mount -t tmpfs cgroup_root /sys/fs/cgroup
+ 2) mkdir /sys/fs/cgroup/cpuset
+ 3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
+ 4) Create the new cgroup by doing mkdir's and write's (or echo's) in
+ the /sys/fs/cgroup/cpuset virtual file system.
+ 5) Start a task that will be the "founding father" of the new job.
+ 6) Attach that task to the new cgroup by writing its PID to the
+ /sys/fs/cgroup/cpuset tasks file for that cgroup.
+ 7) fork, exec or clone the job tasks from this founding father task.
+
+For example, the following sequence of commands will setup a cgroup
+named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
+and then start a subshell 'sh' in that cgroup::
+
+ mount -t tmpfs cgroup_root /sys/fs/cgroup
+ mkdir /sys/fs/cgroup/cpuset
+ mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset
+ cd /sys/fs/cgroup/cpuset
+ mkdir Charlie
+ cd Charlie
+ /bin/echo 2-3 > cpuset.cpus
+ /bin/echo 1 > cpuset.mems
+ /bin/echo $$ > tasks
+ sh
+ # The subshell 'sh' is now running in cgroup Charlie
+ # The next line should display '/Charlie'
+ cat /proc/self/cgroup
+
+2. Usage Examples and Syntax
+============================
+
+2.1 Basic Usage
+---------------
+
+Creating, modifying, using cgroups can be done through the cgroup
+virtual filesystem.
+
+To mount a cgroup hierarchy with all available subsystems, type::
+
+ # mount -t cgroup xxx /sys/fs/cgroup
+
+The "xxx" is not interpreted by the cgroup code, but will appear in
+/proc/mounts so may be any useful identifying string that you like.
+
+Note: Some subsystems do not work without some user input first. For instance,
+if cpusets are enabled the user will have to populate the cpus and mems files
+for each new cgroup created before that group can be used.
+
+As explained in section `1.2 Why are cgroups needed?` you should create
+different hierarchies of cgroups for each single resource or group of
+resources you want to control. Therefore, you should mount a tmpfs on
+/sys/fs/cgroup and create directories for each cgroup resource or resource
+group::
+
+ # mount -t tmpfs cgroup_root /sys/fs/cgroup
+ # mkdir /sys/fs/cgroup/rg1
+
+To mount a cgroup hierarchy with just the cpuset and memory
+subsystems, type::
+
+ # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
+
+While remounting cgroups is currently supported, it is not recommend
+to use it. Remounting allows changing bound subsystems and
+release_agent. Rebinding is hardly useful as it only works when the
+hierarchy is empty and release_agent itself should be replaced with
+conventional fsnotify. The support for remounting will be removed in
+the future.
+
+To Specify a hierarchy's release_agent::
+
+ # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
+ xxx /sys/fs/cgroup/rg1
+
+Note that specifying 'release_agent' more than once will return failure.
+
+Note that changing the set of subsystems is currently only supported
+when the hierarchy consists of a single (root) cgroup. Supporting
+the ability to arbitrarily bind/unbind subsystems from an existing
+cgroup hierarchy is intended to be implemented in the future.
+
+Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
+tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
+is the cgroup that holds the whole system.
+
+If you want to change the value of release_agent::
+
+ # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
+
+It can also be changed via remount.
+
+If you want to create a new cgroup under /sys/fs/cgroup/rg1::
+
+ # cd /sys/fs/cgroup/rg1
+ # mkdir my_cgroup
+
+Now you want to do something with this cgroup:
+
+ # cd my_cgroup
+
+In this directory you can find several files::
+
+ # ls
+ cgroup.procs notify_on_release tasks
+ (plus whatever files added by the attached subsystems)
+
+Now attach your shell to this cgroup::
+
+ # /bin/echo $$ > tasks
+
+You can also create cgroups inside your cgroup by using mkdir in this
+directory::
+
+ # mkdir my_sub_cs
+
+To remove a cgroup, just use rmdir::
+
+ # rmdir my_sub_cs
+
+This will fail if the cgroup is in use (has cgroups inside, or
+has processes attached, or is held alive by other subsystem-specific
+reference).
+
+2.2 Attaching processes
+-----------------------
+
+::
+
+ # /bin/echo PID > tasks
+
+Note that it is PID, not PIDs. You can only attach ONE task at a time.
+If you have several tasks to attach, you have to do it one after another::
+
+ # /bin/echo PID1 > tasks
+ # /bin/echo PID2 > tasks
+ ...
+ # /bin/echo PIDn > tasks
+
+You can attach the current shell task by echoing 0::
+
+ # echo 0 > tasks
+
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the PID of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
+Note: Since every task is always a member of exactly one cgroup in each
+mounted hierarchy, to remove a task from its current cgroup you must
+move it into a new cgroup (possibly the root cgroup) by writing to the
+new cgroup's tasks file.
+
+Note: Due to some restrictions enforced by some cgroup subsystems, moving
+a process to another cgroup can fail.
+
+2.3 Mounting hierarchies by name
+--------------------------------
+
+Passing the name=<x> option when mounting a cgroups hierarchy
+associates the given name with the hierarchy. This can be used when
+mounting a pre-existing hierarchy, in order to refer to it by name
+rather than by its set of active subsystems. Each hierarchy is either
+nameless, or has a unique name.
+
+The name should match [\w.-]+
+
+When passing a name=<x> option for a new hierarchy, you need to
+specify subsystems manually; the legacy behaviour of mounting all
+subsystems when none are explicitly specified is not supported when
+you give a subsystem a name.
+
+The name of the subsystem appears as part of the hierarchy description
+in /proc/mounts and /proc/<pid>/cgroups.
+
+
+3. Kernel API
+=============
+
+3.1 Overview
+------------
+
+Each kernel subsystem that wants to hook into the generic cgroup
+system needs to create a cgroup_subsys object. This contains
+various methods, which are callbacks from the cgroup system, along
+with a subsystem ID which will be assigned by the cgroup system.
+
+Other fields in the cgroup_subsys object include:
+
+- subsys_id: a unique array index for the subsystem, indicating which
+ entry in cgroup->subsys[] this subsystem should be managing.
+
+- name: should be initialized to a unique subsystem name. Should be
+ no longer than MAX_CGROUP_TYPE_NAMELEN.
+
+- early_init: indicate if the subsystem needs early initialization
+ at system boot.
+
+Each cgroup object created by the system has an array of pointers,
+indexed by subsystem ID; this pointer is entirely managed by the
+subsystem; the generic cgroup code will never touch this pointer.
+
+3.2 Synchronization
+-------------------
+
+There is a global mutex, cgroup_mutex, used by the cgroup
+system. This should be taken by anything that wants to modify a
+cgroup. It may also be taken to prevent cgroups from being
+modified, but more specific locks may be more appropriate in that
+situation.
+
+See kernel/cgroup.c for more details.
+
+Subsystems can take/release the cgroup_mutex via the functions
+cgroup_lock()/cgroup_unlock().
+
+Accessing a task's cgroup pointer may be done in the following ways:
+- while holding cgroup_mutex
+- while holding the task's alloc_lock (via task_lock())
+- inside an rcu_read_lock() section via rcu_dereference()
+
+3.3 Subsystem API
+-----------------
+
+Each subsystem should:
+
+- add an entry in linux/cgroup_subsys.h
+- define a cgroup_subsys object called <name>_cgrp_subsys
+
+Each subsystem may export the following methods. The only mandatory
+methods are css_alloc/free. Any others that are null are presumed to
+be successful no-ops.
+
+``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)``
+(cgroup_mutex held by caller)
+
+Called to allocate a subsystem state object for a cgroup. The
+subsystem should allocate its subsystem state object for the passed
+cgroup, returning a pointer to the new object on success or a
+ERR_PTR() value. On success, the subsystem pointer should point to
+a structure of type cgroup_subsys_state (typically embedded in a
+larger subsystem-specific object), which will be initialized by the
+cgroup system. Note that this will be called at initialization to
+create the root subsystem state for this subsystem; this case can be
+identified by the passed cgroup object having a NULL parent (since
+it's the root of the hierarchy) and may be an appropriate place for
+initialization code.
+
+``int css_online(struct cgroup *cgrp)``
+(cgroup_mutex held by caller)
+
+Called after @cgrp successfully completed all allocations and made
+visible to cgroup_for_each_child/descendant_*() iterators. The
+subsystem may choose to fail creation by returning -errno. This
+callback can be used to implement reliable state sharing and
+propagation along the hierarchy. See the comment on
+cgroup_for_each_descendant_pre() for details.
+
+``void css_offline(struct cgroup *cgrp);``
+(cgroup_mutex held by caller)
+
+This is the counterpart of css_online() and called iff css_online()
+has succeeded on @cgrp. This signifies the beginning of the end of
+@cgrp. @cgrp is being removed and the subsystem should start dropping
+all references it's holding on @cgrp. When all references are dropped,
+cgroup removal will proceed to the next step - css_free(). After this
+callback, @cgrp should be considered dead to the subsystem.
+
+``void css_free(struct cgroup *cgrp)``
+(cgroup_mutex held by caller)
+
+The cgroup system is about to free @cgrp; the subsystem should free
+its subsystem state object. By the time this method is called, @cgrp
+is completely unused; @cgrp->parent is still valid. (Note - can also
+be called for a newly-created cgroup if an error occurs after this
+subsystem's create() method has been called for the new cgroup).
+
+``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
+(cgroup_mutex held by caller)
+
+Called prior to moving one or more tasks into a cgroup; if the
+subsystem returns an error, this will abort the attach operation.
+@tset contains the tasks to be attached and is guaranteed to have at
+least one task in it.
+
+If there are multiple tasks in the taskset, then:
+ - it's guaranteed that all are from the same thread group
+ - @tset contains all tasks from the thread group whether or not
+ they're switching cgroups
+ - the first task is the leader
+
+Each @tset entry also contains the task's old cgroup and tasks which
+aren't switching cgroup can be skipped easily using the
+cgroup_taskset_for_each() iterator. Note that this isn't called on a
+fork. If this method returns 0 (success) then this should remain valid
+while the caller holds cgroup_mutex and it is ensured that either
+attach() or cancel_attach() will be called in future.
+
+``void css_reset(struct cgroup_subsys_state *css)``
+(cgroup_mutex held by caller)
+
+An optional operation which should restore @css's configuration to the
+initial state. This is currently only used on the unified hierarchy
+when a subsystem is disabled on a cgroup through
+"cgroup.subtree_control" but should remain enabled because other
+subsystems depend on it. cgroup core makes such a css invisible by
+removing the associated interface files and invokes this callback so
+that the hidden subsystem can return to the initial neutral state.
+This prevents unexpected resource control from a hidden css and
+ensures that the configuration is in the initial state when it is made
+visible again later.
+
+``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
+(cgroup_mutex held by caller)
+
+Called when a task attach operation has failed after can_attach() has succeeded.
+A subsystem whose can_attach() has some side-effects should provide this
+function, so that the subsystem can implement a rollback. If not, not necessary.
+This will be called only about subsystems whose can_attach() operation have
+succeeded. The parameters are identical to can_attach().
+
+``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
+(cgroup_mutex held by caller)
+
+Called after the task has been attached to the cgroup, to allow any
+post-attachment activity that requires memory allocations or blocking.
+The parameters are identical to can_attach().
+
+``void fork(struct task_struct *task)``
+
+Called when a task is forked into a cgroup.
+
+``void exit(struct task_struct *task)``
+
+Called during task exit.
+
+``void free(struct task_struct *task)``
+
+Called when the task_struct is freed.
+
+``void bind(struct cgroup *root)``
+(cgroup_mutex held by caller)
+
+Called when a cgroup subsystem is rebound to a different hierarchy
+and root cgroup. Currently this will only involve movement between
+the default hierarchy (which never has sub-cgroups) and a hierarchy
+that is being created/destroyed (and hence has no sub-cgroups).
+
+4. Extended attribute usage
+===========================
+
+cgroup filesystem supports certain types of extended attributes in its
+directories and files. The current supported types are:
+
+ - Trusted (XATTR_TRUSTED)
+ - Security (XATTR_SECURITY)
+
+Both require CAP_SYS_ADMIN capability to set.
+
+Like in tmpfs, the extended attributes in cgroup filesystem are stored
+using kernel memory and it's advised to keep the usage at minimum. This
+is the reason why user defined extended attributes are not supported, since
+any user can do it and there's no limit in the value size.
+
+The current known users for this feature are SELinux to limit cgroup usage
+in containers and systemd for assorted meta data like main PID in a cgroup
+(systemd creates a cgroup per service).
+
+5. Questions
+============
+
+::
+
+ Q: what's up with this '/bin/echo' ?
+ A: bash's builtin 'echo' command does not check calls to write() against
+ errors. If you use it in the cgroup file system, you won't be
+ able to tell whether a command succeeded or failed.
+
+ Q: When I attach processes, only the first of the line gets really attached !
+ A: We can only return one error code per call to write(). So you should also
+ put only ONE PID.
diff --git a/Documentation/admin-guide/cgroup-v1/cpuacct.rst b/Documentation/admin-guide/cgroup-v1/cpuacct.rst
new file mode 100644
index 0000000..d30ed81
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/cpuacct.rst
@@ -0,0 +1,50 @@
+=========================
+CPU Accounting Controller
+=========================
+
+The CPU accounting controller is used to group tasks using cgroups and
+account the CPU usage of these groups of tasks.
+
+The CPU accounting controller supports multi-hierarchy groups. An accounting
+group accumulates the CPU usage of all of its child groups and the tasks
+directly present in its group.
+
+Accounting groups can be created by first mounting the cgroup filesystem::
+
+ # mount -t cgroup -ocpuacct none /sys/fs/cgroup
+
+With the above step, the initial or the parent accounting group becomes
+visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
+the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
+/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
+by this group which is essentially the CPU time obtained by all the tasks
+in the system.
+
+New accounting groups can be created under the parent group /sys/fs/cgroup::
+
+ # cd /sys/fs/cgroup
+ # mkdir g1
+ # echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it. CPU time consumed by this bash and its children
+can be obtained from g1/cpuacct.usage and the same is accumulated in
+/sys/fs/cgroup/cpuacct.usage also.
+
+cpuacct.stat file lists a few statistics which further divide the
+CPU time obtained by the cgroup into user and system times. Currently
+the following statistics are supported:
+
+user: Time spent by tasks of the cgroup in user mode.
+system: Time spent by tasks of the cgroup in kernel mode.
+
+user and system are in USER_HZ unit.
+
+cpuacct controller uses percpu_counter interface to collect user and
+system times. This has two side effects:
+
+- It is theoretically possible to see wrong values for user and system times.
+ This is because percpu_counter_read() on 32bit systems isn't safe
+ against concurrent writes.
+- It is possible to see slightly outdated values for user and system times
+ due to the batch processing nature of percpu_counter.
diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
new file mode 100644
index 0000000..86a6ae9
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -0,0 +1,866 @@
+=======
+CPUSETS
+=======
+
+Copyright (C) 2004 BULL SA.
+
+Written by Simon.Derr@bull.net
+
+- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
+- Modified by Paul Jackson <pj@sgi.com>
+- Modified by Christoph Lameter <cl@linux.com>
+- Modified by Paul Menage <menage@google.com>
+- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
+
+.. CONTENTS:
+
+ 1. Cpusets
+ 1.1 What are cpusets ?
+ 1.2 Why are cpusets needed ?
+ 1.3 How are cpusets implemented ?
+ 1.4 What are exclusive cpusets ?
+ 1.5 What is memory_pressure ?
+ 1.6 What is memory spread ?
+ 1.7 What is sched_load_balance ?
+ 1.8 What is sched_relax_domain_level ?
+ 1.9 How do I use cpusets ?
+ 2. Usage Examples and Syntax
+ 2.1 Basic Usage
+ 2.2 Adding/removing cpus
+ 2.3 Setting flags
+ 2.4 Attaching processes
+ 3. Questions
+ 4. Contact
+
+1. Cpusets
+==========
+
+1.1 What are cpusets ?
+----------------------
+
+Cpusets provide a mechanism for assigning a set of CPUs and Memory
+Nodes to a set of tasks. In this document "Memory Node" refers to
+an on-line node that contains memory.
+
+Cpusets constrain the CPU and Memory placement of tasks to only
+the resources within a task's current cpuset. They form a nested
+hierarchy visible in a virtual file system. These are the essential
+hooks, beyond what is already present, required to manage dynamic
+job placement on large systems.
+
+Cpusets use the generic cgroup subsystem described in
+Documentation/admin-guide/cgroup-v1/cgroups.rst.
+
+Requests by a task, using the sched_setaffinity(2) system call to
+include CPUs in its CPU affinity mask, and using the mbind(2) and
+set_mempolicy(2) system calls to include Memory Nodes in its memory
+policy, are both filtered through that task's cpuset, filtering out any
+CPUs or Memory Nodes not in that cpuset. The scheduler will not
+schedule a task on a CPU that is not allowed in its cpus_allowed
+vector, and the kernel page allocator will not allocate a page on a
+node that is not allowed in the requesting task's mems_allowed vector.
+
+User level code may create and destroy cpusets by name in the cgroup
+virtual file system, manage the attributes and permissions of these
+cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
+specify and query to which cpuset a task is assigned, and list the
+task pids assigned to a cpuset.
+
+
+1.2 Why are cpusets needed ?
+----------------------------
+
+The management of large computer systems, with many processors (CPUs),
+complex memory cache hierarchies and multiple Memory Nodes having
+non-uniform access times (NUMA) presents additional challenges for
+the efficient scheduling and memory placement of processes.
+
+Frequently more modest sized systems can be operated with adequate
+efficiency just by letting the operating system automatically share
+the available CPU and Memory resources amongst the requesting tasks.
+
+But larger systems, which benefit more from careful processor and
+memory placement to reduce memory access times and contention,
+and which typically represent a larger investment for the customer,
+can benefit from explicitly placing jobs on properly sized subsets of
+the system.
+
+This can be especially valuable on:
+
+ * Web Servers running multiple instances of the same web application,
+ * Servers running different applications (for instance, a web server
+ and a database), or
+ * NUMA systems running large HPC applications with demanding
+ performance characteristics.
+
+These subsets, or "soft partitions" must be able to be dynamically
+adjusted, as the job mix changes, without impacting other concurrently
+executing jobs. The location of the running jobs pages may also be moved
+when the memory locations are changed.
+
+The kernel cpuset patch provides the minimum essential kernel
+mechanisms required to efficiently implement such subsets. It
+leverages existing CPU and Memory Placement facilities in the Linux
+kernel to avoid any additional impact on the critical scheduler or
+memory allocator code.
+
+
+1.3 How are cpusets implemented ?
+---------------------------------
+
+Cpusets provide a Linux kernel mechanism to constrain which CPUs and
+Memory Nodes are used by a process or set of processes.
+
+The Linux kernel already has a pair of mechanisms to specify on which
+CPUs a task may be scheduled (sched_setaffinity) and on which Memory
+Nodes it may obtain memory (mbind, set_mempolicy).
+
+Cpusets extends these two mechanisms as follows:
+
+ - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
+ kernel.
+ - Each task in the system is attached to a cpuset, via a pointer
+ in the task structure to a reference counted cgroup structure.
+ - Calls to sched_setaffinity are filtered to just those CPUs
+ allowed in that task's cpuset.
+ - Calls to mbind and set_mempolicy are filtered to just
+ those Memory Nodes allowed in that task's cpuset.
+ - The root cpuset contains all the systems CPUs and Memory
+ Nodes.
+ - For any cpuset, one can define child cpusets containing a subset
+ of the parents CPU and Memory Node resources.
+ - The hierarchy of cpusets can be mounted at /dev/cpuset, for
+ browsing and manipulation from user space.
+ - A cpuset may be marked exclusive, which ensures that no other
+ cpuset (except direct ancestors and descendants) may contain
+ any overlapping CPUs or Memory Nodes.
+ - You can list all the tasks (by pid) attached to any cpuset.
+
+The implementation of cpusets requires a few, simple hooks
+into the rest of the kernel, none in performance critical paths:
+
+ - in init/main.c, to initialize the root cpuset at system boot.
+ - in fork and exit, to attach and detach a task from its cpuset.
+ - in sched_setaffinity, to mask the requested CPUs by what's
+ allowed in that task's cpuset.
+ - in sched.c migrate_live_tasks(), to keep migrating tasks within
+ the CPUs allowed by their cpuset, if possible.
+ - in the mbind and set_mempolicy system calls, to mask the requested
+ Memory Nodes by what's allowed in that task's cpuset.
+ - in page_alloc.c, to restrict memory to allowed nodes.
+ - in vmscan.c, to restrict page recovery to the current cpuset.
+
+You should mount the "cgroup" filesystem type in order to enable
+browsing and modifying the cpusets presently known to the kernel. No
+new system calls are added for cpusets - all support for querying and
+modifying cpusets is via this cpuset file system.
+
+The /proc/<pid>/status file for each task has four added lines,
+displaying the task's cpus_allowed (on which CPUs it may be scheduled)
+and mems_allowed (on which Memory Nodes it may obtain memory),
+in the two formats seen in the following example::
+
+ Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
+ Cpus_allowed_list: 0-127
+ Mems_allowed: ffffffff,ffffffff
+ Mems_allowed_list: 0-63
+
+Each cpuset is represented by a directory in the cgroup file system
+containing (on top of the standard cgroup files) the following
+files describing that cpuset:
+
+ - cpuset.cpus: list of CPUs in that cpuset
+ - cpuset.mems: list of Memory Nodes in that cpuset
+ - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
+ - cpuset.cpu_exclusive flag: is cpu placement exclusive?
+ - cpuset.mem_exclusive flag: is memory placement exclusive?
+ - cpuset.mem_hardwall flag: is memory allocation hardwalled
+ - cpuset.memory_pressure: measure of how much paging pressure in cpuset
+ - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
+ - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
+ - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
+ - cpuset.sched_relax_domain_level: the searching range when migrating tasks
+
+In addition, only the root cpuset has the following file:
+
+ - cpuset.memory_pressure_enabled flag: compute memory_pressure?
+
+New cpusets are created using the mkdir system call or shell
+command. The properties of a cpuset, such as its flags, allowed
+CPUs and Memory Nodes, and attached tasks, are modified by writing
+to the appropriate file in that cpusets directory, as listed above.
+
+The named hierarchical structure of nested cpusets allows partitioning
+a large system into nested, dynamically changeable, "soft-partitions".
+
+The attachment of each task, automatically inherited at fork by any
+children of that task, to a cpuset allows organizing the work load
+on a system into related sets of tasks such that each set is constrained
+to using the CPUs and Memory Nodes of a particular cpuset. A task
+may be re-attached to any other cpuset, if allowed by the permissions
+on the necessary cpuset file system directories.
+
+Such management of a system "in the large" integrates smoothly with
+the detailed placement done on individual tasks and memory regions
+using the sched_setaffinity, mbind and set_mempolicy system calls.
+
+The following rules apply to each cpuset:
+
+ - Its CPUs and Memory Nodes must be a subset of its parents.
+ - It can't be marked exclusive unless its parent is.
+ - If its cpu or memory is exclusive, they may not overlap any sibling.
+
+These rules, and the natural hierarchy of cpusets, enable efficient
+enforcement of the exclusive guarantee, without having to scan all
+cpusets every time any of them change to ensure nothing overlaps a
+exclusive cpuset. Also, the use of a Linux virtual file system (vfs)
+to represent the cpuset hierarchy provides for a familiar permission
+and name space for cpusets, with a minimum of additional kernel code.
+
+The cpus and mems files in the root (top_cpuset) cpuset are
+read-only. The cpus file automatically tracks the value of
+cpu_online_mask using a CPU hotplug notifier, and the mems file
+automatically tracks the value of node_states[N_MEMORY]--i.e.,
+nodes with memory--using the cpuset_track_online_nodes() hook.
+
+
+1.4 What are exclusive cpusets ?
+--------------------------------
+
+If a cpuset is cpu or mem exclusive, no other cpuset, other than
+a direct ancestor or descendant, may share any of the same CPUs or
+Memory Nodes.
+
+A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
+i.e. it restricts kernel allocations for page, buffer and other data
+commonly shared by the kernel across multiple users. All cpusets,
+whether hardwalled or not, restrict allocations of memory for user
+space. This enables configuring a system so that several independent
+jobs can share common kernel data, such as file system pages, while
+isolating each job's user allocation in its own cpuset. To do this,
+construct a large mem_exclusive cpuset to hold all the jobs, and
+construct child, non-mem_exclusive cpusets for each individual job.
+Only a small amount of typical kernel memory, such as requests from
+interrupt handlers, is allowed to be taken outside even a
+mem_exclusive cpuset.
+
+
+1.5 What is memory_pressure ?
+-----------------------------
+The memory_pressure of a cpuset provides a simple per-cpuset metric
+of the rate that the tasks in a cpuset are attempting to free up in
+use memory on the nodes of the cpuset to satisfy additional memory
+requests.
+
+This enables batch managers monitoring jobs running in dedicated
+cpusets to efficiently detect what level of memory pressure that job
+is causing.
+
+This is useful both on tightly managed systems running a wide mix of
+submitted jobs, which may choose to terminate or re-prioritize jobs that
+are trying to use more memory than allowed on the nodes assigned to them,
+and with tightly coupled, long running, massively parallel scientific
+computing jobs that will dramatically fail to meet required performance
+goals if they start to use more memory than allowed to them.
+
+This mechanism provides a very economical way for the batch manager
+to monitor a cpuset for signs of memory pressure. It's up to the
+batch manager or other user code to decide what to do about it and
+take action.
+
+==>
+ Unless this feature is enabled by writing "1" to the special file
+ /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
+ code of __alloc_pages() for this metric reduces to simply noticing
+ that the cpuset_memory_pressure_enabled flag is zero. So only
+ systems that enable this feature will compute the metric.
+
+Why a per-cpuset, running average:
+
+ Because this meter is per-cpuset, rather than per-task or mm,
+ the system load imposed by a batch scheduler monitoring this
+ metric is sharply reduced on large systems, because a scan of
+ the tasklist can be avoided on each set of queries.
+
+ Because this meter is a running average, instead of an accumulating
+ counter, a batch scheduler can detect memory pressure with a
+ single read, instead of having to read and accumulate results
+ for a period of time.
+
+ Because this meter is per-cpuset rather than per-task or mm,
+ the batch scheduler can obtain the key information, memory
+ pressure in a cpuset, with a single read, rather than having to
+ query and accumulate results over all the (dynamically changing)
+ set of tasks in the cpuset.
+
+A per-cpuset simple digital filter (requires a spinlock and 3 words
+of data per-cpuset) is kept, and updated by any task attached to that
+cpuset, if it enters the synchronous (direct) page reclaim code.
+
+A per-cpuset file provides an integer number representing the recent
+(half-life of 10 seconds) rate of direct page reclaims caused by
+the tasks in the cpuset, in units of reclaims attempted per second,
+times 1000.
+
+
+1.6 What is memory spread ?
+---------------------------
+There are two boolean flag files per cpuset that control where the
+kernel allocates pages for the file system buffers and related in
+kernel data structures. They are called 'cpuset.memory_spread_page' and
+'cpuset.memory_spread_slab'.
+
+If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
+the kernel will spread the file system buffers (page cache) evenly
+over all the nodes that the faulting task is allowed to use, instead
+of preferring to put those pages on the node where the task is running.
+
+If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
+then the kernel will spread some file system related slab caches,
+such as for inodes and dentries evenly over all the nodes that the
+faulting task is allowed to use, instead of preferring to put those
+pages on the node where the task is running.
+
+The setting of these flags does not affect anonymous data segment or
+stack segment pages of a task.
+
+By default, both kinds of memory spreading are off, and memory
+pages are allocated on the node local to where the task is running,
+except perhaps as modified by the task's NUMA mempolicy or cpuset
+configuration, so long as sufficient free memory pages are available.
+
+When new cpusets are created, they inherit the memory spread settings
+of their parent.
+
+Setting memory spreading causes allocations for the affected page
+or slab caches to ignore the task's NUMA mempolicy and be spread
+instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
+mempolicies will not notice any change in these calls as a result of
+their containing task's memory spread settings. If memory spreading
+is turned off, then the currently specified NUMA mempolicy once again
+applies to memory page allocations.
+
+Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
+files. By default they contain "0", meaning that the feature is off
+for that cpuset. If a "1" is written to that file, then that turns
+the named feature on.
+
+The implementation is simple.
+
+Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
+PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
+joins that cpuset. The page allocation calls for the page cache
+is modified to perform an inline check for this PFA_SPREAD_PAGE task
+flag, and if set, a call to a new routine cpuset_mem_spread_node()
+returns the node to prefer for the allocation.
+
+Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
+PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
+pages from the node returned by cpuset_mem_spread_node().
+
+The cpuset_mem_spread_node() routine is also simple. It uses the
+value of a per-task rotor cpuset_mem_spread_rotor to select the next
+node in the current task's mems_allowed to prefer for the allocation.
+
+This memory placement policy is also known (in other contexts) as
+round-robin or interleave.
+
+This policy can provide substantial improvements for jobs that need
+to place thread local data on the corresponding node, but that need
+to access large file system data sets that need to be spread across
+the several nodes in the jobs cpuset in order to fit. Without this
+policy, especially for jobs that might have one thread reading in the
+data set, the memory allocation across the nodes in the jobs cpuset
+can become very uneven.
+
+1.7 What is sched_load_balance ?
+--------------------------------
+
+The kernel scheduler (kernel/sched/core.c) automatically load balances
+tasks. If one CPU is underutilized, kernel code running on that
+CPU will look for tasks on other more overloaded CPUs and move those
+tasks to itself, within the constraints of such placement mechanisms
+as cpusets and sched_setaffinity.
+
+The algorithmic cost of load balancing and its impact on key shared
+kernel data structures such as the task list increases more than
+linearly with the number of CPUs being balanced. So the scheduler
+has support to partition the systems CPUs into a number of sched
+domains such that it only load balances within each sched domain.
+Each sched domain covers some subset of the CPUs in the system;
+no two sched domains overlap; some CPUs might not be in any sched
+domain and hence won't be load balanced.
+
+Put simply, it costs less to balance between two smaller sched domains
+than one big one, but doing so means that overloads in one of the
+two domains won't be load balanced to the other one.
+
+By default, there is one sched domain covering all CPUs, including those
+marked isolated using the kernel boot time "isolcpus=" argument. However,
+the isolated CPUs will not participate in load balancing, and will not
+have tasks running on them unless explicitly assigned.
+
+This default load balancing across all CPUs is not well suited for
+the following two situations:
+
+ 1) On large systems, load balancing across many CPUs is expensive.
+ If the system is managed using cpusets to place independent jobs
+ on separate sets of CPUs, full load balancing is unnecessary.
+ 2) Systems supporting realtime on some CPUs need to minimize
+ system overhead on those CPUs, including avoiding task load
+ balancing if that is not needed.
+
+When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
+setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
+be contained in a single sched domain, ensuring that load balancing
+can move a task (not otherwised pinned, as by sched_setaffinity)
+from any CPU in that cpuset to any other.
+
+When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
+scheduler will avoid load balancing across the CPUs in that cpuset,
+--except-- in so far as is necessary because some overlapping cpuset
+has "sched_load_balance" enabled.
+
+So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
+enabled, then the scheduler will have one sched domain covering all
+CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
+cpusets won't matter, as we're already fully load balancing.
+
+Therefore in the above two situations, the top cpuset flag
+"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
+child cpusets have this flag enabled.
+
+When doing this, you don't usually want to leave any unpinned tasks in
+the top cpuset that might use non-trivial amounts of CPU, as such tasks
+may be artificially constrained to some subset of CPUs, depending on
+the particulars of this flag setting in descendant cpusets. Even if
+such a task could use spare CPU cycles in some other CPUs, the kernel
+scheduler might not consider the possibility of load balancing that
+task to that underused CPU.
+
+Of course, tasks pinned to a particular CPU can be left in a cpuset
+that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
+else anyway.
+
+There is an impedance mismatch here, between cpusets and sched domains.
+Cpusets are hierarchical and nest. Sched domains are flat; they don't
+overlap and each CPU is in at most one sched domain.
+
+It is necessary for sched domains to be flat because load balancing
+across partially overlapping sets of CPUs would risk unstable dynamics
+that would be beyond our understanding. So if each of two partially
+overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
+form a single sched domain that is a superset of both. We won't move
+a task to a CPU outside its cpuset, but the scheduler load balancing
+code might waste some compute cycles considering that possibility.
+
+This mismatch is why there is not a simple one-to-one relation
+between which cpusets have the flag "cpuset.sched_load_balance" enabled,
+and the sched domain configuration. If a cpuset enables the flag, it
+will get balancing across all its CPUs, but if it disables the flag,
+it will only be assured of no load balancing if no other overlapping
+cpuset enables the flag.
+
+If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
+one of them has this flag enabled, then the other may find its
+tasks only partially load balanced, just on the overlapping CPUs.
+This is just the general case of the top_cpuset example given a few
+paragraphs above. In the general case, as in the top cpuset case,
+don't leave tasks that might use non-trivial amounts of CPU in
+such partially load balanced cpusets, as they may be artificially
+constrained to some subset of the CPUs allowed to them, for lack of
+load balancing to the other CPUs.
+
+CPUs in "cpuset.isolcpus" were excluded from load balancing by the
+isolcpus= kernel boot option, and will never be load balanced regardless
+of the value of "cpuset.sched_load_balance" in any cpuset.
+
+1.7.1 sched_load_balance implementation details.
+------------------------------------------------
+
+The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
+to most cpuset flags.) When enabled for a cpuset, the kernel will
+ensure that it can load balance across all the CPUs in that cpuset
+(makes sure that all the CPUs in the cpus_allowed of that cpuset are
+in the same sched domain.)
+
+If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
+then they will be (must be) both in the same sched domain.
+
+If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
+then by the above that means there is a single sched domain covering
+the whole system, regardless of any other cpuset settings.
+
+The kernel commits to user space that it will avoid load balancing
+where it can. It will pick as fine a granularity partition of sched
+domains as it can while still providing load balancing for any set
+of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
+
+The internal kernel cpuset to scheduler interface passes from the
+cpuset code to the scheduler code a partition of the load balanced
+CPUs in the system. This partition is a set of subsets (represented
+as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
+all the CPUs that must be load balanced.
+
+The cpuset code builds a new such partition and passes it to the
+scheduler sched domain setup code, to have the sched domains rebuilt
+as necessary, whenever:
+
+ - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
+ - or CPUs come or go from a cpuset with this flag enabled,
+ - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
+ and with this flag enabled changes,
+ - or a cpuset with non-empty CPUs and with this flag enabled is removed,
+ - or a cpu is offlined/onlined.
+
+This partition exactly defines what sched domains the scheduler should
+setup - one sched domain for each element (struct cpumask) in the
+partition.
+
+The scheduler remembers the currently active sched domain partitions.
+When the scheduler routine partition_sched_domains() is invoked from
+the cpuset code to update these sched domains, it compares the new
+partition requested with the current, and updates its sched domains,
+removing the old and adding the new, for each change.
+
+
+1.8 What is sched_relax_domain_level ?
+--------------------------------------
+
+In sched domain, the scheduler migrates tasks in 2 ways; periodic load
+balance on tick, and at time of some schedule events.
+
+When a task is woken up, scheduler try to move the task on idle CPU.
+For example, if a task A running on CPU X activates another task B
+on the same CPU X, and if CPU Y is X's sibling and performing idle,
+then scheduler migrate task B to CPU Y so that task B can start on
+CPU Y without waiting task A on CPU X.
+
+And if a CPU run out of tasks in its runqueue, the CPU try to pull
+extra tasks from other busy CPUs to help them before it is going to
+be idle.
+
+Of course it takes some searching cost to find movable tasks and/or
+idle CPUs, the scheduler might not search all CPUs in the domain
+every time. In fact, in some architectures, the searching ranges on
+events are limited in the same socket or node where the CPU locates,
+while the load balance on tick searches all.
+
+For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
+is idle while CPU X and the siblings are busy, scheduler can't migrate
+woken task B from X to Z since it is out of its searching range.
+As the result, task B on CPU X need to wait task A or wait load balance
+on the next tick. For some applications in special situation, waiting
+1 tick may be too long.
+
+The 'cpuset.sched_relax_domain_level' file allows you to request changing
+this searching range as you like. This file takes int value which
+indicates size of searching range in levels ideally as follows,
+otherwise initial value -1 that indicates the cpuset has no request.
+
+====== ===========================================================
+ -1 no request. use system default or follow request of others.
+ 0 no search.
+ 1 search siblings (hyperthreads in a core).
+ 2 search cores in a package.
+ 3 search cpus in a node [= system wide on non-NUMA system]
+ 4 search nodes in a chunk of node [on NUMA system]
+ 5 search system wide [on NUMA system]
+====== ===========================================================
+
+The system default is architecture dependent. The system default
+can be changed using the relax_domain_level= boot parameter.
+
+This file is per-cpuset and affect the sched domain where the cpuset
+belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
+is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
+there is no sched domain belonging the cpuset.
+
+If multiple cpusets are overlapping and hence they form a single sched
+domain, the largest value among those is used. Be careful, if one
+requests 0 and others are -1 then 0 is used.
+
+Note that modifying this file will have both good and bad effects,
+and whether it is acceptable or not depends on your situation.
+Don't modify this file if you are not sure.
+
+If your situation is:
+
+ - The migration costs between each cpu can be assumed considerably
+ small(for you) due to your special application's behavior or
+ special hardware support for CPU cache etc.
+ - The searching cost doesn't have impact(for you) or you can make
+ the searching cost enough small by managing cpuset to compact etc.
+ - The latency is required even it sacrifices cache hit rate etc.
+ then increasing 'sched_relax_domain_level' would benefit you.
+
+
+1.9 How do I use cpusets ?
+--------------------------
+
+In order to minimize the impact of cpusets on critical kernel
+code, such as the scheduler, and due to the fact that the kernel
+does not support one task updating the memory placement of another
+task directly, the impact on a task of changing its cpuset CPU
+or Memory Node placement, or of changing to which cpuset a task
+is attached, is subtle.
+
+If a cpuset has its Memory Nodes modified, then for each task attached
+to that cpuset, the next time that the kernel attempts to allocate
+a page of memory for that task, the kernel will notice the change
+in the task's cpuset, and update its per-task memory placement to
+remain within the new cpusets memory placement. If the task was using
+mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
+its new cpuset, then the task will continue to use whatever subset
+of MPOL_BIND nodes are still allowed in the new cpuset. If the task
+was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
+in the new cpuset, then the task will be essentially treated as if it
+was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
+as queried by get_mempolicy(), doesn't change). If a task is moved
+from one cpuset to another, then the kernel will adjust the task's
+memory placement, as above, the next time that the kernel attempts
+to allocate a page of memory for that task.
+
+If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
+will have its allowed CPU placement changed immediately. Similarly,
+if a task's pid is written to another cpuset's 'tasks' file, then its
+allowed CPU placement is changed immediately. If such a task had been
+bound to some subset of its cpuset using the sched_setaffinity() call,
+the task will be allowed to run on any CPU allowed in its new cpuset,
+negating the effect of the prior sched_setaffinity() call.
+
+In summary, the memory placement of a task whose cpuset is changed is
+updated by the kernel, on the next allocation of a page for that task,
+and the processor placement is updated immediately.
+
+Normally, once a page is allocated (given a physical page
+of main memory) then that page stays on whatever node it
+was allocated, so long as it remains allocated, even if the
+cpusets memory placement policy 'cpuset.mems' subsequently changes.
+If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
+tasks are attached to that cpuset, any pages that task had
+allocated to it on nodes in its previous cpuset are migrated
+to the task's new cpuset. The relative placement of the page within
+the cpuset is preserved during these migration operations if possible.
+For example if the page was on the second valid node of the prior cpuset
+then the page will be placed on the second valid node of the new cpuset.
+
+Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
+'cpuset.mems' file is modified, pages allocated to tasks in that
+cpuset, that were on nodes in the previous setting of 'cpuset.mems',
+will be moved to nodes in the new setting of 'mems.'
+Pages that were not in the task's prior cpuset, or in the cpuset's
+prior 'cpuset.mems' setting, will not be moved.
+
+There is an exception to the above. If hotplug functionality is used
+to remove all the CPUs that are currently assigned to a cpuset,
+then all the tasks in that cpuset will be moved to the nearest ancestor
+with non-empty cpus. But the moving of some (or all) tasks might fail if
+cpuset is bound with another cgroup subsystem which has some restrictions
+on task attaching. In this failing case, those tasks will stay
+in the original cpuset, and the kernel will automatically update
+their cpus_allowed to allow all online CPUs. When memory hotplug
+functionality for removing Memory Nodes is available, a similar exception
+is expected to apply there as well. In general, the kernel prefers to
+violate cpuset placement, over starving a task that has had all
+its allowed CPUs or Memory Nodes taken offline.
+
+There is a second exception to the above. GFP_ATOMIC requests are
+kernel internal allocations that must be satisfied, immediately.
+The kernel may drop some request, in rare cases even panic, if a
+GFP_ATOMIC alloc fails. If the request cannot be satisfied within
+the current task's cpuset, then we relax the cpuset, and look for
+memory anywhere we can find it. It's better to violate the cpuset
+than stress the kernel.
+
+To start a new job that is to be contained within a cpuset, the steps are:
+
+ 1) mkdir /sys/fs/cgroup/cpuset
+ 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
+ 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
+ the /sys/fs/cgroup/cpuset virtual file system.
+ 4) Start a task that will be the "founding father" of the new job.
+ 5) Attach that task to the new cpuset by writing its pid to the
+ /sys/fs/cgroup/cpuset tasks file for that cpuset.
+ 6) fork, exec or clone the job tasks from this founding father task.
+
+For example, the following sequence of commands will setup a cpuset
+named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
+and then start a subshell 'sh' in that cpuset::
+
+ mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
+ cd /sys/fs/cgroup/cpuset
+ mkdir Charlie
+ cd Charlie
+ /bin/echo 2-3 > cpuset.cpus
+ /bin/echo 1 > cpuset.mems
+ /bin/echo $$ > tasks
+ sh
+ # The subshell 'sh' is now running in cpuset Charlie
+ # The next line should display '/Charlie'
+ cat /proc/self/cpuset
+
+There are ways to query or modify cpusets:
+
+ - via the cpuset file system directly, using the various cd, mkdir, echo,
+ cat, rmdir commands from the shell, or their equivalent from C.
+ - via the C library libcpuset.
+ - via the C library libcgroup.
+ (http://sourceforge.net/projects/libcg/)
+ - via the python application cset.
+ (http://code.google.com/p/cpuset/)
+
+The sched_setaffinity calls can also be done at the shell prompt using
+SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
+calls can be done at the shell prompt using the numactl command
+(part of Andi Kleen's numa package).
+
+2. Usage Examples and Syntax
+============================
+
+2.1 Basic Usage
+---------------
+
+Creating, modifying, using the cpusets can be done through the cpuset
+virtual filesystem.
+
+To mount it, type:
+# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
+
+Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
+tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
+is the cpuset that holds the whole system.
+
+If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
+
+ # cd /sys/fs/cgroup/cpuset
+ # mkdir my_cpuset
+
+Now you want to do something with this cpuset::
+
+ # cd my_cpuset
+
+In this directory you can find several files::
+
+ # ls
+ cgroup.clone_children cpuset.memory_pressure
+ cgroup.event_control cpuset.memory_spread_page
+ cgroup.procs cpuset.memory_spread_slab
+ cpuset.cpu_exclusive cpuset.mems
+ cpuset.cpus cpuset.sched_load_balance
+ cpuset.mem_exclusive cpuset.sched_relax_domain_level
+ cpuset.mem_hardwall notify_on_release
+ cpuset.memory_migrate tasks
+
+Reading them will give you information about the state of this cpuset:
+the CPUs and Memory Nodes it can use, the processes that are using
+it, its properties. By writing to these files you can manipulate
+the cpuset.
+
+Set some flags::
+
+ # /bin/echo 1 > cpuset.cpu_exclusive
+
+Add some cpus::
+
+ # /bin/echo 0-7 > cpuset.cpus
+
+Add some mems::
+
+ # /bin/echo 0-7 > cpuset.mems
+
+Now attach your shell to this cpuset::
+
+ # /bin/echo $$ > tasks
+
+You can also create cpusets inside your cpuset by using mkdir in this
+directory::
+
+ # mkdir my_sub_cs
+
+To remove a cpuset, just use rmdir::
+
+ # rmdir my_sub_cs
+
+This will fail if the cpuset is in use (has cpusets inside, or has
+processes attached).
+
+Note that for legacy reasons, the "cpuset" filesystem exists as a
+wrapper around the cgroup filesystem.
+
+The command::
+
+ mount -t cpuset X /sys/fs/cgroup/cpuset
+
+is equivalent to::
+
+ mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
+ echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
+
+2.2 Adding/removing cpus
+------------------------
+
+This is the syntax to use when writing in the cpus or mems files
+in cpuset directories::
+
+ # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
+ # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
+
+To add a CPU to a cpuset, write the new list of CPUs including the
+CPU to be added. To add 6 to the above cpuset::
+
+ # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
+
+Similarly to remove a CPU from a cpuset, write the new list of CPUs
+without the CPU to be removed.
+
+To remove all the CPUs::
+
+ # /bin/echo "" > cpuset.cpus -> clear cpus list
+
+2.3 Setting flags
+-----------------
+
+The syntax is very simple::
+
+ # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
+ # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
+
+2.4 Attaching processes
+-----------------------
+
+::
+
+ # /bin/echo PID > tasks
+
+Note that it is PID, not PIDs. You can only attach ONE task at a time.
+If you have several tasks to attach, you have to do it one after another::
+
+ # /bin/echo PID1 > tasks
+ # /bin/echo PID2 > tasks
+ ...
+ # /bin/echo PIDn > tasks
+
+
+3. Questions
+============
+
+Q:
+ what's up with this '/bin/echo' ?
+
+A:
+ bash's builtin 'echo' command does not check calls to write() against
+ errors. If you use it in the cpuset file system, you won't be
+ able to tell whether a command succeeded or failed.
+
+Q:
+ When I attach processes, only the first of the line gets really attached !
+
+A:
+ We can only return one error code per call to write(). So you should also
+ put only ONE pid.
+
+4. Contact
+==========
+
+Web: http://www.bullopensource.org/cpuset
diff --git a/Documentation/admin-guide/cgroup-v1/devices.rst b/Documentation/admin-guide/cgroup-v1/devices.rst
new file mode 100644
index 0000000..e188678
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/devices.rst
@@ -0,0 +1,132 @@
+===========================
+Device Whitelist Controller
+===========================
+
+1. Description
+==============
+
+Implement a cgroup to track and enforce open and mknod restrictions
+on device files. A device cgroup associates a device access
+whitelist with each cgroup. A whitelist entry has 4 fields.
+'type' is a (all), c (char), or b (block). 'all' means it applies
+to all types and all major and minor numbers. Major and minor are
+either an integer or * for all. Access is a composition of r
+(read), w (write), and m (mknod).
+
+The root device cgroup starts with rwm to 'all'. A child device
+cgroup gets a copy of the parent. Administrators can then remove
+devices from the whitelist or add new entries. A child cgroup can
+never receive a device access which is denied by its parent.
+
+2. User Interface
+=================
+
+An entry is added using devices.allow, and removed using
+devices.deny. For instance::
+
+ echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
+
+allows cgroup 1 to read and mknod the device usually known as
+/dev/null. Doing::
+
+ echo a > /sys/fs/cgroup/1/devices.deny
+
+will remove the default 'a *:* rwm' entry. Doing::
+
+ echo a > /sys/fs/cgroup/1/devices.allow
+
+will add the 'a *:* rwm' entry to the whitelist.
+
+3. Security
+===========
+
+Any task can move itself between cgroups. This clearly won't
+suffice, but we can decide the best way to adequately restrict
+movement as people get some experience with this. We may just want
+to require CAP_SYS_ADMIN, which at least is a separate bit from
+CAP_MKNOD. We may want to just refuse moving to a cgroup which
+isn't a descendant of the current one. Or we may want to use
+CAP_MAC_ADMIN, since we really are trying to lock down root.
+
+CAP_SYS_ADMIN is needed to modify the whitelist or move another
+task to a new cgroup. (Again we'll probably want to change that).
+
+A cgroup may not be granted more permissions than the cgroup's
+parent has.
+
+4. Hierarchy
+============
+
+device cgroups maintain hierarchy by making sure a cgroup never has more
+access permissions than its parent. Every time an entry is written to
+a cgroup's devices.deny file, all its children will have that entry removed
+from their whitelist and all the locally set whitelist entries will be
+re-evaluated. In case one of the locally set whitelist entries would provide
+more access than the cgroup's parent, it'll be removed from the whitelist.
+
+Example::
+
+ A
+ / \
+ B
+
+ group behavior exceptions
+ A allow "b 8:* rwm", "c 116:1 rw"
+ B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
+
+If a device is denied in group A::
+
+ # echo "c 116:* r" > A/devices.deny
+
+it'll propagate down and after revalidating B's entries, the whitelist entry
+"c 116:2 rwm" will be removed::
+
+ group whitelist entries denied devices
+ A all "b 8:* rwm", "c 116:* rw"
+ B "c 1:3 rwm", "b 3:* rwm" all the rest
+
+In case parent's exceptions change and local exceptions are not allowed
+anymore, they'll be deleted.
+
+Notice that new whitelist entries will not be propagated::
+
+ A
+ / \
+ B
+
+ group whitelist entries denied devices
+ A "c 1:3 rwm", "c 1:5 r" all the rest
+ B "c 1:3 rwm", "c 1:5 r" all the rest
+
+when adding ``c *:3 rwm``::
+
+ # echo "c *:3 rwm" >A/devices.allow
+
+the result::
+
+ group whitelist entries denied devices
+ A "c *:3 rwm", "c 1:5 r" all the rest
+ B "c 1:3 rwm", "c 1:5 r" all the rest
+
+but now it'll be possible to add new entries to B::
+
+ # echo "c 2:3 rwm" >B/devices.allow
+ # echo "c 50:3 r" >B/devices.allow
+
+or even::
+
+ # echo "c *:3 rwm" >B/devices.allow
+
+Allowing or denying all by writing 'a' to devices.allow or devices.deny will
+not be possible once the device cgroups has children.
+
+4.1 Hierarchy (internal implementation)
+---------------------------------------
+
+device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
+list of exceptions. The internal state is controlled using the same user
+interface to preserve compatibility with the previous whitelist-only
+implementation. Removal or addition of exceptions that will reduce the access
+to devices will be propagated down the hierarchy.
+For every propagated exception, the effective rules will be re-evaluated based
+on current parent's access rules.
diff --git a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
new file mode 100644
index 0000000..582d342
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst
@@ -0,0 +1,127 @@
+==============
+Cgroup Freezer
+==============
+
+The cgroup freezer is useful to batch job management system which start
+and stop sets of tasks in order to schedule the resources of a machine
+according to the desires of a system administrator. This sort of program
+is often used on HPC clusters to schedule access to the cluster as a
+whole. The cgroup freezer uses cgroups to describe the set of tasks to
+be started/stopped by the batch job management system. It also provides
+a means to start and stop the tasks composing the job.
+
+The cgroup freezer will also be useful for checkpointing running groups
+of tasks. The freezer allows the checkpoint code to obtain a consistent
+image of the tasks by attempting to force the tasks in a cgroup into a
+quiescent state. Once the tasks are quiescent another task can
+walk /proc or invoke a kernel interface to gather information about the
+quiesced tasks. Checkpointed tasks can be restarted later should a
+recoverable error occur. This also allows the checkpointed tasks to be
+migrated between nodes in a cluster by copying the gathered information
+to another node and restarting the tasks there.
+
+Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
+and resuming tasks in userspace. Both of these signals are observable
+from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
+blocked, or ignored it can be seen by waiting or ptracing parent tasks.
+SIGCONT is especially unsuitable since it can be caught by the task. Any
+programs designed to watch for SIGSTOP and SIGCONT could be broken by
+attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
+demonstrate this problem using nested bash shells::
+
+ $ echo $$
+ 16644
+ $ bash
+ $ echo $$
+ 16690
+
+ From a second, unrelated bash shell:
+ $ kill -SIGSTOP 16690
+ $ kill -SIGCONT 16690
+
+ <at this point 16690 exits and causes 16644 to exit too>
+
+This happens because bash can observe both signals and choose how it
+responds to them.
+
+Another example of a program which catches and responds to these
+signals is gdb. In fact any program designed to use ptrace is likely to
+have a problem with this method of stopping and resuming tasks.
+
+In contrast, the cgroup freezer uses the kernel freezer code to
+prevent the freeze/unfreeze cycle from becoming visible to the tasks
+being frozen. This allows the bash example above and gdb to run as
+expected.
+
+The cgroup freezer is hierarchical. Freezing a cgroup freezes all
+tasks belonging to the cgroup and all its descendant cgroups. Each
+cgroup has its own state (self-state) and the state inherited from the
+parent (parent-state). Iff both states are THAWED, the cgroup is
+THAWED.
+
+The following cgroupfs files are created by cgroup freezer.
+
+* freezer.state: Read-write.
+
+ When read, returns the effective state of the cgroup - "THAWED",
+ "FREEZING" or "FROZEN". This is the combined self and parent-states.
+ If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
+
+ FREEZING cgroup transitions into FROZEN state when all tasks
+ belonging to the cgroup and its descendants become frozen. Note that
+ a cgroup reverts to FREEZING from FROZEN after a new task is added
+ to the cgroup or one of its descendant cgroups until the new task is
+ frozen.
+
+ When written, sets the self-state of the cgroup. Two values are
+ allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
+ if not already freezing, enters FREEZING state along with all its
+ descendant cgroups.
+
+ If THAWED is written, the self-state of the cgroup is changed to
+ THAWED. Note that the effective state may not change to THAWED if
+ the parent-state is still freezing. If a cgroup's effective state
+ becomes THAWED, all its descendants which are freezing because of
+ the cgroup also leave the freezing state.
+
+* freezer.self_freezing: Read only.
+
+ Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
+ This value is 1 iff the last write to freezer.state was "FROZEN".
+
+* freezer.parent_freezing: Read only.
+
+ Shows the parent-state. 0 if none of the cgroup's ancestors is
+ frozen; otherwise, 1.
+
+The root cgroup is non-freezable and the above interface files don't
+exist.
+
+* Examples of usage::
+
+ # mkdir /sys/fs/cgroup/freezer
+ # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
+ # mkdir /sys/fs/cgroup/freezer/0
+ # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
+
+to get status of the freezer subsystem::
+
+ # cat /sys/fs/cgroup/freezer/0/freezer.state
+ THAWED
+
+to freeze all tasks in the container::
+
+ # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
+ # cat /sys/fs/cgroup/freezer/0/freezer.state
+ FREEZING
+ # cat /sys/fs/cgroup/freezer/0/freezer.state
+ FROZEN
+
+to unfreeze all tasks in the container::
+
+ # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
+ # cat /sys/fs/cgroup/freezer/0/freezer.state
+ THAWED
+
+This is the basic mechanism which should do the right thing for user space task
+in a simple scenario.
diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
new file mode 100644
index 0000000..a3902aa
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -0,0 +1,50 @@
+==================
+HugeTLB Controller
+==================
+
+The HugeTLB controller allows to limit the HugeTLB usage per control group and
+enforces the controller limit during page fault. Since HugeTLB doesn't
+support page reclaim, enforcing the limit at page fault time implies that,
+the application will get SIGBUS signal if it tries to access HugeTLB pages
+beyond its limit. This requires the application to know beforehand how much
+HugeTLB pages it would require for its use.
+
+HugeTLB controller can be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -o hugetlb none /sys/fs/cgroup
+
+With the above step, the initial or the parent HugeTLB group becomes
+visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
+the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
+
+New groups can be created under the parent group /sys/fs/cgroup::
+
+ # cd /sys/fs/cgroup
+ # mkdir g1
+ # echo $$ > g1/tasks
+
+The above steps create a new group g1 and move the current shell
+process (bash) into it.
+
+Brief summary of control files::
+
+ hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
+ hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
+ hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
+
+For a system supporting three hugepage sizes (64k, 32M and 1G), the control
+files include::
+
+ hugetlb.1GB.limit_in_bytes
+ hugetlb.1GB.max_usage_in_bytes
+ hugetlb.1GB.usage_in_bytes
+ hugetlb.1GB.failcnt
+ hugetlb.64KB.limit_in_bytes
+ hugetlb.64KB.max_usage_in_bytes
+ hugetlb.64KB.usage_in_bytes
+ hugetlb.64KB.failcnt
+ hugetlb.32MB.limit_in_bytes
+ hugetlb.32MB.max_usage_in_bytes
+ hugetlb.32MB.usage_in_bytes
+ hugetlb.32MB.failcnt
diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst
new file mode 100644
index 0000000..10bf48b
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/index.rst
@@ -0,0 +1,28 @@
+========================
+Control Groups version 1
+========================
+
+.. toctree::
+ :maxdepth: 1
+
+ cgroups
+
+ blkio-controller
+ cpuacct
+ cpusets
+ devices
+ freezer-subsystem
+ hugetlb
+ memcg_test
+ memory
+ net_cls
+ net_prio
+ pids
+ rdma
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
new file mode 100644
index 0000000..3f7115e
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -0,0 +1,355 @@
+=====================================================
+Memory Resource Controller(Memcg) Implementation Memo
+=====================================================
+
+Last Updated: 2010/2
+
+Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
+
+Because VM is getting complex (one of reasons is memcg...), memcg's behavior
+is complex. This is a document for memcg's internal behavior.
+Please note that implementation details can be changed.
+
+(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
+
+0. How to record usage ?
+========================
+
+ 2 objects are used.
+
+ page_cgroup ....an object per page.
+
+ Allocated at boot or memory hotplug. Freed at memory hot removal.
+
+ swap_cgroup ... an entry per swp_entry.
+
+ Allocated at swapon(). Freed at swapoff().
+
+ The page_cgroup has USED bit and double count against a page_cgroup never
+ occurs. swap_cgroup is used only when a charged page is swapped-out.
+
+1. Charge
+=========
+
+ a page/swp_entry may be charged (usage += PAGE_SIZE) at
+
+ mem_cgroup_try_charge()
+
+2. Uncharge
+===========
+
+ a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
+
+ mem_cgroup_uncharge()
+ Called when a page's refcount goes down to 0.
+
+ mem_cgroup_uncharge_swap()
+ Called when swp_entry's refcnt goes down to 0. A charge against swap
+ disappears.
+
+3. charge-commit-cancel
+=======================
+
+ Memcg pages are charged in two steps:
+
+ - mem_cgroup_try_charge()
+ - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
+
+ At try_charge(), there are no flags to say "this page is charged".
+ at this point, usage += PAGE_SIZE.
+
+ At commit(), the page is associated with the memcg.
+
+ At cancel(), simply usage -= PAGE_SIZE.
+
+Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
+
+4. Anonymous
+============
+
+ Anonymous page is newly allocated at
+ - page fault into MAP_ANONYMOUS mapping.
+ - Copy-On-Write.
+
+ 4.1 Swap-in.
+ At swap-in, the page is taken from swap-cache. There are 2 cases.
+
+ (a) If the SwapCache is newly allocated and read, it has no charges.
+ (b) If the SwapCache has been mapped by processes, it has been
+ charged already.
+
+ 4.2 Swap-out.
+ At swap-out, typical state transition is below.
+
+ (a) add to swap cache. (marked as SwapCache)
+ swp_entry's refcnt += 1.
+ (b) fully unmapped.
+ swp_entry's refcnt += # of ptes.
+ (c) write back to swap.
+ (d) delete from swap cache. (remove from SwapCache)
+ swp_entry's refcnt -= 1.
+
+
+ Finally, at task exit,
+ (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
+
+5. Page Cache
+=============
+
+ Page Cache is charged at
+ - add_to_page_cache_locked().
+
+ The logic is very clear. (About migration, see below)
+
+ Note:
+ __remove_from_page_cache() is called by remove_from_page_cache()
+ and __remove_mapping().
+
+6. Shmem(tmpfs) Page Cache
+===========================
+
+ The best way to understand shmem's page state transition is to read
+ mm/shmem.c.
+
+ But brief explanation of the behavior of memcg around shmem will be
+ helpful to understand the logic.
+
+ Shmem's page (just leaf page, not direct/indirect block) can be on
+
+ - radix-tree of shmem's inode.
+ - SwapCache.
+ - Both on radix-tree and SwapCache. This happens at swap-in
+ and swap-out,
+
+ It's charged when...
+
+ - A new page is added to shmem's radix-tree.
+ - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
+
+7. Page Migration
+=================
+
+ mem_cgroup_migrate()
+
+8. LRU
+======
+ Each memcg has its own private LRU. Now, its handling is under global
+ VM's control (means that it's handled under global pgdat->lru_lock).
+ Almost all routines around memcg's LRU is called by global LRU's
+ list management functions under pgdat->lru_lock.
+
+ A special function is mem_cgroup_isolate_pages(). This scans
+ memcg's private LRU and call __isolate_lru_page() to extract a page
+ from LRU.
+
+ (By __isolate_lru_page(), the page is removed from both of global and
+ private LRU.)
+
+
+9. Typical Tests.
+=================
+
+ Tests for racy cases.
+
+9.1 Small limit to memcg.
+-------------------------
+
+ When you do test to do racy case, it's good test to set memcg's limit
+ to be very small rather than GB. Many races found in the test under
+ xKB or xxMB limits.
+
+ (Memory behavior under GB and Memory behavior under MB shows very
+ different situation.)
+
+9.2 Shmem
+---------
+
+ Historically, memcg's shmem handling was poor and we saw some amount
+ of troubles here. This is because shmem is page-cache but can be
+ SwapCache. Test with shmem/tmpfs is always good test.
+
+9.3 Migration
+-------------
+
+ For NUMA, migration is an another special case. To do easy test, cpuset
+ is useful. Following is a sample script to do migration::
+
+ mount -t cgroup -o cpuset none /opt/cpuset
+
+ mkdir /opt/cpuset/01
+ echo 1 > /opt/cpuset/01/cpuset.cpus
+ echo 0 > /opt/cpuset/01/cpuset.mems
+ echo 1 > /opt/cpuset/01/cpuset.memory_migrate
+ mkdir /opt/cpuset/02
+ echo 1 > /opt/cpuset/02/cpuset.cpus
+ echo 1 > /opt/cpuset/02/cpuset.mems
+ echo 1 > /opt/cpuset/02/cpuset.memory_migrate
+
+ In above set, when you moves a task from 01 to 02, page migration to
+ node 0 to node 1 will occur. Following is a script to migrate all
+ under cpuset.::
+
+ --
+ move_task()
+ {
+ for pid in $1
+ do
+ /bin/echo $pid >$2/tasks 2>/dev/null
+ echo -n $pid
+ echo -n " "
+ done
+ echo END
+ }
+
+ G1_TASK=`cat ${G1}/tasks`
+ G2_TASK=`cat ${G2}/tasks`
+ move_task "${G1_TASK}" ${G2} &
+ --
+
+9.4 Memory hotplug
+------------------
+
+ memory hotplug test is one of good test.
+
+ to offline memory, do following::
+
+ # echo offline > /sys/devices/system/memory/memoryXXX/state
+
+ (XXX is the place of memory)
+
+ This is an easy way to test page migration, too.
+
+9.5 mkdir/rmdir
+---------------
+
+ When using hierarchy, mkdir/rmdir test should be done.
+ Use tests like the following::
+
+ echo 1 >/opt/cgroup/01/memory/use_hierarchy
+ mkdir /opt/cgroup/01/child_a
+ mkdir /opt/cgroup/01/child_b
+
+ set limit to 01.
+ add limit to 01/child_b
+ run jobs under child_a and child_b
+
+ create/delete following groups at random while jobs are running::
+
+ /opt/cgroup/01/child_a/child_aa
+ /opt/cgroup/01/child_b/child_bb
+ /opt/cgroup/01/child_c
+
+ running new jobs in new group is also good.
+
+9.6 Mount with other subsystems
+-------------------------------
+
+ Mounting with other subsystems is a good test because there is a
+ race and lock dependency with other cgroup subsystems.
+
+ example::
+
+ # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
+
+ and do task move, mkdir, rmdir etc...under this.
+
+9.7 swapoff
+-----------
+
+ Besides management of swap is one of complicated parts of memcg,
+ call path of swap-in at swapoff is not same as usual swap-in path..
+ It's worth to be tested explicitly.
+
+ For example, test like following is good:
+
+ (Shell-A)::
+
+ # mount -t cgroup none /cgroup -o memory
+ # mkdir /cgroup/test
+ # echo 40M > /cgroup/test/memory.limit_in_bytes
+ # echo 0 > /cgroup/test/tasks
+
+ Run malloc(100M) program under this. You'll see 60M of swaps.
+
+ (Shell-B)::
+
+ # move all tasks in /cgroup/test to /cgroup
+ # /sbin/swapoff -a
+ # rmdir /cgroup/test
+ # kill malloc task.
+
+ Of course, tmpfs v.s. swapoff test should be tested, too.
+
+9.8 OOM-Killer
+--------------
+
+ Out-of-memory caused by memcg's limit will kill tasks under
+ the memcg. When hierarchy is used, a task under hierarchy
+ will be killed by the kernel.
+
+ In this case, panic_on_oom shouldn't be invoked and tasks
+ in other groups shouldn't be killed.
+
+ It's not difficult to cause OOM under memcg as following.
+
+ Case A) when you can swapoff::
+
+ #swapoff -a
+ #echo 50M > /memory.limit_in_bytes
+
+ run 51M of malloc
+
+ Case B) when you use mem+swap limitation::
+
+ #echo 50M > memory.limit_in_bytes
+ #echo 50M > memory.memsw.limit_in_bytes
+
+ run 51M of malloc
+
+9.9 Move charges at task migration
+----------------------------------
+
+ Charges associated with a task can be moved along with task migration.
+
+ (Shell-A)::
+
+ #mkdir /cgroup/A
+ #echo $$ >/cgroup/A/tasks
+
+ run some programs which uses some amount of memory in /cgroup/A.
+
+ (Shell-B)::
+
+ #mkdir /cgroup/B
+ #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
+ #echo "pid of the program running in group A" >/cgroup/B/tasks
+
+ You can see charges have been moved by reading ``*.usage_in_bytes`` or
+ memory.stat of both A and B.
+
+ See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
+ be written to move_charge_at_immigrate.
+
+9.10 Memory thresholds
+----------------------
+
+ Memory controller implements memory thresholds using cgroups notification
+ API. You can use tools/cgroup/cgroup_event_listener.c to test it.
+
+ (Shell-A) Create cgroup and run event listener::
+
+ # mkdir /cgroup/A
+ # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
+
+ (Shell-B) Add task to cgroup and try to allocate and free memory::
+
+ # echo $$ >/cgroup/A/tasks
+ # a="$(dd if=/dev/zero bs=1M count=10)"
+ # a=
+
+ You will see message from cgroup_event_listener every time you cross
+ the thresholds.
+
+ Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
+
+ It's good idea to test root cgroup as well.
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
new file mode 100644
index 0000000..0ae4f56
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -0,0 +1,1005 @@
+==========================
+Memory Resource Controller
+==========================
+
+NOTE:
+ This document is hopelessly outdated and it asks for a complete
+ rewrite. It still contains a useful information so we are keeping it
+ here but make sure to check the current code if you need a deeper
+ understanding.
+
+NOTE:
+ The Memory Resource Controller has generically been referred to as the
+ memory controller in this document. Do not confuse memory controller
+ used here with the memory controller that is used in hardware.
+
+(For editors) In this document:
+ When we mention a cgroup (cgroupfs's directory) with memory controller,
+ we call it "memory cgroup". When you see git-log and source code, you'll
+ see patch's title and function names tend to use "memcg".
+ In this document, we avoid using it.
+
+Benefits and Purpose of the memory controller
+=============================================
+
+The memory controller isolates the memory behaviour of a group of tasks
+from the rest of the system. The article on LWN [12] mentions some probable
+uses of the memory controller. The memory controller can be used to
+
+a. Isolate an application or a group of applications
+ Memory-hungry applications can be isolated and limited to a smaller
+ amount of memory.
+b. Create a cgroup with a limited amount of memory; this can be used
+ as a good alternative to booting with mem=XXXX.
+c. Virtualization solutions can control the amount of memory they want
+ to assign to a virtual machine instance.
+d. A CD/DVD burner could control the amount of memory used by the
+ rest of the system to ensure that burning does not fail due to lack
+ of available memory.
+e. There are several other use cases; find one or use the controller just
+ for fun (to learn and hack on the VM subsystem).
+
+Current Status: linux-2.6.34-mmotm(development version of 2010/April)
+
+Features:
+
+ - accounting anonymous pages, file caches, swap caches usage and limiting them.
+ - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
+ - optionally, memory+swap usage can be accounted and limited.
+ - hierarchical accounting
+ - soft limit
+ - moving (recharging) account at moving a task is selectable.
+ - usage threshold notifier
+ - memory pressure notifier
+ - oom-killer disable knob and oom-notifier
+ - Root cgroup has no limit controls.
+
+ Kernel memory support is a work in progress, and the current version provides
+ basically functionality. (See Section 2.7)
+
+Brief summary of control files.
+
+==================================== ==========================================
+ tasks attach a task(thread) and show list of
+ threads
+ cgroup.procs show list of processes
+ cgroup.event_control an interface for event_fd()
+ memory.usage_in_bytes show current usage for memory
+ (See 5.5 for details)
+ memory.memsw.usage_in_bytes show current usage for memory+Swap
+ (See 5.5 for details)
+ memory.limit_in_bytes set/show limit of memory usage
+ memory.memsw.limit_in_bytes set/show limit of memory+Swap usage
+ memory.failcnt show the number of memory usage hits limits
+ memory.memsw.failcnt show the number of memory+Swap hits limits
+ memory.max_usage_in_bytes show max memory usage recorded
+ memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
+ memory.soft_limit_in_bytes set/show soft limit of memory usage
+ memory.stat show various statistics
+ memory.use_hierarchy set/show hierarchical account enabled
+ memory.force_empty trigger forced page reclaim
+ memory.pressure_level set memory pressure notifications
+ memory.swappiness set/show swappiness parameter of vmscan
+ (See sysctl's vm.swappiness)
+ memory.move_charge_at_immigrate set/show controls of moving charges
+ memory.oom_control set/show oom controls.
+ memory.numa_stat show the number of memory usage per numa
+ node
+ memory.kmem.limit_in_bytes set/show hard limit for kernel memory
+ This knob is deprecated and shouldn't be
+ used. It is planned that this be removed in
+ the foreseeable future.
+ memory.kmem.usage_in_bytes show current kernel memory allocation
+ memory.kmem.failcnt show the number of kernel memory usage
+ hits limits
+ memory.kmem.max_usage_in_bytes show max kernel memory usage recorded
+
+ memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory
+ memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation
+ memory.kmem.tcp.failcnt show the number of tcp buf memory usage
+ hits limits
+ memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded
+==================================== ==========================================
+
+1. History
+==========
+
+The memory controller has a long history. A request for comments for the memory
+controller was posted by Balbir Singh [1]. At the time the RFC was posted
+there were several implementations for memory control. The goal of the
+RFC was to build consensus and agreement for the minimal features required
+for memory control. The first RSS controller was posted by Balbir Singh[2]
+in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
+RSS controller. At OLS, at the resource management BoF, everyone suggested
+that we handle both page cache and RSS together. Another request was raised
+to allow user space handling of OOM. The current memory controller is
+at version 6; it combines both mapped (RSS) and unmapped Page
+Cache Control [11].
+
+2. Memory Control
+=================
+
+Memory is a unique resource in the sense that it is present in a limited
+amount. If a task requires a lot of CPU processing, the task can spread
+its processing over a period of hours, days, months or years, but with
+memory, the same physical memory needs to be reused to accomplish the task.
+
+The memory controller implementation has been divided into phases. These
+are:
+
+1. Memory controller
+2. mlock(2) controller
+3. Kernel user memory accounting and slab control
+4. user mappings length controller
+
+The memory controller is the first controller developed.
+
+2.1. Design
+-----------
+
+The core of the design is a counter called the page_counter. The
+page_counter tracks the current memory usage and limit of the group of
+processes associated with the controller. Each cgroup has a memory controller
+specific data structure (mem_cgroup) associated with it.
+
+2.2. Accounting
+---------------
+
+::
+
+ +--------------------+
+ | mem_cgroup |
+ | (page_counter) |
+ +--------------------+
+ / ^ \
+ / | \
+ +---------------+ | +---------------+
+ | mm_struct | |.... | mm_struct |
+ | | | | |
+ +---------------+ | +---------------+
+ |
+ + --------------+
+ |
+ +---------------+ +------+--------+
+ | page +----------> page_cgroup|
+ | | | |
+ +---------------+ +---------------+
+
+ (Figure 1: Hierarchy of Accounting)
+
+
+Figure 1 shows the important aspects of the controller
+
+1. Accounting happens per cgroup
+2. Each mm_struct knows about which cgroup it belongs to
+3. Each page has a pointer to the page_cgroup, which in turn knows the
+ cgroup it belongs to
+
+The accounting is done as follows: mem_cgroup_charge_common() is invoked to
+set up the necessary data structures and check if the cgroup that is being
+charged is over its limit. If it is, then reclaim is invoked on the cgroup.
+More details can be found in the reclaim section of this document.
+If everything goes well, a page meta-data-structure called page_cgroup is
+updated. page_cgroup has its own LRU on cgroup.
+(*) page_cgroup structure is allocated at boot/memory-hotplug time.
+
+2.2.1 Accounting details
+------------------------
+
+All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
+Some pages which are never reclaimable and will not be on the LRU
+are not accounted. We just account pages under usual VM management.
+
+RSS pages are accounted at page_fault unless they've already been accounted
+for earlier. A file page will be accounted for as Page Cache when it's
+inserted into inode (radix-tree). While it's mapped into the page tables of
+processes, duplicate accounting is carefully avoided.
+
+An RSS page is unaccounted when it's fully unmapped. A PageCache page is
+unaccounted when it's removed from radix-tree. Even if RSS pages are fully
+unmapped (by kswapd), they may exist as SwapCache in the system until they
+are really freed. Such SwapCaches are also accounted.
+A swapped-in page is not accounted until it's mapped.
+
+Note: The kernel does swapin-readahead and reads multiple swaps at once.
+This means swapped-in pages may contain pages for other tasks than a task
+causing page fault. So, we avoid accounting at swap-in I/O.
+
+At page migration, accounting information is kept.
+
+Note: we just account pages-on-LRU because our purpose is to control amount
+of used pages; not-on-LRU pages tend to be out-of-control from VM view.
+
+2.3 Shared Page Accounting
+--------------------------
+
+Shared pages are accounted on the basis of the first touch approach. The
+cgroup that first touches a page is accounted for the page. The principle
+behind this approach is that a cgroup that aggressively uses a shared
+page will eventually get charged for it (once it is uncharged from
+the cgroup that brought it in -- this will happen on memory pressure).
+
+But see section 8.2: when moving a task to another cgroup, its pages may
+be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
+
+Exception: If CONFIG_MEMCG_SWAP is not used.
+When you do swapoff and make swapped-out pages of shmem(tmpfs) to
+be backed into memory in force, charges for pages are accounted against the
+caller of swapoff rather than the users of shmem.
+
+2.4 Swap Extension (CONFIG_MEMCG_SWAP)
+--------------------------------------
+
+Swap Extension allows you to record charge for swap. A swapped-in page is
+charged back to original page allocator if possible.
+
+When swap is accounted, following files are added.
+
+ - memory.memsw.usage_in_bytes.
+ - memory.memsw.limit_in_bytes.
+
+memsw means memory+swap. Usage of memory+swap is limited by
+memsw.limit_in_bytes.
+
+Example: Assume a system with 4G of swap. A task which allocates 6G of memory
+(by mistake) under 2G memory limitation will use all swap.
+In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
+By using the memsw limit, you can avoid system OOM which can be caused by swap
+shortage.
+
+**why 'memory+swap' rather than swap**
+
+The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
+to move account from memory to swap...there is no change in usage of
+memory+swap. In other words, when we want to limit the usage of swap without
+affecting global LRU, memory+swap limit is better than just limiting swap from
+an OS point of view.
+
+**What happens when a cgroup hits memory.memsw.limit_in_bytes**
+
+When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
+in this cgroup. Then, swap-out will not be done by cgroup routine and file
+caches are dropped. But as mentioned above, global LRU can do swapout memory
+from it for sanity of the system's memory management state. You can't forbid
+it by cgroup.
+
+2.5 Reclaim
+-----------
+
+Each cgroup maintains a per cgroup LRU which has the same structure as
+global VM. When a cgroup goes over its limit, we first try
+to reclaim memory from the cgroup so as to make space for the new
+pages that the cgroup has touched. If the reclaim is unsuccessful,
+an OOM routine is invoked to select and kill the bulkiest task in the
+cgroup. (See 10. OOM Control below.)
+
+The reclaim algorithm has not been modified for cgroups, except that
+pages that are selected for reclaiming come from the per-cgroup LRU
+list.
+
+NOTE:
+ Reclaim does not work for the root cgroup, since we cannot set any
+ limits on the root cgroup.
+
+Note2:
+ When panic_on_oom is set to "2", the whole system will panic.
+
+When oom event notifier is registered, event will be delivered.
+(See oom_control section)
+
+2.6 Locking
+-----------
+
+ lock_page_cgroup()/unlock_page_cgroup() should not be called under
+ the i_pages lock.
+
+ Other lock order is following:
+
+ PG_locked.
+ mm->page_table_lock
+ pgdat->lru_lock
+ lock_page_cgroup.
+
+ In many cases, just lock_page_cgroup() is called.
+
+ per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
+ pgdat->lru_lock, it has no lock of its own.
+
+2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
+-----------------------------------------------
+
+With the Kernel memory extension, the Memory Controller is able to limit
+the amount of kernel memory used by the system. Kernel memory is fundamentally
+different than user memory, since it can't be swapped out, which makes it
+possible to DoS the system by consuming too much of this precious resource.
+
+Kernel memory accounting is enabled for all memory cgroups by default. But
+it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
+at boot time. In this case, kernel memory will not be accounted at all.
+
+Kernel memory limits are not imposed for the root cgroup. Usage for the root
+cgroup may or may not be accounted. The memory used is accumulated into
+memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
+(currently only for tcp).
+
+The main "kmem" counter is fed into the main counter, so kmem charges will
+also be visible from the user counter.
+
+Currently no soft limit is implemented for kernel memory. It is future work
+to trigger slab reclaim when those limits are reached.
+
+2.7.1 Current Kernel Memory resources accounted
+-----------------------------------------------
+
+stack pages:
+ every process consumes some stack pages. By accounting into
+ kernel memory, we prevent new processes from being created when the kernel
+ memory usage is too high.
+
+slab pages:
+ pages allocated by the SLAB or SLUB allocator are tracked. A copy
+ of each kmem_cache is created every time the cache is touched by the first time
+ from inside the memcg. The creation is done lazily, so some objects can still be
+ skipped while the cache is being created. All objects in a slab page should
+ belong to the same memcg. This only fails to hold when a task is migrated to a
+ different memcg during the page allocation by the cache.
+
+sockets memory pressure:
+ some sockets protocols have memory pressure
+ thresholds. The Memory Controller allows them to be controlled individually
+ per cgroup, instead of globally.
+
+tcp memory pressure:
+ sockets memory pressure for the tcp protocol.
+
+2.7.2 Common use cases
+----------------------
+
+Because the "kmem" counter is fed to the main user counter, kernel memory can
+never be limited completely independently of user memory. Say "U" is the user
+limit, and "K" the kernel limit. There are three possible ways limits can be
+set:
+
+U != 0, K = unlimited:
+ This is the standard memcg limitation mechanism already present before kmem
+ accounting. Kernel memory is completely ignored.
+
+U != 0, K < U:
+ Kernel memory is a subset of the user memory. This setup is useful in
+ deployments where the total amount of memory per-cgroup is overcommited.
+ Overcommiting kernel memory limits is definitely not recommended, since the
+ box can still run out of non-reclaimable memory.
+ In this case, the admin could set up K so that the sum of all groups is
+ never greater than the total memory, and freely set U at the cost of his
+ QoS.
+
+WARNING:
+ In the current implementation, memory reclaim will NOT be
+ triggered for a cgroup when it hits K while staying below U, which makes
+ this setup impractical.
+
+U != 0, K >= U:
+ Since kmem charges will also be fed to the user counter and reclaim will be
+ triggered for the cgroup for both kinds of memory. This setup gives the
+ admin a unified view of memory, and it is also useful for people who just
+ want to track kernel memory usage.
+
+3. User Interface
+=================
+
+3.0. Configuration
+------------------
+
+a. Enable CONFIG_CGROUPS
+b. Enable CONFIG_MEMCG
+c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
+d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
+
+3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
+-------------------------------------------------------------------
+
+::
+
+ # mount -t tmpfs none /sys/fs/cgroup
+ # mkdir /sys/fs/cgroup/memory
+ # mount -t cgroup none /sys/fs/cgroup/memory -o memory
+
+3.2. Make the new group and move bash into it::
+
+ # mkdir /sys/fs/cgroup/memory/0
+ # echo $$ > /sys/fs/cgroup/memory/0/tasks
+
+Since now we're in the 0 cgroup, we can alter the memory limit::
+
+ # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
+
+NOTE:
+ We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
+ mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
+ Gibibytes.)
+
+NOTE:
+ We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
+
+NOTE:
+ We cannot set limits on the root cgroup any more.
+
+::
+
+ # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
+ 4194304
+
+We can check the usage::
+
+ # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
+ 1216512
+
+A successful write to this file does not guarantee a successful setting of
+this limit to the value written into the file. This can be due to a
+number of factors, such as rounding up to page boundaries or the total
+availability of memory on the system. The user is required to re-read
+this file after a write to guarantee the value committed by the kernel::
+
+ # echo 1 > memory.limit_in_bytes
+ # cat memory.limit_in_bytes
+ 4096
+
+The memory.failcnt field gives the number of times that the cgroup limit was
+exceeded.
+
+The memory.stat file gives accounting information. Now, the number of
+caches, RSS and Active pages/Inactive pages are shown.
+
+4. Testing
+==========
+
+For testing features and implementation, see memcg_test.txt.
+
+Performance test is also important. To see pure memory controller's overhead,
+testing on tmpfs will give you good numbers of small overheads.
+Example: do kernel make on tmpfs.
+
+Page-fault scalability is also important. At measuring parallel
+page fault test, multi-process test may be better than multi-thread
+test because it has noise of shared objects/status.
+
+But the above two are testing extreme situations.
+Trying usual test under memory controller is always helpful.
+
+4.1 Troubleshooting
+-------------------
+
+Sometimes a user might find that the application under a cgroup is
+terminated by the OOM killer. There are several causes for this:
+
+1. The cgroup limit is too low (just too low to do anything useful)
+2. The user is using anonymous memory and swap is turned off or too low
+
+A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
+some of the pages cached in the cgroup (page cache pages).
+
+To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
+seeing what happens will be helpful.
+
+4.2 Task migration
+------------------
+
+When a task migrates from one cgroup to another, its charge is not
+carried forward by default. The pages allocated from the original cgroup still
+remain charged to it, the charge is dropped when the page is freed or
+reclaimed.
+
+You can move charges of a task along with task migration.
+See 8. "Move charges at task migration"
+
+4.3 Removing a cgroup
+---------------------
+
+A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
+cgroup might have some charge associated with it, even though all
+tasks have migrated away from it. (because we charge against pages, not
+against tasks.)
+
+We move the stats to root (if use_hierarchy==0) or parent (if
+use_hierarchy==1), and no change on the charge except uncharging
+from the child.
+
+Charges recorded in swap information is not updated at removal of cgroup.
+Recorded information is discarded and a cgroup which uses swap (swapcache)
+will be charged as a new owner of it.
+
+About use_hierarchy, see Section 6.
+
+5. Misc. interfaces
+===================
+
+5.1 force_empty
+---------------
+ memory.force_empty interface is provided to make cgroup's memory usage empty.
+ When writing anything to this::
+
+ # echo 0 > memory.force_empty
+
+ the cgroup will be reclaimed and as many pages reclaimed as possible.
+
+ The typical use case for this interface is before calling rmdir().
+ Though rmdir() offlines memcg, but the memcg may still stay there due to
+ charged file caches. Some out-of-use page caches may keep charged until
+ memory pressure happens. If you want to avoid that, force_empty will be useful.
+
+ Also, note that when memory.kmem.limit_in_bytes is set the charges due to
+ kernel pages will still be seen. This is not considered a failure and the
+ write will still return success. In this case, it is expected that
+ memory.kmem.usage_in_bytes == memory.usage_in_bytes.
+
+ About use_hierarchy, see Section 6.
+
+5.2 stat file
+-------------
+
+memory.stat file includes following statistics
+
+per-memory cgroup local status
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+=============== ===============================================================
+cache # of bytes of page cache memory.
+rss # of bytes of anonymous and swap cache memory (includes
+ transparent hugepages).
+rss_huge # of bytes of anonymous transparent hugepages.
+mapped_file # of bytes of mapped file (includes tmpfs/shmem)
+pgpgin # of charging events to the memory cgroup. The charging
+ event happens each time a page is accounted as either mapped
+ anon page(RSS) or cache page(Page Cache) to the cgroup.
+pgpgout # of uncharging events to the memory cgroup. The uncharging
+ event happens each time a page is unaccounted from the cgroup.
+swap # of bytes of swap usage
+dirty # of bytes that are waiting to get written back to the disk.
+writeback # of bytes of file/anon cache that are queued for syncing to
+ disk.
+inactive_anon # of bytes of anonymous and swap cache memory on inactive
+ LRU list.
+active_anon # of bytes of anonymous and swap cache memory on active
+ LRU list.
+inactive_file # of bytes of file-backed memory on inactive LRU list.
+active_file # of bytes of file-backed memory on active LRU list.
+unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
+=============== ===============================================================
+
+status considering hierarchy (see memory.use_hierarchy settings)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+========================= ===================================================
+hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
+ under which the memory cgroup is
+hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
+ hierarchy under which memory cgroup is.
+
+total_<counter> # hierarchical version of <counter>, which in
+ addition to the cgroup's own value includes the
+ sum of all hierarchical children's values of
+ <counter>, i.e. total_cache
+========================= ===================================================
+
+The following additional stats are dependent on CONFIG_DEBUG_VM
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+========================= ========================================
+recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
+recent_rotated_file VM internal parameter. (see mm/vmscan.c)
+recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
+recent_scanned_file VM internal parameter. (see mm/vmscan.c)
+========================= ========================================
+
+Memo:
+ recent_rotated means recent frequency of LRU rotation.
+ recent_scanned means recent # of scans to LRU.
+ showing for better debug please see the code for meanings.
+
+Note:
+ Only anonymous and swap cache memory is listed as part of 'rss' stat.
+ This should not be confused with the true 'resident set size' or the
+ amount of physical memory used by the cgroup.
+
+ 'rss + mapped_file" will give you resident set size of cgroup.
+
+ (Note: file and shmem may be shared among other cgroups. In that case,
+ mapped_file is accounted only when the memory cgroup is owner of page
+ cache.)
+
+5.3 swappiness
+--------------
+
+Overrides /proc/sys/vm/swappiness for the particular group. The tunable
+in the root cgroup corresponds to the global swappiness setting.
+
+Please note that unlike during the global reclaim, limit reclaim
+enforces that 0 swappiness really prevents from any swapping even if
+there is a swap storage available. This might lead to memcg OOM killer
+if there are no file pages to reclaim.
+
+5.4 failcnt
+-----------
+
+A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
+This failcnt(== failure count) shows the number of times that a usage counter
+hit its limit. When a memory cgroup hits a limit, failcnt increases and
+memory under it will be reclaimed.
+
+You can reset failcnt by writing 0 to failcnt file::
+
+ # echo 0 > .../memory.failcnt
+
+5.5 usage_in_bytes
+------------------
+
+For efficiency, as other kernel components, memory cgroup uses some optimization
+to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
+method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
+value for efficient access. (Of course, when necessary, it's synchronized.)
+If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
+value in memory.stat(see 5.2).
+
+5.6 numa_stat
+-------------
+
+This is similar to numa_maps but operates on a per-memcg basis. This is
+useful for providing visibility into the numa locality information within
+an memcg since the pages are allowed to be allocated from any physical
+node. One of the use cases is evaluating application performance by
+combining this information with the application's CPU allocation.
+
+Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
+per-node page counts including "hierarchical_<counter>" which sums up all
+hierarchical children's values in addition to the memcg's own value.
+
+The output format of memory.numa_stat is::
+
+ total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
+ file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
+ anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
+ unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
+ hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
+
+The "total" count is sum of file + anon + unevictable.
+
+6. Hierarchy support
+====================
+
+The memory controller supports a deep hierarchy and hierarchical accounting.
+The hierarchy is created by creating the appropriate cgroups in the
+cgroup filesystem. Consider for example, the following cgroup filesystem
+hierarchy::
+
+ root
+ / | \
+ / | \
+ a b c
+ | \
+ | \
+ d e
+
+In the diagram above, with hierarchical accounting enabled, all memory
+usage of e, is accounted to its ancestors up until the root (i.e, c and root),
+that has memory.use_hierarchy enabled. If one of the ancestors goes over its
+limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
+children of the ancestor.
+
+6.1 Enabling hierarchical accounting and reclaim
+------------------------------------------------
+
+A memory cgroup by default disables the hierarchy feature. Support
+can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
+
+ # echo 1 > memory.use_hierarchy
+
+The feature can be disabled by::
+
+ # echo 0 > memory.use_hierarchy
+
+NOTE1:
+ Enabling/disabling will fail if either the cgroup already has other
+ cgroups created below it, or if the parent cgroup has use_hierarchy
+ enabled.
+
+NOTE2:
+ When panic_on_oom is set to "2", the whole system will panic in
+ case of an OOM event in any cgroup.
+
+7. Soft limits
+==============
+
+Soft limits allow for greater sharing of memory. The idea behind soft limits
+is to allow control groups to use as much of the memory as needed, provided
+
+a. There is no memory contention
+b. They do not exceed their hard limit
+
+When the system detects memory contention or low memory, control groups
+are pushed back to their soft limits. If the soft limit of each control
+group is very high, they are pushed back as much as possible to make
+sure that one control group does not starve the others of memory.
+
+Please note that soft limits is a best-effort feature; it comes with
+no guarantees, but it does its best to make sure that when memory is
+heavily contended for, memory is allocated based on the soft limit
+hints/setup. Currently soft limit based reclaim is set up such that
+it gets invoked from balance_pgdat (kswapd).
+
+7.1 Interface
+-------------
+
+Soft limits can be setup by using the following commands (in this example we
+assume a soft limit of 256 MiB)::
+
+ # echo 256M > memory.soft_limit_in_bytes
+
+If we want to change this to 1G, we can at any time use::
+
+ # echo 1G > memory.soft_limit_in_bytes
+
+NOTE1:
+ Soft limits take effect over a long period of time, since they involve
+ reclaiming memory for balancing between memory cgroups
+NOTE2:
+ It is recommended to set the soft limit always below the hard limit,
+ otherwise the hard limit will take precedence.
+
+8. Move charges at task migration
+=================================
+
+Users can move charges associated with a task along with task migration, that
+is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
+This feature is not supported in !CONFIG_MMU environments because of lack of
+page tables.
+
+8.1 Interface
+-------------
+
+This feature is disabled by default. It can be enabled (and disabled again) by
+writing to memory.move_charge_at_immigrate of the destination cgroup.
+
+If you want to enable it::
+
+ # echo (some positive value) > memory.move_charge_at_immigrate
+
+Note:
+ Each bits of move_charge_at_immigrate has its own meaning about what type
+ of charges should be moved. See 8.2 for details.
+Note:
+ Charges are moved only when you move mm->owner, in other words,
+ a leader of a thread group.
+Note:
+ If we cannot find enough space for the task in the destination cgroup, we
+ try to make space by reclaiming memory. Task migration may fail if we
+ cannot make enough space.
+Note:
+ It can take several seconds if you move charges much.
+
+And if you want disable it again::
+
+ # echo 0 > memory.move_charge_at_immigrate
+
+8.2 Type of charges which can be moved
+--------------------------------------
+
+Each bit in move_charge_at_immigrate has its own meaning about what type of
+charges should be moved. But in any case, it must be noted that an account of
+a page or a swap can be moved only when it is charged to the task's current
+(old) memory cgroup.
+
++---+--------------------------------------------------------------------------+
+|bit| what type of charges would be moved ? |
++===+==========================================================================+
+| 0 | A charge of an anonymous page (or swap of it) used by the target task. |
+| | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
++---+--------------------------------------------------------------------------+
+| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
+| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
+| | anonymous pages, file pages (and swaps) in the range mmapped by the task |
+| | will be moved even if the task hasn't done page fault, i.e. they might |
+| | not be the task's "RSS", but other task's "RSS" that maps the same file. |
+| | And mapcount of the page is ignored (the page can be moved even if |
+| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
+| | enable move of swap charges. |
++---+--------------------------------------------------------------------------+
+
+8.3 TODO
+--------
+
+- All of moving charge operations are done under cgroup_mutex. It's not good
+ behavior to hold the mutex too long, so we may need some trick.
+
+9. Memory thresholds
+====================
+
+Memory cgroup implements memory thresholds using the cgroups notification
+API (see cgroups.txt). It allows to register multiple memory and memsw
+thresholds and gets notifications when it crosses.
+
+To register a threshold, an application must:
+
+- create an eventfd using eventfd(2);
+- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
+- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
+ cgroup.event_control.
+
+Application will be notified through eventfd when memory usage crosses
+threshold in any direction.
+
+It's applicable for root and non-root cgroup.
+
+10. OOM Control
+===============
+
+memory.oom_control file is for OOM notification and other controls.
+
+Memory cgroup implements OOM notifier using the cgroup notification
+API (See cgroups.txt). It allows to register multiple OOM notification
+delivery and gets notification when OOM happens.
+
+To register a notifier, an application must:
+
+ - create an eventfd using eventfd(2)
+ - open memory.oom_control file
+ - write string like "<event_fd> <fd of memory.oom_control>" to
+ cgroup.event_control
+
+The application will be notified through eventfd when OOM happens.
+OOM notification doesn't work for the root cgroup.
+
+You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
+
+ #echo 1 > memory.oom_control
+
+If OOM-killer is disabled, tasks under cgroup will hang/sleep
+in memory cgroup's OOM-waitqueue when they request accountable memory.
+
+For running them, you have to relax the memory cgroup's OOM status by
+
+ * enlarge limit or reduce usage.
+
+To reduce usage,
+
+ * kill some tasks.
+ * move some tasks to other group with account migration.
+ * remove some files (on tmpfs?)
+
+Then, stopped tasks will work again.
+
+At reading, current status of OOM is shown.
+
+ - oom_kill_disable 0 or 1
+ (if 1, oom-killer is disabled)
+ - under_oom 0 or 1
+ (if 1, the memory cgroup is under OOM, tasks may be stopped.)
+
+11. Memory Pressure
+===================
+
+The pressure level notifications can be used to monitor the memory
+allocation cost; based on the pressure, applications can implement
+different strategies of managing their memory resources. The pressure
+levels are defined as following:
+
+The "low" level means that the system is reclaiming memory for new
+allocations. Monitoring this reclaiming activity might be useful for
+maintaining cache level. Upon notification, the program (typically
+"Activity Manager") might analyze vmstat and act in advance (i.e.
+prematurely shutdown unimportant services).
+
+The "medium" level means that the system is experiencing medium memory
+pressure, the system might be making swap, paging out active file caches,
+etc. Upon this event applications may decide to further analyze
+vmstat/zoneinfo/memcg or internal memory usage statistics and free any
+resources that can be easily reconstructed or re-read from a disk.
+
+The "critical" level means that the system is actively thrashing, it is
+about to out of memory (OOM) or even the in-kernel OOM killer is on its
+way to trigger. Applications should do whatever they can to help the
+system. It might be too late to consult with vmstat or any other
+statistics, so it's advisable to take an immediate action.
+
+By default, events are propagated upward until the event is handled, i.e. the
+events are not pass-through. For example, you have three cgroups: A->B->C. Now
+you set up an event listener on cgroups A, B and C, and suppose group C
+experiences some pressure. In this situation, only group C will receive the
+notification, i.e. groups A and B will not receive it. This is done to avoid
+excessive "broadcasting" of messages, which disturbs the system and which is
+especially bad if we are low on memory or thrashing. Group B, will receive
+notification only if there are no event listers for group C.
+
+There are three optional modes that specify different propagation behavior:
+
+ - "default": this is the default behavior specified above. This mode is the
+ same as omitting the optional mode parameter, preserved by backwards
+ compatibility.
+
+ - "hierarchy": events always propagate up to the root, similar to the default
+ behavior, except that propagation continues regardless of whether there are
+ event listeners at each level, with the "hierarchy" mode. In the above
+ example, groups A, B, and C will receive notification of memory pressure.
+
+ - "local": events are pass-through, i.e. they only receive notifications when
+ memory pressure is experienced in the memcg for which the notification is
+ registered. In the above example, group C will receive notification if
+ registered for "local" notification and the group experiences memory
+ pressure. However, group B will never receive notification, regardless if
+ there is an event listener for group C or not, if group B is registered for
+ local notification.
+
+The level and event notification mode ("hierarchy" or "local", if necessary) are
+specified by a comma-delimited string, i.e. "low,hierarchy" specifies
+hierarchical, pass-through, notification for all ancestor memcgs. Notification
+that is the default, non pass-through behavior, does not specify a mode.
+"medium,local" specifies pass-through notification for the medium level.
+
+The file memory.pressure_level is only used to setup an eventfd. To
+register a notification, an application must:
+
+- create an eventfd using eventfd(2);
+- open memory.pressure_level;
+- write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
+ to cgroup.event_control.
+
+Application will be notified through eventfd when memory pressure is at
+the specific level (or higher). Read/write operations to
+memory.pressure_level are no implemented.
+
+Test:
+
+ Here is a small script example that makes a new cgroup, sets up a
+ memory limit, sets up a notification in the cgroup and then makes child
+ cgroup experience a critical pressure::
+
+ # cd /sys/fs/cgroup/memory/
+ # mkdir foo
+ # cd foo
+ # cgroup_event_listener memory.pressure_level low,hierarchy &
+ # echo 8000000 > memory.limit_in_bytes
+ # echo 8000000 > memory.memsw.limit_in_bytes
+ # echo $$ > tasks
+ # dd if=/dev/zero | read x
+
+ (Expect a bunch of notifications, and eventually, the oom-killer will
+ trigger.)
+
+12. TODO
+========
+
+1. Make per-cgroup scanner reclaim not-shared pages first
+2. Teach controller to account for shared-pages
+3. Start reclamation in the background when the limit is
+ not yet hit but the usage is getting closer
+
+Summary
+=======
+
+Overall, the memory controller has been a stable controller and has been
+commented and discussed quite extensively in the community.
+
+References
+==========
+
+1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
+2. Singh, Balbir. Memory Controller (RSS Control),
+ http://lwn.net/Articles/222762/
+3. Emelianov, Pavel. Resource controllers based on process cgroups
+ http://lkml.org/lkml/2007/3/6/198
+4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
+ http://lkml.org/lkml/2007/4/9/78
+5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
+ http://lkml.org/lkml/2007/5/30/244
+6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
+7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
+ subsystem (v3), http://lwn.net/Articles/235534/
+8. Singh, Balbir. RSS controller v2 test results (lmbench),
+ http://lkml.org/lkml/2007/5/17/232
+9. Singh, Balbir. RSS controller v2 AIM9 results
+ http://lkml.org/lkml/2007/5/18/1
+10. Singh, Balbir. Memory controller v6 test results,
+ http://lkml.org/lkml/2007/8/19/36
+11. Singh, Balbir. Memory controller introduction (v6),
+ http://lkml.org/lkml/2007/8/17/69
+12. Corbet, Jonathan, Controlling memory use in cgroups,
+ http://lwn.net/Articles/243795/
diff --git a/Documentation/admin-guide/cgroup-v1/net_cls.rst b/Documentation/admin-guide/cgroup-v1/net_cls.rst
new file mode 100644
index 0000000..a2cf272
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/net_cls.rst
@@ -0,0 +1,44 @@
+=========================
+Network classifier cgroup
+=========================
+
+The Network classifier cgroup provides an interface to
+tag network packets with a class identifier (classid).
+
+The Traffic Controller (tc) can be used to assign
+different priorities to packets from different cgroups.
+Also, Netfilter (iptables) can use this tag to perform
+actions on such packets.
+
+Creating a net_cls cgroups instance creates a net_cls.classid file.
+This net_cls.classid value is initialized to 0.
+
+You can write hexadecimal values to net_cls.classid; the format for these
+values is 0xAAAABBBB; AAAA is the major handle number and BBBB
+is the minor handle number.
+Reading net_cls.classid yields a decimal result.
+
+Example::
+
+ mkdir /sys/fs/cgroup/net_cls
+ mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
+ mkdir /sys/fs/cgroup/net_cls/0
+ echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
+
+- setting a 10:1 handle::
+
+ cat /sys/fs/cgroup/net_cls/0/net_cls.classid
+ 1048577
+
+- configuring tc::
+
+ tc qdisc add dev eth0 root handle 10: htb
+ tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
+
+- creating traffic class 10:1::
+
+ tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
+
+configuring iptables, basic example::
+
+ iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP
diff --git a/Documentation/admin-guide/cgroup-v1/net_prio.rst b/Documentation/admin-guide/cgroup-v1/net_prio.rst
new file mode 100644
index 0000000..b409058
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/net_prio.rst
@@ -0,0 +1,57 @@
+=======================
+Network priority cgroup
+=======================
+
+The Network priority cgroup provides an interface to allow an administrator to
+dynamically set the priority of network traffic generated by various
+applications
+
+Nominally, an application would set the priority of its traffic via the
+SO_PRIORITY socket option. This however, is not always possible because:
+
+1) The application may not have been coded to set this value
+2) The priority of application traffic is often a site-specific administrative
+ decision rather than an application defined one.
+
+This cgroup allows an administrator to assign a process to a group which defines
+the priority of egress traffic on a given interface. Network priority groups can
+be created by first mounting the cgroup filesystem::
+
+ # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
+
+With the above step, the initial group acting as the parent accounting group
+becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
+the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
+
+Each net_prio cgroup contains two files that are subsystem specific
+
+net_prio.prioidx
+ This file is read-only, and is simply informative. It contains a unique
+ integer value that the kernel uses as an internal representation of this
+ cgroup.
+
+net_prio.ifpriomap
+ This file contains a map of the priorities assigned to traffic originating
+ from processes in this group and egressing the system on various interfaces.
+ It contains a list of tuples in the form <ifname priority>. Contents of this
+ file can be modified by echoing a string into the file using the same tuple
+ format. For example::
+
+ echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
+
+This command would force any traffic originating from processes belonging to the
+iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
+said traffic set to the value 5. The parent accounting group also has a
+writeable 'net_prio.ifpriomap' file that can be used to set a system default
+priority.
+
+Priorities are set immediately prior to queueing a frame to the device
+queueing discipline (qdisc) so priorities will be assigned prior to the hardware
+queue selection being made.
+
+One usage for the net_prio cgroup is with mqprio qdisc allowing application
+traffic to be steered to hardware/driver based traffic classes. These mappings
+can then be managed by administrators or other networking protocols such as
+DCBX.
+
+A new net_prio cgroup inherits the parent's configuration.
diff --git a/Documentation/admin-guide/cgroup-v1/pids.rst b/Documentation/admin-guide/cgroup-v1/pids.rst
new file mode 100644
index 0000000..6acebd9
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/pids.rst
@@ -0,0 +1,92 @@
+=========================
+Process Number Controller
+=========================
+
+Abstract
+--------
+
+The process number controller is used to allow a cgroup hierarchy to stop any
+new tasks from being fork()'d or clone()'d after a certain limit is reached.
+
+Since it is trivial to hit the task limit without hitting any kmemcg limits in
+place, PIDs are a fundamental resource. As such, PID exhaustion must be
+preventable in the scope of a cgroup hierarchy by allowing resource limiting of
+the number of tasks in a cgroup.
+
+Usage
+-----
+
+In order to use the `pids` controller, set the maximum number of tasks in
+pids.max (this is not available in the root cgroup for obvious reasons). The
+number of processes currently in the cgroup is given by pids.current.
+
+Organisational operations are not blocked by cgroup policies, so it is possible
+to have pids.current > pids.max. This can be done by either setting the limit to
+be smaller than pids.current, or attaching enough processes to the cgroup such
+that pids.current > pids.max. However, it is not possible to violate a cgroup
+policy through fork() or clone(). fork() and clone() will return -EAGAIN if the
+creation of a new process would cause a cgroup policy to be violated.
+
+To set a cgroup to have no limit, set pids.max to "max". This is the default for
+all new cgroups (N.B. that PID limits are hierarchical, so the most stringent
+limit in the hierarchy is followed).
+
+pids.current tracks all child cgroup hierarchies, so parent/pids.current is a
+superset of parent/child/pids.current.
+
+The pids.events file contains event counters:
+
+ - max: Number of times fork failed because limit was hit.
+
+Example
+-------
+
+First, we mount the pids controller::
+
+ # mkdir -p /sys/fs/cgroup/pids
+ # mount -t cgroup -o pids none /sys/fs/cgroup/pids
+
+Then we create a hierarchy, set limits and attach processes to it::
+
+ # mkdir -p /sys/fs/cgroup/pids/parent/child
+ # echo 2 > /sys/fs/cgroup/pids/parent/pids.max
+ # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
+ # cat /sys/fs/cgroup/pids/parent/pids.current
+ 2
+ #
+
+It should be noted that attempts to overcome the set limit (2 in this case) will
+fail::
+
+ # cat /sys/fs/cgroup/pids/parent/pids.current
+ 2
+ # ( /bin/echo "Here's some processes for you." | cat )
+ sh: fork: Resource temporary unavailable
+ #
+
+Even if we migrate to a child cgroup (which doesn't have a set limit), we will
+not be able to overcome the most stringent limit in the hierarchy (in this case,
+parent's)::
+
+ # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
+ # cat /sys/fs/cgroup/pids/parent/pids.current
+ 2
+ # cat /sys/fs/cgroup/pids/parent/child/pids.current
+ 2
+ # cat /sys/fs/cgroup/pids/parent/child/pids.max
+ max
+ # ( /bin/echo "Here's some processes for you." | cat )
+ sh: fork: Resource temporary unavailable
+ #
+
+We can set a limit that is smaller than pids.current, which will stop any new
+processes from being forked at all (note that the shell itself counts towards
+pids.current)::
+
+ # echo 1 > /sys/fs/cgroup/pids/parent/pids.max
+ # /bin/echo "We can't even spawn a single process now."
+ sh: fork: Resource temporary unavailable
+ # echo 0 > /sys/fs/cgroup/pids/parent/pids.max
+ # /bin/echo "We can't even spawn a single process now."
+ sh: fork: Resource temporary unavailable
+ #
diff --git a/Documentation/admin-guide/cgroup-v1/rdma.rst b/Documentation/admin-guide/cgroup-v1/rdma.rst
new file mode 100644
index 0000000..2fcb0a9
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/rdma.rst
@@ -0,0 +1,117 @@
+===============
+RDMA Controller
+===============
+
+.. Contents
+
+ 1. Overview
+ 1-1. What is RDMA controller?
+ 1-2. Why RDMA controller needed?
+ 1-3. How is RDMA controller implemented?
+ 2. Usage Examples
+
+1. Overview
+===========
+
+1-1. What is RDMA controller?
+-----------------------------
+
+RDMA controller allows user to limit RDMA/IB specific resources that a given
+set of processes can use. These processes are grouped using RDMA controller.
+
+RDMA controller defines two resources which can be limited for processes of a
+cgroup.
+
+1-2. Why RDMA controller needed?
+--------------------------------
+
+Currently user space applications can easily take away all the rdma verb
+specific resources such as AH, CQ, QP, MR etc. Due to which other applications
+in other cgroup or kernel space ULPs may not even get chance to allocate any
+rdma resources. This can lead to service unavailability.
+
+Therefore RDMA controller is needed through which resource consumption
+of processes can be limited. Through this controller different rdma
+resources can be accounted.
+
+1-3. How is RDMA controller implemented?
+----------------------------------------
+
+RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains
+resource accounting per cgroup, per device using resource pool structure.
+Each such resource pool is limited up to 64 resources in given resource pool
+by rdma cgroup, which can be extended later if required.
+
+This resource pool object is linked to the cgroup css. Typically there
+are 0 to 4 resource pool instances per cgroup, per device in most use cases.
+But nothing limits to have it more. At present hundreds of RDMA devices per
+single cgroup may not be handled optimally, however there is no
+known use case or requirement for such configuration either.
+
+Since RDMA resources can be allocated from any process and can be freed by any
+of the child processes which shares the address space, rdma resources are
+always owned by the creator cgroup css. This allows process migration from one
+to other cgroup without major complexity of transferring resource ownership;
+because such ownership is not really present due to shared nature of
+rdma resources. Linking resources around css also ensures that cgroups can be
+deleted after processes migrated. This allow progress migration as well with
+active resources, even though that is not a primary use case.
+
+Whenever RDMA resource charging occurs, owner rdma cgroup is returned to
+the caller. Same rdma cgroup should be passed while uncharging the resource.
+This also allows process migrated with active RDMA resource to charge
+to new owner cgroup for new resource. It also allows to uncharge resource of
+a process from previously charged cgroup which is migrated to new cgroup,
+even though that is not a primary use case.
+
+Resource pool object is created in following situations.
+(a) User sets the limit and no previous resource pool exist for the device
+of interest for the cgroup.
+(b) No resource limits were configured, but IB/RDMA stack tries to
+charge the resource. So that it correctly uncharge them when applications are
+running without limits and later on when limits are enforced during uncharging,
+otherwise usage count will drop to negative.
+
+Resource pool is destroyed if all the resource limits are set to max and
+it is the last resource getting deallocated.
+
+User should set all the limit to max value if it intents to remove/unconfigure
+the resource pool for a particular device.
+
+IB stack honors limits enforced by the rdma controller. When application
+query about maximum resource limits of IB device, it returns minimum of
+what is configured by user for a given cgroup and what is supported by
+IB device.
+
+Following resources can be accounted by rdma controller.
+
+ ========== =============================
+ hca_handle Maximum number of HCA Handles
+ hca_object Maximum number of HCA Objects
+ ========== =============================
+
+2. Usage Examples
+=================
+
+(a) Configure resource limit::
+
+ echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
+ echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
+
+(b) Query resource limit::
+
+ cat /sys/fs/cgroup/rdma/2/rdma.max
+ #Output:
+ mlx4_0 hca_handle=2 hca_object=2000
+ ocrdma1 hca_handle=3 hca_object=max
+
+(c) Query current usage::
+
+ cat /sys/fs/cgroup/rdma/2/rdma.current
+ #Output:
+ mlx4_0 hca_handle=1 hca_object=20
+ ocrdma1 hca_handle=1 hca_object=23
+
+(d) Delete resource limit::
+
+ echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193b..5361ebe 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -9,7 +9,7 @@
conventions of cgroup v2. It describes all userland-visible aspects
of cgroup including core and specific controller behaviors. All
future changes must be reflected in this document. Documentation for
-v1 is available under Documentation/cgroup-v1/.
+v1 is available under Documentation/admin-guide/cgroup-v1/.
.. CONTENTS
@@ -56,11 +56,13 @@
5-3-3-2. IO Latency Interface Files
5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
- 5-6-1. RDMA Interface Files
- 5-7. Misc
- 5-7-1. perf_event
+ 5-5. Cpuset
+ 5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+ 5-7-1. RDMA Interface Files
+ 5-8. Misc
+ 5-8-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -175,6 +177,15 @@
ignored on non-init namespace mounts. Please refer to the
Delegation section for details.
+ memory_localevents
+
+ Only populate memory.events with data for the current cgroup,
+ and not any subtrees. This is legacy behaviour, the default
+ behaviour without this option is to include subtree counts.
+ This option is system wide and can only be set on mount or
+ modified through remount from the init namespace. The mount
+ option is ignored on non-init namespace mounts.
+
Organizing Processes and Threads
--------------------------------
@@ -604,8 +615,8 @@
Protections
-----------
-A cgroup is protected to be allocated upto the configured amount of
-the resource if the usages of all its ancestors are under their
+A cgroup is protected upto the configured amount of the resource
+as long as the usages of all its ancestors are under their
protected levels. Protections can be hard guarantees or best effort
soft boundaries. Protections can also be over-committed in which case
only upto the amount available to the parent is protected among
@@ -694,6 +705,12 @@
informational files on the root cgroup which end up showing global
information available elsewhere shouldn't exist.
+- The default time unit is microseconds. If a different unit is ever
+ used, an explicit unit suffix must be present.
+
+- A parts-per quantity should use a percentage decimal with at least
+ two digit fractional part - e.g. 13.40.
+
- If a controller implements weight based resource distribution, its
interface file should be named "weight" and have the range [1,
10000] with 100 as the default. The values are chosen to allow
@@ -862,6 +879,8 @@
populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+ 1 if the cgroup is frozen; otherwise, 0.
cgroup.max.descendants
A read-write single value files. The default is "max".
@@ -895,6 +914,31 @@
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
+ cgroup.freeze
+ A read-write single value file which exists on non-root cgroups.
+ Allowed values are "0" and "1". The default is "0".
+
+ Writing "1" to the file causes freezing of the cgroup and all
+ descendant cgroups. This means that all belonging processes will
+ be stopped and will not run until the cgroup will be explicitly
+ unfrozen. Freezing of the cgroup may take some time; when this action
+ is completed, the "frozen" value in the cgroup.events control file
+ will be updated to "1" and the corresponding notification will be
+ issued.
+
+ A cgroup can be frozen either by its own settings, or by settings
+ of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+ cgroup will remain frozen.
+
+ Processes in the frozen cgroup can be killed by a fatal signal.
+ They also can enter and leave a frozen cgroup: either by an explicit
+ move by a user, or if freezing of the cgroup races with fork().
+ If a process is moved to a frozen cgroup, it stops. If a process is
+ moved out of a frozen cgroup, it becomes running.
+
+ Frozen status of a cgroup doesn't affect any cgroup tree operations:
+ it's possible to delete a frozen (and empty) cgroup, as well as
+ create new sub-cgroups.
Controllers
===========
@@ -907,6 +951,13 @@
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -966,6 +1017,39 @@
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for CPU. See
+ Documentation/accounting/psi.rst for details.
+
+ cpu.uclamp.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization (protection) as a percentage
+ rational number, e.g. 12.34 for 12.34%.
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ The requested minimum utilization (protection) is always capped by
+ the current value for the maximum utilization (limit), i.e.
+ `cpu.uclamp.max`.
+
+ cpu.uclamp.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max". i.e. no utilization capping
+
+ The requested maximum utilization (limit) as a percentage rational
+ number, e.g. 98.76 for 98.76%.
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+
Memory
------
@@ -1012,7 +1096,10 @@
is within its effective min boundary, the cgroup's memory
won't be reclaimed under any conditions. If there is no
unprotected reclaimable memory available, OOM killer
- is invoked.
+ is invoked. Above the effective min boundary (or
+ effective low boundary if it is higher), pages are reclaimed
+ proportionally to the overage, reducing reclaim pressure for
+ smaller overages.
Effective min boundary is limited by memory.min values of
all ancestor cgroups. If there is memory.min overcommitment
@@ -1034,7 +1121,10 @@
Best-effort memory protection. If the memory usage of a
cgroup is within its effective low boundary, the cgroup's
memory won't be reclaimed unless memory can be reclaimed
- from unprotected cgroups.
+ from unprotected cgroups. Above the effective low boundary (or
+ effective min boundary if it is higher), pages are reclaimed
+ proportionally to the overage, reducing reclaim pressure for
+ smaller overages.
Effective low boundary is limited by memory.low values of
all ancestor cgroups. If there is memory.low overcommitment
@@ -1096,6 +1186,11 @@
otherwise, a value change in this file generates a file
modified event.
+ Note that all fields in this file are hierarchical and the
+ file modified event can be generated due to an event down the
+ hierarchy. For for the local events at the cgroup level see
+ memory.events.local.
+
low
The number of times the cgroup is reclaimed due to
high memory pressure even though its usage is under
@@ -1127,10 +1222,19 @@
disk readahead. For now OOM in memory cgroup kills
tasks iff shortage has happened inside page fault.
+ This event is not raised if the OOM killer is not
+ considered as an option, e.g. for failed high-order
+ allocations.
+
oom_kill
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
+ memory.events.local
+ Similar to memory.events but the fields in the file are local
+ to the cgroup i.e. not hierarchical. The file modified event
+ generated on this file reflects only the local events.
+
memory.stat
A read-only flat-keyed file which exists on non-root cgroups.
@@ -1177,6 +1281,10 @@
Amount of cached filesystem data that was modified and
is currently being written back to disk
+ anon_thp
+ Amount of memory used in anonymous mappings backed by
+ transparent hugepages
+
inactive_anon, active_anon, inactive_file, active_file, unevictable
Amount of memory, swap-backed and filesystem-backed,
on the internal memory management lists used by the
@@ -1236,6 +1344,18 @@
Amount of reclaimed lazyfree pages
+ thp_fault_alloc
+
+ Number of transparent hugepages which were allocated to satisfy
+ a page fault, including COW faults. This counter is not present
+ when CONFIG_TRANSPARENT_HUGEPAGE is not set.
+
+ thp_collapse_alloc
+
+ Number of transparent hugepages which were allocated to allow
+ collapsing an existing range of pages. This counter is not
+ present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
+
memory.swap.current
A read-only single value file which exists on non-root
cgroups.
@@ -1271,6 +1391,12 @@
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.
+ memory.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for memory. See
+ Documentation/accounting/psi.rst for details.
+
Usage Guidelines
~~~~~~~~~~~~~~~~
@@ -1349,6 +1475,103 @@
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
+ io.cost.qos
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the Quality of Service of the IO cost
+ model based controller (CONFIG_BLK_CGROUP_IOCOST) which
+ currently implements "io.weight" proportional control. Lines
+ are keyed by $MAJ:$MIN device numbers and not ordered. The
+ line for a given device is populated on the first write for
+ the device on "io.cost.qos" or "io.cost.model". The following
+ nested keys are defined.
+
+ ====== =====================================
+ enable Weight-based control enable
+ ctrl "auto" or "user"
+ rpct Read latency percentile [0, 100]
+ rlat Read latency threshold
+ wpct Write latency percentile [0, 100]
+ wlat Write latency threshold
+ min Minimum scaling percentage [1, 10000]
+ max Maximum scaling percentage [1, 10000]
+ ====== =====================================
+
+ The controller is disabled by default and can be enabled by
+ setting "enable" to 1. "rpct" and "wpct" parameters default
+ to zero and the controller uses internal device saturation
+ state to adjust the overall IO rate between "min" and "max".
+
+ When a better control quality is needed, latency QoS
+ parameters can be configured. For example::
+
+ 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
+
+ shows that on sdb, the controller is enabled, will consider
+ the device saturated if the 95th percentile of read completion
+ latencies is above 75ms or write 150ms, and adjust the overall
+ IO issue rate between 50% and 150% accordingly.
+
+ The lower the saturation point, the better the latency QoS at
+ the cost of aggregate bandwidth. The narrower the allowed
+ adjustment range between "min" and "max", the more conformant
+ to the cost model the IO behavior. Note that the IO issue
+ base rate may be far off from 100% and setting "min" and "max"
+ blindly can lead to a significant loss of device capacity or
+ control quality. "min" and "max" are useful for regulating
+ devices which show wide temporary behavior changes - e.g. a
+ ssd which accepts writes at the line speed for a while and
+ then completely stalls for multiple seconds.
+
+ When "ctrl" is "auto", the parameters are controlled by the
+ kernel and may change automatically. Setting "ctrl" to "user"
+ or setting any of the percentile and latency parameters puts
+ it into "user" mode and disables the automatic changes. The
+ automatic mode can be restored by setting "ctrl" to "auto".
+
+ io.cost.model
+ A read-write nested-keyed file with exists only on the root
+ cgroup.
+
+ This file configures the cost model of the IO cost model based
+ controller (CONFIG_BLK_CGROUP_IOCOST) which currently
+ implements "io.weight" proportional control. Lines are keyed
+ by $MAJ:$MIN device numbers and not ordered. The line for a
+ given device is populated on the first write for the device on
+ "io.cost.qos" or "io.cost.model". The following nested keys
+ are defined.
+
+ ===== ================================
+ ctrl "auto" or "user"
+ model The cost model in use - "linear"
+ ===== ================================
+
+ When "ctrl" is "auto", the kernel may change all parameters
+ dynamically. When "ctrl" is set to "user" or any other
+ parameters are written to, "ctrl" become "user" and the
+ automatic changes are disabled.
+
+ When "model" is "linear", the following model parameters are
+ defined.
+
+ ============= ========================================
+ [r|w]bps The maximum sequential IO throughput
+ [r|w]seqiops The maximum 4k sequential IOs per second
+ [r|w]randiops The maximum 4k random IOs per second
+ ============= ========================================
+
+ From the above, the builtin linear model determines the base
+ costs of a sequential and random IO and the cost coefficient
+ for the IO size. While simple, this model can cover most
+ common device classes acceptably.
+
+ The IO cost model isn't expected to be accurate in absolute
+ sense and is scaled to the device behavior dynamically.
+
+ If needed, tools/cgroup/iocost_coef_gen.py can be used to
+ generate device-specific coefficients.
+
io.weight
A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100".
@@ -1408,6 +1631,12 @@
8:16 rbps=2097152 wbps=max riops=max wiops=max
+ io.pressure
+ A read-only nested-key file which exists on non-root cgroups.
+
+ Shows pressure stall information for IO. See
+ Documentation/accounting/psi.rst for details.
+
Writeback
~~~~~~~~~
@@ -1479,7 +1708,7 @@
The limits are only applied at the peer level in the hierarchy. This means that
in the diagram below, only groups A, B, and C will influence each other, and
-groups D and F will influence each other. Group G will influence nobody.
+groups D and F will influence each other. Group G will influence nobody::
[root]
/ | \
@@ -1588,6 +1817,176 @@
of a new process would cause a cgroup policy to be violated.
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical. That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+ cpuset.cpus
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the requested CPUs to be used by tasks within this
+ cgroup. The actual list of CPUs to be granted, however, is
+ subjected to constraints imposed by its parent and can differ
+ from the requested CPUs.
+
+ The CPU numbers are comma-separated numbers or ranges.
+ For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.cpus" or all the available CPUs if none is found.
+
+ The value of "cpuset.cpus" stays constant until the next update
+ and won't be affected by any CPU hotplug events.
+
+ cpuset.cpus.effective
+ A read-only multiple values file which exists on all
+ cpuset-enabled cgroups.
+
+ It lists the onlined CPUs that are actually granted to this
+ cgroup by its parent. These CPUs are allowed to be used by
+ tasks within the current cgroup.
+
+ If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+ all the CPUs from the parent cgroup that can be available to
+ be used by this cgroup. Otherwise, it should be a subset of
+ "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+ can be granted. In this case, it will be treated just like an
+ empty "cpuset.cpus".
+
+ Its value will be affected by CPU hotplug events.
+
+ cpuset.mems
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the requested memory nodes to be used by tasks within
+ this cgroup. The actual list of memory nodes granted, however,
+ is subjected to constraints imposed by its parent and can differ
+ from the requested memory nodes.
+
+ The memory node numbers are comma-separated numbers or ranges.
+ For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.mems" or all the available memory nodes if none
+ is found.
+
+ The value of "cpuset.mems" stays constant until the next update
+ and won't be affected by any memory nodes hotplug events.
+
+ cpuset.mems.effective
+ A read-only multiple values file which exists on all
+ cpuset-enabled cgroups.
+
+ It lists the onlined memory nodes that are actually granted to
+ this cgroup by its parent. These memory nodes are allowed to
+ be used by tasks within the current cgroup.
+
+ If "cpuset.mems" is empty, it shows all the memory nodes from the
+ parent cgroup that will be available to be used by this cgroup.
+ Otherwise, it should be a subset of "cpuset.mems" unless none of
+ the memory nodes listed in "cpuset.mems" can be granted. In this
+ case, it will be treated just like an empty "cpuset.mems".
+
+ Its value will be affected by memory nodes hotplug events.
+
+ cpuset.cpus.partition
+ A read-write single value file which exists on non-root
+ cpuset-enabled cgroups. This flag is owned by the parent cgroup
+ and is not delegatable.
+
+ It accepts only the following input values when written to.
+
+ "root" - a paritition root
+ "member" - a non-root member of a partition
+
+ When set to be a partition root, the current cgroup is the
+ root of a new partition or scheduling domain that comprises
+ itself and all its descendants except those that are separate
+ partition roots themselves and their descendants. The root
+ cgroup is always a partition root.
+
+ There are constraints on where a partition root can be set.
+ It can only be set in a cgroup if all the following conditions
+ are true.
+
+ 1) The "cpuset.cpus" is not empty and the list of CPUs are
+ exclusive, i.e. they are not shared by any of its siblings.
+ 2) The parent cgroup is a partition root.
+ 3) The "cpuset.cpus" is also a proper subset of the parent's
+ "cpuset.cpus.effective".
+ 4) There is no child cgroups with cpuset enabled. This is for
+ eliminating corner cases that have to be handled if such a
+ condition is allowed.
+
+ Setting it to partition root will take the CPUs away from the
+ effective CPUs of the parent cgroup. Once it is set, this
+ file cannot be reverted back to "member" if there are any child
+ cgroups with cpuset enabled.
+
+ A parent partition cannot distribute all its CPUs to its
+ child partitions. There must be at least one cpu left in the
+ parent partition.
+
+ Once becoming a partition root, changes to "cpuset.cpus" is
+ generally allowed as long as the first condition above is true,
+ the change will not take away all the CPUs from the parent
+ partition and the new "cpuset.cpus" value is a superset of its
+ children's "cpuset.cpus" values.
+
+ Sometimes, external factors like changes to ancestors'
+ "cpuset.cpus" or cpu hotplug can cause the state of the partition
+ root to change. On read, the "cpuset.sched.partition" file
+ can show the following values.
+
+ "member" Non-root member of a partition
+ "root" Partition root
+ "root invalid" Invalid partition root
+
+ It is a partition root if the first 2 partition root conditions
+ above are true and at least one CPU from "cpuset.cpus" is
+ granted by the parent cgroup.
+
+ A partition root can become invalid if none of CPUs requested
+ in "cpuset.cpus" can be granted by the parent cgroup or the
+ parent cgroup is no longer a partition root itself. In this
+ case, it is not a real partition even though the restriction
+ of the first partition root condition above will still apply.
+ The cpu affinity of all the tasks in the cgroup will then be
+ associated with CPUs in the nearest ancestor partition.
+
+ An invalid partition root can be transitioned back to a
+ real partition root if at least one of the requested CPUs
+ can now be granted by its parent. In this case, the cpu
+ affinity of all the tasks in the formerly invalid partition
+ will be associated to the CPUs of the newly formed partition.
+ Changing the partition state of an invalid partition root to
+ "member" is always allowed even if child cpusets are present.
+
+
Device controller
-----------------
@@ -1857,10 +2256,12 @@
wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and
- associates the bio with the inode's owner cgroup. Can be
- called anytime between bio allocation and submission.
+ associates the bio with the inode's owner cgroup and the
+ corresponding request queue. This must be called after
+ a queue (device) has been associated with the bio and
+ before submission.
- wbc_account_io(@wbc, @page, @bytes)
+ wbc_account_cgroup_owner(@wbc, @page, @bytes)
Should be called for each data segment being written out.
While this function doesn't care exactly when it's called
during the writeback session, it's the easiest and most
@@ -1877,7 +2278,7 @@
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
-cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+cases by skipping wbc_init_bio() and using bio_associate_blkg()
directly.
@@ -2087,8 +2488,10 @@
becomes self-defeating.
The memory.low boundary on the other hand is a top-down allocated
-reserve. A cgroup enjoys reclaim protection when it's within its low,
-which makes delegation of subtrees possible.
+reserve. A cgroup enjoys reclaim protection when it's within its
+effective low, which makes delegation of subtrees possible. It also
+enjoys having reclaim pressure proportional to its overage when
+above its effective low.
The original high boundary, the hard limit, is defined as a strict
limit that can not budge, even if the OOM killer has to be called.
diff --git a/Documentation/admin-guide/cifs/authors.rst b/Documentation/admin-guide/cifs/authors.rst
new file mode 100644
index 0000000..b02d6dd
--- /dev/null
+++ b/Documentation/admin-guide/cifs/authors.rst
@@ -0,0 +1,69 @@
+=======
+Authors
+=======
+
+Original Author
+---------------
+
+Steve French (sfrench@samba.org)
+
+The author wishes to express his appreciation and thanks to:
+Andrew Tridgell (Samba team) for his early suggestions about smb/cifs VFS
+improvements. Thanks to IBM for allowing me time and test resources to pursue
+this project, to Jim McDonough from IBM (and the Samba Team) for his help, to
+the IBM Linux JFS team for explaining many esoteric Linux filesystem features.
+Jeremy Allison of the Samba team has done invaluable work in adding the server
+side of the original CIFS Unix extensions and reviewing and implementing
+portions of the newer CIFS POSIX extensions into the Samba 3 file server. Thank
+Dave Boutcher of IBM Rochester (author of the OS/400 smb/cifs filesystem client)
+for proving years ago that very good smb/cifs clients could be done on Unix-like
+operating systems. Volker Lendecke, Andrew Tridgell, Urban Widmark, John
+Newbigin and others for their work on the Linux smbfs module. Thanks to
+the other members of the Storage Network Industry Association CIFS Technical
+Workgroup for their work specifying this highly complex protocol and finally
+thanks to the Samba team for their technical advice and encouragement.
+
+Patch Contributors
+------------------
+
+- Zwane Mwaikambo
+- Andi Kleen
+- Amrut Joshi
+- Shobhit Dayal
+- Sergey Vlasov
+- Richard Hughes
+- Yury Umanets
+- Mark Hamzy (for some of the early cifs IPv6 work)
+- Domen Puncer
+- Jesper Juhl (in particular for lots of whitespace/formatting cleanup)
+- Vince Negri and Dave Stahl (for finding an important caching bug)
+- Adrian Bunk (kcalloc cleanups)
+- Miklos Szeredi
+- Kazeon team for various fixes especially for 2.4 version.
+- Asser Ferno (Change Notify support)
+- Shaggy (Dave Kleikamp) for innumerable small fs suggestions and some good cleanup
+- Gunter Kukkukk (testing and suggestions for support of old servers)
+- Igor Mammedov (DFS support)
+- Jeff Layton (many, many fixes, as well as great work on the cifs Kerberos code)
+- Scott Lovenberg
+- Pavel Shilovsky (for great work adding SMB2 support, and various SMB3 features)
+- Aurelien Aptel (for DFS SMB3 work and some key bug fixes)
+- Ronnie Sahlberg (for SMB3 xattr work, bug fixes, and lots of great work on compounding)
+- Shirish Pargaonkar (for many ACL patches over the years)
+- Sachin Prabhu (many bug fixes, including for reconnect, copy offload and security)
+- Paulo Alcantara
+- Long Li (some great work on RDMA, SMB Direct)
+
+
+Test case and Bug Report contributors
+-------------------------------------
+Thanks to those in the community who have submitted detailed bug reports
+and debug of problems they have found: Jochen Dolze, David Blaine,
+Rene Scharfe, Martin Josefsson, Alexander Wild, Anthony Liguori,
+Lars Muller, Urban Widmark, Massimiliano Ferrero, Howard Owen,
+Olaf Kirch, Kieron Briggs, Nick Millington and others. Also special
+mention to the Stanford Checker (SWAT) which pointed out many minor
+bugs in error paths. Valuable suggestions also have come from Al Viro
+and Dave Miller.
+
+And thanks to the IBM LTC and Power test teams and SuSE and Citrix and RedHat testers for finding multiple bugs during excellent stress test runs.
diff --git a/Documentation/admin-guide/cifs/changes.rst b/Documentation/admin-guide/cifs/changes.rst
new file mode 100644
index 0000000..71f2ecb
--- /dev/null
+++ b/Documentation/admin-guide/cifs/changes.rst
@@ -0,0 +1,8 @@
+=======
+Changes
+=======
+
+See https://wiki.samba.org/index.php/LinuxCIFSKernel for summary
+information (that may be easier to read than parsing the output of
+"git log fs/cifs") about fixes/improvements to CIFS/SMB2/SMB3 support (changes
+to cifs.ko module) by kernel version (and cifs internal module version).
diff --git a/Documentation/admin-guide/cifs/index.rst b/Documentation/admin-guide/cifs/index.rst
new file mode 100644
index 0000000..fad5268
--- /dev/null
+++ b/Documentation/admin-guide/cifs/index.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+CIFS
+====
+
+.. toctree::
+ :maxdepth: 2
+
+ introduction
+ usage
+ todo
+ changes
+ authors
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/cifs/introduction.rst b/Documentation/admin-guide/cifs/introduction.rst
new file mode 100644
index 0000000..0b98f67
--- /dev/null
+++ b/Documentation/admin-guide/cifs/introduction.rst
@@ -0,0 +1,53 @@
+============
+Introduction
+============
+
+ This is the client VFS module for the SMB3 NAS protocol as well
+ as for older dialects such as the Common Internet File System (CIFS)
+ protocol which was the successor to the Server Message Block
+ (SMB) protocol, the native file sharing mechanism for most early
+ PC operating systems. New and improved versions of CIFS are now
+ called SMB2 and SMB3. Use of SMB3 (and later, including SMB3.1.1)
+ is strongly preferred over using older dialects like CIFS due to
+ security reaasons. All modern dialects, including the most recent,
+ SMB3.1.1 are supported by the CIFS VFS module. The SMB3 protocol
+ is implemented and supported by all major file servers
+ such as all modern versions of Windows (including Windows 2016
+ Server), as well as by Samba (which provides excellent
+ CIFS/SMB2/SMB3 server support and tools for Linux and many other
+ operating systems). Apple systems also support SMB3 well, as
+ do most Network Attached Storage vendors, so this network
+ filesystem client can mount to a wide variety of systems.
+ It also supports mounting to the cloud (for example
+ Microsoft Azure), including the necessary security features.
+
+ The intent of this module is to provide the most advanced network
+ file system function for SMB3 compliant servers, including advanced
+ security features, excellent parallelized high performance i/o, better
+ POSIX compliance, secure per-user session establishment, encryption,
+ high performance safe distributed caching (leases/oplocks), optional packet
+ signing, large files, Unicode support and other internationalization
+ improvements. Since both Samba server and this filesystem client support
+ the CIFS Unix extensions (and in the future SMB3 POSIX extensions),
+ the combination can provide a reasonable alternative to other network and
+ cluster file systems for fileserving in some Linux to Linux environments,
+ not just in Linux to Windows (or Linux to Mac) environments.
+
+ This filesystem has a mount utility (mount.cifs) and various user space
+ tools (including smbinfo and setcifsacl) that can be obtained from
+
+ https://git.samba.org/?p=cifs-utils.git
+
+ or
+
+ git://git.samba.org/cifs-utils.git
+
+ mount.cifs should be installed in the directory with the other mount helpers.
+
+ For more information on the module see the project wiki page at
+
+ https://wiki.samba.org/index.php/LinuxCIFS
+
+ and
+
+ https://wiki.samba.org/index.php/LinuxCIFS_utils
diff --git a/Documentation/admin-guide/cifs/todo.rst b/Documentation/admin-guide/cifs/todo.rst
new file mode 100644
index 0000000..084c25f
--- /dev/null
+++ b/Documentation/admin-guide/cifs/todo.rst
@@ -0,0 +1,133 @@
+====
+TODO
+====
+
+Version 2.14 December 21, 2018
+
+A Partial List of Missing Features
+==================================
+
+Contributions are welcome. There are plenty of opportunities
+for visible, important contributions to this module. Here
+is a partial list of the known problems and missing features:
+
+a) SMB3 (and SMB3.1.1) missing optional features:
+
+ - multichannel (started), integration with RDMA
+ - directory leases (improved metadata caching), started (root dir only)
+ - T10 copy offload ie "ODX" (copy chunk, and "Duplicate Extents" ioctl
+ currently the only two server side copy mechanisms supported)
+
+b) improved sparse file support (fiemap and SEEK_HOLE are implemented
+ but additional features would be supportable by the protocol).
+
+c) Directory entry caching relies on a 1 second timer, rather than
+ using Directory Leases, currently only the root file handle is cached longer
+
+d) quota support (needs minor kernel change since quota calls
+ to make it to network filesystems or deviceless filesystems)
+
+e) Additional use cases can be optimized to use "compounding" (e.g.
+ open/query/close and open/setinfo/close) to reduce the number of
+ roundtrips to the server and improve performance. Various cases
+ (stat, statfs, create, unlink, mkdir) already have been improved by
+ using compounding but more can be done. In addition we could
+ significantly reduce redundant opens by using deferred close (with
+ handle caching leases) and better using reference counters on file
+ handles.
+
+f) Finish inotify support so kde and gnome file list windows
+ will autorefresh (partially complete by Asser). Needs minor kernel
+ vfs change to support removing D_NOTIFY on a file.
+
+g) Add GUI tool to configure /proc/fs/cifs settings and for display of
+ the CIFS statistics (started)
+
+h) implement support for security and trusted categories of xattrs
+ (requires minor protocol extension) to enable better support for SELINUX
+
+i) Add support for tree connect contexts (see MS-SMB2) a new SMB3.1.1 protocol
+ feature (may be especially useful for virtualization).
+
+j) Create UID mapping facility so server UIDs can be mapped on a per
+ mount or a per server basis to client UIDs or nobody if no mapping
+ exists. Also better integration with winbind for resolving SID owners
+
+k) Add tools to take advantage of more smb3 specific ioctls and features
+ (passthrough ioctl/fsctl is now implemented in cifs.ko to allow
+ sending various SMB3 fsctls and query info and set info calls
+ directly from user space) Add tools to make setting various non-POSIX
+ metadata attributes easier from tools (e.g. extending what was done
+ in smb-info tool).
+
+l) encrypted file support
+
+m) improved stats gathering tools (perhaps integration with nfsometer?)
+ to extend and make easier to use what is currently in /proc/fs/cifs/Stats
+
+n) Add support for claims based ACLs ("DAC")
+
+o) mount helper GUI (to simplify the various configuration options on mount)
+
+p) Add support for witness protocol (perhaps ioctl to cifs.ko from user space
+ tool listening on witness protocol RPC) to allow for notification of share
+ move, server failover, and server adapter changes. And also improve other
+ failover scenarios, e.g. when client knows multiple DFS entries point to
+ different servers, and the server we are connected to has gone down.
+
+q) Allow mount.cifs to be more verbose in reporting errors with dialect
+ or unsupported feature errors.
+
+r) updating cifs documentation, and user guide.
+
+s) Addressing bugs found by running a broader set of xfstests in standard
+ file system xfstest suite.
+
+t) split cifs and smb3 support into separate modules so legacy (and less
+ secure) CIFS dialect can be disabled in environments that don't need it
+ and simplify the code.
+
+v) POSIX Extensions for SMB3.1.1 (started, create and mkdir support added
+ so far).
+
+w) Add support for additional strong encryption types, and additional spnego
+ authentication mechanisms (see MS-SMB2)
+
+x) Finish support for SMB3.1.1 compression
+
+Known Bugs
+==========
+
+See http://bugzilla.samba.org - search on product "CifsVFS" for
+current bug list. Also check http://bugzilla.kernel.org (Product = File System, Component = CIFS)
+
+1) existing symbolic links (Windows reparse points) are recognized but
+ can not be created remotely. They are implemented for Samba and those that
+ support the CIFS Unix extensions, although earlier versions of Samba
+ overly restrict the pathnames.
+2) follow_link and readdir code does not follow dfs junctions
+ but recognizes them
+
+Misc testing to do
+==================
+1) check out max path names and max path name components against various server
+ types. Try nested symlinks (8 deep). Return max path name in stat -f information
+
+2) Improve xfstest's cifs/smb3 enablement and adapt xfstests where needed to test
+ cifs/smb3 better
+
+3) Additional performance testing and optimization using iozone and similar -
+ there are some easy changes that can be done to parallelize sequential writes,
+ and when signing is disabled to request larger read sizes (larger than
+ negotiated size) and send larger write sizes to modern servers.
+
+4) More exhaustively test against less common servers
+
+5) Continue to extend the smb3 "buildbot" which does automated xfstesting
+ against Windows, Samba and Azure currently - to add additional tests and
+ to allow the buildbot to execute the tests faster. The URL for the
+ buildbot is: http://smb3-test-rhel-75.southcentralus.cloudapp.azure.com
+
+6) Address various coverity warnings (most are not bugs per-se, but
+ the more warnings are addressed, the easier it is to spot real
+ problems that static analyzers will point out in the future).
diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst
new file mode 100644
index 0000000..d3fb67b
--- /dev/null
+++ b/Documentation/admin-guide/cifs/usage.rst
@@ -0,0 +1,869 @@
+=====
+Usage
+=====
+
+This module supports the SMB3 family of advanced network protocols (as well
+as older dialects, originally called "CIFS" or SMB1).
+
+The CIFS VFS module for Linux supports many advanced network filesystem
+features such as hierarchical DFS like namespace, hardlinks, locking and more.
+It was designed to comply with the SNIA CIFS Technical Reference (which
+supersedes the 1992 X/Open SMB Standard) as well as to perform best practice
+practical interoperability with Windows 2000, Windows XP, Samba and equivalent
+servers. This code was developed in participation with the Protocol Freedom
+Information Foundation. CIFS and now SMB3 has now become a defacto
+standard for interoperating between Macs and Windows and major NAS appliances.
+
+Please see
+MS-SMB2 (for detailed SMB2/SMB3/SMB3.1.1 protocol specification)
+http://protocolfreedom.org/ and
+http://samba.org/samba/PFIF/
+for more details.
+
+
+For questions or bug reports please contact:
+
+ smfrench@gmail.com
+
+See the project page at: https://wiki.samba.org/index.php/LinuxCIFS_utils
+
+Build instructions
+==================
+
+For Linux:
+
+1) Download the kernel (e.g. from http://www.kernel.org)
+ and change directory into the top of the kernel directory tree
+ (e.g. /usr/src/linux-2.5.73)
+2) make menuconfig (or make xconfig)
+3) select cifs from within the network filesystem choices
+4) save and exit
+5) make
+
+
+Installation instructions
+=========================
+
+If you have built the CIFS vfs as module (successfully) simply
+type ``make modules_install`` (or if you prefer, manually copy the file to
+the modules directory e.g. /lib/modules/2.4.10-4GB/kernel/fs/cifs/cifs.ko).
+
+If you have built the CIFS vfs into the kernel itself, follow the instructions
+for your distribution on how to install a new kernel (usually you
+would simply type ``make install``).
+
+If you do not have the utility mount.cifs (in the Samba 4.x source tree and on
+the CIFS VFS web site) copy it to the same directory in which mount helpers
+reside (usually /sbin). Although the helper software is not
+required, mount.cifs is recommended. Most distros include a ``cifs-utils``
+package that includes this utility so it is recommended to install this.
+
+Note that running the Winbind pam/nss module (logon service) on all of your
+Linux clients is useful in mapping Uids and Gids consistently across the
+domain to the proper network user. The mount.cifs mount helper can be
+found at cifs-utils.git on git.samba.org
+
+If cifs is built as a module, then the size and number of network buffers
+and maximum number of simultaneous requests to one server can be configured.
+Changing these from their defaults is not recommended. By executing modinfo::
+
+ modinfo kernel/fs/cifs/cifs.ko
+
+on kernel/fs/cifs/cifs.ko the list of configuration changes that can be made
+at module initialization time (by running insmod cifs.ko) can be seen.
+
+Recommendations
+===============
+
+To improve security the SMB2.1 dialect or later (usually will get SMB3) is now
+the new default. To use old dialects (e.g. to mount Windows XP) use "vers=1.0"
+on mount (or vers=2.0 for Windows Vista). Note that the CIFS (vers=1.0) is
+much older and less secure than the default dialect SMB3 which includes
+many advanced security features such as downgrade attack detection
+and encrypted shares and stronger signing and authentication algorithms.
+There are additional mount options that may be helpful for SMB3 to get
+improved POSIX behavior (NB: can use vers=3.0 to force only SMB3, never 2.1):
+
+ ``mfsymlinks`` and ``cifsacl`` and ``idsfromsid``
+
+Allowing User Mounts
+====================
+
+To permit users to mount and unmount over directories they own is possible
+with the cifs vfs. A way to enable such mounting is to mark the mount.cifs
+utility as suid (e.g. ``chmod +s /sbin/mount.cifs``). To enable users to
+umount shares they mount requires
+
+1) mount.cifs version 1.4 or later
+2) an entry for the share in /etc/fstab indicating that a user may
+ unmount it e.g.::
+
+ //server/usersharename /mnt/username cifs user 0 0
+
+Note that when the mount.cifs utility is run suid (allowing user mounts),
+in order to reduce risks, the ``nosuid`` mount flag is passed in on mount to
+disallow execution of an suid program mounted on the remote target.
+When mount is executed as root, nosuid is not passed in by default,
+and execution of suid programs on the remote target would be enabled
+by default. This can be changed, as with nfs and other filesystems,
+by simply specifying ``nosuid`` among the mount options. For user mounts
+though to be able to pass the suid flag to mount requires rebuilding
+mount.cifs with the following flag: CIFS_ALLOW_USR_SUID
+
+There is a corresponding manual page for cifs mounting in the Samba 3.0 and
+later source tree in docs/manpages/mount.cifs.8
+
+Allowing User Unmounts
+======================
+
+To permit users to ummount directories that they have user mounted (see above),
+the utility umount.cifs may be used. It may be invoked directly, or if
+umount.cifs is placed in /sbin, umount can invoke the cifs umount helper
+(at least for most versions of the umount utility) for umount of cifs
+mounts, unless umount is invoked with -i (which will avoid invoking a umount
+helper). As with mount.cifs, to enable user unmounts umount.cifs must be marked
+as suid (e.g. ``chmod +s /sbin/umount.cifs``) or equivalent (some distributions
+allow adding entries to a file to the /etc/permissions file to achieve the
+equivalent suid effect). For this utility to succeed the target path
+must be a cifs mount, and the uid of the current user must match the uid
+of the user who mounted the resource.
+
+Also note that the customary way of allowing user mounts and unmounts is
+(instead of using mount.cifs and unmount.cifs as suid) to add a line
+to the file /etc/fstab for each //server/share you wish to mount, but
+this can become unwieldy when potential mount targets include many
+or unpredictable UNC names.
+
+Samba Considerations
+====================
+
+Most current servers support SMB2.1 and SMB3 which are more secure,
+but there are useful protocol extensions for the older less secure CIFS
+dialect, so to get the maximum benefit if mounting using the older dialect
+(CIFS/SMB1), we recommend using a server that supports the SNIA CIFS
+Unix Extensions standard (e.g. almost any version of Samba ie version
+2.2.5 or later) but the CIFS vfs works fine with a wide variety of CIFS servers.
+Note that uid, gid and file permissions will display default values if you do
+not have a server that supports the Unix extensions for CIFS (such as Samba
+2.2.5 or later). To enable the Unix CIFS Extensions in the Samba server, add
+the line::
+
+ unix extensions = yes
+
+to your smb.conf file on the server. Note that the following smb.conf settings
+are also useful (on the Samba server) when the majority of clients are Unix or
+Linux::
+
+ case sensitive = yes
+ delete readonly = yes
+ ea support = yes
+
+Note that server ea support is required for supporting xattrs from the Linux
+cifs client, and that EA support is present in later versions of Samba (e.g.
+3.0.6 and later (also EA support works in all versions of Windows, at least to
+shares on NTFS filesystems). Extended Attribute (xattr) support is an optional
+feature of most Linux filesystems which may require enabling via
+make menuconfig. Client support for extended attributes (user xattr) can be
+disabled on a per-mount basis by specifying ``nouser_xattr`` on mount.
+
+The CIFS client can get and set POSIX ACLs (getfacl, setfacl) to Samba servers
+version 3.10 and later. Setting POSIX ACLs requires enabling both XATTR and
+then POSIX support in the CIFS configuration options when building the cifs
+module. POSIX ACL support can be disabled on a per mount basic by specifying
+``noacl`` on mount.
+
+Some administrators may want to change Samba's smb.conf ``map archive`` and
+``create mask`` parameters from the default. Unless the create mask is changed
+newly created files can end up with an unnecessarily restrictive default mode,
+which may not be what you want, although if the CIFS Unix extensions are
+enabled on the server and client, subsequent setattr calls (e.g. chmod) can
+fix the mode. Note that creating special devices (mknod) remotely
+may require specifying a mkdev function to Samba if you are not using
+Samba 3.0.6 or later. For more information on these see the manual pages
+(``man smb.conf``) on the Samba server system. Note that the cifs vfs,
+unlike the smbfs vfs, does not read the smb.conf on the client system
+(the few optional settings are passed in on mount via -o parameters instead).
+Note that Samba 2.2.7 or later includes a fix that allows the CIFS VFS to delete
+open files (required for strict POSIX compliance). Windows Servers already
+supported this feature. Samba server does not allow symlinks that refer to files
+outside of the share, so in Samba versions prior to 3.0.6, most symlinks to
+files with absolute paths (ie beginning with slash) such as::
+
+ ln -s /mnt/foo bar
+
+would be forbidden. Samba 3.0.6 server or later includes the ability to create
+such symlinks safely by converting unsafe symlinks (ie symlinks to server
+files that are outside of the share) to a samba specific format on the server
+that is ignored by local server applications and non-cifs clients and that will
+not be traversed by the Samba server). This is opaque to the Linux client
+application using the cifs vfs. Absolute symlinks will work to Samba 3.0.5 or
+later, but only for remote clients using the CIFS Unix extensions, and will
+be invisbile to Windows clients and typically will not affect local
+applications running on the same server as Samba.
+
+Use instructions
+================
+
+Once the CIFS VFS support is built into the kernel or installed as a module
+(cifs.ko), you can use mount syntax like the following to access Samba or
+Mac or Windows servers::
+
+ mount -t cifs //9.53.216.11/e$ /mnt -o username=myname,password=mypassword
+
+Before -o the option -v may be specified to make the mount.cifs
+mount helper display the mount steps more verbosely.
+After -o the following commonly used cifs vfs specific options
+are supported::
+
+ username=<username>
+ password=<password>
+ domain=<domain name>
+
+Other cifs mount options are described below. Use of TCP names (in addition to
+ip addresses) is available if the mount helper (mount.cifs) is installed. If
+you do not trust the server to which are mounted, or if you do not have
+cifs signing enabled (and the physical network is insecure), consider use
+of the standard mount options ``noexec`` and ``nosuid`` to reduce the risk of
+running an altered binary on your local system (downloaded from a hostile server
+or altered by a hostile router).
+
+Although mounting using format corresponding to the CIFS URL specification is
+not possible in mount.cifs yet, it is possible to use an alternate format
+for the server and sharename (which is somewhat similar to NFS style mount
+syntax) instead of the more widely used UNC format (i.e. \\server\share)::
+
+ mount -t cifs tcp_name_of_server:share_name /mnt -o user=myname,pass=mypasswd
+
+When using the mount helper mount.cifs, passwords may be specified via alternate
+mechanisms, instead of specifying it after -o using the normal ``pass=`` syntax
+on the command line:
+1) By including it in a credential file. Specify credentials=filename as one
+of the mount options. Credential files contain two lines::
+
+ username=someuser
+ password=your_password
+
+2) By specifying the password in the PASSWD environment variable (similarly
+ the user name can be taken from the USER environment variable).
+3) By specifying the password in a file by name via PASSWD_FILE
+4) By specifying the password in a file by file descriptor via PASSWD_FD
+
+If no password is provided, mount.cifs will prompt for password entry
+
+Restrictions
+============
+
+Servers must support either "pure-TCP" (port 445 TCP/IP CIFS connections) or RFC
+1001/1002 support for "Netbios-Over-TCP/IP." This is not likely to be a
+problem as most servers support this.
+
+Valid filenames differ between Windows and Linux. Windows typically restricts
+filenames which contain certain reserved characters (e.g.the character :
+which is used to delimit the beginning of a stream name by Windows), while
+Linux allows a slightly wider set of valid characters in filenames. Windows
+servers can remap such characters when an explicit mapping is specified in
+the Server's registry. Samba starting with version 3.10 will allow such
+filenames (ie those which contain valid Linux characters, which normally
+would be forbidden for Windows/CIFS semantics) as long as the server is
+configured for Unix Extensions (and the client has not disabled
+/proc/fs/cifs/LinuxExtensionsEnabled). In addition the mount option
+``mapposix`` can be used on CIFS (vers=1.0) to force the mapping of
+illegal Windows/NTFS/SMB characters to a remap range (this mount parm
+is the default for SMB3). This remap (``mapposix``) range is also
+compatible with Mac (and "Services for Mac" on some older Windows).
+
+CIFS VFS Mount Options
+======================
+A partial list of the supported mount options follows:
+
+ username
+ The user name to use when trying to establish
+ the CIFS session.
+ password
+ The user password. If the mount helper is
+ installed, the user will be prompted for password
+ if not supplied.
+ ip
+ The ip address of the target server
+ unc
+ The target server Universal Network Name (export) to
+ mount.
+ domain
+ Set the SMB/CIFS workgroup name prepended to the
+ username during CIFS session establishment
+ forceuid
+ Set the default uid for inodes to the uid
+ passed in on mount. For mounts to servers
+ which do support the CIFS Unix extensions, such as a
+ properly configured Samba server, the server provides
+ the uid, gid and mode so this parameter should not be
+ specified unless the server and clients uid and gid
+ numbering differ. If the server and client are in the
+ same domain (e.g. running winbind or nss_ldap) and
+ the server supports the Unix Extensions then the uid
+ and gid can be retrieved from the server (and uid
+ and gid would not have to be specified on the mount.
+ For servers which do not support the CIFS Unix
+ extensions, the default uid (and gid) returned on lookup
+ of existing files will be the uid (gid) of the person
+ who executed the mount (root, except when mount.cifs
+ is configured setuid for user mounts) unless the ``uid=``
+ (gid) mount option is specified. Also note that permission
+ checks (authorization checks) on accesses to a file occur
+ at the server, but there are cases in which an administrator
+ may want to restrict at the client as well. For those
+ servers which do not report a uid/gid owner
+ (such as Windows), permissions can also be checked at the
+ client, and a crude form of client side permission checking
+ can be enabled by specifying file_mode and dir_mode on
+ the client. (default)
+ forcegid
+ (similar to above but for the groupid instead of uid) (default)
+ noforceuid
+ Fill in file owner information (uid) by requesting it from
+ the server if possible. With this option, the value given in
+ the uid= option (on mount) will only be used if the server
+ can not support returning uids on inodes.
+ noforcegid
+ (similar to above but for the group owner, gid, instead of uid)
+ uid
+ Set the default uid for inodes, and indicate to the
+ cifs kernel driver which local user mounted. If the server
+ supports the unix extensions the default uid is
+ not used to fill in the owner fields of inodes (files)
+ unless the ``forceuid`` parameter is specified.
+ gid
+ Set the default gid for inodes (similar to above).
+ file_mode
+ If CIFS Unix extensions are not supported by the server
+ this overrides the default mode for file inodes.
+ fsc
+ Enable local disk caching using FS-Cache (off by default). This
+ option could be useful to improve performance on a slow link,
+ heavily loaded server and/or network where reading from the
+ disk is faster than reading from the server (over the network).
+ This could also impact scalability positively as the
+ number of calls to the server are reduced. However, local
+ caching is not suitable for all workloads for e.g. read-once
+ type workloads. So, you need to consider carefully your
+ workload/scenario before using this option. Currently, local
+ disk caching is functional for CIFS files opened as read-only.
+ dir_mode
+ If CIFS Unix extensions are not supported by the server
+ this overrides the default mode for directory inodes.
+ port
+ attempt to contact the server on this tcp port, before
+ trying the usual ports (port 445, then 139).
+ iocharset
+ Codepage used to convert local path names to and from
+ Unicode. Unicode is used by default for network path
+ names if the server supports it. If iocharset is
+ not specified then the nls_default specified
+ during the local client kernel build will be used.
+ If server does not support Unicode, this parameter is
+ unused.
+ rsize
+ default read size (usually 16K). The client currently
+ can not use rsize larger than CIFSMaxBufSize. CIFSMaxBufSize
+ defaults to 16K and may be changed (from 8K to the maximum
+ kmalloc size allowed by your kernel) at module install time
+ for cifs.ko. Setting CIFSMaxBufSize to a very large value
+ will cause cifs to use more memory and may reduce performance
+ in some cases. To use rsize greater than 127K (the original
+ cifs protocol maximum) also requires that the server support
+ a new Unix Capability flag (for very large read) which some
+ newer servers (e.g. Samba 3.0.26 or later) do. rsize can be
+ set from a minimum of 2048 to a maximum of 130048 (127K or
+ CIFSMaxBufSize, whichever is smaller)
+ wsize
+ default write size (default 57344)
+ maximum wsize currently allowed by CIFS is 57344 (fourteen
+ 4096 byte pages)
+ actimeo=n
+ attribute cache timeout in seconds (default 1 second).
+ After this timeout, the cifs client requests fresh attribute
+ information from the server. This option allows to tune the
+ attribute cache timeout to suit the workload needs. Shorter
+ timeouts mean better the cache coherency, but increased number
+ of calls to the server. Longer timeouts mean reduced number
+ of calls to the server at the expense of less stricter cache
+ coherency checks (i.e. incorrect attribute cache for a short
+ period of time).
+ rw
+ mount the network share read-write (note that the
+ server may still consider the share read-only)
+ ro
+ mount network share read-only
+ version
+ used to distinguish different versions of the
+ mount helper utility (not typically needed)
+ sep
+ if first mount option (after the -o), overrides
+ the comma as the separator between the mount
+ parms. e.g.::
+
+ -o user=myname,password=mypassword,domain=mydom
+
+ could be passed instead with period as the separator by::
+
+ -o sep=.user=myname.password=mypassword.domain=mydom
+
+ this might be useful when comma is contained within username
+ or password or domain. This option is less important
+ when the cifs mount helper cifs.mount (version 1.1 or later)
+ is used.
+ nosuid
+ Do not allow remote executables with the suid bit
+ program to be executed. This is only meaningful for mounts
+ to servers such as Samba which support the CIFS Unix Extensions.
+ If you do not trust the servers in your network (your mount
+ targets) it is recommended that you specify this option for
+ greater security.
+ exec
+ Permit execution of binaries on the mount.
+ noexec
+ Do not permit execution of binaries on the mount.
+ dev
+ Recognize block devices on the remote mount.
+ nodev
+ Do not recognize devices on the remote mount.
+ suid
+ Allow remote files on this mountpoint with suid enabled to
+ be executed (default for mounts when executed as root,
+ nosuid is default for user mounts).
+ credentials
+ Although ignored by the cifs kernel component, it is used by
+ the mount helper, mount.cifs. When mount.cifs is installed it
+ opens and reads the credential file specified in order
+ to obtain the userid and password arguments which are passed to
+ the cifs vfs.
+ guest
+ Although ignored by the kernel component, the mount.cifs
+ mount helper will not prompt the user for a password
+ if guest is specified on the mount options. If no
+ password is specified a null password will be used.
+ perm
+ Client does permission checks (vfs_permission check of uid
+ and gid of the file against the mode and desired operation),
+ Note that this is in addition to the normal ACL check on the
+ target machine done by the server software.
+ Client permission checking is enabled by default.
+ noperm
+ Client does not do permission checks. This can expose
+ files on this mount to access by other users on the local
+ client system. It is typically only needed when the server
+ supports the CIFS Unix Extensions but the UIDs/GIDs on the
+ client and server system do not match closely enough to allow
+ access by the user doing the mount, but it may be useful with
+ non CIFS Unix Extension mounts for cases in which the default
+ mode is specified on the mount but is not to be enforced on the
+ client (e.g. perhaps when MultiUserMount is enabled)
+ Note that this does not affect the normal ACL check on the
+ target machine done by the server software (of the server
+ ACL against the user name provided at mount time).
+ serverino
+ Use server's inode numbers instead of generating automatically
+ incrementing inode numbers on the client. Although this will
+ make it easier to spot hardlinked files (as they will have
+ the same inode numbers) and inode numbers may be persistent,
+ note that the server does not guarantee that the inode numbers
+ are unique if multiple server side mounts are exported under a
+ single share (since inode numbers on the servers might not
+ be unique if multiple filesystems are mounted under the same
+ shared higher level directory). Note that some older
+ (e.g. pre-Windows 2000) do not support returning UniqueIDs
+ or the CIFS Unix Extensions equivalent and for those
+ this mount option will have no effect. Exporting cifs mounts
+ under nfsd requires this mount option on the cifs mount.
+ This is now the default if server supports the
+ required network operation.
+ noserverino
+ Client generates inode numbers (rather than using the actual one
+ from the server). These inode numbers will vary after
+ unmount or reboot which can confuse some applications,
+ but not all server filesystems support unique inode
+ numbers.
+ setuids
+ If the CIFS Unix extensions are negotiated with the server
+ the client will attempt to set the effective uid and gid of
+ the local process on newly created files, directories, and
+ devices (create, mkdir, mknod). If the CIFS Unix Extensions
+ are not negotiated, for newly created files and directories
+ instead of using the default uid and gid specified on
+ the mount, cache the new file's uid and gid locally which means
+ that the uid for the file can change when the inode is
+ reloaded (or the user remounts the share).
+ nosetuids
+ The client will not attempt to set the uid and gid on
+ on newly created files, directories, and devices (create,
+ mkdir, mknod) which will result in the server setting the
+ uid and gid to the default (usually the server uid of the
+ user who mounted the share). Letting the server (rather than
+ the client) set the uid and gid is the default. If the CIFS
+ Unix Extensions are not negotiated then the uid and gid for
+ new files will appear to be the uid (gid) of the mounter or the
+ uid (gid) parameter specified on the mount.
+ netbiosname
+ When mounting to servers via port 139, specifies the RFC1001
+ source name to use to represent the client netbios machine
+ name when doing the RFC1001 netbios session initialize.
+ direct
+ Do not do inode data caching on files opened on this mount.
+ This precludes mmapping files on this mount. In some cases
+ with fast networks and little or no caching benefits on the
+ client (e.g. when the application is doing large sequential
+ reads bigger than page size without rereading the same data)
+ this can provide better performance than the default
+ behavior which caches reads (readahead) and writes
+ (writebehind) through the local Linux client pagecache
+ if oplock (caching token) is granted and held. Note that
+ direct allows write operations larger than page size
+ to be sent to the server.
+ strictcache
+ Use for switching on strict cache mode. In this mode the
+ client read from the cache all the time it has Oplock Level II,
+ otherwise - read from the server. All written data are stored
+ in the cache, but if the client doesn't have Exclusive Oplock,
+ it writes the data to the server.
+ rwpidforward
+ Forward pid of a process who opened a file to any read or write
+ operation on that file. This prevent applications like WINE
+ from failing on read and write if we use mandatory brlock style.
+ acl
+ Allow setfacl and getfacl to manage posix ACLs if server
+ supports them. (default)
+ noacl
+ Do not allow setfacl and getfacl calls on this mount
+ user_xattr
+ Allow getting and setting user xattrs (those attributes whose
+ name begins with ``user.`` or ``os2.``) as OS/2 EAs (extended
+ attributes) to the server. This allows support of the
+ setfattr and getfattr utilities. (default)
+ nouser_xattr
+ Do not allow getfattr/setfattr to get/set/list xattrs
+ mapchars
+ Translate six of the seven reserved characters (not backslash)::
+
+ *?<>|:
+
+ to the remap range (above 0xF000), which also
+ allows the CIFS client to recognize files created with
+ such characters by Windows's POSIX emulation. This can
+ also be useful when mounting to most versions of Samba
+ (which also forbids creating and opening files
+ whose names contain any of these seven characters).
+ This has no effect if the server does not support
+ Unicode on the wire.
+ nomapchars
+ Do not translate any of these seven characters (default).
+ nocase
+ Request case insensitive path name matching (case
+ sensitive is the default if the server supports it).
+ (mount option ``ignorecase`` is identical to ``nocase``)
+ posixpaths
+ If CIFS Unix extensions are supported, attempt to
+ negotiate posix path name support which allows certain
+ characters forbidden in typical CIFS filenames, without
+ requiring remapping. (default)
+ noposixpaths
+ If CIFS Unix extensions are supported, do not request
+ posix path name support (this may cause servers to
+ reject creatingfile with certain reserved characters).
+ nounix
+ Disable the CIFS Unix Extensions for this mount (tree
+ connection). This is rarely needed, but it may be useful
+ in order to turn off multiple settings all at once (ie
+ posix acls, posix locks, posix paths, symlink support
+ and retrieving uids/gids/mode from the server) or to
+ work around a bug in server which implement the Unix
+ Extensions.
+ nobrl
+ Do not send byte range lock requests to the server.
+ This is necessary for certain applications that break
+ with cifs style mandatory byte range locks (and most
+ cifs servers do not yet support requesting advisory
+ byte range locks).
+ forcemandatorylock
+ Even if the server supports posix (advisory) byte range
+ locking, send only mandatory lock requests. For some
+ (presumably rare) applications, originally coded for
+ DOS/Windows, which require Windows style mandatory byte range
+ locking, they may be able to take advantage of this option,
+ forcing the cifs client to only send mandatory locks
+ even if the cifs server would support posix advisory locks.
+ ``forcemand`` is accepted as a shorter form of this mount
+ option.
+ nostrictsync
+ If this mount option is set, when an application does an
+ fsync call then the cifs client does not send an SMB Flush
+ to the server (to force the server to write all dirty data
+ for this file immediately to disk), although cifs still sends
+ all dirty (cached) file data to the server and waits for the
+ server to respond to the write. Since SMB Flush can be
+ very slow, and some servers may be reliable enough (to risk
+ delaying slightly flushing the data to disk on the server),
+ turning on this option may be useful to improve performance for
+ applications that fsync too much, at a small risk of server
+ crash. If this mount option is not set, by default cifs will
+ send an SMB flush request (and wait for a response) on every
+ fsync call.
+ nodfs
+ Disable DFS (global name space support) even if the
+ server claims to support it. This can help work around
+ a problem with parsing of DFS paths with Samba server
+ versions 3.0.24 and 3.0.25.
+ remount
+ remount the share (often used to change from ro to rw mounts
+ or vice versa)
+ cifsacl
+ Report mode bits (e.g. on stat) based on the Windows ACL for
+ the file. (EXPERIMENTAL)
+ servern
+ Specify the server 's netbios name (RFC1001 name) to use
+ when attempting to setup a session to the server.
+ This is needed for mounting to some older servers (such
+ as OS/2 or Windows 98 and Windows ME) since they do not
+ support a default server name. A server name can be up
+ to 15 characters long and is usually uppercased.
+ sfu
+ When the CIFS Unix Extensions are not negotiated, attempt to
+ create device files and fifos in a format compatible with
+ Services for Unix (SFU). In addition retrieve bits 10-12
+ of the mode via the SETFILEBITS extended attribute (as
+ SFU does). In the future the bottom 9 bits of the
+ mode also will be emulated using queries of the security
+ descriptor (ACL).
+ mfsymlinks
+ Enable support for Minshall+French symlinks
+ (see http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks)
+ This option is ignored when specified together with the
+ 'sfu' option. Minshall+French symlinks are used even if
+ the server supports the CIFS Unix Extensions.
+ sign
+ Must use packet signing (helps avoid unwanted data modification
+ by intermediate systems in the route). Note that signing
+ does not work with lanman or plaintext authentication.
+ seal
+ Must seal (encrypt) all data on this mounted share before
+ sending on the network. Requires support for Unix Extensions.
+ Note that this differs from the sign mount option in that it
+ causes encryption of data sent over this mounted share but other
+ shares mounted to the same server are unaffected.
+ locallease
+ This option is rarely needed. Fcntl F_SETLEASE is
+ used by some applications such as Samba and NFSv4 server to
+ check to see whether a file is cacheable. CIFS has no way
+ to explicitly request a lease, but can check whether a file
+ is cacheable (oplocked). Unfortunately, even if a file
+ is not oplocked, it could still be cacheable (ie cifs client
+ could grant fcntl leases if no other local processes are using
+ the file) for cases for example such as when the server does not
+ support oplocks and the user is sure that the only updates to
+ the file will be from this client. Specifying this mount option
+ will allow the cifs client to check for leases (only) locally
+ for files which are not oplocked instead of denying leases
+ in that case. (EXPERIMENTAL)
+ sec
+ Security mode. Allowed values are:
+
+ none
+ attempt to connection as a null user (no name)
+ krb5
+ Use Kerberos version 5 authentication
+ krb5i
+ Use Kerberos authentication and packet signing
+ ntlm
+ Use NTLM password hashing (default)
+ ntlmi
+ Use NTLM password hashing with signing (if
+ /proc/fs/cifs/PacketSigningEnabled on or if
+ server requires signing also can be the default)
+ ntlmv2
+ Use NTLMv2 password hashing
+ ntlmv2i
+ Use NTLMv2 password hashing with packet signing
+ lanman
+ (if configured in kernel config) use older
+ lanman hash
+ hard
+ Retry file operations if server is not responding
+ soft
+ Limit retries to unresponsive servers (usually only
+ one retry) before returning an error. (default)
+
+The mount.cifs mount helper also accepts a few mount options before -o
+including:
+
+=============== ===============================================================
+ -S take password from stdin (equivalent to setting the environment
+ variable ``PASSWD_FD=0``
+ -V print mount.cifs version
+ -? display simple usage information
+=============== ===============================================================
+
+With most 2.6 kernel versions of modutils, the version of the cifs kernel
+module can be displayed via modinfo.
+
+Misc /proc/fs/cifs Flags and Debug Info
+=======================================
+
+Informational pseudo-files:
+
+======================= =======================================================
+DebugData Displays information about active CIFS sessions and
+ shares, features enabled as well as the cifs.ko
+ version.
+Stats Lists summary resource usage information as well as per
+ share statistics.
+======================= =======================================================
+
+Configuration pseudo-files:
+
+======================= =======================================================
+SecurityFlags Flags which control security negotiation and
+ also packet signing. Authentication (may/must)
+ flags (e.g. for NTLM and/or NTLMv2) may be combined with
+ the signing flags. Specifying two different password
+ hashing mechanisms (as "must use") on the other hand
+ does not make much sense. Default flags are::
+
+ 0x07007
+
+ (NTLM, NTLMv2 and packet signing allowed). The maximum
+ allowable flags if you want to allow mounts to servers
+ using weaker password hashes is 0x37037 (lanman,
+ plaintext, ntlm, ntlmv2, signing allowed). Some
+ SecurityFlags require the corresponding menuconfig
+ options to be enabled (lanman and plaintext require
+ CONFIG_CIFS_WEAK_PW_HASH for example). Enabling
+ plaintext authentication currently requires also
+ enabling lanman authentication in the security flags
+ because the cifs module only supports sending
+ laintext passwords using the older lanman dialect
+ form of the session setup SMB. (e.g. for authentication
+ using plain text passwords, set the SecurityFlags
+ to 0x30030)::
+
+ may use packet signing 0x00001
+ must use packet signing 0x01001
+ may use NTLM (most common password hash) 0x00002
+ must use NTLM 0x02002
+ may use NTLMv2 0x00004
+ must use NTLMv2 0x04004
+ may use Kerberos security 0x00008
+ must use Kerberos 0x08008
+ may use lanman (weak) password hash 0x00010
+ must use lanman password hash 0x10010
+ may use plaintext passwords 0x00020
+ must use plaintext passwords 0x20020
+ (reserved for future packet encryption) 0x00040
+
+cifsFYI If set to non-zero value, additional debug information
+ will be logged to the system error log. This field
+ contains three flags controlling different classes of
+ debugging entries. The maximum value it can be set
+ to is 7 which enables all debugging points (default 0).
+ Some debugging statements are not compiled into the
+ cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the
+ kernel configuration. cifsFYI may be set to one or
+ nore of the following flags (7 sets them all)::
+
+ +-----------------------------------------------+------+
+ | log cifs informational messages | 0x01 |
+ +-----------------------------------------------+------+
+ | log return codes from cifs entry points | 0x02 |
+ +-----------------------------------------------+------+
+ | log slow responses | 0x04 |
+ | (ie which take longer than 1 second) | |
+ | | |
+ | CONFIG_CIFS_STATS2 must be enabled in .config | |
+ +-----------------------------------------------+------+
+
+traceSMB If set to one, debug information is logged to the
+ system error log with the start of smb requests
+ and responses (default 0)
+LookupCacheEnable If set to one, inode information is kept cached
+ for one second improving performance of lookups
+ (default 1)
+LinuxExtensionsEnabled If set to one then the client will attempt to
+ use the CIFS "UNIX" extensions which are optional
+ protocol enhancements that allow CIFS servers
+ to return accurate UID/GID information as well
+ as support symbolic links. If you use servers
+ such as Samba that support the CIFS Unix
+ extensions but do not want to use symbolic link
+ support and want to map the uid and gid fields
+ to values supplied at mount (rather than the
+ actual values, then set this to zero. (default 1)
+======================= =======================================================
+
+These experimental features and tracing can be enabled by changing flags in
+/proc/fs/cifs (after the cifs module has been installed or built into the
+kernel, e.g. insmod cifs). To enable a feature set it to 1 e.g. to enable
+tracing to the kernel message log type::
+
+ echo 7 > /proc/fs/cifs/cifsFYI
+
+cifsFYI functions as a bit mask. Setting it to 1 enables additional kernel
+logging of various informational messages. 2 enables logging of non-zero
+SMB return codes while 4 enables logging of requests that take longer
+than one second to complete (except for byte range lock requests).
+Setting it to 4 requires CONFIG_CIFS_STATS2 to be set in kernel configuration
+(.config). Setting it to seven enables all three. Finally, tracing
+the start of smb requests and responses can be enabled via::
+
+ echo 1 > /proc/fs/cifs/traceSMB
+
+Per share (per client mount) statistics are available in /proc/fs/cifs/Stats.
+Additional information is available if CONFIG_CIFS_STATS2 is enabled in the
+kernel configuration (.config). The statistics returned include counters which
+represent the number of attempted and failed (ie non-zero return code from the
+server) SMB3 (or cifs) requests grouped by request type (read, write, close etc.).
+Also recorded is the total bytes read and bytes written to the server for
+that share. Note that due to client caching effects this can be less than the
+number of bytes read and written by the application running on the client.
+Statistics can be reset to zero by ``echo 0 > /proc/fs/cifs/Stats`` which may be
+useful if comparing performance of two different scenarios.
+
+Also note that ``cat /proc/fs/cifs/DebugData`` will display information about
+the active sessions and the shares that are mounted.
+
+Enabling Kerberos (extended security) works but requires version 1.2 or later
+of the helper program cifs.upcall to be present and to be configured in the
+/etc/request-key.conf file. The cifs.upcall helper program is from the Samba
+project(http://www.samba.org). NTLM and NTLMv2 and LANMAN support do not
+require this helper. Note that NTLMv2 security (which does not require the
+cifs.upcall helper program), instead of using Kerberos, is sufficient for
+some use cases.
+
+DFS support allows transparent redirection to shares in an MS-DFS name space.
+In addition, DFS support for target shares which are specified as UNC
+names which begin with host names (rather than IP addresses) requires
+a user space helper (such as cifs.upcall) to be present in order to
+translate host names to ip address, and the user space helper must also
+be configured in the file /etc/request-key.conf. Samba, Windows servers and
+many NAS appliances support DFS as a way of constructing a global name
+space to ease network configuration and improve reliability.
+
+To use cifs Kerberos and DFS support, the Linux keyutils package should be
+installed and something like the following lines should be added to the
+/etc/request-key.conf file::
+
+ create cifs.spnego * * /usr/local/sbin/cifs.upcall %k
+ create dns_resolver * * /usr/local/sbin/cifs.upcall %k
+
+CIFS kernel module parameters
+=============================
+These module parameters can be specified or modified either during the time of
+module loading or during the runtime by using the interface::
+
+ /proc/module/cifs/parameters/<param>
+
+i.e.::
+
+ echo "value" > /sys/module/cifs/parameters/<param>
+
+================= ==========================================================
+1. enable_oplocks Enable or disable oplocks. Oplocks are enabled by default.
+ [Y/y/1]. To disable use any of [N/n/0].
+================= ==========================================================
diff --git a/Documentation/admin-guide/cifs/winucase_convert.pl b/Documentation/admin-guide/cifs/winucase_convert.pl
new file mode 100755
index 0000000..322a9c8
--- /dev/null
+++ b/Documentation/admin-guide/cifs/winucase_convert.pl
@@ -0,0 +1,62 @@
+#!/usr/bin/perl -w
+#
+# winucase_convert.pl -- convert "Windows 8 Upper Case Mapping Table.txt" to
+# a two-level set of C arrays.
+#
+# Copyright 2013: Jeff Layton <jlayton@redhat.com>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+#
+
+while(<>) {
+ next if (!/^0x(..)(..)\t0x(....)\t/);
+ $firstchar = hex($1);
+ $secondchar = hex($2);
+ $uppercase = hex($3);
+
+ $top[$firstchar][$secondchar] = $uppercase;
+}
+
+for ($i = 0; $i < 256; $i++) {
+ next if (!$top[$i]);
+
+ printf("static const wchar_t t2_%2.2x[256] = {", $i);
+ for ($j = 0; $j < 256; $j++) {
+ if (($j % 8) == 0) {
+ print "\n\t";
+ } else {
+ print " ";
+ }
+ printf("0x%4.4x,", $top[$i][$j] ? $top[$i][$j] : 0);
+ }
+ print "\n};\n\n";
+}
+
+printf("static const wchar_t *const toplevel[256] = {", $i);
+for ($i = 0; $i < 256; $i++) {
+ if (($i % 8) == 0) {
+ print "\n\t";
+ } elsif ($top[$i]) {
+ print " ";
+ } else {
+ print " ";
+ }
+
+ if ($top[$i]) {
+ printf("t2_%2.2x,", $i);
+ } else {
+ print "NULL,";
+ }
+}
+print "\n};\n\n";
diff --git a/Documentation/admin-guide/clearing-warn-once.rst b/Documentation/admin-guide/clearing-warn-once.rst
new file mode 100644
index 0000000..211fd92
--- /dev/null
+++ b/Documentation/admin-guide/clearing-warn-once.rst
@@ -0,0 +1,9 @@
+Clearing WARN_ONCE
+------------------
+
+WARN_ONCE / WARN_ON_ONCE / printk_once only emit a message once.
+
+echo 1 > /sys/kernel/debug/clear_warn_once
+
+clears the state and allows the warnings to print once again.
+This can be useful after test suite runs to reproduce problems.
diff --git a/Documentation/admin-guide/conf.py b/Documentation/admin-guide/conf.py
deleted file mode 100644
index 86f7389..0000000
--- a/Documentation/admin-guide/conf.py
+++ /dev/null
@@ -1,10 +0,0 @@
-# -*- coding: utf-8; mode: python -*-
-
-project = 'Linux Kernel User Documentation'
-
-tags.add("subproject")
-
-latex_documents = [
- ('index', 'linux-user.tex', 'Linux Kernel User Documentation',
- 'The kernel development community', 'manual'),
-]
diff --git a/Documentation/admin-guide/cpu-load.rst b/Documentation/admin-guide/cpu-load.rst
new file mode 100644
index 0000000..2d01ce4
--- /dev/null
+++ b/Documentation/admin-guide/cpu-load.rst
@@ -0,0 +1,114 @@
+========
+CPU load
+========
+
+Linux exports various bits of information via ``/proc/stat`` and
+``/proc/uptime`` that userland tools, such as top(1), use to calculate
+the average time system spent in a particular state, for example::
+
+ $ iostat
+ Linux 2.6.18.3-exp (linmac) 02/20/2007
+
+ avg-cpu: %user %nice %system %iowait %steal %idle
+ 10.01 0.00 2.92 5.44 0.00 81.63
+
+ ...
+
+Here the system thinks that over the default sampling period the
+system spent 10.01% of the time doing work in user space, 2.92% in the
+kernel, and was overall 81.63% of the time idle.
+
+In most cases the ``/proc/stat`` information reflects the reality quite
+closely, however due to the nature of how/when the kernel collects
+this data sometimes it can not be trusted at all.
+
+So how is this information collected? Whenever timer interrupt is
+signalled the kernel looks what kind of task was running at this
+moment and increments the counter that corresponds to this tasks
+kind/state. The problem with this is that the system could have
+switched between various states multiple times between two timer
+interrupts yet the counter is incremented only for the last state.
+
+
+Example
+-------
+
+If we imagine the system with one task that periodically burns cycles
+in the following manner::
+
+ time line between two timer interrupts
+ |--------------------------------------|
+ ^ ^
+ |_ something begins working |
+ |_ something goes to sleep
+ (only to be awaken quite soon)
+
+In the above situation the system will be 0% loaded according to the
+``/proc/stat`` (since the timer interrupt will always happen when the
+system is executing the idle handler), but in reality the load is
+closer to 99%.
+
+One can imagine many more situations where this behavior of the kernel
+will lead to quite erratic information inside ``/proc/stat``::
+
+
+ /* gcc -o hog smallhog.c */
+ #include <time.h>
+ #include <limits.h>
+ #include <signal.h>
+ #include <sys/time.h>
+ #define HIST 10
+
+ static volatile sig_atomic_t stop;
+
+ static void sighandler (int signr)
+ {
+ (void) signr;
+ stop = 1;
+ }
+ static unsigned long hog (unsigned long niters)
+ {
+ stop = 0;
+ while (!stop && --niters);
+ return niters;
+ }
+ int main (void)
+ {
+ int i;
+ struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
+ .it_value = { .tv_sec = 0, .tv_usec = 1 } };
+ sigset_t set;
+ unsigned long v[HIST];
+ double tmp = 0.0;
+ unsigned long n;
+ signal (SIGALRM, &sighandler);
+ setitimer (ITIMER_REAL, &it, NULL);
+
+ hog (ULONG_MAX);
+ for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
+ for (i = 0; i < HIST; ++i) tmp += v[i];
+ tmp /= HIST;
+ n = tmp - (tmp / 3.0);
+
+ sigemptyset (&set);
+ sigaddset (&set, SIGALRM);
+
+ for (;;) {
+ hog (n);
+ sigwait (&set, &i);
+ }
+ return 0;
+ }
+
+
+References
+----------
+
+- http://lkml.org/lkml/2007/2/12/6
+- Documentation/filesystems/proc.txt (1.8)
+
+
+Thanks
+------
+
+Con Kolivas, Pavel Machek
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
new file mode 100644
index 0000000..b90dafc
--- /dev/null
+++ b/Documentation/admin-guide/cputopology.rst
@@ -0,0 +1,177 @@
+===========================================
+How CPU topology info is exported via sysfs
+===========================================
+
+Export CPU topology info via sysfs. Items (attributes) are similar
+to /proc/cpuinfo output of some architectures. They reside in
+/sys/devices/system/cpu/cpuX/topology/:
+
+physical_package_id:
+
+ physical package id of cpuX. Typically corresponds to a physical
+ socket number, but the actual value is architecture and platform
+ dependent.
+
+die_id:
+
+ the CPU die ID of cpuX. Typically it is the hardware platform's
+ identifier (rather than the kernel's). The actual value is
+ architecture and platform dependent.
+
+core_id:
+
+ the CPU core ID of cpuX. Typically it is the hardware platform's
+ identifier (rather than the kernel's). The actual value is
+ architecture and platform dependent.
+
+book_id:
+
+ the book ID of cpuX. Typically it is the hardware platform's
+ identifier (rather than the kernel's). The actual value is
+ architecture and platform dependent.
+
+drawer_id:
+
+ the drawer ID of cpuX. Typically it is the hardware platform's
+ identifier (rather than the kernel's). The actual value is
+ architecture and platform dependent.
+
+core_cpus:
+
+ internal kernel map of CPUs within the same core.
+ (deprecated name: "thread_siblings")
+
+core_cpus_list:
+
+ human-readable list of CPUs within the same core.
+ (deprecated name: "thread_siblings_list");
+
+package_cpus:
+
+ internal kernel map of the CPUs sharing the same physical_package_id.
+ (deprecated name: "core_siblings")
+
+package_cpus_list:
+
+ human-readable list of CPUs sharing the same physical_package_id.
+ (deprecated name: "core_siblings_list")
+
+die_cpus:
+
+ internal kernel map of CPUs within the same die.
+
+die_cpus_list:
+
+ human-readable list of CPUs within the same die.
+
+book_siblings:
+
+ internal kernel map of cpuX's hardware threads within the same
+ book_id.
+
+book_siblings_list:
+
+ human-readable list of cpuX's hardware threads within the same
+ book_id.
+
+drawer_siblings:
+
+ internal kernel map of cpuX's hardware threads within the same
+ drawer_id.
+
+drawer_siblings_list:
+
+ human-readable list of cpuX's hardware threads within the same
+ drawer_id.
+
+Architecture-neutral, drivers/base/topology.c, exports these attributes.
+However, the book and drawer related sysfs files will only be created if
+CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively.
+
+CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390,
+where they reflect the cpu and cache hierarchy.
+
+For an architecture to support this feature, it must define some of
+these macros in include/asm-XXX/topology.h::
+
+ #define topology_physical_package_id(cpu)
+ #define topology_die_id(cpu)
+ #define topology_core_id(cpu)
+ #define topology_book_id(cpu)
+ #define topology_drawer_id(cpu)
+ #define topology_sibling_cpumask(cpu)
+ #define topology_core_cpumask(cpu)
+ #define topology_die_cpumask(cpu)
+ #define topology_book_cpumask(cpu)
+ #define topology_drawer_cpumask(cpu)
+
+The type of ``**_id macros`` is int.
+The type of ``**_cpumask macros`` is ``(const) struct cpumask *``. The latter
+correspond with appropriate ``**_siblings`` sysfs attributes (except for
+topology_sibling_cpumask() which corresponds with thread_siblings).
+
+To be consistent on all architectures, include/linux/topology.h
+provides default definitions for any of the above macros that are
+not defined by include/asm-XXX/topology.h:
+
+1) topology_physical_package_id: -1
+2) topology_die_id: -1
+3) topology_core_id: 0
+4) topology_sibling_cpumask: just the given CPU
+5) topology_core_cpumask: just the given CPU
+6) topology_die_cpumask: just the given CPU
+
+For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
+default definitions for topology_book_id() and topology_book_cpumask().
+For architectures that don't support drawers (CONFIG_SCHED_DRAWER) there are
+no default definitions for topology_drawer_id() and topology_drawer_cpumask().
+
+Additionally, CPU topology information is provided under
+/sys/devices/system/cpu and includes these files. The internal
+source for the output is in brackets ("[]").
+
+ =========== ==========================================================
+ kernel_max: the maximum CPU index allowed by the kernel configuration.
+ [NR_CPUS-1]
+
+ offline: CPUs that are not online because they have been
+ HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
+ of CPUs allowed by the kernel configuration (kernel_max
+ above). [~cpu_online_mask + cpus >= NR_CPUS]
+
+ online: CPUs that are online and being scheduled [cpu_online_mask]
+
+ possible: CPUs that have been allocated resources and can be
+ brought online if they are present. [cpu_possible_mask]
+
+ present: CPUs that have been identified as being present in the
+ system. [cpu_present_mask]
+ =========== ==========================================================
+
+The format for the above output is compatible with cpulist_parse()
+[see <linux/cpumask.h>]. Some examples follow.
+
+In this example, there are 64 CPUs in the system but cpus 32-63 exceed
+the kernel max which is limited to 0..31 by the NR_CPUS config option
+being 32. Note also that CPUs 2 and 4-31 are not online but could be
+brought online as they are both present and possible::
+
+ kernel_max: 31
+ offline: 2,4-31,32-63
+ online: 0-1,3
+ possible: 0-31
+ present: 0-31
+
+In this example, the NR_CPUS config option is 128, but the kernel was
+started with possible_cpus=144. There are 4 CPUs in the system and cpu2
+was manually taken offline (and is the only CPU that can be brought
+online.)::
+
+ kernel_max: 127
+ offline: 2,4-127,128-143
+ online: 0-1,3
+ possible: 0-127
+ present: 0-3
+
+See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
+as well as more information on the various cpumasks.
diff --git a/Documentation/admin-guide/device-mapper/cache-policies.rst b/Documentation/admin-guide/device-mapper/cache-policies.rst
new file mode 100644
index 0000000..b17fe35
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/cache-policies.rst
@@ -0,0 +1,131 @@
+=============================
+Guidance for writing policies
+=============================
+
+Try to keep transactionality out of it. The core is careful to
+avoid asking about anything that is migrating. This is a pain, but
+makes it easier to write the policies.
+
+Mappings are loaded into the policy at construction time.
+
+Every bio that is mapped by the target is referred to the policy.
+The policy can return a simple HIT or MISS or issue a migration.
+
+Currently there's no way for the policy to issue background work,
+e.g. to start writing back dirty blocks that are going to be evicted
+soon.
+
+Because we map bios, rather than requests it's easy for the policy
+to get fooled by many small bios. For this reason the core target
+issues periodic ticks to the policy. It's suggested that the policy
+doesn't update states (eg, hit counts) for a block more than once
+for each tick. The core ticks by watching bios complete, and so
+trying to see when the io scheduler has let the ios run.
+
+
+Overview of supplied cache replacement policies
+===============================================
+
+multiqueue (mq)
+---------------
+
+This policy is now an alias for smq (see below).
+
+The following tunables are accepted, but have no effect::
+
+ 'sequential_threshold <#nr_sequential_ios>'
+ 'random_threshold <#nr_random_ios>'
+ 'read_promote_adjustment <value>'
+ 'write_promote_adjustment <value>'
+ 'discard_promote_adjustment <value>'
+
+Stochastic multiqueue (smq)
+---------------------------
+
+This policy is the default.
+
+The stochastic multi-queue (smq) policy addresses some of the problems
+with the multiqueue (mq) policy.
+
+The smq policy (vs mq) offers the promise of less memory utilization,
+improved performance and increased adaptability in the face of changing
+workloads. smq also does not have any cumbersome tuning knobs.
+
+Users may switch from "mq" to "smq" simply by appropriately reloading a
+DM table that is using the cache target. Doing so will cause all of the
+mq policy's hints to be dropped. Also, performance of the cache may
+degrade slightly until smq recalculates the origin device's hotspots
+that should be cached.
+
+Memory usage
+^^^^^^^^^^^^
+
+The mq policy used a lot of memory; 88 bytes per cache block on a 64
+bit machine.
+
+smq uses 28bit indexes to implement its data structures rather than
+pointers. It avoids storing an explicit hit count for each block. It
+has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
+the entries (each hotspot block covers a larger area than a single
+cache block).
+
+All this means smq uses ~25bytes per cache block. Still a lot of
+memory, but a substantial improvement nontheless.
+
+Level balancing
+^^^^^^^^^^^^^^^
+
+mq placed entries in different levels of the multiqueue structures
+based on their hit count (~ln(hit count)). This meant the bottom
+levels generally had the most entries, and the top ones had very
+few. Having unbalanced levels like this reduced the efficacy of the
+multiqueue.
+
+smq does not maintain a hit count, instead it swaps hit entries with
+the least recently used entry from the level above. The overall
+ordering being a side effect of this stochastic process. With this
+scheme we can decide how many entries occupy each multiqueue level,
+resulting in better promotion/demotion decisions.
+
+Adaptability:
+The mq policy maintained a hit count for each cache block. For a
+different block to get promoted to the cache its hit count has to
+exceed the lowest currently in the cache. This meant it could take a
+long time for the cache to adapt between varying IO patterns.
+
+smq doesn't maintain hit counts, so a lot of this problem just goes
+away. In addition it tracks performance of the hotspot queue, which
+is used to decide which blocks to promote. If the hotspot queue is
+performing badly then it starts moving entries more quickly between
+levels. This lets it adapt to new IO patterns very quickly.
+
+Performance
+^^^^^^^^^^^
+
+Testing smq shows substantially better performance than mq.
+
+cleaner
+-------
+
+The cleaner writes back all dirty blocks in a cache to decommission it.
+
+Examples
+========
+
+The syntax for a table is::
+
+ cache <metadata dev> <cache dev> <origin dev> <block size>
+ <#feature_args> [<feature arg>]*
+ <policy> <#policy_args> [<policy arg>]*
+
+The syntax to send a message using the dmsetup command is::
+
+ dmsetup message <mapped device> 0 sequential_threshold 1024
+ dmsetup message <mapped device> 0 random_threshold 8
+
+Using dmsetup::
+
+ dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \
+ /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8"
+ creates a 128GB large mapped device named 'blah' with the
+ sequential threshold set to 1024 and the random_threshold set to 8.
diff --git a/Documentation/admin-guide/device-mapper/cache.rst b/Documentation/admin-guide/device-mapper/cache.rst
new file mode 100644
index 0000000..f15e525
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/cache.rst
@@ -0,0 +1,337 @@
+=====
+Cache
+=====
+
+Introduction
+============
+
+dm-cache is a device mapper target written by Joe Thornber, Heinz
+Mauelshagen, and Mike Snitzer.
+
+It aims to improve performance of a block device (eg, a spindle) by
+dynamically migrating some of its data to a faster, smaller device
+(eg, an SSD).
+
+This device-mapper solution allows us to insert this caching at
+different levels of the dm stack, for instance above the data device for
+a thin-provisioning pool. Caching solutions that are integrated more
+closely with the virtual memory system should give better performance.
+
+The target reuses the metadata library used in the thin-provisioning
+library.
+
+The decision as to what data to migrate and when is left to a plug-in
+policy module. Several of these have been written as we experiment,
+and we hope other people will contribute others for specific io
+scenarios (eg. a vm image server).
+
+Glossary
+========
+
+ Migration
+ Movement of the primary copy of a logical block from one
+ device to the other.
+ Promotion
+ Migration from slow device to fast device.
+ Demotion
+ Migration from fast device to slow device.
+
+The origin device always contains a copy of the logical block, which
+may be out of date or kept in sync with the copy on the cache device
+(depending on policy).
+
+Design
+======
+
+Sub-devices
+-----------
+
+The target is constructed by passing three devices to it (along with
+other parameters detailed later):
+
+1. An origin device - the big, slow one.
+
+2. A cache device - the small, fast one.
+
+3. A small metadata device - records which blocks are in the cache,
+ which are dirty, and extra hints for use by the policy object.
+ This information could be put on the cache device, but having it
+ separate allows the volume manager to configure it differently,
+ e.g. as a mirror for extra robustness. This metadata device may only
+ be used by a single cache device.
+
+Fixed block size
+----------------
+
+The origin is divided up into blocks of a fixed size. This block size
+is configurable when you first create the cache. Typically we've been
+using block sizes of 256KB - 1024KB. The block size must be between 64
+sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
+
+Having a fixed block size simplifies the target a lot. But it is
+something of a compromise. For instance, a small part of a block may be
+getting hit a lot, yet the whole block will be promoted to the cache.
+So large block sizes are bad because they waste cache space. And small
+block sizes are bad because they increase the amount of metadata (both
+in core and on disk).
+
+Cache operating modes
+---------------------
+
+The cache has three operating modes: writeback, writethrough and
+passthrough.
+
+If writeback, the default, is selected then a write to a block that is
+cached will go only to the cache and the block will be marked dirty in
+the metadata.
+
+If writethrough is selected then a write to a cached block will not
+complete until it has hit both the origin and cache devices. Clean
+blocks should remain clean.
+
+If passthrough is selected, useful when the cache contents are not known
+to be coherent with the origin device, then all reads are served from
+the origin device (all reads miss the cache) and all writes are
+forwarded to the origin device; additionally, write hits cause cache
+block invalidates. To enable passthrough mode the cache must be clean.
+Passthrough mode allows a cache device to be activated without having to
+worry about coherency. Coherency that exists is maintained, although
+the cache will gradually cool as writes take place. If the coherency of
+the cache can later be verified, or established through use of the
+"invalidate_cblocks" message, the cache device can be transitioned to
+writethrough or writeback mode while still warm. Otherwise, the cache
+contents can be discarded prior to transitioning to the desired
+operating mode.
+
+A simple cleaner policy is provided, which will clean (write back) all
+dirty blocks in a cache. Useful for decommissioning a cache or when
+shrinking a cache. Shrinking the cache's fast device requires all cache
+blocks, in the area of the cache being removed, to be clean. If the
+area being removed from the cache still contains dirty blocks the resize
+will fail. Care must be taken to never reduce the volume used for the
+cache's fast device until the cache is clean. This is of particular
+importance if writeback mode is used. Writethrough and passthrough
+modes already maintain a clean cache. Future support to partially clean
+the cache, above a specified threshold, will allow for keeping the cache
+warm and in writeback mode during resize.
+
+Migration throttling
+--------------------
+
+Migrating data between the origin and cache device uses bandwidth.
+The user can set a throttle to prevent more than a certain amount of
+migration occurring at any one time. Currently we're not taking any
+account of normal io traffic going to the devices. More work needs
+doing here to avoid migrating during those peak io moments.
+
+For the time being, a message "migration_threshold <#sectors>"
+can be used to set the maximum number of sectors being migrated,
+the default being 2048 sectors (1MB).
+
+Updating on-disk metadata
+-------------------------
+
+On-disk metadata is committed every time a FLUSH or FUA bio is written.
+If no such requests are made then commits will occur every second. This
+means the cache behaves like a physical disk that has a volatile write
+cache. If power is lost you may lose some recent writes. The metadata
+should always be consistent in spite of any crash.
+
+The 'dirty' state for a cache block changes far too frequently for us
+to keep updating it on the fly. So we treat it as a hint. In normal
+operation it will be written when the dm device is suspended. If the
+system crashes all cache blocks will be assumed dirty when restarted.
+
+Per-block policy hints
+----------------------
+
+Policy plug-ins can store a chunk of data per cache block. It's up to
+the policy how big this chunk is, but it should be kept small. Like the
+dirty flags this data is lost if there's a crash so a safe fallback
+value should always be possible.
+
+Policy hints affect performance, not correctness.
+
+Policy messaging
+----------------
+
+Policies will have different tunables, specific to each one, so we
+need a generic way of getting and setting these. Device-mapper
+messages are used. Refer to cache-policies.txt.
+
+Discard bitset resolution
+-------------------------
+
+We can avoid copying data during migration if we know the block has
+been discarded. A prime example of this is when mkfs discards the
+whole block device. We store a bitset tracking the discard state of
+blocks. However, we allow this bitset to have a different block size
+from the cache blocks. This is because we need to track the discard
+state for all of the origin device (compare with the dirty bitset
+which is just for the smaller cache device).
+
+Target interface
+================
+
+Constructor
+-----------
+
+ ::
+
+ cache <metadata dev> <cache dev> <origin dev> <block size>
+ <#feature args> [<feature arg>]*
+ <policy> <#policy args> [policy args]*
+
+ ================ =======================================================
+ metadata dev fast device holding the persistent metadata
+ cache dev fast device holding cached data blocks
+ origin dev slow device holding original data blocks
+ block size cache unit size in sectors
+
+ #feature args number of feature arguments passed
+ feature args writethrough or passthrough (The default is writeback.)
+
+ policy the replacement policy to use
+ #policy args an even number of arguments corresponding to
+ key/value pairs passed to the policy
+ policy args key/value pairs passed to the policy
+ E.g. 'sequential_threshold 1024'
+ See cache-policies.txt for details.
+ ================ =======================================================
+
+Optional feature arguments are:
+
+
+ ==================== ========================================================
+ writethrough write through caching that prohibits cache block
+ content from being different from origin block content.
+ Without this argument, the default behaviour is to write
+ back cache block contents later for performance reasons,
+ so they may differ from the corresponding origin blocks.
+
+ passthrough a degraded mode useful for various cache coherency
+ situations (e.g., rolling back snapshots of
+ underlying storage). Reads and writes always go to
+ the origin. If a write goes to a cached origin
+ block, then the cache block is invalidated.
+ To enable passthrough mode the cache must be clean.
+
+ metadata2 use version 2 of the metadata. This stores the dirty
+ bits in a separate btree, which improves speed of
+ shutting down the cache.
+
+ no_discard_passdown disable passing down discards from the cache
+ to the origin's data device.
+ ==================== ========================================================
+
+A policy called 'default' is always registered. This is an alias for
+the policy we currently think is giving best all round performance.
+
+As the default policy could vary between kernels, if you are relying on
+the characteristics of a specific policy, always request it by name.
+
+Status
+------
+
+::
+
+ <metadata block size> <#used metadata blocks>/<#total metadata blocks>
+ <cache block size> <#used cache blocks>/<#total cache blocks>
+ <#read hits> <#read misses> <#write hits> <#write misses>
+ <#demotions> <#promotions> <#dirty> <#features> <features>*
+ <#core args> <core args>* <policy name> <#policy args> <policy args>*
+ <cache metadata mode>
+
+
+========================= =====================================================
+metadata block size Fixed block size for each metadata block in
+ sectors
+#used metadata blocks Number of metadata blocks used
+#total metadata blocks Total number of metadata blocks
+cache block size Configurable block size for the cache device
+ in sectors
+#used cache blocks Number of blocks resident in the cache
+#total cache blocks Total number of cache blocks
+#read hits Number of times a READ bio has been mapped
+ to the cache
+#read misses Number of times a READ bio has been mapped
+ to the origin
+#write hits Number of times a WRITE bio has been mapped
+ to the cache
+#write misses Number of times a WRITE bio has been
+ mapped to the origin
+#demotions Number of times a block has been removed
+ from the cache
+#promotions Number of times a block has been moved to
+ the cache
+#dirty Number of blocks in the cache that differ
+ from the origin
+#feature args Number of feature args to follow
+feature args 'writethrough' (optional)
+#core args Number of core arguments (must be even)
+core args Key/value pairs for tuning the core
+ e.g. migration_threshold
+policy name Name of the policy
+#policy args Number of policy arguments to follow (must be even)
+policy args Key/value pairs e.g. sequential_threshold
+cache metadata mode ro if read-only, rw if read-write
+
+ In serious cases where even a read-only mode is
+ deemed unsafe no further I/O will be permitted and
+ the status will just contain the string 'Fail'.
+ The userspace recovery tools should then be used.
+needs_check 'needs_check' if set, '-' if not set
+ A metadata operation has failed, resulting in the
+ needs_check flag being set in the metadata's
+ superblock. The metadata device must be
+ deactivated and checked/repaired before the
+ cache can be made fully operational again.
+ '-' indicates needs_check is not set.
+========================= =====================================================
+
+Messages
+--------
+
+Policies will have different tunables, specific to each one, so we
+need a generic way of getting and setting these. Device-mapper
+messages are used. (A sysfs interface would also be possible.)
+
+The message format is::
+
+ <key> <value>
+
+E.g.::
+
+ dmsetup message my_cache 0 sequential_threshold 1024
+
+
+Invalidation is removing an entry from the cache without writing it
+back. Cache blocks can be invalidated via the invalidate_cblocks
+message, which takes an arbitrary number of cblock ranges. Each cblock
+range's end value is "one past the end", meaning 5-10 expresses a range
+of values from 5 to 9. Each cblock must be expressed as a decimal
+value, in the future a variant message that takes cblock ranges
+expressed in hexadecimal may be needed to better support efficient
+invalidation of larger caches. The cache must be in passthrough mode
+when invalidate_cblocks is used::
+
+ invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*
+
+E.g.::
+
+ dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789
+
+Examples
+========
+
+The test suite can be found here:
+
+https://github.com/jthornber/device-mapper-test-suite
+
+::
+
+ dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
+ /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
+ dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
+ /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
+ mq 4 sequential_threshold 1024 random_threshold 8'
diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst
new file mode 100644
index 0000000..917ba8c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/delay.rst
@@ -0,0 +1,31 @@
+========
+dm-delay
+========
+
+Device-Mapper's "delay" target delays reads and/or writes
+and maps them to different devices.
+
+Parameters::
+
+ <device> <offset> <delay> [<write_device> <write_offset> <write_delay>
+ [<flush_device> <flush_offset> <flush_delay>]]
+
+With separate write parameters, the first set is only used for reads.
+Offsets are specified in sectors.
+Delays are specified in milliseconds.
+
+Example scripts
+===============
+
+::
+
+ #!/bin/sh
+ # Create device delaying rw operation for 500ms
+ echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
+
+::
+
+ #!/bin/sh
+ # Create device delaying only write operation for 500ms and
+ # splitting reads and writes to different devices $1 $2
+ echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
diff --git a/Documentation/admin-guide/device-mapper/dm-clone.rst b/Documentation/admin-guide/device-mapper/dm-clone.rst
new file mode 100644
index 0000000..b43a34c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-clone.rst
@@ -0,0 +1,333 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+========
+dm-clone
+========
+
+Introduction
+============
+
+dm-clone is a device mapper target which produces a one-to-one copy of an
+existing, read-only source device into a writable destination device: It
+presents a virtual block device which makes all data appear immediately, and
+redirects reads and writes accordingly.
+
+The main use case of dm-clone is to clone a potentially remote, high-latency,
+read-only, archival-type block device into a writable, fast, primary-type device
+for fast, low-latency I/O. The cloned device is visible/mountable immediately
+and the copy of the source device to the destination device happens in the
+background, in parallel with user I/O.
+
+For example, one could restore an application backup from a read-only copy,
+accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE,
+etc.), into a local SSD or NVMe device, and start using the device immediately,
+without waiting for the restore to complete.
+
+When the cloning completes, the dm-clone table can be removed altogether and be
+replaced, e.g., by a linear table, mapping directly to the destination device.
+
+The dm-clone target reuses the metadata library used by the thin-provisioning
+target.
+
+Glossary
+========
+
+ Hydration
+ The process of filling a region of the destination device with data from
+ the same region of the source device, i.e., copying the region from the
+ source to the destination device.
+
+Once a region gets hydrated we redirect all I/O regarding it to the destination
+device.
+
+Design
+======
+
+Sub-devices
+-----------
+
+The target is constructed by passing three devices to it (along with other
+parameters detailed later):
+
+1. A source device - the read-only device that gets cloned and source of the
+ hydration.
+
+2. A destination device - the destination of the hydration, which will become a
+ clone of the source device.
+
+3. A small metadata device - it records which regions are already valid in the
+ destination device, i.e., which regions have already been hydrated, or have
+ been written to directly, via user I/O.
+
+The size of the destination device must be at least equal to the size of the
+source device.
+
+Regions
+-------
+
+dm-clone divides the source and destination devices in fixed sized regions.
+Regions are the unit of hydration, i.e., the minimum amount of data copied from
+the source to the destination device.
+
+The region size is configurable when you first create the dm-clone device. The
+recommended region size is the same as the file system block size, which usually
+is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors
+(1GB) and a power of two.
+
+Reads and writes from/to hydrated regions are serviced from the destination
+device.
+
+A read to a not yet hydrated region is serviced directly from the source device.
+
+A write to a not yet hydrated region will be delayed until the corresponding
+region has been hydrated and the hydration of the region starts immediately.
+
+Note that a write request with size equal to region size will skip copying of
+the corresponding region from the source device and overwrite the region of the
+destination device directly.
+
+Discards
+--------
+
+dm-clone interprets a discard request to a range that hasn't been hydrated yet
+as a hint to skip hydration of the regions covered by the request, i.e., it
+skips copying the region's data from the source to the destination device, and
+only updates its metadata.
+
+If the destination device supports discards, then by default dm-clone will pass
+down discard requests to it.
+
+Background Hydration
+--------------------
+
+dm-clone copies continuously from the source to the destination device, until
+all of the device has been copied.
+
+Copying data from the source to the destination device uses bandwidth. The user
+can set a throttle to prevent more than a certain amount of copying occurring at
+any one time. Moreover, dm-clone takes into account user I/O traffic going to
+the devices and pauses the background hydration when there is I/O in-flight.
+
+A message `hydration_threshold <#regions>` can be used to set the maximum number
+of regions being copied, the default being 1 region.
+
+dm-clone employs dm-kcopyd for copying portions of the source device to the
+destination device. By default, we issue copy requests of size equal to the
+region size. A message `hydration_batch_size <#regions>` can be used to tune the
+size of these copy requests. Increasing the hydration batch size results in
+dm-clone trying to batch together contiguous regions, so we copy the data in
+batches of this many regions.
+
+When the hydration of the destination device finishes, a dm event will be sent
+to user space.
+
+Updating on-disk metadata
+-------------------------
+
+On-disk metadata is committed every time a FLUSH or FUA bio is written. If no
+such requests are made then commits will occur every second. This means the
+dm-clone device behaves like a physical disk that has a volatile write cache. If
+power is lost you may lose some recent writes. The metadata should always be
+consistent in spite of any crash.
+
+Target Interface
+================
+
+Constructor
+-----------
+
+ ::
+
+ clone <metadata dev> <destination dev> <source dev> <region size>
+ [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]]
+
+ ================ ==============================================================
+ metadata dev Fast device holding the persistent metadata
+ destination dev The destination device, where the source will be cloned
+ source dev Read only device containing the data that gets cloned
+ region size The size of a region in sectors
+
+ #feature args Number of feature arguments passed
+ feature args no_hydration or no_discard_passdown
+
+ #core args An even number of arguments corresponding to key/value pairs
+ passed to dm-clone
+ core args Key/value pairs passed to dm-clone, e.g. `hydration_threshold
+ 256`
+ ================ ==============================================================
+
+Optional feature arguments are:
+
+ ==================== =========================================================
+ no_hydration Create a dm-clone instance with background hydration
+ disabled
+ no_discard_passdown Disable passing down discards to the destination device
+ ==================== =========================================================
+
+Optional core arguments are:
+
+ ================================ ==============================================
+ hydration_threshold <#regions> Maximum number of regions being copied from
+ the source to the destination device at any
+ one time, during background hydration.
+ hydration_batch_size <#regions> During background hydration, try to batch
+ together contiguous regions, so we copy data
+ from the source to the destination device in
+ batches of this many regions.
+ ================================ ==============================================
+
+Status
+------
+
+ ::
+
+ <metadata block size> <#used metadata blocks>/<#total metadata blocks>
+ <region size> <#hydrated regions>/<#total regions> <#hydrating regions>
+ <#feature args> <feature args>* <#core args> <core args>*
+ <clone metadata mode>
+
+ ======================= =======================================================
+ metadata block size Fixed block size for each metadata block in sectors
+ #used metadata blocks Number of metadata blocks used
+ #total metadata blocks Total number of metadata blocks
+ region size Configurable region size for the device in sectors
+ #hydrated regions Number of regions that have finished hydrating
+ #total regions Total number of regions to hydrate
+ #hydrating regions Number of regions currently hydrating
+ #feature args Number of feature arguments to follow
+ feature args Feature arguments, e.g. `no_hydration`
+ #core args Even number of core arguments to follow
+ core args Key/value pairs for tuning the core, e.g.
+ `hydration_threshold 256`
+ clone metadata mode ro if read-only, rw if read-write
+
+ In serious cases where even a read-only mode is deemed
+ unsafe no further I/O will be permitted and the status
+ will just contain the string 'Fail'. If the metadata
+ mode changes, a dm event will be sent to user space.
+ ======================= =======================================================
+
+Messages
+--------
+
+ `disable_hydration`
+ Disable the background hydration of the destination device.
+
+ `enable_hydration`
+ Enable the background hydration of the destination device.
+
+ `hydration_threshold <#regions>`
+ Set background hydration threshold.
+
+ `hydration_batch_size <#regions>`
+ Set background hydration batch size.
+
+Examples
+========
+
+Clone a device containing a file system
+---------------------------------------
+
+1. Create the dm-clone device.
+
+ ::
+
+ dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \
+ $source_dev 8 1 no_hydration"
+
+2. Mount the device and trim the file system. dm-clone interprets the discards
+ sent by the file system and it will not hydrate the unused space.
+
+ ::
+
+ mount /dev/mapper/clone /mnt/cloned-fs
+ fstrim /mnt/cloned-fs
+
+3. Enable background hydration of the destination device.
+
+ ::
+
+ dmsetup message clone 0 enable_hydration
+
+4. When the hydration finishes, we can replace the dm-clone table with a linear
+ table.
+
+ ::
+
+ dmsetup suspend clone
+ dmsetup load clone --table "0 1048576000 linear $dest_dev 0"
+ dmsetup resume clone
+
+ The metadata device is no longer needed and can be safely discarded or reused
+ for other purposes.
+
+Known issues
+============
+
+1. We redirect reads, to not-yet-hydrated regions, to the source device. If
+ reading the source device has high latency and the user repeatedly reads from
+ the same regions, this behaviour could degrade performance. We should use
+ these reads as hints to hydrate the relevant regions sooner. Currently, we
+ rely on the page cache to cache these regions, so we hopefully don't end up
+ reading them multiple times from the source device.
+
+2. Release in-core resources, i.e., the bitmaps tracking which regions are
+ hydrated, after the hydration has finished.
+
+3. During background hydration, if we fail to read the source or write to the
+ destination device, we print an error message, but the hydration process
+ continues indefinitely, until it succeeds. We should stop the background
+ hydration after a number of failures and emit a dm event for user space to
+ notice.
+
+Why not...?
+===========
+
+We explored the following alternatives before implementing dm-clone:
+
+1. Use dm-cache with cache size equal to the source device and implement a new
+ cloning policy:
+
+ * The resulting cache device is not a one-to-one mirror of the source device
+ and thus we cannot remove the cache device once cloning completes.
+
+ * dm-cache writes to the source device, which violates our requirement that
+ the source device must be treated as read-only.
+
+ * Caching is semantically different from cloning.
+
+2. Use dm-snapshot with a COW device equal to the source device:
+
+ * dm-snapshot stores its metadata in the COW device, so the resulting device
+ is not a one-to-one mirror of the source device.
+
+ * No background copying mechanism.
+
+ * dm-snapshot needs to commit its metadata whenever a pending exception
+ completes, to ensure snapshot consistency. In the case of cloning, we don't
+ need to be so strict and can rely on committing metadata every time a FLUSH
+ or FUA bio is written, or periodically, like dm-thin and dm-cache do. This
+ improves the performance significantly.
+
+3. Use dm-mirror: The mirror target has a background copying/mirroring
+ mechanism, but it writes to all mirrors, thus violating our requirement that
+ the source device must be treated as read-only.
+
+4. Use dm-thin's external snapshot functionality. This approach is the most
+ promising among all alternatives, as the thinly-provisioned volume is a
+ one-to-one mirror of the source device and handles reads and writes to
+ un-provisioned/not-yet-cloned areas the same way as dm-clone does.
+
+ Still:
+
+ * There is no background copying mechanism, though one could be implemented.
+
+ * Most importantly, we want to support arbitrary block devices as the
+ destination of the cloning process and not restrict ourselves to
+ thinly-provisioned volumes. Thin-provisioning has an inherent metadata
+ overhead, for maintaining the thin volume mappings, which significantly
+ degrades performance.
+
+ Moreover, cloning a device shouldn't force the use of thin-provisioning. On
+ the other hand, if we wish to use thin provisioning, we can just use a thin
+ LV as dm-clone's destination device.
diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst
new file mode 100644
index 0000000..8f4a3f8
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst
@@ -0,0 +1,173 @@
+========
+dm-crypt
+========
+
+Device-Mapper's "crypt" target provides transparent encryption of block devices
+using the kernel crypto API.
+
+For a more detailed description of supported parameters see:
+https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
+
+Parameters::
+
+ <cipher> <key> <iv_offset> <device path> \
+ <offset> [<#opt_params> <opt_params>]
+
+<cipher>
+ Encryption cipher, encryption mode and Initial Vector (IV) generator.
+
+ The cipher specifications format is::
+
+ cipher[:keycount]-chainmode-ivmode[:ivopts]
+
+ Examples::
+
+ aes-cbc-essiv:sha256
+ aes-xts-plain64
+ serpent-xts-plain64
+
+ Cipher format also supports direct specification with kernel crypt API
+ format (selected by capi: prefix). The IV specification is the same
+ as for the first format type.
+ This format is mainly used for specification of authenticated modes.
+
+ The crypto API cipher specifications format is::
+
+ capi:cipher_api_spec-ivmode[:ivopts]
+
+ Examples::
+
+ capi:cbc(aes)-essiv:sha256
+ capi:xts(aes)-plain64
+
+ Examples of authenticated modes::
+
+ capi:gcm(aes)-random
+ capi:authenc(hmac(sha256),xts(aes))-random
+ capi:rfc7539(chacha20,poly1305)-random
+
+ The /proc/crypto contains a list of curently loaded crypto modes.
+
+<key>
+ Key used for encryption. It is encoded either as a hexadecimal number
+ or it can be passed as <key_string> prefixed with single colon
+ character (':') for keys residing in kernel keyring service.
+ You can only use key sizes that are valid for the selected cipher
+ in combination with the selected iv mode.
+ Note that for some iv modes the key string can contain additional
+ keys (for example IV seed) so the key contains more parts concatenated
+ into a single string.
+
+<key_string>
+ The kernel keyring key is identified by string in following format:
+ <key_size>:<key_type>:<key_description>.
+
+<key_size>
+ The encryption key size in bytes. The kernel key payload size must match
+ the value passed in <key_size>.
+
+<key_type>
+ Either 'logon' or 'user' kernel key type.
+
+<key_description>
+ The kernel keyring key description crypt target should look for
+ when loading key of <key_type>.
+
+<keycount>
+ Multi-key compatibility mode. You can define <keycount> keys and
+ then sectors are encrypted according to their offsets (sector 0 uses key0;
+ sector 1 uses key1 etc.). <keycount> must be a power of two.
+
+<iv_offset>
+ The IV offset is a sector count that is added to the sector number
+ before creating the IV.
+
+<device path>
+ This is the device that is going to be used as backend and contains the
+ encrypted data. You can specify it as a path like /dev/xxx or a device
+ number <major>:<minor>.
+
+<offset>
+ Starting sector within the device where the encrypted data begins.
+
+<#opt_params>
+ Number of optional parameters. If there are no optional parameters,
+ the optional paramaters section can be skipped or #opt_params can be zero.
+ Otherwise #opt_params is the number of following arguments.
+
+ Example of optional parameters section:
+ 3 allow_discards same_cpu_crypt submit_from_crypt_cpus
+
+allow_discards
+ Block discard requests (a.k.a. TRIM) are passed through the crypt device.
+ The default is to ignore discard requests.
+
+ WARNING: Assess the specific security risks carefully before enabling this
+ option. For example, allowing discards on encrypted devices may lead to
+ the leak of information about the ciphertext device (filesystem type,
+ used space etc.) if the discarded blocks can be located easily on the
+ device later.
+
+same_cpu_crypt
+ Perform encryption using the same cpu that IO was submitted on.
+ The default is to use an unbound workqueue so that encryption work
+ is automatically balanced between available CPUs.
+
+submit_from_crypt_cpus
+ Disable offloading writes to a separate thread after encryption.
+ There are some situations where offloading write bios from the
+ encryption threads to a single thread degrades performance
+ significantly. The default is to offload write bios to the same
+ thread because it benefits CFQ to have writes submitted using the
+ same context.
+
+integrity:<bytes>:<type>
+ The device requires additional <bytes> metadata per-sector stored
+ in per-bio integrity structure. This metadata must by provided
+ by underlying dm-integrity target.
+
+ The <type> can be "none" if metadata is used only for persistent IV.
+
+ For Authenticated Encryption with Additional Data (AEAD)
+ the <type> is "aead". An AEAD mode additionally calculates and verifies
+ integrity for the encrypted device. The additional space is then
+ used for storing authentication tag (and persistent IV if needed).
+
+sector_size:<bytes>
+ Use <bytes> as the encryption unit instead of 512 bytes sectors.
+ This option can be in range 512 - 4096 bytes and must be power of two.
+ Virtual device will announce this size as a minimal IO and logical sector.
+
+iv_large_sectors
+ IV generators will use sector number counted in <sector_size> units
+ instead of default 512 bytes sectors.
+
+ For example, if <sector_size> is 4096 bytes, plain64 IV for the second
+ sector will be 8 (without flag) and 1 if iv_large_sectors is present.
+ The <iv_offset> must be multiple of <sector_size> (in 512 bytes units)
+ if this flag is specified.
+
+Example scripts
+===============
+LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
+encryption with dm-crypt using the 'cryptsetup' utility, see
+https://gitlab.com/cryptsetup/cryptsetup
+
+::
+
+ #!/bin/sh
+ # Create a crypt device using dmsetup
+ dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0"
+
+::
+
+ #!/bin/sh
+ # Create a crypt device using dmsetup when encryption key is stored in keyring service
+ dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0"
+
+::
+
+ #!/bin/sh
+ # Create a crypt device using cryptsetup and LUKS header with default cipher
+ cryptsetup luksFormat $1
+ cryptsetup luksOpen $1 crypt1
diff --git a/Documentation/admin-guide/device-mapper/dm-dust.txt b/Documentation/admin-guide/device-mapper/dm-dust.txt
new file mode 100644
index 0000000..954d402
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-dust.txt
@@ -0,0 +1,272 @@
+dm-dust
+=======
+
+This target emulates the behavior of bad sectors at arbitrary
+locations, and the ability to enable the emulation of the failures
+at an arbitrary time.
+
+This target behaves similarly to a linear target. At a given time,
+the user can send a message to the target to start failing read
+requests on specific blocks (to emulate the behavior of a hard disk
+drive with bad sectors).
+
+When the failure behavior is enabled (i.e.: when the output of
+"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
+in the "bad block list" will fail with EIO ("Input/output error").
+
+Writes of blocks in the "bad block list will result in the following:
+
+1. Remove the block from the "bad block list".
+2. Successfully complete the write.
+
+This emulates the "remapped sector" behavior of a drive with bad
+sectors.
+
+Normally, a drive that is encountering bad sectors will most likely
+encounter more bad sectors, at an unknown time or location.
+With dm-dust, the user can use the "addbadblock" and "removebadblock"
+messages to add arbitrary bad blocks at new locations, and the
+"enable" and "disable" messages to modulate the state of whether the
+configured "bad blocks" will be treated as bad, or bypassed.
+This allows the pre-writing of test data and metadata prior to
+simulating a "failure" event where bad sectors start to appear.
+
+Table parameters:
+-----------------
+<device_path> <offset> <blksz>
+
+Mandatory parameters:
+ <device_path>: path to the block device.
+ <offset>: offset to data area from start of device_path
+ <blksz>: block size in bytes
+ (minimum 512, maximum 1073741824, must be a power of 2)
+
+Usage instructions:
+-------------------
+
+First, find the size (in 512-byte sectors) of the device to be used:
+
+$ sudo blockdev --getsz /dev/vdb1
+33552384
+
+Create the dm-dust device:
+(For a device with a block size of 512 bytes)
+$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 512'
+
+(For a device with a block size of 4096 bytes)
+$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096'
+
+Check the status of the read behavior ("bypass" indicates that all I/O
+will be passed through to the underlying device):
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 bypass
+
+$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
+128+0 records in
+128+0 records out
+
+$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
+128+0 records in
+128+0 records out
+
+Adding and removing bad blocks:
+-------------------------------
+
+At any time (i.e.: whether the device has the "bad block" emulation
+enabled or disabled), bad blocks may be added or removed from the
+device via the "addbadblock" and "removebadblock" messages:
+
+$ sudo dmsetup message dust1 0 addbadblock 60
+kernel: device-mapper: dust: badblock added at block 60
+
+$ sudo dmsetup message dust1 0 addbadblock 67
+kernel: device-mapper: dust: badblock added at block 67
+
+$ sudo dmsetup message dust1 0 addbadblock 72
+kernel: device-mapper: dust: badblock added at block 72
+
+These bad blocks will be stored in the "bad block list".
+While the device is in "bypass" mode, reads and writes will succeed:
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 bypass
+
+Enabling block read failures:
+-----------------------------
+
+To enable the "fail read on bad block" behavior, send the "enable" message:
+
+$ sudo dmsetup message dust1 0 enable
+kernel: device-mapper: dust: enabling read failures on bad sectors
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 fail_read_on_bad_block
+
+With the device in "fail read on bad block" mode, attempting to read a
+block will encounter an "Input/output error":
+
+$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
+dd: error reading '/dev/mapper/dust1': Input/output error
+0+0 records in
+0+0 records out
+0 bytes copied, 0.00040651 s, 0.0 kB/s
+
+...and writing to the bad blocks will remove the blocks from the list,
+therefore emulating the "remap" behavior of hard disk drives:
+
+$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
+128+0 records in
+128+0 records out
+
+kernel: device-mapper: dust: block 60 removed from badblocklist by write
+kernel: device-mapper: dust: block 67 removed from badblocklist by write
+kernel: device-mapper: dust: block 72 removed from badblocklist by write
+kernel: device-mapper: dust: block 87 removed from badblocklist by write
+
+Bad block add/remove error handling:
+------------------------------------
+
+Attempting to add a bad block that already exists in the list will
+result in an "Invalid argument" error, as well as a helpful message:
+
+$ sudo dmsetup message dust1 0 addbadblock 88
+device-mapper: message ioctl on dust1 failed: Invalid argument
+kernel: device-mapper: dust: block 88 already in badblocklist
+
+Attempting to remove a bad block that doesn't exist in the list will
+result in an "Invalid argument" error, as well as a helpful message:
+
+$ sudo dmsetup message dust1 0 removebadblock 87
+device-mapper: message ioctl on dust1 failed: Invalid argument
+kernel: device-mapper: dust: block 87 not found in badblocklist
+
+Counting the number of bad blocks in the bad block list:
+--------------------------------------------------------
+
+To count the number of bad blocks configured in the device, run the
+following message command:
+
+$ sudo dmsetup message dust1 0 countbadblocks
+
+A message will print with the number of bad blocks currently
+configured on the device:
+
+kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
+
+Querying for specific bad blocks:
+---------------------------------
+
+To find out if a specific block is in the bad block list, run the
+following message command:
+
+$ sudo dmsetup message dust1 0 queryblock 72
+
+The following message will print if the block is in the list:
+device-mapper: dust: queryblock: block 72 found in badblocklist
+
+The following message will print if the block is in the list:
+device-mapper: dust: queryblock: block 72 not found in badblocklist
+
+The "queryblock" message command will work in both the "enabled"
+and "disabled" modes, allowing the verification of whether a block
+will be treated as "bad" without having to issue I/O to the device,
+or having to "enable" the bad block emulation.
+
+Clearing the bad block list:
+----------------------------
+
+To clear the bad block list (without needing to individually run
+a "removebadblock" message command for every block), run the
+following message command:
+
+$ sudo dmsetup message dust1 0 clearbadblocks
+
+After clearing the bad block list, the following message will appear:
+
+kernel: device-mapper: dust: clearbadblocks: badblocks cleared
+
+If there were no bad blocks to clear, the following message will
+appear:
+
+kernel: device-mapper: dust: clearbadblocks: no badblocks found
+
+Message commands list:
+----------------------
+
+Below is a list of the messages that can be sent to a dust device:
+
+Operations on blocks (requires a <blknum> argument):
+
+addbadblock <blknum>
+queryblock <blknum>
+removebadblock <blknum>
+
+...where <blknum> is a block number within range of the device
+ (corresponding to the block size of the device.)
+
+Single argument message commands:
+
+countbadblocks
+clearbadblocks
+disable
+enable
+quiet
+
+Device removal:
+---------------
+
+When finished, remove the device via the "dmsetup remove" command:
+
+$ sudo dmsetup remove dust1
+
+Quiet mode:
+-----------
+
+On test runs with many bad blocks, it may be desirable to avoid
+excessive logging (from bad blocks added, removed, or "remapped").
+This can be done by enabling "quiet mode" via the following message:
+
+$ sudo dmsetup message dust1 0 quiet
+
+This will suppress log messages from add / remove / removed by write
+operations. Log messages from "countbadblocks" or "queryblock"
+message commands will still print in quiet mode.
+
+The status of quiet mode can be seen by running "dmsetup status":
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 fail_read_on_bad_block quiet
+
+To disable quiet mode, send the "quiet" message again:
+
+$ sudo dmsetup message dust1 0 quiet
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 fail_read_on_bad_block verbose
+
+(The presence of "verbose" indicates normal logging.)
+
+"Why not...?"
+-------------
+
+scsi_debug has a "medium error" mode that can fail reads on one
+specified sector (sector 0x1234, hardcoded in the source code), but
+it uses RAM for the persistent storage, which drastically decreases
+the potential device size.
+
+dm-flakey fails all I/O from all block locations at a specified time
+frequency, and not a given point in time.
+
+When a bad sector occurs on a hard disk drive, reads to that sector
+are failed by the device, usually resulting in an error code of EIO
+("I/O error") or ENODATA ("No data available"). However, a write to
+the sector may succeed, and result in the sector becoming readable
+after the device controller no longer experiences errors reading the
+sector (or after a reallocation of the sector). However, there may
+be bad sectors that occur on the device in the future, in a different,
+unpredictable location.
+
+This target seeks to provide a device that can exhibit the behavior
+of a bad sector at a known sector location, at a known time, based
+on a large storage device (at least tens of gigabytes, not occupying
+system memory).
diff --git a/Documentation/admin-guide/device-mapper/dm-flakey.rst b/Documentation/admin-guide/device-mapper/dm-flakey.rst
new file mode 100644
index 0000000..8613873
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-flakey.rst
@@ -0,0 +1,74 @@
+=========
+dm-flakey
+=========
+
+This target is the same as the linear target except that it exhibits
+unreliable behaviour periodically. It's been found useful in simulating
+failing devices for testing purposes.
+
+Starting from the time the table is loaded, the device is available for
+<up interval> seconds, then exhibits unreliable behaviour for <down
+interval> seconds, and then this cycle repeats.
+
+Also, consider using this in combination with the dm-delay target too,
+which can delay reads and writes and/or send them to different
+underlying devices.
+
+Table parameters
+----------------
+
+::
+
+ <dev path> <offset> <up interval> <down interval> \
+ [<num_features> [<feature arguments>]]
+
+Mandatory parameters:
+
+ <dev path>:
+ Full pathname to the underlying block-device, or a
+ "major:minor" device-number.
+ <offset>:
+ Starting sector within the device.
+ <up interval>:
+ Number of seconds device is available.
+ <down interval>:
+ Number of seconds device returns errors.
+
+Optional feature parameters:
+
+ If no feature parameters are present, during the periods of
+ unreliability, all I/O returns errors.
+
+ drop_writes:
+ All write I/O is silently ignored.
+ Read I/O is handled correctly.
+
+ error_writes:
+ All write I/O is failed with an error signalled.
+ Read I/O is handled correctly.
+
+ corrupt_bio_byte <Nth_byte> <direction> <value> <flags>:
+ During <down interval>, replace <Nth_byte> of the data of
+ each matching bio with <value>.
+
+ <Nth_byte>:
+ The offset of the byte to replace.
+ Counting starts at 1, to replace the first byte.
+ <direction>:
+ Either 'r' to corrupt reads or 'w' to corrupt writes.
+ 'w' is incompatible with drop_writes.
+ <value>:
+ The value (from 0-255) to write.
+ <flags>:
+ Perform the replacement only if bio->bi_opf has all the
+ selected flags set.
+
+Examples:
+
+Replaces the 32nd byte of READ bios with the value 1::
+
+ corrupt_bio_byte 32 r 1 0
+
+Replaces the 224th byte of REQ_META (=32) bios with the value 0::
+
+ corrupt_bio_byte 224 w 0 32
diff --git a/Documentation/admin-guide/device-mapper/dm-init.rst b/Documentation/admin-guide/device-mapper/dm-init.rst
new file mode 100644
index 0000000..e5242ff
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-init.rst
@@ -0,0 +1,125 @@
+================================
+Early creation of mapped devices
+================================
+
+It is possible to configure a device-mapper device to act as the root device for
+your system in two ways.
+
+The first is to build an initial ramdisk which boots to a minimal userspace
+which configures the device, then pivot_root(8) in to it.
+
+The second is to create one or more device-mappers using the module parameter
+"dm-mod.create=" through the kernel boot command line argument.
+
+The format is specified as a string of data separated by commas and optionally
+semi-colons, where:
+
+ - a comma is used to separate fields like name, uuid, flags and table
+ (specifies one device)
+ - a semi-colon is used to separate devices.
+
+So the format will look like this::
+
+ dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
+
+Where::
+
+ <name> ::= The device name.
+ <uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
+ <minor> ::= The device minor number | ""
+ <flags> ::= "ro" | "rw"
+ <table> ::= <start_sector> <num_sectors> <target_type> <target_args>
+ <target_type> ::= "verity" | "linear" | ... (see list below)
+
+The dm line should be equivalent to the one used by the dmsetup tool with the
+`--concise` argument.
+
+Target types
+============
+
+Not all target types are available as there are serious risks in allowing
+activation of certain DM targets without first using userspace tools to check
+the validity of associated metadata.
+
+======================= =======================================================
+`cache` constrained, userspace should verify cache device
+`crypt` allowed
+`delay` allowed
+`era` constrained, userspace should verify metadata device
+`flakey` constrained, meant for test
+`linear` allowed
+`log-writes` constrained, userspace should verify metadata device
+`mirror` constrained, userspace should verify main/mirror device
+`raid` constrained, userspace should verify metadata device
+`snapshot` constrained, userspace should verify src/dst device
+`snapshot-origin` allowed
+`snapshot-merge` constrained, userspace should verify src/dst device
+`striped` allowed
+`switch` constrained, userspace should verify dev path
+`thin` constrained, requires dm target message from userspace
+`thin-pool` constrained, requires dm target message from userspace
+`verity` allowed
+`writecache` constrained, userspace should verify cache device
+`zero` constrained, not meant for rootfs
+======================= =======================================================
+
+If the target is not listed above, it is constrained by default (not tested).
+
+Examples
+========
+An example of booting to a linear array made up of user-mode linux block
+devices::
+
+ dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
+
+This will boot to a rw dm-linear target of 8192 sectors split across two block
+devices identified by their major:minor numbers. After boot, udev will rename
+this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
+
+An example of multiple device-mappers, with the dm-mod.create="..." contents
+is shown here split on multiple lines for readability::
+
+ dm-linear,,1,rw,
+ 0 32768 linear 8:1 0,
+ 32768 1024000 linear 8:2 0;
+ dm-verity,,3,ro,
+ 0 1638400 verity 1 /dev/sdc1 /dev/sdc2 4096 4096 204800 1 sha256
+ ac87db56303c9c1da433d7209b5a6ef3e4779df141200cbd7c157dcb8dd89c42
+ 5ebfe87f7df3235b80a117ebc4078e44f55045487ad4a96581d1adb564615b51
+
+Other examples (per target):
+
+"crypt"::
+
+ dm-crypt,,8,ro,
+ 0 1048576 crypt aes-xts-plain64
+ babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
+ /dev/sda 0 1 allow_discards
+
+"delay"::
+
+ dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
+
+"linear"::
+
+ dm-linear,,,rw,
+ 0 32768 linear /dev/sda1 0,
+ 32768 1024000 linear /dev/sda2 0,
+ 1056768 204800 linear /dev/sda3 0,
+ 1261568 512000 linear /dev/sda4 0
+
+"snapshot-origin"::
+
+ dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
+
+"striped"::
+
+ dm-striped,,4,ro,0 1638400 striped 4 4096
+ /dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
+
+"verity"::
+
+ dm-verity,,4,ro,
+ 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
+ fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
+ 51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584
diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst
new file mode 100644
index 0000000..a30aa91
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst
@@ -0,0 +1,259 @@
+============
+dm-integrity
+============
+
+The dm-integrity target emulates a block device that has additional
+per-sector tags that can be used for storing integrity information.
+
+A general problem with storing integrity tags with every sector is that
+writing the sector and the integrity tag must be atomic - i.e. in case of
+crash, either both sector and integrity tag or none of them is written.
+
+To guarantee write atomicity, the dm-integrity target uses journal, it
+writes sector data and integrity tags into a journal, commits the journal
+and then copies the data and integrity tags to their respective location.
+
+The dm-integrity target can be used with the dm-crypt target - in this
+situation the dm-crypt target creates the integrity data and passes them
+to the dm-integrity target via bio_integrity_payload attached to the bio.
+In this mode, the dm-crypt and dm-integrity targets provide authenticated
+disk encryption - if the attacker modifies the encrypted device, an I/O
+error is returned instead of random data.
+
+The dm-integrity target can also be used as a standalone target, in this
+mode it calculates and verifies the integrity tag internally. In this
+mode, the dm-integrity target can be used to detect silent data
+corruption on the disk or in the I/O path.
+
+There's an alternate mode of operation where dm-integrity uses bitmap
+instead of a journal. If a bit in the bitmap is 1, the corresponding
+region's data and integrity tags are not synchronized - if the machine
+crashes, the unsynchronized regions will be recalculated. The bitmap mode
+is faster than the journal mode, because we don't have to write the data
+twice, but it is also less reliable, because if data corruption happens
+when the machine crashes, it may not be detected.
+
+When loading the target for the first time, the kernel driver will format
+the device. But it will only format the device if the superblock contains
+zeroes. If the superblock is neither valid nor zeroed, the dm-integrity
+target can't be loaded.
+
+To use the target for the first time:
+
+1. overwrite the superblock with zeroes
+2. load the dm-integrity target with one-sector size, the kernel driver
+ will format the device
+3. unload the dm-integrity target
+4. read the "provided_data_sectors" value from the superblock
+5. load the dm-integrity target with the the target size
+ "provided_data_sectors"
+6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target
+ with the size "provided_data_sectors"
+
+
+Target arguments:
+
+1. the underlying block device
+
+2. the number of reserved sector at the beginning of the device - the
+ dm-integrity won't read of write these sectors
+
+3. the size of the integrity tag (if "-" is used, the size is taken from
+ the internal-hash algorithm)
+
+4. mode:
+
+ D - direct writes (without journal)
+ in this mode, journaling is
+ not used and data sectors and integrity tags are written
+ separately. In case of crash, it is possible that the data
+ and integrity tag doesn't match.
+ J - journaled writes
+ data and integrity tags are written to the
+ journal and atomicity is guaranteed. In case of crash,
+ either both data and tag or none of them are written. The
+ journaled mode degrades write throughput twice because the
+ data have to be written twice.
+ B - bitmap mode - data and metadata are written without any
+ synchronization, the driver maintains a bitmap of dirty
+ regions where data and metadata don't match. This mode can
+ only be used with internal hash.
+ R - recovery mode - in this mode, journal is not replayed,
+ checksums are not checked and writes to the device are not
+ allowed. This mode is useful for data recovery if the
+ device cannot be activated in any of the other standard
+ modes.
+
+5. the number of additional arguments
+
+Additional arguments:
+
+journal_sectors:number
+ The size of journal, this argument is used only if formatting the
+ device. If the device is already formatted, the value from the
+ superblock is used.
+
+interleave_sectors:number
+ The number of interleaved sectors. This values is rounded down to
+ a power of two. If the device is already formatted, the value from
+ the superblock is used.
+
+meta_device:device
+ Don't interleave the data and metadata on on device. Use a
+ separate device for metadata.
+
+buffer_sectors:number
+ The number of sectors in one buffer. The value is rounded down to
+ a power of two.
+
+ The tag area is accessed using buffers, the buffer size is
+ configurable. The large buffer size means that the I/O size will
+ be larger, but there could be less I/Os issued.
+
+journal_watermark:number
+ The journal watermark in percents. When the size of the journal
+ exceeds this watermark, the thread that flushes the journal will
+ be started.
+
+commit_time:number
+ Commit time in milliseconds. When this time passes, the journal is
+ written. The journal is also written immediatelly if the FLUSH
+ request is received.
+
+internal_hash:algorithm(:key) (the key is optional)
+ Use internal hash or crc.
+ When this argument is used, the dm-integrity target won't accept
+ integrity tags from the upper target, but it will automatically
+ generate and verify the integrity tags.
+
+ You can use a crc algorithm (such as crc32), then integrity target
+ will protect the data against accidental corruption.
+ You can also use a hmac algorithm (for example
+ "hmac(sha256):0123456789abcdef"), in this mode it will provide
+ cryptographic authentication of the data without encryption.
+
+ When this argument is not used, the integrity tags are accepted
+ from an upper layer target, such as dm-crypt. The upper layer
+ target should check the validity of the integrity tags.
+
+recalculate
+ Recalculate the integrity tags automatically. It is only valid
+ when using internal hash.
+
+journal_crypt:algorithm(:key) (the key is optional)
+ Encrypt the journal using given algorithm to make sure that the
+ attacker can't read the journal. You can use a block cipher here
+ (such as "cbc(aes)") or a stream cipher (for example "chacha20",
+ "salsa20", "ctr(aes)" or "ecb(arc4)").
+
+ The journal contains history of last writes to the block device,
+ an attacker reading the journal could see the last sector nubmers
+ that were written. From the sector numbers, the attacker can infer
+ the size of files that were written. To protect against this
+ situation, you can encrypt the journal.
+
+journal_mac:algorithm(:key) (the key is optional)
+ Protect sector numbers in the journal from accidental or malicious
+ modification. To protect against accidental modification, use a
+ crc algorithm, to protect against malicious modification, use a
+ hmac algorithm with a key.
+
+ This option is not needed when using internal-hash because in this
+ mode, the integrity of journal entries is checked when replaying
+ the journal. Thus, modified sector number would be detected at
+ this stage.
+
+block_size:number
+ The size of a data block in bytes. The larger the block size the
+ less overhead there is for per-block integrity metadata.
+ Supported values are 512, 1024, 2048 and 4096 bytes. If not
+ specified the default block size is 512 bytes.
+
+sectors_per_bit:number
+ In the bitmap mode, this parameter specifies the number of
+ 512-byte sectors that corresponds to one bitmap bit.
+
+bitmap_flush_interval:number
+ The bitmap flush interval in milliseconds. The metadata buffers
+ are synchronized when this interval expires.
+
+
+The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can
+be changed when reloading the target (load an inactive table and swap the
+tables with suspend and resume). The other arguments should not be changed
+when reloading the target because the layout of disk data depend on them
+and the reloaded target would be non-functional.
+
+
+The layout of the formatted block device:
+
+* reserved sectors
+ (they are not used by this target, they can be used for
+ storing LUKS metadata or for other purpose), the size of the reserved
+ area is specified in the target arguments
+
+* superblock (4kiB)
+ * magic string - identifies that the device was formatted
+ * version
+ * log2(interleave sectors)
+ * integrity tag size
+ * the number of journal sections
+ * provided data sectors - the number of sectors that this target
+ provides (i.e. the size of the device minus the size of all
+ metadata and padding). The user of this target should not send
+ bios that access data beyond the "provided data sectors" limit.
+ * flags
+ SB_FLAG_HAVE_JOURNAL_MAC
+ - a flag is set if journal_mac is used
+ SB_FLAG_RECALCULATING
+ - recalculating is in progress
+ SB_FLAG_DIRTY_BITMAP
+ - journal area contains the bitmap of dirty
+ blocks
+ * log2(sectors per block)
+ * a position where recalculating finished
+* journal
+ The journal is divided into sections, each section contains:
+
+ * metadata area (4kiB), it contains journal entries
+
+ - every journal entry contains:
+
+ * logical sector (specifies where the data and tag should
+ be written)
+ * last 8 bytes of data
+ * integrity tag (the size is specified in the superblock)
+
+ - every metadata sector ends with
+
+ * mac (8-bytes), all the macs in 8 metadata sectors form a
+ 64-byte value. It is used to store hmac of sector
+ numbers in the journal section, to protect against a
+ possibility that the attacker tampers with sector
+ numbers in the journal.
+ * commit id
+
+ * data area (the size is variable; it depends on how many journal
+ entries fit into the metadata area)
+
+ - every sector in the data area contains:
+
+ * data (504 bytes of data, the last 8 bytes are stored in
+ the journal entry)
+ * commit id
+
+ To test if the whole journal section was written correctly, every
+ 512-byte sector of the journal ends with 8-byte commit id. If the
+ commit id matches on all sectors in a journal section, then it is
+ assumed that the section was written correctly. If the commit id
+ doesn't match, the section was written partially and it should not
+ be replayed.
+
+* one or more runs of interleaved tags and data.
+ Each run contains:
+
+ * tag area - it contains integrity tags. There is one tag for each
+ sector in the data area
+ * data area - it contains data sectors. The number of data sectors
+ in one run must be a power of two. log2 of this value is stored
+ in the superblock.
diff --git a/Documentation/admin-guide/device-mapper/dm-io.rst b/Documentation/admin-guide/device-mapper/dm-io.rst
new file mode 100644
index 0000000..d249291
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-io.rst
@@ -0,0 +1,75 @@
+=====
+dm-io
+=====
+
+Dm-io provides synchronous and asynchronous I/O services. There are three
+types of I/O services available, and each type has a sync and an async
+version.
+
+The user must set up an io_region structure to describe the desired location
+of the I/O. Each io_region indicates a block-device along with the starting
+sector and size of the region::
+
+ struct io_region {
+ struct block_device *bdev;
+ sector_t sector;
+ sector_t count;
+ };
+
+Dm-io can read from one io_region or write to one or more io_regions. Writes
+to multiple regions are specified by an array of io_region structures.
+
+The first I/O service type takes a list of memory pages as the data buffer for
+the I/O, along with an offset into the first page::
+
+ struct page_list {
+ struct page_list *next;
+ struct page *page;
+ };
+
+ int dm_io_sync(unsigned int num_regions, struct io_region *where, int rw,
+ struct page_list *pl, unsigned int offset,
+ unsigned long *error_bits);
+ int dm_io_async(unsigned int num_regions, struct io_region *where, int rw,
+ struct page_list *pl, unsigned int offset,
+ io_notify_fn fn, void *context);
+
+The second I/O service type takes an array of bio vectors as the data buffer
+for the I/O. This service can be handy if the caller has a pre-assembled bio,
+but wants to direct different portions of the bio to different devices::
+
+ int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where,
+ int rw, struct bio_vec *bvec,
+ unsigned long *error_bits);
+ int dm_io_async_bvec(unsigned int num_regions, struct io_region *where,
+ int rw, struct bio_vec *bvec,
+ io_notify_fn fn, void *context);
+
+The third I/O service type takes a pointer to a vmalloc'd memory buffer as the
+data buffer for the I/O. This service can be handy if the caller needs to do
+I/O to a large region but doesn't want to allocate a large number of individual
+memory pages::
+
+ int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw,
+ void *data, unsigned long *error_bits);
+ int dm_io_async_vm(unsigned int num_regions, struct io_region *where, int rw,
+ void *data, io_notify_fn fn, void *context);
+
+Callers of the asynchronous I/O services must include the name of a completion
+callback routine and a pointer to some context data for the I/O::
+
+ typedef void (*io_notify_fn)(unsigned long error, void *context);
+
+The "error" parameter in this callback, as well as the `*error` parameter in
+all of the synchronous versions, is a bitset (instead of a simple error value).
+In the case of an write-I/O to multiple regions, this bitset allows dm-io to
+indicate success or failure on each individual region.
+
+Before using any of the dm-io services, the user should call dm_io_get()
+and specify the number of pages they expect to perform I/O on concurrently.
+Dm-io will attempt to resize its mempool to make sure enough pages are
+always available in order to avoid unnecessary waiting while performing I/O.
+
+When the user is finished using the dm-io services, they should call
+dm_io_put() and specify the same number of pages that were given on the
+dm_io_get() call.
diff --git a/Documentation/admin-guide/device-mapper/dm-log.rst b/Documentation/admin-guide/device-mapper/dm-log.rst
new file mode 100644
index 0000000..ba4fce3
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-log.rst
@@ -0,0 +1,57 @@
+=====================
+Device-Mapper Logging
+=====================
+The device-mapper logging code is used by some of the device-mapper
+RAID targets to track regions of the disk that are not consistent.
+A region (or portion of the address space) of the disk may be
+inconsistent because a RAID stripe is currently being operated on or
+a machine died while the region was being altered. In the case of
+mirrors, a region would be considered dirty/inconsistent while you
+are writing to it because the writes need to be replicated for all
+the legs of the mirror and may not reach the legs at the same time.
+Once all writes are complete, the region is considered clean again.
+
+There is a generic logging interface that the device-mapper RAID
+implementations use to perform logging operations (see
+dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different
+logging implementations are available and provide different
+capabilities. The list includes:
+
+============== ==============================================================
+Type Files
+============== ==============================================================
+disk drivers/md/dm-log.c
+core drivers/md/dm-log.c
+userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h
+============== ==============================================================
+
+The "disk" log type
+-------------------
+This log implementation commits the log state to disk. This way, the
+logging state survives reboots/crashes.
+
+The "core" log type
+-------------------
+This log implementation keeps the log state in memory. The log state
+will not survive a reboot or crash, but there may be a small boost in
+performance. This method can also be used if no storage device is
+available for storing log state.
+
+The "userspace" log type
+------------------------
+This log type simply provides a way to export the log API to userspace,
+so log implementations can be done there. This is done by forwarding most
+logging requests to userspace, where a daemon receives and processes the
+request.
+
+The structure used for communication between kernel and userspace are
+located in include/linux/dm-log-userspace.h. Due to the frequency,
+diversity, and 2-way communication nature of the exchanges between
+kernel and userspace, 'connector' is used as the interface for
+communication.
+
+There are currently two userspace log implementations that leverage this
+framework - "clustered-disk" and "clustered-core". These implementations
+provide a cluster-coherent log for shared-storage. Device-mapper mirroring
+can be used in a shared-storage environment when the cluster log implementations
+are employed.
diff --git a/Documentation/admin-guide/device-mapper/dm-queue-length.rst b/Documentation/admin-guide/device-mapper/dm-queue-length.rst
new file mode 100644
index 0000000..d8e381c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-queue-length.rst
@@ -0,0 +1,48 @@
+===============
+dm-queue-length
+===============
+
+dm-queue-length is a path selector module for device-mapper targets,
+which selects a path with the least number of in-flight I/Os.
+The path selector name is 'queue-length'.
+
+Table parameters for each path: [<repeat_count>]
+
+::
+
+ <repeat_count>: The number of I/Os to dispatch using the selected
+ path before switching to the next path.
+ If not given, internal default is used. To check
+ the default value, see the activated table.
+
+Status for each path: <status> <fail-count> <in-flight>
+
+::
+
+ <status>: 'A' if the path is active, 'F' if the path is failed.
+ <fail-count>: The number of path failures.
+ <in-flight>: The number of in-flight I/Os on the path.
+
+
+Algorithm
+=========
+
+dm-queue-length increments/decrements 'in-flight' when an I/O is
+dispatched/completed respectively.
+dm-queue-length selects a path with the minimum 'in-flight'.
+
+
+Examples
+========
+In case that 2 paths (sda and sdb) are used with repeat_count == 128.
+
+::
+
+ # echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \
+ dmsetup create test
+ #
+ # dmsetup table
+ test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128
+ #
+ # dmsetup status
+ test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0
diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst
new file mode 100644
index 0000000..2fe255b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-raid.rst
@@ -0,0 +1,419 @@
+=======
+dm-raid
+=======
+
+The device-mapper RAID (dm-raid) target provides a bridge from DM to MD.
+It allows the MD RAID drivers to be accessed using a device-mapper
+interface.
+
+
+Mapping Table Interface
+-----------------------
+The target is named "raid" and it accepts the following parameters::
+
+ <raid_type> <#raid_params> <raid_params> \
+ <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
+
+<raid_type>:
+
+ ============= ===============================================================
+ raid0 RAID0 striping (no resilience)
+ raid1 RAID1 mirroring
+ raid4 RAID4 with dedicated last parity disk
+ raid5_n RAID5 with dedicated last parity disk supporting takeover
+ Same as raid4
+
+ - Transitory layout
+ raid5_la RAID5 left asymmetric
+
+ - rotating parity 0 with data continuation
+ raid5_ra RAID5 right asymmetric
+
+ - rotating parity N with data continuation
+ raid5_ls RAID5 left symmetric
+
+ - rotating parity 0 with data restart
+ raid5_rs RAID5 right symmetric
+
+ - rotating parity N with data restart
+ raid6_zr RAID6 zero restart
+
+ - rotating parity zero (left-to-right) with data restart
+ raid6_nr RAID6 N restart
+
+ - rotating parity N (right-to-left) with data restart
+ raid6_nc RAID6 N continue
+
+ - rotating parity N (right-to-left) with data continuation
+ raid6_n_6 RAID6 with dedicate parity disks
+
+ - parity and Q-syndrome on the last 2 disks;
+ layout for takeover from/to raid4/raid5_n
+ raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
+
+ - layout for takeover from raid5_la from/to raid6
+ raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
+
+ - layout for takeover from raid5_ra from/to raid6
+ raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
+
+ - layout for takeover from raid5_ls from/to raid6
+ raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
+
+ - layout for takeover from raid5_rs from/to raid6
+ raid10 Various RAID10 inspired algorithms chosen by additional params
+ (see raid10_format and raid10_copies below)
+
+ - RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
+ - RAID1E: Integrated Adjacent Stripe Mirroring
+ - RAID1E: Integrated Offset Stripe Mirroring
+ - and other similar RAID10 variants
+ ============= ===============================================================
+
+ Reference: Chapter 4 of
+ http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
+
+<#raid_params>: The number of parameters that follow.
+
+<raid_params> consists of
+
+ Mandatory parameters:
+ <chunk_size>:
+ Chunk size in sectors. This parameter is often known as
+ "stripe size". It is the only mandatory parameter and
+ is placed first.
+
+ followed by optional parameters (in any order):
+ [sync|nosync]
+ Force or prevent RAID initialization.
+
+ [rebuild <idx>]
+ Rebuild drive number 'idx' (first drive is 0).
+
+ [daemon_sleep <ms>]
+ Interval between runs of the bitmap daemon that
+ clear bits. A longer interval means less bitmap I/O but
+ resyncing after a failure is likely to take longer.
+
+ [min_recovery_rate <kB/sec/disk>]
+ Throttle RAID initialization
+ [max_recovery_rate <kB/sec/disk>]
+ Throttle RAID initialization
+ [write_mostly <idx>]
+ Mark drive index 'idx' write-mostly.
+ [max_write_behind <sectors>]
+ See '--write-behind=' (man mdadm)
+ [stripe_cache <sectors>]
+ Stripe cache size (RAID 4/5/6 only)
+ [region_size <sectors>]
+ The region_size multiplied by the number of regions is the
+ logical size of the array. The bitmap records the device
+ synchronisation state for each region.
+
+ [raid10_copies <# copies>], [raid10_format <near|far|offset>]
+ These two options are used to alter the default layout of
+ a RAID10 configuration. The number of copies is can be
+ specified, but the default is 2. There are also three
+ variations to how the copies are laid down - the default
+ is "near". Near copies are what most people think of with
+ respect to mirroring. If these options are left unspecified,
+ or 'raid10_copies 2' and/or 'raid10_format near' are given,
+ then the layouts for 2, 3 and 4 devices are:
+
+ ======== ========== ==============
+ 2 drives 3 drives 4 drives
+ ======== ========== ==============
+ A1 A1 A1 A1 A2 A1 A1 A2 A2
+ A2 A2 A2 A3 A3 A3 A3 A4 A4
+ A3 A3 A4 A4 A5 A5 A5 A6 A6
+ A4 A4 A5 A6 A6 A7 A7 A8 A8
+ .. .. .. .. .. .. .. .. ..
+ ======== ========== ==============
+
+ The 2-device layout is equivalent 2-way RAID1. The 4-device
+ layout is what a traditional RAID10 would look like. The
+ 3-device layout is what might be called a 'RAID1E - Integrated
+ Adjacent Stripe Mirroring'.
+
+ If 'raid10_copies 2' and 'raid10_format far', then the layouts
+ for 2, 3 and 4 devices are:
+
+ ======== ============ ===================
+ 2 drives 3 drives 4 drives
+ ======== ============ ===================
+ A1 A2 A1 A2 A3 A1 A2 A3 A4
+ A3 A4 A4 A5 A6 A5 A6 A7 A8
+ A5 A6 A7 A8 A9 A9 A10 A11 A12
+ .. .. .. .. .. .. .. .. ..
+ A2 A1 A3 A1 A2 A2 A1 A4 A3
+ A4 A3 A6 A4 A5 A6 A5 A8 A7
+ A6 A5 A9 A7 A8 A10 A9 A12 A11
+ .. .. .. .. .. .. .. .. ..
+ ======== ============ ===================
+
+ If 'raid10_copies 2' and 'raid10_format offset', then the
+ layouts for 2, 3 and 4 devices are:
+
+ ======== ========== ================
+ 2 drives 3 drives 4 drives
+ ======== ========== ================
+ A1 A2 A1 A2 A3 A1 A2 A3 A4
+ A2 A1 A3 A1 A2 A2 A1 A4 A3
+ A3 A4 A4 A5 A6 A5 A6 A7 A8
+ A4 A3 A6 A4 A5 A6 A5 A8 A7
+ A5 A6 A7 A8 A9 A9 A10 A11 A12
+ A6 A5 A9 A7 A8 A10 A9 A12 A11
+ .. .. .. .. .. .. .. .. ..
+ ======== ========== ================
+
+ Here we see layouts closely akin to 'RAID1E - Integrated
+ Offset Stripe Mirroring'.
+
+ [delta_disks <N>]
+ The delta_disks option value (-251 < N < +251) triggers
+ device removal (negative value) or device addition (positive
+ value) to any reshape supporting raid levels 4/5/6 and 10.
+ RAID levels 4/5/6 allow for addition of devices (metadata
+ and data device tuple), raid10_near and raid10_offset only
+ allow for device addition. raid10_far does not support any
+ reshaping at all.
+ A minimum of devices have to be kept to enforce resilience,
+ which is 3 devices for raid4/5 and 4 devices for raid6.
+
+ [data_offset <sectors>]
+ This option value defines the offset into each data device
+ where the data starts. This is used to provide out-of-place
+ reshaping space to avoid writing over data while
+ changing the layout of stripes, hence an interruption/crash
+ may happen at any time without the risk of losing data.
+ E.g. when adding devices to an existing raid set during
+ forward reshaping, the out-of-place space will be allocated
+ at the beginning of each raid device. The kernel raid4/5/6/10
+ MD personalities supporting such device addition will read the data from
+ the existing first stripes (those with smaller number of stripes)
+ starting at data_offset to fill up a new stripe with the larger
+ number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
+ and write that new stripe to offset 0. Same will be applied to all
+ N-1 other new stripes. This out-of-place scheme is used to change
+ the RAID type (i.e. the allocation algorithm) as well, e.g.
+ changing from raid5_ls to raid5_n.
+
+ [journal_dev <dev>]
+ This option adds a journal device to raid4/5/6 raid sets and
+ uses it to close the 'write hole' caused by the non-atomic updates
+ to the component devices which can cause data loss during recovery.
+ The journal device is used as writethrough thus causing writes to
+ be throttled versus non-journaled raid4/5/6 sets.
+ Takeover/reshape is not possible with a raid4/5/6 journal device;
+ it has to be deconfigured before requesting these.
+
+ [journal_mode <mode>]
+ This option sets the caching mode on journaled raid4/5/6 raid sets
+ (see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'.
+ If 'writeback' is selected the journal device has to be resilient
+ and must not suffer from the 'write hole' problem itself (e.g. use
+ raid1 or raid10) to avoid a single point of failure.
+
+<#raid_devs>: The number of devices composing the array.
+ Each device consists of two entries. The first is the device
+ containing the metadata (if any); the second is the one containing the
+ data. A Maximum of 64 metadata/data device entries are supported
+ up to target version 1.8.0.
+ 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
+
+ If a drive has failed or is missing at creation time, a '-' can be
+ given for both the metadata and data drives for a given position.
+
+
+Example Tables
+--------------
+
+::
+
+ # RAID4 - 4 data drives, 1 parity (no metadata devices)
+ # No metadata devices specified to hold superblock/bitmap info
+ # Chunk size of 1MiB
+ # (Lines separated for easy reading)
+
+ 0 1960893648 raid \
+ raid4 1 2048 \
+ 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
+
+ # RAID4 - 4 data drives, 1 parity (with metadata devices)
+ # Chunk size of 1MiB, force RAID initialization,
+ # min recovery rate at 20 kiB/sec/disk
+
+ 0 1960893648 raid \
+ raid4 4 2048 sync min_recovery_rate 20 \
+ 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82
+
+
+Status Output
+-------------
+'dmsetup table' displays the table used to construct the mapping.
+The optional parameters are always printed in the order listed
+above with "sync" or "nosync" always output ahead of the other
+arguments, regardless of the order used when originally loading the table.
+Arguments that can be repeated are ordered by value.
+
+
+'dmsetup status' yields information on the state and health of the array.
+The output is as follows (normally a single line, but expanded here for
+clarity)::
+
+ 1: <s> <l> raid \
+ 2: <raid_type> <#devices> <health_chars> \
+ 3: <sync_ratio> <sync_action> <mismatch_cnt>
+
+Line 1 is the standard output produced by device-mapper.
+
+Line 2 & 3 are produced by the raid target and are best explained by example::
+
+ 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0
+
+Here we can see the RAID type is raid4, there are 5 devices - all of
+which are 'A'live, and the array is 2/490221568 complete with its initial
+recovery. Here is a fuller description of the individual fields:
+
+ =============== =========================================================
+ <raid_type> Same as the <raid_type> used to create the array.
+ <health_chars> One char for each device, indicating:
+
+ - 'A' = alive and in-sync
+ - 'a' = alive but not in-sync
+ - 'D' = dead/failed.
+ <sync_ratio> The ratio indicating how much of the array has undergone
+ the process described by 'sync_action'. If the
+ 'sync_action' is "check" or "repair", then the process
+ of "resync" or "recover" can be considered complete.
+ <sync_action> One of the following possible states:
+
+ idle
+ - No synchronization action is being performed.
+ frozen
+ - The current action has been halted.
+ resync
+ - Array is undergoing its initial synchronization
+ or is resynchronizing after an unclean shutdown
+ (possibly aided by a bitmap).
+ recover
+ - A device in the array is being rebuilt or
+ replaced.
+ check
+ - A user-initiated full check of the array is
+ being performed. All blocks are read and
+ checked for consistency. The number of
+ discrepancies found are recorded in
+ <mismatch_cnt>. No changes are made to the
+ array by this action.
+ repair
+ - The same as "check", but discrepancies are
+ corrected.
+ reshape
+ - The array is undergoing a reshape.
+ <mismatch_cnt> The number of discrepancies found between mirror copies
+ in RAID1/10 or wrong parity values found in RAID4/5/6.
+ This value is valid only after a "check" of the array
+ is performed. A healthy array has a 'mismatch_cnt' of 0.
+ <data_offset> The current data offset to the start of the user data on
+ each component device of a raid set (see the respective
+ raid parameter to support out-of-place reshaping).
+ <journal_char> - 'A' - active write-through journal device.
+ - 'a' - active write-back journal device.
+ - 'D' - dead journal device.
+ - '-' - no journal device.
+ =============== =========================================================
+
+
+Message Interface
+-----------------
+The dm-raid target will accept certain actions through the 'message' interface.
+('man dmsetup' for more information on the message interface.) These actions
+include:
+
+ ========= ================================================
+ "idle" Halt the current sync action.
+ "frozen" Freeze the current sync action.
+ "resync" Initiate/continue a resync.
+ "recover" Initiate/continue a recover process.
+ "check" Initiate a check (i.e. a "scrub") of the array.
+ "repair" Initiate a repair of the array.
+ ========= ================================================
+
+
+Discard Support
+---------------
+The implementation of discard support among hardware vendors varies.
+When a block is discarded, some storage devices will return zeroes when
+the block is read. These devices set the 'discard_zeroes_data'
+attribute. Other devices will return random data. Confusingly, some
+devices that advertise 'discard_zeroes_data' will not reliably return
+zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks
+from a number of devices to calculate parity blocks and (for performance
+reasons) relies on 'discard_zeroes_data' being reliable, it is important
+that the devices be consistent. Blocks may be discarded in the middle
+of a RAID 4/5/6 stripe and if subsequent read results are not
+consistent, the parity blocks may be calculated differently at any time;
+making the parity blocks useless for redundancy. It is important to
+understand how your hardware behaves with discards if you are going to
+enable discards with RAID 4/5/6.
+
+Since the behavior of storage devices is unreliable in this respect,
+even when reporting 'discard_zeroes_data', by default RAID 4/5/6
+discard support is disabled -- this ensures data integrity at the
+expense of losing some performance.
+
+Storage devices that properly support 'discard_zeroes_data' are
+increasingly whitelisted in the kernel and can thus be trusted.
+
+For trusted devices, the following dm-raid module parameter can be set
+to safely enable discard support for RAID 4/5/6:
+
+ 'devices_handle_discards_safely'
+
+
+Version History
+---------------
+
+::
+
+ 1.0.0 Initial version. Support for RAID 4/5/6
+ 1.1.0 Added support for RAID 1
+ 1.2.0 Handle creation of arrays that contain failed devices.
+ 1.3.0 Added support for RAID 10
+ 1.3.1 Allow device replacement/rebuild for RAID 10
+ 1.3.2 Fix/improve redundancy checking for RAID10
+ 1.4.0 Non-functional change. Removes arg from mapping function.
+ 1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5).
+ 1.4.2 Add RAID10 "far" and "offset" algorithm support.
+ 1.5.0 Add message interface to allow manipulation of the sync_action.
+ New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt.
+ 1.5.1 Add ability to restore transiently failed devices on resume.
+ 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
+ 1.6.0 Add discard support (and devices_handle_discard_safely module param).
+ 1.7.0 Add support for MD RAID0 mappings.
+ 1.8.0 Explicitly check for compatible flags in the superblock metadata
+ and reject to start the raid set if any are set by a newer
+ target version, thus avoiding data corruption on a raid set
+ with a reshape in progress.
+ 1.9.0 Add support for RAID level takeover/reshape/region size
+ and set size reduction.
+ 1.9.1 Fix activation of existing RAID 4/10 mapped devices
+ 1.9.2 Don't emit '- -' on the status table line in case the constructor
+ fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and
+ 'D' on the status line. If '- -' is passed into the constructor, emit
+ '- -' on the table line and '-' as the status line health character.
+ 1.10.0 Add support for raid4/5/6 journal device
+ 1.10.1 Fix data corruption on reshape request
+ 1.11.0 Fix table line argument order
+ (wrong raid10_copies/raid10_format sequence)
+ 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
+ 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
+ 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
+ 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
+ state races.
+ 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
+ 1.14.0 Fix reshape race on small devices. Fix stripe adding reshape
+ deadlock/potential data corruption. Update superblock when
+ specific devices are requested via rebuild. Fix RAID leg
+ rebuild errors.
diff --git a/Documentation/admin-guide/device-mapper/dm-service-time.rst b/Documentation/admin-guide/device-mapper/dm-service-time.rst
new file mode 100644
index 0000000..facf277
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-service-time.rst
@@ -0,0 +1,101 @@
+===============
+dm-service-time
+===============
+
+dm-service-time is a path selector module for device-mapper targets,
+which selects a path with the shortest estimated service time for
+the incoming I/O.
+
+The service time for each path is estimated by dividing the total size
+of in-flight I/Os on a path with the performance value of the path.
+The performance value is a relative throughput value among all paths
+in a path-group, and it can be specified as a table argument.
+
+The path selector name is 'service-time'.
+
+Table parameters for each path:
+
+ [<repeat_count> [<relative_throughput>]]
+ <repeat_count>:
+ The number of I/Os to dispatch using the selected
+ path before switching to the next path.
+ If not given, internal default is used. To check
+ the default value, see the activated table.
+ <relative_throughput>:
+ The relative throughput value of the path
+ among all paths in the path-group.
+ The valid range is 0-100.
+ If not given, minimum value '1' is used.
+ If '0' is given, the path isn't selected while
+ other paths having a positive value are available.
+
+Status for each path:
+
+ <status> <fail-count> <in-flight-size> <relative_throughput>
+ <status>:
+ 'A' if the path is active, 'F' if the path is failed.
+ <fail-count>:
+ The number of path failures.
+ <in-flight-size>:
+ The size of in-flight I/Os on the path.
+ <relative_throughput>:
+ The relative throughput value of the path
+ among all paths in the path-group.
+
+
+Algorithm
+=========
+
+dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
+dispatched and subtracts when completed.
+Basically, dm-service-time selects a path having minimum service time
+which is calculated by::
+
+ ('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput'
+
+However, some optimizations below are used to reduce the calculation
+as much as possible.
+
+ 1. If the paths have the same 'relative_throughput', skip
+ the division and just compare the 'in-flight-size'.
+
+ 2. If the paths have the same 'in-flight-size', skip the division
+ and just compare the 'relative_throughput'.
+
+ 3. If some paths have non-zero 'relative_throughput' and others
+ have zero 'relative_throughput', ignore those paths with zero
+ 'relative_throughput'.
+
+If such optimizations can't be applied, calculate service time, and
+compare service time.
+If calculated service time is equal, the path having maximum
+'relative_throughput' may be better. So compare 'relative_throughput'
+then.
+
+
+Examples
+========
+In case that 2 paths (sda and sdb) are used with repeat_count == 128
+and sda has an average throughput 1GB/s and sdb has 4GB/s,
+'relative_throughput' value may be '1' for sda and '4' for sdb::
+
+ # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
+ dmsetup create test
+ #
+ # dmsetup table
+ test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
+ #
+ # dmsetup status
+ test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
+
+
+Or '2' for sda and '8' for sdb would be also true::
+
+ # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
+ dmsetup create test
+ #
+ # dmsetup table
+ test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
+ #
+ # dmsetup status
+ test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
diff --git a/Documentation/admin-guide/device-mapper/dm-uevent.rst b/Documentation/admin-guide/device-mapper/dm-uevent.rst
new file mode 100644
index 0000000..4a8ee8d
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-uevent.rst
@@ -0,0 +1,110 @@
+====================
+device-mapper uevent
+====================
+
+The device-mapper uevent code adds the capability to device-mapper to create
+and send kobject uevents (uevents). Previously device-mapper events were only
+available through the ioctl interface. The advantage of the uevents interface
+is the event contains environment attributes providing increased context for
+the event avoiding the need to query the state of the device-mapper device after
+the event is received.
+
+There are two functions currently for device-mapper events. The first function
+listed creates the event and the second function sends the event(s)::
+
+ void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti,
+ const char *path, unsigned nr_valid_paths)
+
+ void dm_send_uevents(struct list_head *events, struct kobject *kobj)
+
+
+The variables added to the uevent environment are:
+
+Variable Name: DM_TARGET
+------------------------
+:Uevent Action(s): KOBJ_CHANGE
+:Type: string
+:Description:
+:Value: Name of device-mapper target that generated the event.
+
+Variable Name: DM_ACTION
+------------------------
+:Uevent Action(s): KOBJ_CHANGE
+:Type: string
+:Description:
+:Value: Device-mapper specific action that caused the uevent action.
+ PATH_FAILED - A path has failed;
+ PATH_REINSTATED - A path has been reinstated.
+
+Variable Name: DM_SEQNUM
+------------------------
+:Uevent Action(s): KOBJ_CHANGE
+:Type: unsigned integer
+:Description: A sequence number for this specific device-mapper device.
+:Value: Valid unsigned integer range.
+
+Variable Name: DM_PATH
+----------------------
+:Uevent Action(s): KOBJ_CHANGE
+:Type: string
+:Description: Major and minor number of the path device pertaining to this
+ event.
+:Value: Path name in the form of "Major:Minor"
+
+Variable Name: DM_NR_VALID_PATHS
+--------------------------------
+:Uevent Action(s): KOBJ_CHANGE
+:Type: unsigned integer
+:Description:
+:Value: Valid unsigned integer range.
+
+Variable Name: DM_NAME
+----------------------
+:Uevent Action(s): KOBJ_CHANGE
+:Type: string
+:Description: Name of the device-mapper device.
+:Value: Name
+
+Variable Name: DM_UUID
+----------------------
+:Uevent Action(s): KOBJ_CHANGE
+:Type: string
+:Description: UUID of the device-mapper device.
+:Value: UUID. (Empty string if there isn't one.)
+
+An example of the uevents generated as captured by udevmonitor is shown
+below
+
+1.) Path failure::
+
+ UEVENT[1192521009.711215] change@/block/dm-3
+ ACTION=change
+ DEVPATH=/block/dm-3
+ SUBSYSTEM=block
+ DM_TARGET=multipath
+ DM_ACTION=PATH_FAILED
+ DM_SEQNUM=1
+ DM_PATH=8:32
+ DM_NR_VALID_PATHS=0
+ DM_NAME=mpath2
+ DM_UUID=mpath-35333333000002328
+ MINOR=3
+ MAJOR=253
+ SEQNUM=1130
+
+2.) Path reinstate::
+
+ UEVENT[1192521132.989927] change@/block/dm-3
+ ACTION=change
+ DEVPATH=/block/dm-3
+ SUBSYSTEM=block
+ DM_TARGET=multipath
+ DM_ACTION=PATH_REINSTATED
+ DM_SEQNUM=2
+ DM_PATH=8:32
+ DM_NR_VALID_PATHS=1
+ DM_NAME=mpath2
+ DM_UUID=mpath-35333333000002328
+ MINOR=3
+ MAJOR=253
+ SEQNUM=1131
diff --git a/Documentation/admin-guide/device-mapper/dm-zoned.rst b/Documentation/admin-guide/device-mapper/dm-zoned.rst
new file mode 100644
index 0000000..07f56eb
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/dm-zoned.rst
@@ -0,0 +1,146 @@
+========
+dm-zoned
+========
+
+The dm-zoned device mapper target exposes a zoned block device (ZBC and
+ZAC compliant devices) as a regular block device without any write
+pattern constraints. In effect, it implements a drive-managed zoned
+block device which hides from the user (a file system or an application
+doing raw block device accesses) the sequential write constraints of
+host-managed zoned block devices and can mitigate the potential
+device-side performance degradation due to excessive random writes on
+host-aware zoned block devices.
+
+For a more detailed description of the zoned block device models and
+their constraints see (for SCSI devices):
+
+http://www.t10.org/drafts.htm#ZBC_Family
+
+and (for ATA devices):
+
+http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
+
+The dm-zoned implementation is simple and minimizes system overhead (CPU
+and memory usage as well as storage capacity loss). For a 10TB
+host-managed disk with 256 MB zones, dm-zoned memory usage per disk
+instance is at most 4.5 MB and as little as 5 zones will be used
+internally for storing metadata and performaing reclaim operations.
+
+dm-zoned target devices are formatted and checked using the dmzadm
+utility available at:
+
+https://github.com/hgst/dm-zoned-tools
+
+Algorithm
+=========
+
+dm-zoned implements an on-disk buffering scheme to handle non-sequential
+write accesses to the sequential zones of a zoned block device.
+Conventional zones are used for caching as well as for storing internal
+metadata.
+
+The zones of the device are separated into 2 types:
+
+1) Metadata zones: these are conventional zones used to store metadata.
+Metadata zones are not reported as useable capacity to the user.
+
+2) Data zones: all remaining zones, the vast majority of which will be
+sequential zones used exclusively to store user data. The conventional
+zones of the device may be used also for buffering user random writes.
+Data in these zones may be directly mapped to the conventional zone, but
+later moved to a sequential zone so that the conventional zone can be
+reused for buffering incoming random writes.
+
+dm-zoned exposes a logical device with a sector size of 4096 bytes,
+irrespective of the physical sector size of the backend zoned block
+device being used. This allows reducing the amount of metadata needed to
+manage valid blocks (blocks written).
+
+The on-disk metadata format is as follows:
+
+1) The first block of the first conventional zone found contains the
+super block which describes the on disk amount and position of metadata
+blocks.
+
+2) Following the super block, a set of blocks is used to describe the
+mapping of the logical device blocks. The mapping is done per chunk of
+blocks, with the chunk size equal to the zoned block device size. The
+mapping table is indexed by chunk number and each mapping entry
+indicates the zone number of the device storing the chunk of data. Each
+mapping entry may also indicate if the zone number of a conventional
+zone used to buffer random modification to the data zone.
+
+3) A set of blocks used to store bitmaps indicating the validity of
+blocks in the data zones follows the mapping table. A valid block is
+defined as a block that was written and not discarded. For a buffered
+data chunk, a block is always valid only in the data zone mapping the
+chunk or in the buffer zone of the chunk.
+
+For a logical chunk mapped to a conventional zone, all write operations
+are processed by directly writing to the zone. If the mapping zone is a
+sequential zone, the write operation is processed directly only if the
+write offset within the logical chunk is equal to the write pointer
+offset within of the sequential data zone (i.e. the write operation is
+aligned on the zone write pointer). Otherwise, write operations are
+processed indirectly using a buffer zone. In that case, an unused
+conventional zone is allocated and assigned to the chunk being
+accessed. Writing a block to the buffer zone of a chunk will
+automatically invalidate the same block in the sequential zone mapping
+the chunk. If all blocks of the sequential zone become invalid, the zone
+is freed and the chunk buffer zone becomes the primary zone mapping the
+chunk, resulting in native random write performance similar to a regular
+block device.
+
+Read operations are processed according to the block validity
+information provided by the bitmaps. Valid blocks are read either from
+the sequential zone mapping a chunk, or if the chunk is buffered, from
+the buffer zone assigned. If the accessed chunk has no mapping, or the
+accessed blocks are invalid, the read buffer is zeroed and the read
+operation terminated.
+
+After some time, the limited number of convnetional zones available may
+be exhausted (all used to map chunks or buffer sequential zones) and
+unaligned writes to unbuffered chunks become impossible. To avoid this
+situation, a reclaim process regularly scans used conventional zones and
+tries to reclaim the least recently used zones by copying the valid
+blocks of the buffer zone to a free sequential zone. Once the copy
+completes, the chunk mapping is updated to point to the sequential zone
+and the buffer zone freed for reuse.
+
+Metadata Protection
+===================
+
+To protect metadata against corruption in case of sudden power loss or
+system crash, 2 sets of metadata zones are used. One set, the primary
+set, is used as the main metadata region, while the secondary set is
+used as a staging area. Modified metadata is first written to the
+secondary set and validated by updating the super block in the secondary
+set, a generation counter is used to indicate that this set contains the
+newest metadata. Once this operation completes, in place of metadata
+block updates can be done in the primary metadata set. This ensures that
+one of the set is always consistent (all modifications committed or none
+at all). Flush operations are used as a commit point. Upon reception of
+a flush request, metadata modification activity is temporarily blocked
+(for both incoming BIO processing and reclaim process) and all dirty
+metadata blocks are staged and updated. Normal operation is then
+resumed. Flushing metadata thus only temporarily delays write and
+discard requests. Read requests can be processed concurrently while
+metadata flush is being executed.
+
+Usage
+=====
+
+A zoned block device must first be formatted using the dmzadm tool. This
+will analyze the device zone configuration, determine where to place the
+metadata sets on the device and initialize the metadata sets.
+
+Ex::
+
+ dmzadm --format /dev/sdxx
+
+For a formatted device, the target can be created normally with the
+dmsetup utility. The only parameter that dm-zoned requires is the
+underlying zoned block device name. Ex::
+
+ echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
+ dmsetup create dmz-`basename ${dev}`
diff --git a/Documentation/admin-guide/device-mapper/era.rst b/Documentation/admin-guide/device-mapper/era.rst
new file mode 100644
index 0000000..90dd5c6
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/era.rst
@@ -0,0 +1,116 @@
+======
+dm-era
+======
+
+Introduction
+============
+
+dm-era is a target that behaves similar to the linear target. In
+addition it keeps track of which blocks were written within a user
+defined period of time called an 'era'. Each era target instance
+maintains the current era as a monotonically increasing 32-bit
+counter.
+
+Use cases include tracking changed blocks for backup software, and
+partially invalidating the contents of a cache to restore cache
+coherency after rolling back a vendor snapshot.
+
+Constructor
+===========
+
+era <metadata dev> <origin dev> <block size>
+
+ ================ ======================================================
+ metadata dev fast device holding the persistent metadata
+ origin dev device holding data blocks that may change
+ block size block size of origin data device, granularity that is
+ tracked by the target
+ ================ ======================================================
+
+Messages
+========
+
+None of the dm messages take any arguments.
+
+checkpoint
+----------
+
+Possibly move to a new era. You shouldn't assume the era has
+incremented. After sending this message, you should check the
+current era via the status line.
+
+take_metadata_snap
+------------------
+
+Create a clone of the metadata, to allow a userland process to read it.
+
+drop_metadata_snap
+------------------
+
+Drop the metadata snapshot.
+
+Status
+======
+
+<metadata block size> <#used metadata blocks>/<#total metadata blocks>
+<current era> <held metadata root | '-'>
+
+========================= ==============================================
+metadata block size Fixed block size for each metadata block in
+ sectors
+#used metadata blocks Number of metadata blocks used
+#total metadata blocks Total number of metadata blocks
+current era The current era
+held metadata root The location, in blocks, of the metadata root
+ that has been 'held' for userspace read
+ access. '-' indicates there is no held root
+========================= ==============================================
+
+Detailed use case
+=================
+
+The scenario of invalidating a cache when rolling back a vendor
+snapshot was the primary use case when developing this target:
+
+Taking a vendor snapshot
+------------------------
+
+- Send a checkpoint message to the era target
+- Make a note of the current era in its status line
+- Take vendor snapshot (the era and snapshot should be forever
+ associated now).
+
+Rolling back to an vendor snapshot
+----------------------------------
+
+- Cache enters passthrough mode (see: dm-cache's docs in cache.txt)
+- Rollback vendor storage
+- Take metadata snapshot
+- Ascertain which blocks have been written since the snapshot was taken
+ by checking each block's era
+- Invalidate those blocks in the caching software
+- Cache returns to writeback/writethrough mode
+
+Memory usage
+============
+
+The target uses a bitset to record writes in the current era. It also
+has a spare bitset ready for switching over to a new era. Other than
+that it uses a few 4k blocks for updating metadata::
+
+ (4 * nr_blocks) bytes + buffers
+
+Resilience
+==========
+
+Metadata is updated on disk before a write to a previously unwritten
+block is performed. As such dm-era should not be effected by a hard
+crash such as power failure.
+
+Userland tools
+==============
+
+Userland tools are found in the increasingly poorly named
+thin-provisioning-tools project:
+
+ https://github.com/jthornber/thin-provisioning-tools
diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
new file mode 100644
index 0000000..c77c58b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -0,0 +1,42 @@
+=============
+Device Mapper
+=============
+
+.. toctree::
+ :maxdepth: 1
+
+ cache-policies
+ cache
+ delay
+ dm-crypt
+ dm-flakey
+ dm-init
+ dm-integrity
+ dm-io
+ dm-log
+ dm-queue-length
+ dm-raid
+ dm-service-time
+ dm-uevent
+ dm-zoned
+ era
+ kcopyd
+ linear
+ log-writes
+ persistent-data
+ snapshot
+ statistics
+ striped
+ switch
+ thin-provisioning
+ unstriped
+ verity
+ writecache
+ zero
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/device-mapper/kcopyd.rst b/Documentation/admin-guide/device-mapper/kcopyd.rst
new file mode 100644
index 0000000..7651d39
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/kcopyd.rst
@@ -0,0 +1,47 @@
+======
+kcopyd
+======
+
+Kcopyd provides the ability to copy a range of sectors from one block-device
+to one or more other block-devices, with an asynchronous completion
+notification. It is used by dm-snapshot and dm-mirror.
+
+Users of kcopyd must first create a client and indicate how many memory pages
+to set aside for their copy jobs. This is done with a call to
+kcopyd_client_create()::
+
+ int kcopyd_client_create(unsigned int num_pages,
+ struct kcopyd_client **result);
+
+To start a copy job, the user must set up io_region structures to describe
+the source and destinations of the copy. Each io_region indicates a
+block-device along with the starting sector and size of the region. The source
+of the copy is given as one io_region structure, and the destinations of the
+copy are given as an array of io_region structures::
+
+ struct io_region {
+ struct block_device *bdev;
+ sector_t sector;
+ sector_t count;
+ };
+
+To start the copy, the user calls kcopyd_copy(), passing in the client
+pointer, pointers to the source and destination io_regions, the name of a
+completion callback routine, and a pointer to some context data for the copy::
+
+ int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from,
+ unsigned int num_dests, struct io_region *dests,
+ unsigned int flags, kcopyd_notify_fn fn, void *context);
+
+ typedef void (*kcopyd_notify_fn)(int read_err, unsigned int write_err,
+ void *context);
+
+When the copy completes, kcopyd will call the user's completion routine,
+passing back the user's context pointer. It will also indicate if a read or
+write error occurred during the copy.
+
+When a user is done with all their copy jobs, they should call
+kcopyd_client_destroy() to delete the kcopyd client, which will release the
+associated memory pages::
+
+ void kcopyd_client_destroy(struct kcopyd_client *kc);
diff --git a/Documentation/admin-guide/device-mapper/linear.rst b/Documentation/admin-guide/device-mapper/linear.rst
new file mode 100644
index 0000000..9d17fc6
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/linear.rst
@@ -0,0 +1,63 @@
+=========
+dm-linear
+=========
+
+Device-Mapper's "linear" target maps a linear range of the Device-Mapper
+device onto a linear range of another device. This is the basic building
+block of logical volume managers.
+
+Parameters: <dev path> <offset>
+ <dev path>:
+ Full pathname to the underlying block-device, or a
+ "major:minor" device-number.
+ <offset>:
+ Starting sector within the device.
+
+
+Example scripts
+===============
+
+::
+
+ #!/bin/sh
+ # Create an identity mapping for a device
+ echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity
+
+::
+
+ #!/bin/sh
+ # Join 2 devices together
+ size1=`blockdev --getsz $1`
+ size2=`blockdev --getsz $2`
+ echo "0 $size1 linear $1 0
+ $size1 $size2 linear $2 0" | dmsetup create joined
+
+::
+
+ #!/usr/bin/perl -w
+ # Split a device into 4M chunks and then join them together in reverse order.
+
+ my $name = "reverse";
+ my $extent_size = 4 * 1024 * 2;
+ my $dev = $ARGV[0];
+ my $table = "";
+ my $count = 0;
+
+ if (!defined($dev)) {
+ die("Please specify a device.\n");
+ }
+
+ my $dev_size = `blockdev --getsz $dev`;
+ my $extents = int($dev_size / $extent_size) -
+ (($dev_size % $extent_size) ? 1 : 0);
+
+ while ($extents > 0) {
+ my $this_start = $count * $extent_size;
+ $extents--;
+ $count++;
+ my $this_offset = $extents * $extent_size;
+
+ $table .= "$this_start $extent_size linear $dev $this_offset\n";
+ }
+
+ `echo \"$table\" | dmsetup create $name`;
diff --git a/Documentation/admin-guide/device-mapper/log-writes.rst b/Documentation/admin-guide/device-mapper/log-writes.rst
new file mode 100644
index 0000000..23141f2
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/log-writes.rst
@@ -0,0 +1,145 @@
+=============
+dm-log-writes
+=============
+
+This target takes 2 devices, one to pass all IO to normally, and one to log all
+of the write operations to. This is intended for file system developers wishing
+to verify the integrity of metadata or data as the file system is written to.
+There is a log_write_entry written for every WRITE request and the target is
+able to take arbitrary data from userspace to insert into the log. The data
+that is in the WRITE requests is copied into the log to make the replay happen
+exactly as it happened originally.
+
+Log Ordering
+============
+
+We log things in order of completion once we are sure the write is no longer in
+cache. This means that normal WRITE requests are not actually logged until the
+next REQ_PREFLUSH request. This is to make it easier for userspace to replay
+the log in a way that correlates to what is on disk and not what is in cache,
+to make it easier to detect improper waiting/flushing.
+
+This works by attaching all WRITE requests to a list once the write completes.
+Once we see a REQ_PREFLUSH request we splice this list onto the request and once
+the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
+completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
+simulate the worst case scenario with regard to power failures. Consider the
+following example (W means write, C means complete):
+
+ W1,W2,W3,C3,C2,Wflush,C1,Cflush
+
+The log would show the following:
+
+ W3,W2,flush,W1....
+
+Again this is to simulate what is actually on disk, this allows us to detect
+cases where a power failure at a particular point in time would create an
+inconsistent file system.
+
+Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
+they complete as those requests will obviously bypass the device cache.
+
+Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
+have all the DISCARD requests, and then the WRITE requests and then the FLUSH
+request. Consider the following example:
+
+ WRITE block 1, DISCARD block 1, FLUSH
+
+If we logged DISCARD when it completed, the replay would look like this:
+
+ DISCARD 1, WRITE 1, FLUSH
+
+which isn't quite what happened and wouldn't be caught during the log replay.
+
+Target interface
+================
+
+i) Constructor
+
+ log-writes <dev_path> <log_dev_path>
+
+ ============= ==============================================
+ dev_path Device that all of the IO will go to normally.
+ log_dev_path Device where the log entries are written to.
+ ============= ==============================================
+
+ii) Status
+
+ <#logged entries> <highest allocated sector>
+
+ =========================== ========================
+ #logged entries Number of logged entries
+ highest allocated sector Highest allocated sector
+ =========================== ========================
+
+iii) Messages
+
+ mark <description>
+
+ You can use a dmsetup message to set an arbitrary mark in a log.
+ For example say you want to fsck a file system after every
+ write, but first you need to replay up to the mkfs to make sure
+ we're fsck'ing something reasonable, you would do something like
+ this::
+
+ mkfs.btrfs -f /dev/mapper/log
+ dmsetup message log 0 mark mkfs
+ <run test>
+
+ This would allow you to replay the log up to the mkfs mark and
+ then replay from that point on doing the fsck check in the
+ interval that you want.
+
+ Every log has a mark at the end labeled "dm-log-writes-end".
+
+Userspace component
+===================
+
+There is a userspace tool that will replay the log for you in various ways.
+It can be found here: https://github.com/josefbacik/log-writes
+
+Example usage
+=============
+
+Say you want to test fsync on your file system. You would do something like
+this::
+
+ TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
+ dmsetup create log --table "$TABLE"
+ mkfs.btrfs -f /dev/mapper/log
+ dmsetup message log 0 mark mkfs
+
+ mount /dev/mapper/log /mnt/btrfs-test
+ <some test that does fsync at the end>
+ dmsetup message log 0 mark fsync
+ md5sum /mnt/btrfs-test/foo
+ umount /mnt/btrfs-test
+
+ dmsetup remove log
+ replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
+ mount /dev/sdb /mnt/btrfs-test
+ md5sum /mnt/btrfs-test/foo
+ <verify md5sum's are correct>
+
+ Another option is to do a complicated file system operation and verify the file
+ system is consistent during the entire operation. You could do this with:
+
+ TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
+ dmsetup create log --table "$TABLE"
+ mkfs.btrfs -f /dev/mapper/log
+ dmsetup message log 0 mark mkfs
+
+ mount /dev/mapper/log /mnt/btrfs-test
+ <fsstress to dirty the fs>
+ btrfs filesystem balance /mnt/btrfs-test
+ umount /mnt/btrfs-test
+ dmsetup remove log
+
+ replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
+ btrfsck /dev/sdb
+ replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
+ --fsck "btrfsck /dev/sdb" --check fua
+
+And that will replay the log until it sees a FUA request, run the fsck command
+and if the fsck passes it will replay to the next FUA, until it is completed or
+the fsck command exists abnormally.
diff --git a/Documentation/admin-guide/device-mapper/persistent-data.rst b/Documentation/admin-guide/device-mapper/persistent-data.rst
new file mode 100644
index 0000000..2065c3c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/persistent-data.rst
@@ -0,0 +1,88 @@
+===============
+Persistent data
+===============
+
+Introduction
+============
+
+The more-sophisticated device-mapper targets require complex metadata
+that is managed in kernel. In late 2010 we were seeing that various
+different targets were rolling their own data structures, for example:
+
+- Mikulas Patocka's multisnap implementation
+- Heinz Mauelshagen's thin provisioning target
+- Another btree-based caching target posted to dm-devel
+- Another multi-snapshot target based on a design of Daniel Phillips
+
+Maintaining these data structures takes a lot of work, so if possible
+we'd like to reduce the number.
+
+The persistent-data library is an attempt to provide a re-usable
+framework for people who want to store metadata in device-mapper
+targets. It's currently used by the thin-provisioning target and an
+upcoming hierarchical storage target.
+
+Overview
+========
+
+The main documentation is in the header files which can all be found
+under drivers/md/persistent-data.
+
+The block manager
+-----------------
+
+dm-block-manager.[hc]
+
+This provides access to the data on disk in fixed sized-blocks. There
+is a read/write locking interface to prevent concurrent accesses, and
+keep data that is being used in the cache.
+
+Clients of persistent-data are unlikely to use this directly.
+
+The transaction manager
+-----------------------
+
+dm-transaction-manager.[hc]
+
+This restricts access to blocks and enforces copy-on-write semantics.
+The only way you can get hold of a writable block through the
+transaction manager is by shadowing an existing block (ie. doing
+copy-on-write) or allocating a fresh one. Shadowing is elided within
+the same transaction so performance is reasonable. The commit method
+ensures that all data is flushed before it writes the superblock.
+On power failure your metadata will be as it was when last committed.
+
+The Space Maps
+--------------
+
+dm-space-map.h
+dm-space-map-metadata.[hc]
+dm-space-map-disk.[hc]
+
+On-disk data structures that keep track of reference counts of blocks.
+Also acts as the allocator of new blocks. Currently two
+implementations: a simpler one for managing blocks on a different
+device (eg. thinly-provisioned data blocks); and one for managing
+the metadata space. The latter is complicated by the need to store
+its own data within the space it's managing.
+
+The data structures
+-------------------
+
+dm-btree.[hc]
+dm-btree-remove.c
+dm-btree-spine.c
+dm-btree-internal.h
+
+Currently there is only one data structure, a hierarchical btree.
+There are plans to add more. For example, something with an
+array-like interface would see a lot of use.
+
+The btree is 'hierarchical' in that you can define it to be composed
+of nested btrees, and take multiple keys. For example, the
+thin-provisioning target uses a btree with two levels of nesting.
+The first maps a device id to a mapping tree, and that in turn maps a
+virtual block to a physical block.
+
+Values stored in the btrees can have arbitrary size. Keys are always
+64bits, although nesting allows you to use multiple keys.
diff --git a/Documentation/admin-guide/device-mapper/snapshot.rst b/Documentation/admin-guide/device-mapper/snapshot.rst
new file mode 100644
index 0000000..ccdd8b5
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/snapshot.rst
@@ -0,0 +1,196 @@
+==============================
+Device-mapper snapshot support
+==============================
+
+Device-mapper allows you, without massive data copying:
+
+- To create snapshots of any block device i.e. mountable, saved states of
+ the block device which are also writable without interfering with the
+ original content;
+- To create device "forks", i.e. multiple different versions of the
+ same data stream.
+- To merge a snapshot of a block device back into the snapshot's origin
+ device.
+
+In the first two cases, dm copies only the chunks of data that get
+changed and uses a separate copy-on-write (COW) block device for
+storage.
+
+For snapshot merge the contents of the COW storage are merged back into
+the origin device.
+
+
+There are three dm targets available:
+snapshot, snapshot-origin, and snapshot-merge.
+
+- snapshot-origin <origin>
+
+which will normally have one or more snapshots based on it.
+Reads will be mapped directly to the backing device. For each write, the
+original data will be saved in the <COW device> of each snapshot to keep
+its visible content unchanged, at least until the <COW device> fills up.
+
+
+- snapshot <origin> <COW device> <persistent?> <chunksize>
+ [<# feature args> [<arg>]*]
+
+A snapshot of the <origin> block device is created. Changed chunks of
+<chunksize> sectors will be stored on the <COW device>. Writes will
+only go to the <COW device>. Reads will come from the <COW device> or
+from <origin> for unchanged data. <COW device> will often be
+smaller than the origin and if it fills up the snapshot will become
+useless and be disabled, returning errors. So it is important to monitor
+the amount of free space and expand the <COW device> before it fills up.
+
+<persistent?> is P (Persistent) or N (Not persistent - will not survive
+after reboot). O (Overflow) can be added as a persistent store option
+to allow userspace to advertise its support for seeing "Overflow" in the
+snapshot status. So supported store types are "P", "PO" and "N".
+
+The difference between persistent and transient is with transient
+snapshots less metadata must be saved on disk - they can be kept in
+memory by the kernel.
+
+When loading or unloading the snapshot target, the corresponding
+snapshot-origin or snapshot-merge target must be suspended. A failure to
+suspend the origin target could result in data corruption.
+
+Optional features:
+
+ discard_zeroes_cow - a discard issued to the snapshot device that
+ maps to entire chunks to will zero the corresponding exception(s) in
+ the snapshot's exception store.
+
+ discard_passdown_origin - a discard to the snapshot device is passed
+ down to the snapshot-origin's underlying device. This doesn't cause
+ copy-out to the snapshot exception store because the snapshot-origin
+ target is bypassed.
+
+ The discard_passdown_origin feature depends on the discard_zeroes_cow
+ feature being enabled.
+
+
+- snapshot-merge <origin> <COW device> <persistent> <chunksize>
+ [<# feature args> [<arg>]*]
+
+takes the same table arguments as the snapshot target except it only
+works with persistent snapshots. This target assumes the role of the
+"snapshot-origin" target and must not be loaded if the "snapshot-origin"
+is still present for <origin>.
+
+Creates a merging snapshot that takes control of the changed chunks
+stored in the <COW device> of an existing snapshot, through a handover
+procedure, and merges these chunks back into the <origin>. Once merging
+has started (in the background) the <origin> may be opened and the merge
+will continue while I/O is flowing to it. Changes to the <origin> are
+deferred until the merging snapshot's corresponding chunk(s) have been
+merged. Once merging has started the snapshot device, associated with
+the "snapshot" target, will return -EIO when accessed.
+
+
+How snapshot is used by LVM2
+============================
+When you create the first LVM2 snapshot of a volume, four dm devices are used:
+
+1) a device containing the original mapping table of the source volume;
+2) a device used as the <COW device>;
+3) a "snapshot" device, combining #1 and #2, which is the visible snapshot
+ volume;
+4) the "original" volume (which uses the device number used by the original
+ source volume), whose table is replaced by a "snapshot-origin" mapping
+ from device #1.
+
+A fixed naming scheme is used, so with the following commands::
+
+ lvcreate -L 1G -n base volumeGroup
+ lvcreate -L 100M --snapshot -n snap volumeGroup/base
+
+we'll have this situation (with volumes in above order)::
+
+ # dmsetup table|grep volumeGroup
+
+ volumeGroup-base-real: 0 2097152 linear 8:19 384
+ volumeGroup-snap-cow: 0 204800 linear 8:19 2097536
+ volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16
+ volumeGroup-base: 0 2097152 snapshot-origin 254:11
+
+ # ls -lL /dev/mapper/volumeGroup-*
+ brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
+ brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow
+ brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap
+ brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base
+
+
+How snapshot-merge is used by LVM2
+==================================
+A merging snapshot assumes the role of the "snapshot-origin" while
+merging. As such the "snapshot-origin" is replaced with
+"snapshot-merge". The "-real" device is not changed and the "-cow"
+device is renamed to <origin name>-cow to aid LVM2's cleanup of the
+merging snapshot after it completes. The "snapshot" that hands over its
+COW device to the "snapshot-merge" is deactivated (unless using lvchange
+--refresh); but if it is left active it will simply return I/O errors.
+
+A snapshot will merge into its origin with the following command::
+
+ lvconvert --merge volumeGroup/snap
+
+we'll now have this situation::
+
+ # dmsetup table|grep volumeGroup
+
+ volumeGroup-base-real: 0 2097152 linear 8:19 384
+ volumeGroup-base-cow: 0 204800 linear 8:19 2097536
+ volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
+
+ # ls -lL /dev/mapper/volumeGroup-*
+ brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
+ brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
+ brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
+
+
+How to determine when a merging is complete
+===========================================
+The snapshot-merge and snapshot status lines end with:
+
+ <sectors_allocated>/<total_sectors> <metadata_sectors>
+
+Both <sectors_allocated> and <total_sectors> include both data and metadata.
+During merging, the number of sectors allocated gets smaller and
+smaller. Merging has finished when the number of sectors holding data
+is zero, in other words <sectors_allocated> == <metadata_sectors>.
+
+Here is a practical example (using a hybrid of lvm and dmsetup commands)::
+
+ # lvs
+ LV VG Attr LSize Origin Snap% Move Log Copy% Convert
+ base volumeGroup owi-a- 4.00g
+ snap volumeGroup swi-a- 1.00g base 18.97
+
+ # dmsetup status volumeGroup-snap
+ 0 8388608 snapshot 397896/2097152 1560
+ ^^^^ metadata sectors
+
+ # lvconvert --merge -b volumeGroup/snap
+ Merging of volume snap started.
+
+ # lvs volumeGroup/snap
+ LV VG Attr LSize Origin Snap% Move Log Copy% Convert
+ base volumeGroup Owi-a- 4.00g 17.23
+
+ # dmsetup status volumeGroup-base
+ 0 8388608 snapshot-merge 281688/2097152 1104
+
+ # dmsetup status volumeGroup-base
+ 0 8388608 snapshot-merge 180480/2097152 712
+
+ # dmsetup status volumeGroup-base
+ 0 8388608 snapshot-merge 16/2097152 16
+
+Merging has finished.
+
+::
+
+ # lvs
+ LV VG Attr LSize Origin Snap% Move Log Copy% Convert
+ base volumeGroup owi-a- 4.00g
diff --git a/Documentation/admin-guide/device-mapper/statistics.rst b/Documentation/admin-guide/device-mapper/statistics.rst
new file mode 100644
index 0000000..41ded0b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/statistics.rst
@@ -0,0 +1,225 @@
+=============
+DM statistics
+=============
+
+Device Mapper supports the collection of I/O statistics on user-defined
+regions of a DM device. If no regions are defined no statistics are
+collected so there isn't any performance impact. Only bio-based DM
+devices are currently supported.
+
+Each user-defined region specifies a starting sector, length and step.
+Individual statistics will be collected for each step-sized area within
+the range specified.
+
+The I/O statistics counters for each step-sized area of a region are
+in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see:
+Documentation/admin-guide/iostats.rst). But two extra counters (12 and 13) are
+provided: total time spent reading and writing. When the histogram
+argument is used, the 14th parameter is reported that represents the
+histogram of latencies. All these counters may be accessed by sending
+the @stats_print message to the appropriate DM device via dmsetup.
+
+The reported times are in milliseconds and the granularity depends on
+the kernel ticks. When the option precise_timestamps is used, the
+reported times are in nanoseconds.
+
+Each region has a corresponding unique identifier, which we call a
+region_id, that is assigned when the region is created. The region_id
+must be supplied when querying statistics about the region, deleting the
+region, etc. Unique region_ids enable multiple userspace programs to
+request and process statistics for the same DM device without stepping
+on each other's data.
+
+The creation of DM statistics will allocate memory via kmalloc or
+fallback to using vmalloc space. At most, 1/4 of the overall system
+memory may be allocated by DM statistics. The admin can see how much
+memory is used by reading:
+
+ /sys/module/dm_mod/parameters/stats_current_allocated_bytes
+
+Messages
+========
+
+ @stats_create <range> <step> [<number_of_optional_arguments> <optional_arguments>...] [<program_id> [<aux_data>]]
+ Create a new region and return the region_id.
+
+ <range>
+ "-"
+ whole device
+ "<start_sector>+<length>"
+ a range of <length> 512-byte sectors
+ starting with <start_sector>.
+
+ <step>
+ "<area_size>"
+ the range is subdivided into areas each containing
+ <area_size> sectors.
+ "/<number_of_areas>"
+ the range is subdivided into the specified
+ number of areas.
+
+ <number_of_optional_arguments>
+ The number of optional arguments
+
+ <optional_arguments>
+ The following optional arguments are supported:
+
+ precise_timestamps
+ use precise timer with nanosecond resolution
+ instead of the "jiffies" variable. When this argument is
+ used, the resulting times are in nanoseconds instead of
+ milliseconds. Precise timestamps are a little bit slower
+ to obtain than jiffies-based timestamps.
+ histogram:n1,n2,n3,n4,...
+ collect histogram of latencies. The
+ numbers n1, n2, etc are times that represent the boundaries
+ of the histogram. If precise_timestamps is not used, the
+ times are in milliseconds, otherwise they are in
+ nanoseconds. For each range, the kernel will report the
+ number of requests that completed within this range. For
+ example, if we use "histogram:10,20,30", the kernel will
+ report four numbers a:b:c:d. a is the number of requests
+ that took 0-10 ms to complete, b is the number of requests
+ that took 10-20 ms to complete, c is the number of requests
+ that took 20-30 ms to complete and d is the number of
+ requests that took more than 30 ms to complete.
+
+ <program_id>
+ An optional parameter. A name that uniquely identifies
+ the userspace owner of the range. This groups ranges together
+ so that userspace programs can identify the ranges they
+ created and ignore those created by others.
+ The kernel returns this string back in the output of
+ @stats_list message, but it doesn't use it for anything else.
+ If we omit the number of optional arguments, program id must not
+ be a number, otherwise it would be interpreted as the number of
+ optional arguments.
+
+ <aux_data>
+ An optional parameter. A word that provides auxiliary data
+ that is useful to the client program that created the range.
+ The kernel returns this string back in the output of
+ @stats_list message, but it doesn't use this value for anything.
+
+ @stats_delete <region_id>
+ Delete the region with the specified id.
+
+ <region_id>
+ region_id returned from @stats_create
+
+ @stats_clear <region_id>
+ Clear all the counters except the in-flight i/o counters.
+
+ <region_id>
+ region_id returned from @stats_create
+
+ @stats_list [<program_id>]
+ List all regions registered with @stats_create.
+
+ <program_id>
+ An optional parameter.
+ If this parameter is specified, only matching regions
+ are returned.
+ If it is not specified, all regions are returned.
+
+ Output format:
+ <region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
+ precise_timestamps histogram:n1,n2,n3,...
+
+ The strings "precise_timestamps" and "histogram" are printed only
+ if they were specified when creating the region.
+
+ @stats_print <region_id> [<starting_line> <number_of_lines>]
+ Print counters for each step-sized area of a region.
+
+ <region_id>
+ region_id returned from @stats_create
+
+ <starting_line>
+ The index of the starting line in the output.
+ If omitted, all lines are returned.
+
+ <number_of_lines>
+ The number of lines to include in the output.
+ If omitted, all lines are returned.
+
+ Output format for each step-sized area of a region:
+
+ <start_sector>+<length>
+ counters
+
+ The first 11 counters have the same meaning as
+ `/sys/block/*/stat or /proc/diskstats`.
+
+ Please refer to Documentation/admin-guide/iostats.rst for details.
+
+ 1. the number of reads completed
+ 2. the number of reads merged
+ 3. the number of sectors read
+ 4. the number of milliseconds spent reading
+ 5. the number of writes completed
+ 6. the number of writes merged
+ 7. the number of sectors written
+ 8. the number of milliseconds spent writing
+ 9. the number of I/Os currently in progress
+ 10. the number of milliseconds spent doing I/Os
+ 11. the weighted number of milliseconds spent doing I/Os
+
+ Additional counters:
+
+ 12. the total time spent reading in milliseconds
+ 13. the total time spent writing in milliseconds
+
+ @stats_print_clear <region_id> [<starting_line> <number_of_lines>]
+ Atomically print and then clear all the counters except the
+ in-flight i/o counters. Useful when the client consuming the
+ statistics does not want to lose any statistics (those updated
+ between printing and clearing).
+
+ <region_id>
+ region_id returned from @stats_create
+
+ <starting_line>
+ The index of the starting line in the output.
+ If omitted, all lines are printed and then cleared.
+
+ <number_of_lines>
+ The number of lines to process.
+ If omitted, all lines are printed and then cleared.
+
+ @stats_set_aux <region_id> <aux_data>
+ Store auxiliary data aux_data for the specified region.
+
+ <region_id>
+ region_id returned from @stats_create
+
+ <aux_data>
+ The string that identifies data which is useful to the client
+ program that created the range. The kernel returns this
+ string back in the output of @stats_list message, but it
+ doesn't use this value for anything.
+
+Examples
+========
+
+Subdivide the DM device 'vol' into 100 pieces and start collecting
+statistics on them::
+
+ dmsetup message vol 0 @stats_create - /100
+
+Set the auxiliary data string to "foo bar baz" (the escape for each
+space must also be escaped, otherwise the shell will consume them)::
+
+ dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
+
+List the statistics::
+
+ dmsetup message vol 0 @stats_list
+
+Print the statistics::
+
+ dmsetup message vol 0 @stats_print 0
+
+Delete the statistics::
+
+ dmsetup message vol 0 @stats_delete 0
diff --git a/Documentation/admin-guide/device-mapper/striped.rst b/Documentation/admin-guide/device-mapper/striped.rst
new file mode 100644
index 0000000..e9a8da1
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/striped.rst
@@ -0,0 +1,61 @@
+=========
+dm-stripe
+=========
+
+Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0)
+device across one or more underlying devices. Data is written in "chunks",
+with consecutive chunks rotating among the underlying devices. This can
+potentially provide improved I/O throughput by utilizing several physical
+devices in parallel.
+
+Parameters: <num devs> <chunk size> [<dev path> <offset>]+
+ <num devs>:
+ Number of underlying devices.
+ <chunk size>:
+ Size of each chunk of data. Must be at least as
+ large as the system's PAGE_SIZE.
+ <dev path>:
+ Full pathname to the underlying block-device, or a
+ "major:minor" device-number.
+ <offset>:
+ Starting sector within the device.
+
+One or more underlying devices can be specified. The striped device size must
+be a multiple of the chunk size multiplied by the number of underlying devices.
+
+
+Example scripts
+===============
+
+::
+
+ #!/usr/bin/perl -w
+ # Create a striped device across any number of underlying devices. The device
+ # will be called "stripe_dev" and have a chunk-size of 128k.
+
+ my $chunk_size = 128 * 2;
+ my $dev_name = "stripe_dev";
+ my $num_devs = @ARGV;
+ my @devs = @ARGV;
+ my ($min_dev_size, $stripe_dev_size, $i);
+
+ if (!$num_devs) {
+ die("Specify at least one device\n");
+ }
+
+ $min_dev_size = `blockdev --getsz $devs[0]`;
+ for ($i = 1; $i < $num_devs; $i++) {
+ my $this_size = `blockdev --getsz $devs[$i]`;
+ $min_dev_size = ($min_dev_size < $this_size) ?
+ $min_dev_size : $this_size;
+ }
+
+ $stripe_dev_size = $min_dev_size * $num_devs;
+ $stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs);
+
+ $table = "0 $stripe_dev_size striped $num_devs $chunk_size";
+ for ($i = 0; $i < $num_devs; $i++) {
+ $table .= " $devs[$i] 0";
+ }
+
+ `echo $table | dmsetup create $dev_name`;
diff --git a/Documentation/admin-guide/device-mapper/switch.rst b/Documentation/admin-guide/device-mapper/switch.rst
new file mode 100644
index 0000000..7dde06b
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/switch.rst
@@ -0,0 +1,141 @@
+=========
+dm-switch
+=========
+
+The device-mapper switch target creates a device that supports an
+arbitrary mapping of fixed-size regions of I/O across a fixed set of
+paths. The path used for any specific region can be switched
+dynamically by sending the target a message.
+
+It maps I/O to underlying block devices efficiently when there is a large
+number of fixed-sized address regions but there is no simple pattern
+that would allow for a compact representation of the mapping such as
+dm-stripe.
+
+Background
+----------
+
+Dell EqualLogic and some other iSCSI storage arrays use a distributed
+frameless architecture. In this architecture, the storage group
+consists of a number of distinct storage arrays ("members") each having
+independent controllers, disk storage and network adapters. When a LUN
+is created it is spread across multiple members. The details of the
+spreading are hidden from initiators connected to this storage system.
+The storage group exposes a single target discovery portal, no matter
+how many members are being used. When iSCSI sessions are created, each
+session is connected to an eth port on a single member. Data to a LUN
+can be sent on any iSCSI session, and if the blocks being accessed are
+stored on another member the I/O will be forwarded as required. This
+forwarding is invisible to the initiator. The storage layout is also
+dynamic, and the blocks stored on disk may be moved from member to
+member as needed to balance the load.
+
+This architecture simplifies the management and configuration of both
+the storage group and initiators. In a multipathing configuration, it
+is possible to set up multiple iSCSI sessions to use multiple network
+interfaces on both the host and target to take advantage of the
+increased network bandwidth. An initiator could use a simple round
+robin algorithm to send I/O across all paths and let the storage array
+members forward it as necessary, but there is a performance advantage to
+sending data directly to the correct member.
+
+A device-mapper table already lets you map different regions of a
+device onto different targets. However in this architecture the LUN is
+spread with an address region size on the order of 10s of MBs, which
+means the resulting table could have more than a million entries and
+consume far too much memory.
+
+Using this device-mapper switch target we can now build a two-layer
+device hierarchy:
+
+ Upper Tier - Determine which array member the I/O should be sent to.
+ Lower Tier - Load balance amongst paths to a particular member.
+
+The lower tier consists of a single dm multipath device for each member.
+Each of these multipath devices contains the set of paths directly to
+the array member in one priority group, and leverages existing path
+selectors to load balance amongst these paths. We also build a
+non-preferred priority group containing paths to other array members for
+failover reasons.
+
+The upper tier consists of a single dm-switch device. This device uses
+a bitmap to look up the location of the I/O and choose the appropriate
+lower tier device to route the I/O. By using a bitmap we are able to
+use 4 bits for each address range in a 16 member group (which is very
+large for us). This is a much denser representation than the dm table
+b-tree can achieve.
+
+Construction Parameters
+=======================
+
+ <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
+ <num_paths>
+ The number of paths across which to distribute the I/O.
+
+ <region_size>
+ The number of 512-byte sectors in a region. Each region can be redirected
+ to any of the available paths.
+
+ <num_optional_args>
+ The number of optional arguments. Currently, no optional arguments
+ are supported and so this must be zero.
+
+ <dev_path>
+ The block device that represents a specific path to the device.
+
+ <offset>
+ The offset of the start of data on the specific <dev_path> (in units
+ of 512-byte sectors). This number is added to the sector number when
+ forwarding the request to the specific path. Typically it is zero.
+
+Messages
+========
+
+set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
+
+Modify the region table by specifying which regions are redirected to
+which paths.
+
+<index>
+ The region number (region size was specified in constructor parameters).
+ If index is omitted, the next region (previous index + 1) is used.
+ Expressed in hexadecimal (WITHOUT any prefix like 0x).
+
+<path_nr>
+ The path number in the range 0 ... (<num_paths> - 1).
+ Expressed in hexadecimal (WITHOUT any prefix like 0x).
+
+R<n>,<m>
+ This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
+ are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
+ slots.
+
+Status
+======
+
+No status line is reported.
+
+Example
+=======
+
+Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
+the same size.
+
+Create a switch device with 64kB region size::
+
+ dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
+ switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
+
+Set mappings for the first 7 entries to point to devices switch0, switch1,
+switch2, switch0, switch1, switch2, switch1::
+
+ dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
+
+Set repetitive mapping. This command::
+
+ dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
+
+is equivalent to::
+
+ dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
+ :1 :2 :1 :2 :1 :2 :1 :2 :1 :2
diff --git a/Documentation/admin-guide/device-mapper/thin-provisioning.rst b/Documentation/admin-guide/device-mapper/thin-provisioning.rst
new file mode 100644
index 0000000..bafebf7
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/thin-provisioning.rst
@@ -0,0 +1,427 @@
+=================
+Thin provisioning
+=================
+
+Introduction
+============
+
+This document describes a collection of device-mapper targets that
+between them implement thin-provisioning and snapshots.
+
+The main highlight of this implementation, compared to the previous
+implementation of snapshots, is that it allows many virtual devices to
+be stored on the same data volume. This simplifies administration and
+allows the sharing of data between volumes, thus reducing disk usage.
+
+Another significant feature is support for an arbitrary depth of
+recursive snapshots (snapshots of snapshots of snapshots ...). The
+previous implementation of snapshots did this by chaining together
+lookup tables, and so performance was O(depth). This new
+implementation uses a single data structure to avoid this degradation
+with depth. Fragmentation may still be an issue, however, in some
+scenarios.
+
+Metadata is stored on a separate device from data, giving the
+administrator some freedom, for example to:
+
+- Improve metadata resilience by storing metadata on a mirrored volume
+ but data on a non-mirrored one.
+
+- Improve performance by storing the metadata on SSD.
+
+Status
+======
+
+These targets are considered safe for production use. But different use
+cases will have different performance characteristics, for example due
+to fragmentation of the data volume.
+
+If you find this software is not performing as expected please mail
+dm-devel@redhat.com with details and we'll try our best to improve
+things for you.
+
+Userspace tools for checking and repairing the metadata have been fully
+developed and are available as 'thin_check' and 'thin_repair'. The name
+of the package that provides these utilities varies by distribution (on
+a Red Hat distribution it is named 'device-mapper-persistent-data').
+
+Cookbook
+========
+
+This section describes some quick recipes for using thin provisioning.
+They use the dmsetup program to control the device-mapper driver
+directly. End users will be advised to use a higher-level volume
+manager such as LVM2 once support has been added.
+
+Pool device
+-----------
+
+The pool device ties together the metadata volume and the data volume.
+It maps I/O linearly to the data volume and updates the metadata via
+two mechanisms:
+
+- Function calls from the thin targets
+
+- Device-mapper 'messages' from userspace which control the creation of new
+ virtual devices amongst other things.
+
+Setting up a fresh pool device
+------------------------------
+
+Setting up a pool device requires a valid metadata device, and a
+data device. If you do not have an existing metadata device you can
+make one by zeroing the first 4k to indicate empty metadata.
+
+ dd if=/dev/zero of=$metadata_dev bs=4096 count=1
+
+The amount of metadata you need will vary according to how many blocks
+are shared between thin devices (i.e. through snapshots). If you have
+less sharing than average you'll need a larger-than-average metadata device.
+
+As a guide, we suggest you calculate the number of bytes to use in the
+metadata device as 48 * $data_dev_size / $data_block_size but round it up
+to 2MB if the answer is smaller. If you're creating large numbers of
+snapshots which are recording large amounts of change, you may find you
+need to increase this.
+
+The largest size supported is 16GB: If the device is larger,
+a warning will be issued and the excess space will not be used.
+
+Reloading a pool table
+----------------------
+
+You may reload a pool's table, indeed this is how the pool is resized
+if it runs out of space. (N.B. While specifying a different metadata
+device when reloading is not forbidden at the moment, things will go
+wrong if it does not route I/O to exactly the same on-disk location as
+previously.)
+
+Using an existing pool device
+-----------------------------
+
+::
+
+ dmsetup create pool \
+ --table "0 20971520 thin-pool $metadata_dev $data_dev \
+ $data_block_size $low_water_mark"
+
+$data_block_size gives the smallest unit of disk space that can be
+allocated at a time expressed in units of 512-byte sectors.
+$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
+multiple of 128 (64KB). $data_block_size cannot be changed after the
+thin-pool is created. People primarily interested in thin provisioning
+may want to use a value such as 1024 (512KB). People doing lots of
+snapshotting may want a smaller value such as 128 (64KB). If you are
+not zeroing newly-allocated data, a larger $data_block_size in the
+region of 256000 (128MB) is suggested.
+
+$low_water_mark is expressed in blocks of size $data_block_size. If
+free space on the data device drops below this level then a dm event
+will be triggered which a userspace daemon should catch allowing it to
+extend the pool device. Only one such event will be sent.
+
+No special event is triggered if a just resumed device's free space is below
+the low water mark. However, resuming a device always triggers an
+event; a userspace daemon should verify that free space exceeds the low
+water mark when handling this event.
+
+A low water mark for the metadata device is maintained in the kernel and
+will trigger a dm event if free space on the metadata device drops below
+it.
+
+Updating on-disk metadata
+-------------------------
+
+On-disk metadata is committed every time a FLUSH or FUA bio is written.
+If no such requests are made then commits will occur every second. This
+means the thin-provisioning target behaves like a physical disk that has
+a volatile write cache. If power is lost you may lose some recent
+writes. The metadata should always be consistent in spite of any crash.
+
+If data space is exhausted the pool will either error or queue IO
+according to the configuration (see: error_if_no_space). If metadata
+space is exhausted or a metadata operation fails: the pool will error IO
+until the pool is taken offline and repair is performed to 1) fix any
+potential inconsistencies and 2) clear the flag that imposes repair.
+Once the pool's metadata device is repaired it may be resized, which
+will allow the pool to return to normal operation. Note that if a pool
+is flagged as needing repair, the pool's data and metadata devices
+cannot be resized until repair is performed. It should also be noted
+that when the pool's metadata space is exhausted the current metadata
+transaction is aborted. Given that the pool will cache IO whose
+completion may have already been acknowledged to upper IO layers
+(e.g. filesystem) it is strongly suggested that consistency checks
+(e.g. fsck) be performed on those layers when repair of the pool is
+required.
+
+Thin provisioning
+-----------------
+
+i) Creating a new thinly-provisioned volume.
+
+ To create a new thinly- provisioned volume you must send a message to an
+ active pool device, /dev/mapper/pool in this example::
+
+ dmsetup message /dev/mapper/pool 0 "create_thin 0"
+
+ Here '0' is an identifier for the volume, a 24-bit number. It's up
+ to the caller to allocate and manage these identifiers. If the
+ identifier is already in use, the message will fail with -EEXIST.
+
+ii) Using a thinly-provisioned volume.
+
+ Thinly-provisioned volumes are activated using the 'thin' target::
+
+ dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
+
+ The last parameter is the identifier for the thinp device.
+
+Internal snapshots
+------------------
+
+i) Creating an internal snapshot.
+
+ Snapshots are created with another message to the pool.
+
+ N.B. If the origin device that you wish to snapshot is active, you
+ must suspend it before creating the snapshot to avoid corruption.
+ This is NOT enforced at the moment, so please be careful!
+
+ ::
+
+ dmsetup suspend /dev/mapper/thin
+ dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
+ dmsetup resume /dev/mapper/thin
+
+ Here '1' is the identifier for the volume, a 24-bit number. '0' is the
+ identifier for the origin device.
+
+ii) Using an internal snapshot.
+
+ Once created, the user doesn't have to worry about any connection
+ between the origin and the snapshot. Indeed the snapshot is no
+ different from any other thinly-provisioned device and can be
+ snapshotted itself via the same method. It's perfectly legal to
+ have only one of them active, and there's no ordering requirement on
+ activating or removing them both. (This differs from conventional
+ device-mapper snapshots.)
+
+ Activate it exactly the same way as any other thinly-provisioned volume::
+
+ dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
+
+External snapshots
+------------------
+
+You can use an external **read only** device as an origin for a
+thinly-provisioned volume. Any read to an unprovisioned area of the
+thin device will be passed through to the origin. Writes trigger
+the allocation of new blocks as usual.
+
+One use case for this is VM hosts that want to run guests on
+thinly-provisioned volumes but have the base image on another device
+(possibly shared between many VMs).
+
+You must not write to the origin device if you use this technique!
+Of course, you may write to the thin device and take internal snapshots
+of the thin volume.
+
+i) Creating a snapshot of an external device
+
+ This is the same as creating a thin device.
+ You don't mention the origin at this stage.
+
+ ::
+
+ dmsetup message /dev/mapper/pool 0 "create_thin 0"
+
+ii) Using a snapshot of an external device.
+
+ Append an extra parameter to the thin target specifying the origin::
+
+ dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
+
+ N.B. All descendants (internal snapshots) of this snapshot require the
+ same extra origin parameter.
+
+Deactivation
+------------
+
+All devices using a pool must be deactivated before the pool itself
+can be.
+
+::
+
+ dmsetup remove thin
+ dmsetup remove snap
+ dmsetup remove pool
+
+Reference
+=========
+
+'thin-pool' target
+------------------
+
+i) Constructor
+
+ ::
+
+ thin-pool <metadata dev> <data dev> <data block size (sectors)> \
+ <low water mark (blocks)> [<number of feature args> [<arg>]*]
+
+ Optional feature arguments:
+
+ skip_block_zeroing:
+ Skip the zeroing of newly-provisioned blocks.
+
+ ignore_discard:
+ Disable discard support.
+
+ no_discard_passdown:
+ Don't pass discards down to the underlying
+ data device, but just remove the mapping.
+
+ read_only:
+ Don't allow any changes to be made to the pool
+ metadata. This mode is only available after the
+ thin-pool has been created and first used in full
+ read/write mode. It cannot be specified on initial
+ thin-pool creation.
+
+ error_if_no_space:
+ Error IOs, instead of queueing, if no space.
+
+ Data block size must be between 64KB (128 sectors) and 1GB
+ (2097152 sectors) inclusive.
+
+
+ii) Status
+
+ ::
+
+ <transaction id> <used metadata blocks>/<total metadata blocks>
+ <used data blocks>/<total data blocks> <held metadata root>
+ ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
+ needs_check|- metadata_low_watermark
+
+ transaction id:
+ A 64-bit number used by userspace to help synchronise with metadata
+ from volume managers.
+
+ used data blocks / total data blocks
+ If the number of free blocks drops below the pool's low water mark a
+ dm event will be sent to userspace. This event is edge-triggered and
+ it will occur only once after each resume so volume manager writers
+ should register for the event and then check the target's status.
+
+ held metadata root:
+ The location, in blocks, of the metadata root that has been
+ 'held' for userspace read access. '-' indicates there is no
+ held root.
+
+ discard_passdown|no_discard_passdown
+ Whether or not discards are actually being passed down to the
+ underlying device. When this is enabled when loading the table,
+ it can get disabled if the underlying device doesn't support it.
+
+ ro|rw|out_of_data_space
+ If the pool encounters certain types of device failures it will
+ drop into a read-only metadata mode in which no changes to
+ the pool metadata (like allocating new blocks) are permitted.
+
+ In serious cases where even a read-only mode is deemed unsafe
+ no further I/O will be permitted and the status will just
+ contain the string 'Fail'. The userspace recovery tools
+ should then be used.
+
+ error_if_no_space|queue_if_no_space
+ If the pool runs out of data or metadata space, the pool will
+ either queue or error the IO destined to the data device. The
+ default is to queue the IO until more space is added or the
+ 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool
+ module parameter can be used to change this timeout -- it
+ defaults to 60 seconds but may be disabled using a value of 0.
+
+ needs_check
+ A metadata operation has failed, resulting in the needs_check
+ flag being set in the metadata's superblock. The metadata
+ device must be deactivated and checked/repaired before the
+ thin-pool can be made fully operational again. '-' indicates
+ needs_check is not set.
+
+ metadata_low_watermark:
+ Value of metadata low watermark in blocks. The kernel sets this
+ value internally but userspace needs to know this value to
+ determine if an event was caused by crossing this threshold.
+
+iii) Messages
+
+ create_thin <dev id>
+ Create a new thinly-provisioned device.
+ <dev id> is an arbitrary unique 24-bit identifier chosen by
+ the caller.
+
+ create_snap <dev id> <origin id>
+ Create a new snapshot of another thinly-provisioned device.
+ <dev id> is an arbitrary unique 24-bit identifier chosen by
+ the caller.
+ <origin id> is the identifier of the thinly-provisioned device
+ of which the new device will be a snapshot.
+
+ delete <dev id>
+ Deletes a thin device. Irreversible.
+
+ set_transaction_id <current id> <new id>
+ Userland volume managers, such as LVM, need a way to
+ synchronise their external metadata with the internal metadata of the
+ pool target. The thin-pool target offers to store an
+ arbitrary 64-bit transaction id and return it on the target's
+ status line. To avoid races you must provide what you think
+ the current transaction id is when you change it with this
+ compare-and-swap message.
+
+ reserve_metadata_snap
+ Reserve a copy of the data mapping btree for use by userland.
+ This allows userland to inspect the mappings as they were when
+ this message was executed. Use the pool's status command to
+ get the root block associated with the metadata snapshot.
+
+ release_metadata_snap
+ Release a previously reserved copy of the data mapping btree.
+
+'thin' target
+-------------
+
+i) Constructor
+
+ ::
+
+ thin <pool dev> <dev id> [<external origin dev>]
+
+ pool dev:
+ the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
+
+ dev id:
+ the internal device identifier of the device to be
+ activated.
+
+ external origin dev:
+ an optional block device outside the pool to be treated as a
+ read-only snapshot origin: reads to unprovisioned areas of the
+ thin target will be mapped to this device.
+
+The pool doesn't store any size against the thin devices. If you
+load a thin target that is smaller than you've been using previously,
+then you'll have no access to blocks mapped beyond the end. If you
+load a target that is bigger than before, then extra blocks will be
+provisioned as and when needed.
+
+ii) Status
+
+ <nr mapped sectors> <highest mapped sector>
+ If the pool has encountered device errors and failed, the status
+ will just contain the string 'Fail'. The userspace recovery
+ tools should then be used.
+
+ In the case where <nr mapped sectors> is 0, there is no highest
+ mapped sector and the value of <highest mapped sector> is unspecified.
diff --git a/Documentation/admin-guide/device-mapper/unstriped.rst b/Documentation/admin-guide/device-mapper/unstriped.rst
new file mode 100644
index 0000000..0a8d3eb
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/unstriped.rst
@@ -0,0 +1,135 @@
+================================
+Device-mapper "unstriped" target
+================================
+
+Introduction
+============
+
+The device-mapper "unstriped" target provides a transparent mechanism to
+unstripe a device-mapper "striped" target to access the underlying disks
+without having to touch the true backing block-device. It can also be
+used to unstripe a hardware RAID-0 to access backing disks.
+
+Parameters:
+<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
+
+<number of stripes>
+ The number of stripes in the RAID 0.
+
+<chunk size>
+ The amount of 512B sectors in the chunk striping.
+
+<dev_path>
+ The block device you wish to unstripe.
+
+<stripe #>
+ The stripe number within the device that corresponds to physical
+ drive you wish to unstripe. This must be 0 indexed.
+
+
+Why use this module?
+====================
+
+An example of undoing an existing dm-stripe
+-------------------------------------------
+
+This small bash script will setup 4 loop devices and use the existing
+striped target to combine the 4 devices into one. It then will use
+the unstriped target ontop of the striped device to access the
+individual backing loop devices. We write data to the newly exposed
+unstriped devices and verify the data written matches the correct
+underlying device on the striped array::
+
+ #!/bin/bash
+
+ MEMBER_SIZE=$((128 * 1024 * 1024))
+ NUM=4
+ SEQ_END=$((${NUM}-1))
+ CHUNK=256
+ BS=4096
+
+ RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
+ DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
+ COUNT=$((${MEMBER_SIZE} / ${BS}))
+
+ for i in $(seq 0 ${SEQ_END}); do
+ dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
+ losetup /dev/loop${i} member-${i}
+ DM_PARMS+=" /dev/loop${i} 0"
+ done
+
+ echo $DM_PARMS | dmsetup create raid0
+ for i in $(seq 0 ${SEQ_END}); do
+ echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
+ done;
+
+ for i in $(seq 0 ${SEQ_END}); do
+ dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
+ diff /dev/mapper/set-${i} member-${i}
+ done;
+
+ for i in $(seq 0 ${SEQ_END}); do
+ dmsetup remove set-${i}
+ done
+
+ dmsetup remove raid0
+
+ for i in $(seq 0 ${SEQ_END}); do
+ losetup -d /dev/loop${i}
+ rm -f member-${i}
+ done
+
+Another example
+---------------
+
+Intel NVMe drives contain two cores on the physical device.
+Each core of the drive has segregated access to its LBA range.
+The current LBA model has a RAID 0 128k chunk on each core, resulting
+in a 256k stripe across the two cores::
+
+ Core 0: Core 1:
+ __________ __________
+ | LBA 512| | LBA 768|
+ | LBA 0 | | LBA 256|
+ ---------- ----------
+
+The purpose of this unstriping is to provide better QoS in noisy
+neighbor environments. When two partitions are created on the
+aggregate drive without this unstriping, reads on one partition
+can affect writes on another partition. This is because the partitions
+are striped across the two cores. When we unstripe this hardware RAID 0
+and make partitions on each new exposed device the two partitions are now
+physically separated.
+
+With the dm-unstriped target we're able to segregate an fio script that
+has read and write jobs that are independent of each other. Compared to
+when we run the test on a combined drive with partitions, we were able
+to get a 92% reduction in read latency using this device mapper target.
+
+
+Example dmsetup usage
+=====================
+
+unstriped ontop of Intel NVMe device that has 2 cores
+-----------------------------------------------------
+
+::
+
+ dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
+ dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
+
+There will now be two devices that expose Intel NVMe core 0 and 1
+respectively::
+
+ /dev/mapper/nvmset0
+ /dev/mapper/nvmset1
+
+unstriped ontop of striped with 4 drives using 128K chunk size
+--------------------------------------------------------------
+
+::
+
+ dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
+ dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
+ dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
+ dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
new file mode 100644
index 0000000..bb02caa
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/verity.rst
@@ -0,0 +1,236 @@
+=========
+dm-verity
+=========
+
+Device-Mapper's "verity" target provides transparent integrity checking of
+block devices using a cryptographic digest provided by the kernel crypto API.
+This target is read-only.
+
+Construction Parameters
+=======================
+
+::
+
+ <version> <dev> <hash_dev>
+ <data_block_size> <hash_block_size>
+ <num_data_blocks> <hash_start_block>
+ <algorithm> <digest> <salt>
+ [<#opt_params> <opt_params>]
+
+<version>
+ This is the type of the on-disk hash format.
+
+ 0 is the original format used in the Chromium OS.
+ The salt is appended when hashing, digests are stored continuously and
+ the rest of the block is padded with zeroes.
+
+ 1 is the current format that should be used for new devices.
+ The salt is prepended when hashing and each digest is
+ padded with zeroes to the power of two.
+
+<dev>
+ This is the device containing data, the integrity of which needs to be
+ checked. It may be specified as a path, like /dev/sdaX, or a device number,
+ <major>:<minor>.
+
+<hash_dev>
+ This is the device that supplies the hash tree data. It may be
+ specified similarly to the device path and may be the same device. If the
+ same device is used, the hash_start should be outside the configured
+ dm-verity device.
+
+<data_block_size>
+ The block size on a data device in bytes.
+ Each block corresponds to one digest on the hash device.
+
+<hash_block_size>
+ The size of a hash block in bytes.
+
+<num_data_blocks>
+ The number of data blocks on the data device. Additional blocks are
+ inaccessible. You can place hashes to the same partition as data, in this
+ case hashes are placed after <num_data_blocks>.
+
+<hash_start_block>
+ This is the offset, in <hash_block_size>-blocks, from the start of hash_dev
+ to the root block of the hash tree.
+
+<algorithm>
+ The cryptographic hash algorithm used for this device. This should
+ be the name of the algorithm, like "sha1".
+
+<digest>
+ The hexadecimal encoding of the cryptographic hash of the root hash block
+ and the salt. This hash should be trusted as there is no other authenticity
+ beyond this point.
+
+<salt>
+ The hexadecimal encoding of the salt value.
+
+<#opt_params>
+ Number of optional parameters. If there are no optional parameters,
+ the optional paramaters section can be skipped or #opt_params can be zero.
+ Otherwise #opt_params is the number of following arguments.
+
+ Example of optional parameters section:
+ 1 ignore_corruption
+
+ignore_corruption
+ Log corrupted blocks, but allow read operations to proceed normally.
+
+restart_on_corruption
+ Restart the system when a corrupted block is discovered. This option is
+ not compatible with ignore_corruption and requires user space support to
+ avoid restart loops.
+
+ignore_zero_blocks
+ Do not verify blocks that are expected to contain zeroes and always return
+ zeroes instead. This may be useful if the partition contains unused blocks
+ that are not guaranteed to contain zeroes.
+
+use_fec_from_device <fec_dev>
+ Use forward error correction (FEC) to recover from corruption if hash
+ verification fails. Use encoding data from the specified device. This
+ may be the same device where data and hash blocks reside, in which case
+ fec_start must be outside data and hash areas.
+
+ If the encoding data covers additional metadata, it must be accessible
+ on the hash device after the hash blocks.
+
+ Note: block sizes for data and hash devices must match. Also, if the
+ verity <dev> is encrypted the <fec_dev> should be too.
+
+fec_roots <num>
+ Number of generator roots. This equals to the number of parity bytes in
+ the encoding data. For example, in RS(M, N) encoding, the number of roots
+ is M-N.
+
+fec_blocks <num>
+ The number of encoding data blocks on the FEC device. The block size for
+ the FEC device is <data_block_size>.
+
+fec_start <offset>
+ This is the offset, in <data_block_size> blocks, from the start of the
+ FEC device to the beginning of the encoding data.
+
+check_at_most_once
+ Verify data blocks only the first time they are read from the data device,
+ rather than every time. This reduces the overhead of dm-verity so that it
+ can be used on systems that are memory and/or CPU constrained. However, it
+ provides a reduced level of security because only offline tampering of the
+ data device's content will be detected, not online tampering.
+
+ Hash blocks are still verified each time they are read from the hash device,
+ since verification of hash blocks is less performance critical than data
+ blocks, and a hash block will not be verified any more after all the data
+ blocks it covers have been verified anyway.
+
+root_hash_sig_key_desc <key_description>
+ This is the description of the USER_KEY that the kernel will lookup to get
+ the pkcs7 signature of the roothash. The pkcs7 signature is used to validate
+ the root hash during the creation of the device mapper block device.
+ Verification of roothash depends on the config DM_VERITY_VERIFY_ROOTHASH_SIG
+ being set in the kernel.
+
+Theory of operation
+===================
+
+dm-verity is meant to be set up as part of a verified boot path. This
+may be anything ranging from a boot using tboot or trustedgrub to just
+booting from a known-good device (like a USB drive or CD).
+
+When a dm-verity device is configured, it is expected that the caller
+has been authenticated in some way (cryptographic signatures, etc).
+After instantiation, all hashes will be verified on-demand during
+disk access. If they cannot be verified up to the root node of the
+tree, the root hash, then the I/O will fail. This should detect
+tampering with any data on the device and the hash data.
+
+Cryptographic hashes are used to assert the integrity of the device on a
+per-block basis. This allows for a lightweight hash computation on first read
+into the page cache. Block hashes are stored linearly, aligned to the nearest
+block size.
+
+If forward error correction (FEC) support is enabled any recovery of
+corrupted data will be verified using the cryptographic hash of the
+corresponding data. This is why combining error correction with
+integrity checking is essential.
+
+Hash Tree
+---------
+
+Each node in the tree is a cryptographic hash. If it is a leaf node, the hash
+of some data block on disk is calculated. If it is an intermediary node,
+the hash of a number of child nodes is calculated.
+
+Each entry in the tree is a collection of neighboring nodes that fit in one
+block. The number is determined based on block_size and the size of the
+selected cryptographic digest algorithm. The hashes are linearly-ordered in
+this entry and any unaligned trailing space is ignored but included when
+calculating the parent node.
+
+The tree looks something like:
+
+ alg = sha256, num_blocks = 32768, block_size = 4096
+
+::
+
+ [ root ]
+ / . . . \
+ [entry_0] [entry_1]
+ / . . . \ . . . \
+ [entry_0_0] . . . [entry_0_127] . . . . [entry_1_127]
+ / ... \ / . . . \ / \
+ blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767
+
+
+On-disk format
+==============
+
+The verity kernel code does not read the verity metadata on-disk header.
+It only reads the hash blocks which directly follow the header.
+It is expected that a user-space tool will verify the integrity of the
+verity header.
+
+Alternatively, the header can be omitted and the dmsetup parameters can
+be passed via the kernel command-line in a rooted chain of trust where
+the command-line is verified.
+
+Directly following the header (and with sector number padded to the next hash
+block boundary) are the hash blocks which are stored a depth at a time
+(starting from the root), sorted in order of increasing index.
+
+The full specification of kernel parameters and on-disk metadata format
+is available at the cryptsetup project's wiki page
+
+ https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity
+
+Status
+======
+V (for Valid) is returned if every check performed so far was valid.
+If any check failed, C (for Corruption) is returned.
+
+Example
+=======
+Set up a device::
+
+ # dmsetup create vroot --readonly --table \
+ "0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\
+ "4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\
+ "1234000000000000000000000000000000000000000000000000000000000000"
+
+A command line tool veritysetup is available to compute or verify
+the hash tree or activate the kernel device. This is available from
+the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/
+(as a libcryptsetup extension).
+
+Create hash on the device::
+
+ # veritysetup format /dev/sda1 /dev/sda2
+ ...
+ Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
+
+Activate the device::
+
+ # veritysetup create vroot /dev/sda1 /dev/sda2 \
+ 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076
diff --git a/Documentation/admin-guide/device-mapper/writecache.rst b/Documentation/admin-guide/device-mapper/writecache.rst
new file mode 100644
index 0000000..d3d7690
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/writecache.rst
@@ -0,0 +1,79 @@
+=================
+Writecache target
+=================
+
+The writecache target caches writes on persistent memory or on SSD. It
+doesn't cache reads because reads are supposed to be cached in page cache
+in normal RAM.
+
+When the device is constructed, the first sector should be zeroed or the
+first sector should contain valid superblock from previous invocation.
+
+Constructor parameters:
+
+1. type of the cache device - "p" or "s"
+
+ - p - persistent memory
+ - s - SSD
+2. the underlying device that will be cached
+3. the cache device
+4. block size (4096 is recommended; the maximum block size is the page
+ size)
+5. the number of optional parameters (the parameters with an argument
+ count as two)
+
+ start_sector n (default: 0)
+ offset from the start of cache device in 512-byte sectors
+ high_watermark n (default: 50)
+ start writeback when the number of used blocks reach this
+ watermark
+ low_watermark x (default: 45)
+ stop writeback when the number of used blocks drops below
+ this watermark
+ writeback_jobs n (default: unlimited)
+ limit the number of blocks that are in flight during
+ writeback. Setting this value reduces writeback
+ throughput, but it may improve latency of read requests
+ autocommit_blocks n (default: 64 for pmem, 65536 for ssd)
+ when the application writes this amount of blocks without
+ issuing the FLUSH request, the blocks are automatically
+ commited
+ autocommit_time ms (default: 1000)
+ autocommit time in milliseconds. The data is automatically
+ commited if this time passes and no FLUSH request is
+ received
+ fua (by default on)
+ applicable only to persistent memory - use the FUA flag
+ when writing data from persistent memory back to the
+ underlying device
+ nofua
+ applicable only to persistent memory - don't use the FUA
+ flag when writing back data and send the FLUSH request
+ afterwards
+
+ - some underlying devices perform better with fua, some
+ with nofua. The user should test it
+
+Status:
+1. error indicator - 0 if there was no error, otherwise error number
+2. the number of blocks
+3. the number of free blocks
+4. the number of blocks under writeback
+
+Messages:
+ flush
+ flush the cache device. The message returns successfully
+ if the cache device was flushed without an error
+ flush_on_suspend
+ flush the cache device on next suspend. Use this message
+ when you are going to remove the cache device. The proper
+ sequence for removing the cache device is:
+
+ 1. send the "flush_on_suspend" message
+ 2. load an inactive table with a linear target that maps
+ to the underlying device
+ 3. suspend the device
+ 4. ask for status and verify that there are no errors
+ 5. resume the device, so that it will use the linear
+ target
+ 6. the cache device is now inactive and it can be deleted
diff --git a/Documentation/admin-guide/device-mapper/zero.rst b/Documentation/admin-guide/device-mapper/zero.rst
new file mode 100644
index 0000000..11fb5cf
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/zero.rst
@@ -0,0 +1,37 @@
+=======
+dm-zero
+=======
+
+Device-Mapper's "zero" target provides a block-device that always returns
+zero'd data on reads and silently drops writes. This is similar behavior to
+/dev/zero, but as a block-device instead of a character-device.
+
+Dm-zero has no target-specific parameters.
+
+One very interesting use of dm-zero is for creating "sparse" devices in
+conjunction with dm-snapshot. A sparse device reports a device-size larger
+than the amount of actual storage space available for that device. A user can
+write data anywhere within the sparse device and read it back like a normal
+device. Reads to previously unwritten areas will return a zero'd buffer. When
+enough data has been written to fill up the actual storage space, the sparse
+device is deactivated. This can be very useful for testing device and
+filesystem limitations.
+
+To create a sparse device, start by creating a dm-zero device that's the
+desired size of the sparse device. For this example, we'll assume a 10TB
+sparse device::
+
+ TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors
+ echo "0 $TEN_TERABYTES zero" | dmsetup create zero1
+
+Then create a snapshot of the zero device, using any available block-device as
+the COW device. The size of the COW device will determine the amount of real
+space available to the sparse device. For this example, we'll assume /dev/sdb1
+is an available 10GB partition::
+
+ echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \
+ dmsetup create sparse1
+
+This will create a 10TB sparse device called /dev/mapper/sparse1 that has
+10GB of actual storage space available. If more than 10GB of data is written
+to this device, it will start returning I/O errors.
diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst
index 7fadc05..d41671a 100644
--- a/Documentation/admin-guide/devices.rst
+++ b/Documentation/admin-guide/devices.rst
@@ -1,3 +1,4 @@
+.. _admin_devices:
Linux allocated devices (4.x+ version)
======================================
diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt
index 1649117..1c5d228 100644
--- a/Documentation/admin-guide/devices.txt
+++ b/Documentation/admin-guide/devices.txt
@@ -1647,8 +1647,17 @@
0 = /dev/comedi0 First comedi device
1 = /dev/comedi1 Second comedi device
...
+ 47 = /dev/comedi47 48th comedi device
- See http://stm.lbl.gov/comedi.
+ Minors 48 to 255 are reserved for comedi subdevices with
+ pathnames of the form "/dev/comediX_subdY", where "X" is the
+ minor number of the associated comedi device and "Y" is the
+ subdevice number. These subdevice minors are assigned
+ dynamically, so there is no fixed mapping from subdevice
+ pathnames to minor numbers.
+
+ See http://www.comedi.org/ for information about the Comedi
+ project.
98 block User-mode virtual block device
0 = /dev/ubda First user-mode block device
@@ -2693,8 +2702,8 @@
41 = /dev/ttySMX0 Motorola i.MX - port 0
42 = /dev/ttySMX1 Motorola i.MX - port 1
43 = /dev/ttySMX2 Motorola i.MX - port 2
- 44 = /dev/ttyMM0 Marvell MPSC - port 0
- 45 = /dev/ttyMM1 Marvell MPSC - port 1
+ 44 = /dev/ttyMM0 Marvell MPSC - port 0 (obsolete unused)
+ 45 = /dev/ttyMM1 Marvell MPSC - port 1 (obsolete unused)
46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0
...
47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index fdf7242..252e5ef 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -110,8 +110,8 @@
~# cat query-batch-file > <debugfs>/dynamic_debug/control
-A another way is to use wildcard. The match rule support ``*`` (matches
-zero or more characters) and ``?`` (matches exactly one character).For
+Another way is to use wildcards. The match rule supports ``*`` (matches
+zero or more characters) and ``?`` (matches exactly one character). For
example, you can match all usb drivers::
~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
@@ -258,7 +258,7 @@
If ``foo`` module is not built-in, ``foo.dyndbg`` will still be processed at
boot time, without effect, but will be reprocessed when module is
-loaded later. ``dyndbg_query=`` and bare ``dyndbg=`` are only processed at
+loaded later. ``ddebug_query=`` and bare ``dyndbg=`` are only processed at
boot.
@@ -301,7 +301,7 @@
For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or
enabled by ``-DDEBUG`` flag during compilation) can be disabled later via
-the sysfs interface if the debug messages are no longer needed::
+the debugfs interface if the debug messages are no longer needed::
echo "module module_name -p" > <debugfs>/dynamic_debug/control
diff --git a/Documentation/admin-guide/efi-stub.rst b/Documentation/admin-guide/efi-stub.rst
new file mode 100644
index 0000000..833edb0
--- /dev/null
+++ b/Documentation/admin-guide/efi-stub.rst
@@ -0,0 +1,100 @@
+=================
+The EFI Boot Stub
+=================
+
+On the x86 and ARM platforms, a kernel zImage/bzImage can masquerade
+as a PE/COFF image, thereby convincing EFI firmware loaders to load
+it as an EFI executable. The code that modifies the bzImage header,
+along with the EFI-specific entry point that the firmware loader
+jumps to are collectively known as the "EFI boot stub", and live in
+arch/x86/boot/header.S and arch/x86/boot/compressed/eboot.c,
+respectively. For ARM the EFI stub is implemented in
+arch/arm/boot/compressed/efi-header.S and
+arch/arm/boot/compressed/efi-stub.c. EFI stub code that is shared
+between architectures is in drivers/firmware/efi/libstub.
+
+For arm64, there is no compressed kernel support, so the Image itself
+masquerades as a PE/COFF image and the EFI stub is linked into the
+kernel. The arm64 EFI stub lives in arch/arm64/kernel/efi-entry.S
+and drivers/firmware/efi/libstub/arm64-stub.c.
+
+By using the EFI boot stub it's possible to boot a Linux kernel
+without the use of a conventional EFI boot loader, such as grub or
+elilo. Since the EFI boot stub performs the jobs of a boot loader, in
+a certain sense it *IS* the boot loader.
+
+The EFI boot stub is enabled with the CONFIG_EFI_STUB kernel option.
+
+
+How to install bzImage.efi
+--------------------------
+
+The bzImage located in arch/x86/boot/bzImage must be copied to the EFI
+System Partition (ESP) and renamed with the extension ".efi". Without
+the extension the EFI firmware loader will refuse to execute it. It's
+not possible to execute bzImage.efi from the usual Linux file systems
+because EFI firmware doesn't have support for them. For ARM the
+arch/arm/boot/zImage should be copied to the system partition, and it
+may not need to be renamed. Similarly for arm64, arch/arm64/boot/Image
+should be copied but not necessarily renamed.
+
+
+Passing kernel parameters from the EFI shell
+--------------------------------------------
+
+Arguments to the kernel can be passed after bzImage.efi, e.g.::
+
+ fs0:> bzImage.efi console=ttyS0 root=/dev/sda4
+
+
+The "initrd=" option
+--------------------
+
+Like most boot loaders, the EFI stub allows the user to specify
+multiple initrd files using the "initrd=" option. This is the only EFI
+stub-specific command line parameter, everything else is passed to the
+kernel when it boots.
+
+The path to the initrd file must be an absolute path from the
+beginning of the ESP, relative path names do not work. Also, the path
+is an EFI-style path and directory elements must be separated with
+backslashes (\). For example, given the following directory layout::
+
+ fs0:>
+ Kernels\
+ bzImage.efi
+ initrd-large.img
+
+ Ramdisks\
+ initrd-small.img
+ initrd-medium.img
+
+to boot with the initrd-large.img file if the current working
+directory is fs0:\Kernels, the following command must be used::
+
+ fs0:\Kernels> bzImage.efi initrd=\Kernels\initrd-large.img
+
+Notice how bzImage.efi can be specified with a relative path. That's
+because the image we're executing is interpreted by the EFI shell,
+which understands relative paths, whereas the rest of the command line
+is passed to bzImage.efi.
+
+
+The "dtb=" option
+-----------------
+
+For the ARM and arm64 architectures, a device tree must be provided to
+the kernel. Normally firmware shall supply the device tree via the
+EFI CONFIGURATION TABLE. However, the "dtb=" command line option can
+be used to override the firmware supplied device tree, or to supply
+one when firmware is unable to.
+
+Please note: Firmware adds runtime configuration information to the
+device tree before booting the kernel. If dtb= is used to override
+the device tree, then any runtime data provided by firmware will be
+lost. The dtb= option should only be used either as a debug tool, or
+as a last resort when a device tree is not provided in the EFI
+CONFIGURATION TABLE.
+
+"dtb=" is processed in the same manner as the "initrd=" option that is
+described above.
diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
new file mode 100644
index 0000000..059ddcb
--- /dev/null
+++ b/Documentation/admin-guide/ext4.rst
@@ -0,0 +1,612 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+ext4 General Information
+========================
+
+Ext4 is an advanced level of the ext3 filesystem which incorporates
+scalability and reliability enhancements for supporting large filesystems
+(64 bit) in keeping with increasing disk capacities and state-of-the-art
+feature requirements.
+
+Mailing list: linux-ext4@vger.kernel.org
+Web site: http://ext4.wiki.kernel.org
+
+
+Quick usage instructions
+========================
+
+Note: More extensive information for getting started with ext4 can be
+found at the ext4 wiki site at the URL:
+http://ext4.wiki.kernel.org/index.php/Ext4_Howto
+
+ - The latest version of e2fsprogs can be found at:
+
+ https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
+
+ or
+
+ http://sourceforge.net/project/showfiles.php?group_id=2406
+
+ or grab the latest git repository from:
+
+ https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
+
+ - Create a new filesystem using the ext4 filesystem type:
+
+ # mke2fs -t ext4 /dev/hda1
+
+ Or to configure an existing ext3 filesystem to support extents:
+
+ # tune2fs -O extents /dev/hda1
+
+ If the filesystem was created with 128 byte inodes, it can be
+ converted to use 256 byte for greater efficiency via:
+
+ # tune2fs -I 256 /dev/hda1
+
+ - Mounting:
+
+ # mount -t ext4 /dev/hda1 /wherever
+
+ - When comparing performance with other filesystems, it's always
+ important to try multiple workloads; very often a subtle change in a
+ workload parameter can completely change the ranking of which
+ filesystems do well compared to others. When comparing versus ext3,
+ note that ext4 enables write barriers by default, while ext3 does
+ not enable write barriers by default. So it is useful to use
+ explicitly specify whether barriers are enabled or not when via the
+ '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
+ for a fair comparison. When tuning ext3 for best benchmark numbers,
+ it is often worthwhile to try changing the data journaling mode; '-o
+ data=writeback' can be faster for some workloads. (Note however that
+ running mounted with data=writeback can potentially leave stale data
+ exposed in recently written files in case of an unclean shutdown,
+ which could be a security exposure in some situations.) Configuring
+ the filesystem with a large journal can also be helpful for
+ metadata-intensive workloads.
+
+Features
+========
+
+Currently Available
+-------------------
+
+* ability to use filesystems > 16TB (e2fsprogs support not available yet)
+* extent format reduces metadata overhead (RAM, IO for access, transactions)
+* extent format more robust in face of on-disk corruption due to magics,
+* internal redundancy in tree
+* improved file allocation (multi-block alloc)
+* lift 32000 subdirectory limit imposed by i_links_count[1]
+* nsec timestamps for mtime, atime, ctime, create time
+* inode version field on disk (NFSv4, Lustre)
+* reduced e2fsck time via uninit_bg feature
+* journal checksumming for robustness, performance
+* persistent file preallocation (e.g for streaming media, databases)
+* ability to pack bitmaps and inode tables into larger virtual groups via the
+ flex_bg feature
+* large file support
+* inode allocation using large virtual block groups via flex_bg
+* delayed allocation
+* large block (up to pagesize) support
+* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
+ the ordering)
+* Case-insensitive file name lookups
+
+[1] Filesystems with a block size of 1k may see a limit imposed by the
+directory hash tree having a maximum depth of two.
+
+case-insensitive file name lookups
+======================================================
+
+The case-insensitive file name lookup feature is supported on a
+per-directory basis, allowing the user to mix case-insensitive and
+case-sensitive directories in the same filesystem. It is enabled by
+flipping the +F inode attribute of an empty directory. The
+case-insensitive string match operation is only defined when we know how
+text in encoded in a byte sequence. For that reason, in order to enable
+case-insensitive directories, the filesystem must have the
+casefold feature, which stores the filesystem-wide encoding
+model used. By default, the charset adopted is the latest version of
+Unicode (12.1.0, by the time of this writing), encoded in the UTF-8
+form. The comparison algorithm is implemented by normalizing the
+strings to the Canonical decomposition form, as defined by Unicode,
+followed by a byte per byte comparison.
+
+The case-awareness is name-preserving on the disk, meaning that the file
+name provided by userspace is a byte-per-byte match to what is actually
+written in the disk. The Unicode normalization format used by the
+kernel is thus an internal representation, and not exposed to the
+userspace nor to the disk, with the important exception of disk hashes,
+used on large case-insensitive directories with DX feature. On DX
+directories, the hash must be calculated using the casefolded version of
+the filename, meaning that the normalization format used actually has an
+impact on where the directory entry is stored.
+
+When we change from viewing filenames as opaque byte sequences to seeing
+them as encoded strings we need to address what happens when a program
+tries to create a file with an invalid name. The Unicode subsystem
+within the kernel leaves the decision of what to do in this case to the
+filesystem, which select its preferred behavior by enabling/disabling
+the strict mode. When Ext4 encounters one of those strings and the
+filesystem did not require strict mode, it falls back to considering the
+entire string as an opaque byte sequence, which still allows the user to
+operate on that file, but the case-insensitive lookups won't work.
+
+Options
+=======
+
+When mounting an ext4 filesystem, the following option are accepted:
+(*) == default
+
+ ro
+ Mount filesystem read only. Note that ext4 will replay the journal (and
+ thus write to the partition) even when mounted "read only". The mount
+ options "ro,noload" can be used to prevent writes to the filesystem.
+
+ journal_checksum
+ Enable checksumming of the journal transactions. This will allow the
+ recovery code in e2fsck and the kernel to detect corruption in the
+ kernel. It is a compatible change and will be ignored by older
+ kernels.
+
+ journal_async_commit
+ Commit block can be written to disk without waiting for descriptor
+ blocks. If enabled older kernels cannot mount the device. This will
+ enable 'journal_checksum' internally.
+
+ journal_path=path, journal_dev=devnum
+ When the external journal device's major/minor numbers have changed,
+ these options allow the user to specify the new journal location. The
+ journal device is identified through either its new major/minor numbers
+ encoded in devnum, or via a path to the device.
+
+ norecovery, noload
+ Don't load the journal on mounting. Note that if the filesystem was
+ not unmounted cleanly, skipping the journal replay will lead to the
+ filesystem containing inconsistencies that can lead to any number of
+ problems.
+
+ data=journal
+ All data are committed into the journal prior to being written into the
+ main file system. Enabling this mode will disable delayed allocation
+ and O_DIRECT support.
+
+ data=ordered (*)
+ All data are forced directly out to the main file system prior to its
+ metadata being committed to the journal.
+
+ data=writeback
+ Data ordering is not preserved, data may be written into the main file
+ system after its metadata has been committed to the journal.
+
+ commit=nrsec (*)
+ Ext4 can be told to sync all its data and metadata every 'nrsec'
+ seconds. The default value is 5 seconds. This means that if you lose
+ your power, you will lose as much as the latest 5 seconds of work (your
+ filesystem will not be damaged though, thanks to the journaling). This
+ default value (or any low value) will hurt performance, but it's good
+ for data-safety. Setting it to 0 will have the same effect as leaving
+ it at the default (5 seconds). Setting it to very large values will
+ improve performance.
+
+ barrier=<0|1(*)>, barrier(*), nobarrier
+ This enables/disables the use of write barriers in the jbd code.
+ barrier=0 disables, barrier=1 enables. This also requires an IO stack
+ which can support barriers, and if jbd gets an error on a barrier
+ write, it will disable again with a warning. Write barriers enforce
+ proper on-disk ordering of journal commits, making volatile disk write
+ caches safe to use, at some performance penalty. If your disks are
+ battery-backed in one way or another, disabling barriers may safely
+ improve performance. The mount options "barrier" and "nobarrier" can
+ also be used to enable or disable barriers, for consistency with other
+ ext4 mount options.
+
+ inode_readahead_blks=n
+ This tuning parameter controls the maximum number of inode table blocks
+ that ext4's inode table readahead algorithm will pre-read into the
+ buffer cache. The default value is 32 blocks.
+
+ nouser_xattr
+ Disables Extended User Attributes. See the attr(5) manual page for
+ more information about extended attributes.
+
+ noacl
+ This option disables POSIX Access Control List support. If ACL support
+ is enabled in the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL
+ is enabled by default on mount. See the acl(5) manual page for more
+ information about acl.
+
+ bsddf (*)
+ Make 'df' act like BSD.
+
+ minixdf
+ Make 'df' act like Minix.
+
+ debug
+ Extra debugging information is sent to syslog.
+
+ abort
+ Simulate the effects of calling ext4_abort() for debugging purposes.
+ This is normally used while remounting a filesystem which is already
+ mounted.
+
+ errors=remount-ro
+ Remount the filesystem read-only on an error.
+
+ errors=continue
+ Keep going on a filesystem error.
+
+ errors=panic
+ Panic and halt the machine if an error occurs. (These mount options
+ override the errors behavior specified in the superblock, which can be
+ configured using tune2fs)
+
+ data_err=ignore(*)
+ Just print an error message if an error occurs in a file data buffer in
+ ordered mode.
+ data_err=abort
+ Abort the journal if an error occurs in a file data buffer in ordered
+ mode.
+
+ grpid | bsdgroups
+ New objects have the group ID of their parent.
+
+ nogrpid (*) | sysvgroups
+ New objects have the group ID of their creator.
+
+ resgid=n
+ The group ID which may use the reserved blocks.
+
+ resuid=n
+ The user ID which may use the reserved blocks.
+
+ sb=
+ Use alternate superblock at this location.
+
+ quota, noquota, grpquota, usrquota
+ These options are ignored by the filesystem. They are used only by
+ quota tools to recognize volumes where quota should be turned on. See
+ documentation in the quota-tools package for more details
+ (http://sourceforge.net/projects/linuxquota).
+
+ jqfmt=<quota type>, usrjquota=<file>, grpjquota=<file>
+ These options tell filesystem details about quota so that quota
+ information can be properly updated during journal replay. They replace
+ the above quota options. See documentation in the quota-tools package
+ for more details (http://sourceforge.net/projects/linuxquota).
+
+ stripe=n
+ Number of filesystem blocks that mballoc will try to use for allocation
+ size and alignment. For RAID5/6 systems this should be the number of
+ data disks * RAID chunk size in file system blocks.
+
+ delalloc (*)
+ Defer block allocation until just before ext4 writes out the block(s)
+ in question. This allows ext4 to better allocation decisions more
+ efficiently.
+
+ nodelalloc
+ Disable delayed allocation. Blocks are allocated when the data is
+ copied from userspace to the page cache, either via the write(2) system
+ call or when an mmap'ed page which was previously unallocated is
+ written for the first time.
+
+ max_batch_time=usec
+ Maximum amount of time ext4 should wait for additional filesystem
+ operations to be batch together with a synchronous write operation.
+ Since a synchronous write operation is going to force a commit and then
+ a wait for the I/O complete, it doesn't cost much, and can be a huge
+ throughput win, we wait for a small amount of time to see if any other
+ transactions can piggyback on the synchronous write. The algorithm
+ used is designed to automatically tune for the speed of the disk, by
+ measuring the amount of time (on average) that it takes to finish
+ committing a transaction. Call this time the "commit time". If the
+ time that the transaction has been running is less than the commit
+ time, ext4 will try sleeping for the commit time to see if other
+ operations will join the transaction. The commit time is capped by
+ the max_batch_time, which defaults to 15000us (15ms). This
+ optimization can be turned off entirely by setting max_batch_time to 0.
+
+ min_batch_time=usec
+ This parameter sets the commit time (as described above) to be at least
+ min_batch_time. It defaults to zero microseconds. Increasing this
+ parameter may improve the throughput of multi-threaded, synchronous
+ workloads on very fast disks, at the cost of increasing latency.
+
+ journal_ioprio=prio
+ The I/O priority (from 0 to 7, where 0 is the highest priority) which
+ should be used for I/O operations submitted by kjournald2 during a
+ commit operation. This defaults to 3, which is a slightly higher
+ priority than the default I/O priority.
+
+ auto_da_alloc(*), noauto_da_alloc
+ Many broken applications don't use fsync() when replacing existing
+ files via patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/
+ rename("foo.new", "foo"), or worse yet, fd = open("foo",
+ O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4
+ will detect the replace-via-rename and replace-via-truncate patterns
+ and force that any delayed allocation blocks are allocated such that at
+ the next journal commit, in the default data=ordered mode, the data
+ blocks of the new file are forced to disk before the rename() operation
+ is committed. This provides roughly the same level of guarantees as
+ ext3, and avoids the "zero-length" problem that can happen when a
+ system crashes before the delayed allocation blocks are forced to disk.
+
+ noinit_itable
+ Do not initialize any uninitialized inode table blocks in the
+ background. This feature may be used by installation CD's so that the
+ install process can complete as quickly as possible; the inode table
+ initialization process would then be deferred until the next time the
+ file system is unmounted.
+
+ init_itable=n
+ The lazy itable init code will wait n times the number of milliseconds
+ it took to zero out the previous block group's inode table. This
+ minimizes the impact on the system performance while file system's
+ inode table is being initialized.
+
+ discard, nodiscard(*)
+ Controls whether ext4 should issue discard/TRIM commands to the
+ underlying block device when blocks are freed. This is useful for SSD
+ devices and sparse/thinly-provisioned LUNs, but it is off by default
+ until sufficient testing has been done.
+
+ nouid32
+ Disables 32-bit UIDs and GIDs. This is for interoperability with
+ older kernels which only store and expect 16-bit values.
+
+ block_validity(*), noblock_validity
+ These options enable or disable the in-kernel facility for tracking
+ filesystem metadata blocks within internal data structures. This
+ allows multi- block allocator and other routines to notice bugs or
+ corrupted allocation bitmaps which cause blocks to be allocated which
+ overlap with filesystem metadata blocks.
+
+ dioread_lock, dioread_nolock
+ Controls whether or not ext4 should use the DIO read locking. If the
+ dioread_nolock option is specified ext4 will allocate uninitialized
+ extent before buffer write and convert the extent to initialized after
+ IO completes. This approach allows ext4 code to avoid using inode
+ mutex, which improves scalability on high speed storages. However this
+ does not work with data journaling and dioread_nolock option will be
+ ignored with kernel warning. Note that dioread_nolock code path is only
+ used for extent-based files. Because of the restrictions this options
+ comprises it is off by default (e.g. dioread_lock).
+
+ max_dir_size_kb=n
+ This limits the size of directories so that any attempt to expand them
+ beyond the specified limit in kilobytes will cause an ENOSPC error.
+ This is useful in memory constrained environments, where a very large
+ directory can cause severe performance problems or even provoke the Out
+ Of Memory killer. (For example, if there is only 512mb memory
+ available, a 176mb directory may seriously cramp the system's style.)
+
+ i_version
+ Enable 64-bit inode version support. This option is off by default.
+
+ dax
+ Use direct access (no page cache). See
+ Documentation/filesystems/dax.txt. Note that this option is
+ incompatible with data=journal.
+
+Data Mode
+=========
+There are 3 different data modes:
+
+* writeback mode
+
+ In data=writeback mode, ext4 does not journal data at all. This mode provides
+ a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
+ mode - metadata journaling. A crash+recovery can cause incorrect data to
+ appear in files which were written shortly before the crash. This mode will
+ typically provide the best ext4 performance.
+
+* ordered mode
+
+ In data=ordered mode, ext4 only officially journals metadata, but it logically
+ groups metadata information related to data changes with the data blocks into
+ a single unit called a transaction. When it's time to write the new metadata
+ out to disk, the associated data blocks are written first. In general, this
+ mode performs slightly slower than writeback but significantly faster than
+ journal mode.
+
+* journal mode
+
+ data=journal mode provides full data and metadata journaling. All new data is
+ written to the journal first, and then to its final location. In the event of
+ a crash, the journal can be replayed, bringing both data and metadata into a
+ consistent state. This mode is the slowest except when data needs to be read
+ from and written to disk at the same time where it outperforms all others
+ modes. Enabling this mode will disable delayed allocation and O_DIRECT
+ support.
+
+/proc entries
+=============
+
+Information about mounted ext4 file systems can be found in
+/proc/fs/ext4. Each mounted filesystem will have a directory in
+/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
+/proc/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /proc/fs/ext4/<devname>
+
+ mb_groups
+ details of multiblock allocator buddy cache of free blocks
+
+/sys entries
+============
+
+Information about mounted ext4 file systems can be found in
+/sys/fs/ext4. Each mounted filesystem will have a directory in
+/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
+/sys/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /sys/fs/ext4/<devname>:
+
+(see also Documentation/ABI/testing/sysfs-fs-ext4)
+
+ delayed_allocation_blocks
+ This file is read-only and shows the number of blocks that are dirty in
+ the page cache, but which do not have their location in the filesystem
+ allocated yet.
+
+ inode_goal
+ Tuning parameter which (if non-zero) controls the goal inode used by
+ the inode allocator in preference to all other allocation heuristics.
+ This is intended for debugging use only, and should be 0 on production
+ systems.
+
+ inode_readahead_blks
+ Tuning parameter which controls the maximum number of inode table
+ blocks that ext4's inode table readahead algorithm will pre-read into
+ the buffer cache.
+
+ lifetime_write_kbytes
+ This file is read-only and shows the number of kilobytes of data that
+ have been written to this filesystem since it was created.
+
+ max_writeback_mb_bump
+ The maximum number of megabytes the writeback code will try to write
+ out before move on to another inode.
+
+ mb_group_prealloc
+ The multiblock allocator will round up allocation requests to a
+ multiple of this tuning parameter if the stripe size is not set in the
+ ext4 superblock
+
+ mb_max_to_scan
+ The maximum number of extents the multiblock allocator will search to
+ find the best extent.
+
+ mb_min_to_scan
+ The minimum number of extents the multiblock allocator will search to
+ find the best extent.
+
+ mb_order2_req
+ Tuning parameter which controls the minimum size for requests (as a
+ power of 2) where the buddy cache is used.
+
+ mb_stats
+ Controls whether the multiblock allocator should collect statistics,
+ which are shown during the unmount. 1 means to collect statistics, 0
+ means not to collect statistics.
+
+ mb_stream_req
+ Files which have fewer blocks than this tunable parameter will have
+ their blocks allocated out of a block group specific preallocation
+ pool, so that small files are packed closely together. Each large file
+ will have its blocks allocated out of its own unique preallocation
+ pool.
+
+ session_write_kbytes
+ This file is read-only and shows the number of kilobytes of data that
+ have been written to this filesystem since it was mounted.
+
+ reserved_clusters
+ This is RW file and contains number of reserved clusters in the file
+ system which will be used in the specific situations to avoid costly
+ zeroout, unexpected ENOSPC, or possible data loss. The default is 2% or
+ 4096 clusters, whichever is smaller and this can be changed however it
+ can never exceed number of clusters in the file system. If there is not
+ enough space for the reserved space when mounting the file mount will
+ _not_ fail.
+
+Ioctls
+======
+
+There is some Ext4 specific functionality which can be accessed by applications
+through the system call interfaces. The list of all Ext4 specific ioctls are
+shown in the table below.
+
+Table of Ext4 specific ioctls
+
+ EXT4_IOC_GETFLAGS
+ Get additional attributes associated with inode. The ioctl argument is
+ an integer bitfield, with bit values described in ext4.h. This ioctl is
+ an alias for FS_IOC_GETFLAGS.
+
+ EXT4_IOC_SETFLAGS
+ Set additional attributes associated with inode. The ioctl argument is
+ an integer bitfield, with bit values described in ext4.h. This ioctl is
+ an alias for FS_IOC_SETFLAGS.
+
+ EXT4_IOC_GETVERSION, EXT4_IOC_GETVERSION_OLD
+ Get the inode i_generation number stored for each inode. The
+ i_generation number is normally changed only when new inode is created
+ and it is particularly useful for network filesystems. The '_OLD'
+ version of this ioctl is an alias for FS_IOC_GETVERSION.
+
+ EXT4_IOC_SETVERSION, EXT4_IOC_SETVERSION_OLD
+ Set the inode i_generation number stored for each inode. The '_OLD'
+ version of this ioctl is an alias for FS_IOC_SETVERSION.
+
+ EXT4_IOC_GROUP_EXTEND
+ This ioctl has the same purpose as the resize mount option. It allows
+ to resize filesystem to the end of the last existing block group,
+ further resize has to be done with resize2fs, either online, or
+ offline. The argument points to the unsigned logn number representing
+ the filesystem new block count.
+
+ EXT4_IOC_MOVE_EXT
+ Move the block extents from orig_fd (the one this ioctl is pointing to)
+ to the donor_fd (the one specified in move_extent structure passed as
+ an argument to this ioctl). Then, exchange inode metadata between
+ orig_fd and donor_fd. This is especially useful for online
+ defragmentation, because the allocator has the opportunity to allocate
+ moved blocks better, ideally into one contiguous extent.
+
+ EXT4_IOC_GROUP_ADD
+ Add a new group descriptor to an existing or new group descriptor
+ block. The new group descriptor is described by ext4_new_group_input
+ structure, which is passed as an argument to this ioctl. This is
+ especially useful in conjunction with EXT4_IOC_GROUP_EXTEND, which
+ allows online resize of the filesystem to the end of the last existing
+ block group. Those two ioctls combined is used in userspace online
+ resize tool (e.g. resize2fs).
+
+ EXT4_IOC_MIGRATE
+ This ioctl operates on the filesystem itself. It converts (migrates)
+ ext3 indirect block mapped inode to ext4 extent mapped inode by walking
+ through indirect block mapping of the original inode and converting
+ contiguous block ranges into ext4 extents of the temporary inode. Then,
+ inodes are swapped. This ioctl might help, when migrating from ext3 to
+ ext4 filesystem, however suggestion is to create fresh ext4 filesystem
+ and copy data from the backup. Note, that filesystem has to support
+ extents for this ioctl to work.
+
+ EXT4_IOC_ALLOC_DA_BLKS
+ Force all of the delay allocated blocks to be allocated to preserve
+ application-expected ext3 behaviour. Note that this will also start
+ triggering a write of the data blocks, but this behaviour may change in
+ the future as it is not necessary and has been done this way only for
+ sake of simplicity.
+
+ EXT4_IOC_RESIZE_FS
+ Resize the filesystem to a new size. The number of blocks of resized
+ filesystem is passed in via 64 bit integer argument. The kernel
+ allocates bitmaps and inode table, the userspace tool thus just passes
+ the new number of blocks.
+
+ EXT4_IOC_SWAP_BOOT
+ Swap i_blocks and associated attributes (like i_blocks, i_size,
+ i_flags, ...) from the specified inode with inode EXT4_BOOT_LOADER_INO
+ (#5). This is typically used to store a boot loader in a secure part of
+ the filesystem, where it can't be changed by a normal user by accident.
+ The data blocks of the previous boot loader will be associated with the
+ given inode.
+
+References
+==========
+
+kernel source: <file:fs/ext4/>
+ <file:fs/jbd2/>
+
+programs: http://e2fsprogs.sourceforge.net/
+
+useful links: http://fedoraproject.org/wiki/ext3-devel
+ http://www.bullopensource.org/ext4/
+ http://ext4.wiki.kernel.org/index.php/Main_Page
+ http://fedoraproject.org/wiki/Features/Ext4
diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst
new file mode 100644
index 0000000..a244ba4
--- /dev/null
+++ b/Documentation/admin-guide/gpio/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+gpio
+====
+
+.. toctree::
+ :maxdepth: 1
+
+ sysfs
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/gpio/sysfs.rst b/Documentation/admin-guide/gpio/sysfs.rst
new file mode 100644
index 0000000..ec09ffd
--- /dev/null
+++ b/Documentation/admin-guide/gpio/sysfs.rst
@@ -0,0 +1,167 @@
+GPIO Sysfs Interface for Userspace
+==================================
+
+.. warning::
+
+ THIS ABI IS DEPRECATED, THE ABI DOCUMENTATION HAS BEEN MOVED TO
+ Documentation/ABI/obsolete/sysfs-gpio AND NEW USERSPACE CONSUMERS
+ ARE SUPPOSED TO USE THE CHARACTER DEVICE ABI. THIS OLD SYSFS ABI WILL
+ NOT BE DEVELOPED (NO NEW FEATURES), IT WILL JUST BE MAINTAINED.
+
+Refer to the examples in tools/gpio/* for an introduction to the new
+character device ABI. Also see the userspace header in
+include/uapi/linux/gpio.h
+
+The deprecated sysfs ABI
+------------------------
+Platforms which use the "gpiolib" implementors framework may choose to
+configure a sysfs user interface to GPIOs. This is different from the
+debugfs interface, since it provides control over GPIO direction and
+value instead of just showing a gpio state summary. Plus, it could be
+present on production systems without debugging support.
+
+Given appropriate hardware documentation for the system, userspace could
+know for example that GPIO #23 controls the write protect line used to
+protect boot loader segments in flash memory. System upgrade procedures
+may need to temporarily remove that protection, first importing a GPIO,
+then changing its output state, then updating the code before re-enabling
+the write protection. In normal use, GPIO #23 would never be touched,
+and the kernel would have no need to know about it.
+
+Again depending on appropriate hardware documentation, on some systems
+userspace GPIO can be used to determine system configuration data that
+standard kernels won't know about. And for some tasks, simple userspace
+GPIO drivers could be all that the system really needs.
+
+DO NOT ABUSE SYSFS TO CONTROL HARDWARE THAT HAS PROPER KERNEL DRIVERS.
+PLEASE READ THE DOCUMENT AT Documentation/driver-api/gpio/drivers-on-gpio.rst
+TO AVOID REINVENTING KERNEL WHEELS IN USERSPACE. I MEAN IT. REALLY.
+
+Paths in Sysfs
+--------------
+There are three kinds of entries in /sys/class/gpio:
+
+ - Control interfaces used to get userspace control over GPIOs;
+
+ - GPIOs themselves; and
+
+ - GPIO controllers ("gpio_chip" instances).
+
+That's in addition to standard files including the "device" symlink.
+
+The control interfaces are write-only:
+
+ /sys/class/gpio/
+
+ "export" ...
+ Userspace may ask the kernel to export control of
+ a GPIO to userspace by writing its number to this file.
+
+ Example: "echo 19 > export" will create a "gpio19" node
+ for GPIO #19, if that's not requested by kernel code.
+
+ "unexport" ...
+ Reverses the effect of exporting to userspace.
+
+ Example: "echo 19 > unexport" will remove a "gpio19"
+ node exported using the "export" file.
+
+GPIO signals have paths like /sys/class/gpio/gpio42/ (for GPIO #42)
+and have the following read/write attributes:
+
+ /sys/class/gpio/gpioN/
+
+ "direction" ...
+ reads as either "in" or "out". This value may
+ normally be written. Writing as "out" defaults to
+ initializing the value as low. To ensure glitch free
+ operation, values "low" and "high" may be written to
+ configure the GPIO as an output with that initial value.
+
+ Note that this attribute *will not exist* if the kernel
+ doesn't support changing the direction of a GPIO, or
+ it was exported by kernel code that didn't explicitly
+ allow userspace to reconfigure this GPIO's direction.
+
+ "value" ...
+ reads as either 0 (low) or 1 (high). If the GPIO
+ is configured as an output, this value may be written;
+ any nonzero value is treated as high.
+
+ If the pin can be configured as interrupt-generating interrupt
+ and if it has been configured to generate interrupts (see the
+ description of "edge"), you can poll(2) on that file and
+ poll(2) will return whenever the interrupt was triggered. If
+ you use poll(2), set the events POLLPRI and POLLERR. If you
+ use select(2), set the file descriptor in exceptfds. After
+ poll(2) returns, either lseek(2) to the beginning of the sysfs
+ file and read the new value or close the file and re-open it
+ to read the value.
+
+ "edge" ...
+ reads as either "none", "rising", "falling", or
+ "both". Write these strings to select the signal edge(s)
+ that will make poll(2) on the "value" file return.
+
+ This file exists only if the pin can be configured as an
+ interrupt generating input pin.
+
+ "active_low" ...
+ reads as either 0 (false) or 1 (true). Write
+ any nonzero value to invert the value attribute both
+ for reading and writing. Existing and subsequent
+ poll(2) support configuration via the edge attribute
+ for "rising" and "falling" edges will follow this
+ setting.
+
+GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the
+controller implementing GPIOs starting at #42) and have the following
+read-only attributes:
+
+ /sys/class/gpio/gpiochipN/
+
+ "base" ...
+ same as N, the first GPIO managed by this chip
+
+ "label" ...
+ provided for diagnostics (not always unique)
+
+ "ngpio" ...
+ how many GPIOs this manages (N to N + ngpio - 1)
+
+Board documentation should in most cases cover what GPIOs are used for
+what purposes. However, those numbers are not always stable; GPIOs on
+a daughtercard might be different depending on the base board being used,
+or other cards in the stack. In such cases, you may need to use the
+gpiochip nodes (possibly in conjunction with schematics) to determine
+the correct GPIO number to use for a given signal.
+
+
+Exporting from Kernel code
+--------------------------
+Kernel code can explicitly manage exports of GPIOs which have already been
+requested using gpio_request()::
+
+ /* export the GPIO to userspace */
+ int gpiod_export(struct gpio_desc *desc, bool direction_may_change);
+
+ /* reverse gpio_export() */
+ void gpiod_unexport(struct gpio_desc *desc);
+
+ /* create a sysfs link to an exported GPIO node */
+ int gpiod_export_link(struct device *dev, const char *name,
+ struct gpio_desc *desc);
+
+After a kernel driver requests a GPIO, it may only be made available in
+the sysfs interface by gpiod_export(). The driver can control whether the
+signal direction may change. This helps drivers prevent userspace code
+from accidentally clobbering important system state.
+
+This explicit exporting can help with debugging (by making some kinds
+of experiments easier), or can provide an always-there interface that's
+suitable for documenting as part of a board support package.
+
+After the GPIO has been exported, gpiod_export_link() allows creating
+symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can
+use this to provide the interface under their own device in sysfs with
+a descriptive name.
diff --git a/Documentation/admin-guide/highuid.rst b/Documentation/admin-guide/highuid.rst
new file mode 100644
index 0000000..6ee7046
--- /dev/null
+++ b/Documentation/admin-guide/highuid.rst
@@ -0,0 +1,80 @@
+===================================================
+Notes on the change from 16-bit UIDs to 32-bit UIDs
+===================================================
+
+:Author: Chris Wing <wingc@umich.edu>
+:Last updated: January 11, 2000
+
+- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t
+ when communicating between user and kernel space in an ioctl or data
+ structure.
+
+- kernel code should use uid_t and gid_t in kernel-private structures and
+ code.
+
+What's left to be done for 32-bit UIDs on all Linux architectures:
+
+- Disk quotas have an interesting limitation that is not related to the
+ maximum UID/GID. They are limited by the maximum file size on the
+ underlying filesystem, because quota records are written at offsets
+ corresponding to the UID in question.
+ Further investigation is needed to see if the quota system can cope
+ properly with huge UIDs. If it can deal with 64-bit file offsets on all
+ architectures, this should not be a problem.
+
+- Decide whether or not to keep backwards compatibility with the system
+ accounting file, or if we should break it as the comments suggest
+ (currently, the old 16-bit UID and GID are still written to disk, and
+ part of the former pad space is used to store separate 32-bit UID and
+ GID)
+
+- Need to validate that OS emulation calls the 16-bit UID
+ compatibility syscalls, if the OS being emulated used 16-bit UIDs, or
+ uses the 32-bit UID system calls properly otherwise.
+
+ This affects at least:
+
+ - iBCS on Intel
+
+ - sparc32 emulation on sparc64
+ (need to support whatever new 32-bit UID system calls are added to
+ sparc32)
+
+- Validate that all filesystems behave properly.
+
+ At present, 32-bit UIDs _should_ work for:
+
+ - ext2
+ - ufs
+ - isofs
+ - nfs
+ - coda
+ - udf
+
+ Ioctl() fixups have been made for:
+
+ - ncpfs
+ - smbfs
+
+ Filesystems with simple fixups to prevent 16-bit UID wraparound:
+
+ - minix
+ - sysv
+ - qnx4
+
+ Other filesystems have not been checked yet.
+
+- The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in
+ all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but
+ more are needed. (as well as new user<->kernel data structures)
+
+- The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k,
+ sh, and sparc32. Fixing this is probably not that important, but would
+ require adding a new ELF section.
+
+- The ioctl()s used to control the in-kernel NFS server only support
+ 16-bit UIDs on arm, i386, m68k, sh, and sparc32.
+
+- make sure that the UID mapping feature of AX25 networking works properly
+ (it should be safe because it's always used a 32-bit integer to
+ communicate between user and kernel)
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
new file mode 100644
index 0000000..0795e3c
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -0,0 +1,16 @@
+========================
+Hardware vulnerabilities
+========================
+
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+ :maxdepth: 1
+
+ spectre
+ l1tf
+ mds
+ tsx_async_abort
+ multihit.rst
diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/hw-vuln/l1tf.rst
similarity index 98%
rename from Documentation/admin-guide/l1tf.rst
rename to Documentation/admin-guide/hw-vuln/l1tf.rst
index bae52b8..f83212f 100644
--- a/Documentation/admin-guide/l1tf.rst
+++ b/Documentation/admin-guide/hw-vuln/l1tf.rst
@@ -241,7 +241,7 @@
For further information about confining guests to a single or to a group
of cores consult the cpusets documentation:
- https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
+ https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst
.. _interrupt_isolation:
@@ -405,6 +405,9 @@
off Disables hypervisor mitigations and doesn't emit any
warnings.
+ It also drops the swap size and available RAM limit restrictions
+ on both hypervisor and bare metal.
+
============ =============================================================
The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
@@ -442,6 +445,7 @@
line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
module parameter is ignored and writes to the sysfs file are rejected.
+.. _mitigation_selection:
Mitigation selection guide
--------------------------
@@ -553,7 +557,7 @@
the bare metal hypervisor, the nested hypervisor and the nested virtual
machine. VMENTER operations from the nested hypervisor into the nested
guest will always be processed by the bare metal hypervisor. If KVM is the
-bare metal hypervisor it wiil:
+bare metal hypervisor it will:
- Flush the L1D cache on every switch from the nested hypervisor to the
nested virtual machine, so that the nested hypervisor's secrets are not
@@ -576,7 +580,8 @@
The kernel default mitigations for vulnerable processors are:
- PTE inversion to protect against malicious user space. This is done
- unconditionally and cannot be controlled.
+ unconditionally and cannot be controlled. The swap storage is limited
+ to ~16TB.
- L1D conditional flushing on VMENTER when EPT is enabled for
a guest.
diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst
new file mode 100644
index 0000000..2d19c9f
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/mds.rst
@@ -0,0 +1,311 @@
+MDS - Microarchitectural Data Sampling
+======================================
+
+Microarchitectural Data Sampling is a hardware vulnerability which allows
+unprivileged speculative access to data which is available in various CPU
+internal buffers.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+ - Processors from AMD, Centaur and other non Intel vendors
+
+ - Older processor models, where the CPU family is < 6
+
+ - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+ - Intel processors which have the ARCH_CAP_MDS_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR.
+
+Whether a processor is affected or not can be read out from the MDS
+vulnerability file in sysfs. See :ref:`mds_sys_info`.
+
+Not all processors are affected by all variants of MDS, but the mitigation
+is identical for all of them so the kernel treats them as a single
+vulnerability.
+
+Related CVEs
+------------
+
+The following CVE entries are related to the MDS vulnerability:
+
+ ============== ===== ===================================================
+ CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling
+ CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling
+ CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling
+ CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory
+ ============== ===== ===================================================
+
+Problem
+-------
+
+When performing store, load, L1 refill operations, processors write data
+into temporary microarchitectural structures (buffers). The data in the
+buffer can be forwarded to load operations as an optimization.
+
+Under certain conditions, usually a fault/assist caused by a load
+operation, data unrelated to the load memory address can be speculatively
+forwarded from the buffers. Because the load operation causes a fault or
+assist and its result will be discarded, the forwarded data will not cause
+incorrect program execution or state changes. But a malicious operation
+may be able to forward this speculative data to a disclosure gadget which
+allows in turn to infer the value via a cache side channel attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks are possible.
+
+Deeper technical information is available in the MDS specific x86
+architecture section: :ref:`Documentation/x86/mds.rst <mds>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the MDS vulnerabilities can be mounted from malicious non
+priviledged user space applications running on hosts or guest. Malicious
+guest OSes can obviously mount attacks as well.
+
+Contrary to other speculation based vulnerabilities the MDS vulnerability
+does not allow the attacker to control the memory target address. As a
+consequence the attacks are purely sampling based, but as demonstrated with
+the TLBleed attack samples can be postprocessed successfully.
+
+Web-Browsers
+^^^^^^^^^^^^
+
+ It's unclear whether attacks through Web-Browsers are possible at
+ all. The exploitation through Java-Script is considered very unlikely,
+ but other widely used web technologies like Webassembly could possibly be
+ abused.
+
+
+.. _mds_sys_info:
+
+MDS system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current MDS
+status of the system: whether the system is vulnerable, and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable
+ * - 'Vulnerable'
+ - The processor is vulnerable, but no mitigation enabled
+ * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+ - The processor is vulnerable but microcode is not updated.
+
+ The mitigation is enabled on a best effort basis. See :ref:`vmwerv`
+ * - 'Mitigation: Clear CPU buffers'
+ - The processor is vulnerable and the CPU buffer clearing mitigation is
+ enabled.
+
+If the processor is vulnerable then the following information is appended
+to the above information:
+
+ ======================== ============================================
+ 'SMT vulnerable' SMT is enabled
+ 'SMT mitigated' SMT is enabled and mitigated
+ 'SMT disabled' SMT is disabled
+ 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
+ ======================== ============================================
+
+.. _vmwerv:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the processor is vulnerable, but the availability of the microcode based
+ mitigation mechanism is not advertised via CPUID the kernel selects a best
+ effort mitigation mode. This mode invokes the mitigation instructions
+ without a guarantee that they clear the CPU buffers.
+
+ This is done to address virtualization scenarios where the host has the
+ microcode update applied, but the hypervisor is not yet updated to expose
+ the CPUID to the guest. If the host has updated microcode the protection
+ takes effect otherwise a few cpu cycles are wasted pointlessly.
+
+ The state in the mds sysfs file reflects this situation accordingly.
+
+
+Mitigation mechanism
+-------------------------
+
+The kernel detects the affected CPUs and the presence of the microcode
+which is required.
+
+If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default. The mitigation can be controlled at boot
+time via a kernel command line option. See
+:ref:`mds_mitigation_control_command_line`.
+
+.. _cpu_buffer_clear:
+
+CPU buffer clearing
+^^^^^^^^^^^^^^^^^^^
+
+ The mitigation for MDS clears the affected CPU buffers on return to user
+ space and when entering a guest.
+
+ If SMT is enabled it also clears the buffers on idle entry when the CPU
+ is only affected by MSBDS and not any other MDS variant, because the
+ other variants cannot be protected against cross Hyper-Thread attacks.
+
+ For CPUs which are only affected by MSBDS the user space, guest and idle
+ transition mitigations are sufficient and SMT is not affected.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection for host to guest transition depends on the L1TF
+ vulnerability of the CPU:
+
+ - CPU is affected by L1TF:
+
+ If the L1D flush mitigation is enabled and up to date microcode is
+ available, the L1D flush mitigation is automatically protecting the
+ guest transition.
+
+ If the L1D flush mitigation is disabled then the MDS mitigation is
+ invoked explicit when the host MDS mitigation is enabled.
+
+ For details on L1TF and virtualization see:
+ :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <mitigation_control_kvm>`.
+
+ - CPU is not affected by L1TF:
+
+ CPU buffers are flushed before entering the guest when the host MDS
+ mitigation is enabled.
+
+ The resulting MDS protection matrix for the host to guest transition:
+
+ ============ ===== ============= ============ =================
+ L1TF MDS VMX-L1FLUSH Host MDS MDS-State
+
+ Don't care No Don't care N/A Not affected
+
+ Yes Yes Disabled Off Vulnerable
+
+ Yes Yes Disabled Full Mitigated
+
+ Yes Yes Enabled Don't care Mitigated
+
+ No Yes N/A Off Vulnerable
+
+ No Yes N/A Full Mitigated
+ ============ ===== ============= ============ =================
+
+ This only covers the host to guest transition, i.e. prevents leakage from
+ host to guest, but does not protect the guest internally. Guests need to
+ have their own protections.
+
+.. _xeon_phi:
+
+XEON PHI specific considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The XEON PHI processor family is affected by MSBDS which can be exploited
+ cross Hyper-Threads when entering idle states. Some XEON PHI variants allow
+ to use MWAIT in user space (Ring 3) which opens an potential attack vector
+ for malicious user space. The exposure can be disabled on the kernel
+ command line with the 'ring3mwait=disable' command line option.
+
+ XEON PHI is not affected by the other MDS variants and MSBDS is mitigated
+ before the CPU enters a idle state. As XEON PHI is not affected by L1TF
+ either disabling SMT is not required for full protection.
+
+.. _mds_smt_control:
+
+SMT control
+^^^^^^^^^^^
+
+ All MDS variants except MSBDS can be attacked cross Hyper-Threads. That
+ means on CPUs which are affected by MFBDS or MLPDS it is necessary to
+ disable SMT for full protection. These are most of the affected CPUs; the
+ exception is XEON PHI, see :ref:`xeon_phi`.
+
+ Disabling SMT can have a significant performance impact, but the impact
+ depends on the type of workloads.
+
+ See the relevant chapter in the L1TF mitigation documentation for details:
+ :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
+
+
+.. _mds_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the MDS mitigations at boot
+time with the option "mds=". The valid arguments for this option are:
+
+ ============ =============================================================
+ full If the CPU is vulnerable, enable all available mitigations
+ for the MDS vulnerability, CPU buffer clearing on exit to
+ userspace and when entering a VM. Idle transitions are
+ protected as well if SMT is enabled.
+
+ It does not automatically disable SMT.
+
+ full,nosmt The same as mds=full, with SMT disabled on vulnerable
+ CPUs. This is the complete mitigation.
+
+ off Disables MDS mitigations completely.
+
+ ============ =============================================================
+
+Not specifying this option is equivalent to "mds=full". For processors
+that are affected by both TAA (TSX Asynchronous Abort) and MDS,
+specifying just "mds=off" without an accompanying "tsx_async_abort=off"
+will have no effect as the same mitigation is used for both
+vulnerabilities.
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+ If all userspace applications are from a trusted source and do not
+ execute untrusted code which is supplied externally, then the mitigation
+ can be disabled.
+
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The same considerations as above versus trusted user space apply.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The protection depends on the state of the L1TF mitigations.
+ See :ref:`virt_mechanism`.
+
+ If the MDS mitigation is enabled and SMT is disabled, guest to host and
+ guest to guest attacks are prevented.
+
+.. _mds_default_mitigations:
+
+Default mitigations
+-------------------
+
+ The kernel default mitigations for vulnerable processors are:
+
+ - Enable CPU buffer clearing
+
+ The kernel does not by default enforce the disabling of SMT, which leaves
+ SMT systems vulnerable when running untrusted code. The same rationale as
+ for L1TF applies.
+ See :ref:`Documentation/admin-guide/hw-vuln//l1tf.rst <default_mitigations>`.
diff --git a/Documentation/admin-guide/hw-vuln/multihit.rst b/Documentation/admin-guide/hw-vuln/multihit.rst
new file mode 100644
index 0000000..ba9988d
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/multihit.rst
@@ -0,0 +1,163 @@
+iTLB multihit
+=============
+
+iTLB multihit is an erratum where some processors may incur a machine check
+error, possibly resulting in an unrecoverable CPU lockup, when an
+instruction fetch hits multiple entries in the instruction TLB. This can
+occur when the page size is changed along with either the physical address
+or cache type. A malicious guest running on a virtualized system can
+exploit this erratum to perform a denial of service attack.
+
+
+Affected processors
+-------------------
+
+Variations of this erratum are present on most Intel Core and Xeon processor
+models. The erratum is not present on:
+
+ - non-Intel processors
+
+ - Some Atoms (Airmont, Bonnell, Goldmont, GoldmontPlus, Saltwell, Silvermont)
+
+ - Intel processors that have the PSCHANGE_MC_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR.
+
+
+Related CVEs
+------------
+
+The following CVE entry is related to this issue:
+
+ ============== =================================================
+ CVE-2018-12207 Machine Check Error Avoidance on Page Size Change
+ ============== =================================================
+
+
+Problem
+-------
+
+Privileged software, including OS and virtual machine managers (VMM), are in
+charge of memory management. A key component in memory management is the control
+of the page tables. Modern processors use virtual memory, a technique that creates
+the illusion of a very large memory for processors. This virtual space is split
+into pages of a given size. Page tables translate virtual addresses to physical
+addresses.
+
+To reduce latency when performing a virtual to physical address translation,
+processors include a structure, called TLB, that caches recent translations.
+There are separate TLBs for instruction (iTLB) and data (dTLB).
+
+Under this errata, instructions are fetched from a linear address translated
+using a 4 KB translation cached in the iTLB. Privileged software modifies the
+paging structure so that the same linear address using large page size (2 MB, 4
+MB, 1 GB) with a different physical address or memory type. After the page
+structure modification but before the software invalidates any iTLB entries for
+the linear address, a code fetch that happens on the same linear address may
+cause a machine-check error which can result in a system hang or shutdown.
+
+
+Attack scenarios
+----------------
+
+Attacks against the iTLB multihit erratum can be mounted from malicious
+guests in a virtualized system.
+
+
+iTLB multihit system information
+--------------------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current iTLB
+multihit status of the system:whether the system is vulnerable and which
+mitigations are active. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/itlb_multihit
+
+The possible values in this file are:
+
+.. list-table::
+
+ * - Not affected
+ - The processor is not vulnerable.
+ * - KVM: Mitigation: Split huge pages
+ - Software changes mitigate this issue.
+ * - KVM: Vulnerable
+ - The processor is vulnerable, but no mitigation enabled
+
+
+Enumeration of the erratum
+--------------------------------
+
+A new bit has been allocated in the IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) msr
+and will be set on CPU's which are mitigated against this issue.
+
+ ======================================= =========== ===============================
+ IA32_ARCH_CAPABILITIES MSR Not present Possibly vulnerable,check model
+ IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '0' Likely vulnerable,check model
+ IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '1' Not vulnerable
+ ======================================= =========== ===============================
+
+
+Mitigation mechanism
+-------------------------
+
+This erratum can be mitigated by restricting the use of large page sizes to
+non-executable pages. This forces all iTLB entries to be 4K, and removes
+the possibility of multiple hits.
+
+In order to mitigate the vulnerability, KVM initially marks all huge pages
+as non-executable. If the guest attempts to execute in one of those pages,
+the page is broken down into 4K pages, which are then marked executable.
+
+If EPT is disabled or not available on the host, KVM is in control of TLB
+flushes and the problematic situation cannot happen. However, the shadow
+EPT paging mechanism used by nested virtualization is vulnerable, because
+the nested guest can trigger multiple iTLB hits by modifying its own
+(non-nested) page tables. For simplicity, KVM will make large pages
+non-executable in all shadow paging modes.
+
+Mitigation control on the kernel command line and KVM - module parameter
+------------------------------------------------------------------------
+
+The KVM hypervisor mitigation mechanism for marking huge pages as
+non-executable can be controlled with a module parameter "nx_huge_pages=".
+The kernel command line allows to control the iTLB multihit mitigations at
+boot time with the option "kvm.nx_huge_pages=".
+
+The valid arguments for these options are:
+
+ ========== ================================================================
+ force Mitigation is enabled. In this case, the mitigation implements
+ non-executable huge pages in Linux kernel KVM module. All huge
+ pages in the EPT are marked as non-executable.
+ If a guest attempts to execute in one of those pages, the page is
+ broken down into 4K pages, which are then marked executable.
+
+ off Mitigation is disabled.
+
+ auto Enable mitigation only if the platform is affected and the kernel
+ was not booted with the "mitigations=off" command line parameter.
+ This is the default option.
+ ========== ================================================================
+
+
+Mitigation selection guide
+--------------------------
+
+1. No virtualization in use
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The system is protected by the kernel unconditionally and no further
+ action is required.
+
+2. Virtualization with trusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ If the guest comes from a trusted source, you may assume that the guest will
+ not attempt to maliciously exploit these errata and no further action is
+ required.
+
+3. Virtualization with untrusted guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ If the guest comes from an untrusted source, the guest host kernel will need
+ to apply iTLB multihit mitigation via the kernel command line or kvm
+ module parameter.
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
new file mode 100644
index 0000000..e05e581
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -0,0 +1,769 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Spectre Side Channels
+=====================
+
+Spectre is a class of side channel attacks that exploit branch prediction
+and speculative execution on modern CPUs to read memory, possibly
+bypassing access controls. Speculative execution side channel exploits
+do not modify memory but attempt to infer privileged data in the memory.
+
+This document covers Spectre variant 1 and Spectre variant 2.
+
+Affected processors
+-------------------
+
+Speculative execution side channel methods affect a wide range of modern
+high performance processors, since most modern high speed processors
+use branch prediction and speculative execution.
+
+The following CPUs are vulnerable:
+
+ - Intel Core, Atom, Pentium, and Xeon processors
+
+ - AMD Phenom, EPYC, and Zen processors
+
+ - IBM POWER and zSeries processors
+
+ - Higher end ARM processors
+
+ - Apple CPUs
+
+ - Higher end MIPS CPUs
+
+ - Likely most other high performance CPUs. Contact your CPU vendor for details.
+
+Whether a processor is affected or not can be read out from the Spectre
+vulnerability files in sysfs. See :ref:`spectre_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entries describe Spectre variants:
+
+ ============= ======================= ==========================
+ CVE-2017-5753 Bounds check bypass Spectre variant 1
+ CVE-2017-5715 Branch target injection Spectre variant 2
+ CVE-2019-1125 Spectre v1 swapgs Spectre variant 1 (swapgs)
+ ============= ======================= ==========================
+
+Problem
+-------
+
+CPUs use speculative operations to improve performance. That may leave
+traces of memory accesses or computations in the processor's caches,
+buffers, and branch predictors. Malicious software may be able to
+influence the speculative execution paths, and then use the side effects
+of the speculative execution in the CPUs' caches and buffers to infer
+privileged data touched during the speculative execution.
+
+Spectre variant 1 attacks take advantage of speculative execution of
+conditional branches, while Spectre variant 2 attacks use speculative
+execution of indirect branches to leak privileged memory.
+See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[7] <spec_ref7>`
+:ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.
+
+Spectre variant 1 (Bounds Check Bypass)
+---------------------------------------
+
+The bounds check bypass attack :ref:`[2] <spec_ref2>` takes advantage
+of speculative execution that bypasses conditional branch instructions
+used for memory access bounds check (e.g. checking if the index of an
+array results in memory access within a valid range). This results in
+memory accesses to invalid memory (with out-of-bound index) that are
+done speculatively before validation checks resolve. Such speculative
+memory accesses can leave side effects, creating side channels which
+leak information to the attacker.
+
+There are some extensions of Spectre variant 1 attacks for reading data
+over the network, see :ref:`[12] <spec_ref12>`. However such attacks
+are difficult, low bandwidth, fragile, and are considered low risk.
+
+Note that, despite "Bounds Check Bypass" name, Spectre variant 1 is not
+only about user-controlled array bounds checks. It can affect any
+conditional checks. The kernel entry code interrupt, exception, and NMI
+handlers all have conditional swapgs checks. Those may be problematic
+in the context of Spectre v1, as kernel code can speculatively run with
+a user GS.
+
+Spectre variant 2 (Branch Target Injection)
+-------------------------------------------
+
+The branch target injection attack takes advantage of speculative
+execution of indirect branches :ref:`[3] <spec_ref3>`. The indirect
+branch predictors inside the processor used to guess the target of
+indirect branches can be influenced by an attacker, causing gadget code
+to be speculatively executed, thus exposing sensitive data touched by
+the victim. The side effects left in the CPU's caches during speculative
+execution can be measured to infer data values.
+
+.. _poison_btb:
+
+In Spectre variant 2 attacks, the attacker can steer speculative indirect
+branches in the victim to gadget code by poisoning the branch target
+buffer of a CPU used for predicting indirect branch addresses. Such
+poisoning could be done by indirect branching into existing code,
+with the address offset of the indirect branch under the attacker's
+control. Since the branch prediction on impacted hardware does not
+fully disambiguate branch address and uses the offset for prediction,
+this could cause privileged code's indirect branch to jump to a gadget
+code with the same offset.
+
+The most useful gadgets take an attacker-controlled input parameter (such
+as a register value) so that the memory read can be controlled. Gadgets
+without input parameters might be possible, but the attacker would have
+very little control over what memory can be read, reducing the risk of
+the attack revealing useful data.
+
+One other variant 2 attack vector is for the attacker to poison the
+return stack buffer (RSB) :ref:`[13] <spec_ref13>` to cause speculative
+subroutine return instruction execution to go to a gadget. An attacker's
+imbalanced subroutine call instructions might "poison" entries in the
+return stack buffer which are later consumed by a victim's subroutine
+return instructions. This attack can be mitigated by flushing the return
+stack buffer on context switch, or virtual machine (VM) exit.
+
+On systems with simultaneous multi-threading (SMT), attacks are possible
+from the sibling thread, as level 1 cache and branch target buffer
+(BTB) may be shared between hardware threads in a CPU core. A malicious
+program running on the sibling thread may influence its peer's BTB to
+steer its indirect branch speculations to gadget code, and measure the
+speculative execution's side effects left in level 1 cache to infer the
+victim's data.
+
+Attack scenarios
+----------------
+
+The following list of attack scenarios have been anticipated, but may
+not cover all possible attack vectors.
+
+1. A user process attacking the kernel
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Spectre variant 1
+~~~~~~~~~~~~~~~~~
+
+ The attacker passes a parameter to the kernel via a register or
+ via a known address in memory during a syscall. Such parameter may
+ be used later by the kernel as an index to an array or to derive
+ a pointer for a Spectre variant 1 attack. The index or pointer
+ is invalid, but bound checks are bypassed in the code branch taken
+ for speculative execution. This could cause privileged memory to be
+ accessed and leaked.
+
+ For kernel code that has been identified where data pointers could
+ potentially be influenced for Spectre attacks, new "nospec" accessor
+ macros are used to prevent speculative loading of data.
+
+Spectre variant 1 (swapgs)
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ An attacker can train the branch predictor to speculatively skip the
+ swapgs path for an interrupt or exception. If they initialize
+ the GS register to a user-space value, if the swapgs is speculatively
+ skipped, subsequent GS-related percpu accesses in the speculation
+ window will be done with the attacker-controlled GS value. This
+ could cause privileged memory to be accessed and leaked.
+
+ For example:
+
+ ::
+
+ if (coming from user space)
+ swapgs
+ mov %gs:<percpu_offset>, %reg
+ mov (%reg), %reg1
+
+ When coming from user space, the CPU can speculatively skip the
+ swapgs, and then do a speculative percpu load using the user GS
+ value. So the user can speculatively force a read of any kernel
+ value. If a gadget exists which uses the percpu value as an address
+ in another load/store, then the contents of the kernel value may
+ become visible via an L1 side channel attack.
+
+ A similar attack exists when coming from kernel space. The CPU can
+ speculatively do the swapgs, causing the user GS to get used for the
+ rest of the speculative window.
+
+Spectre variant 2
+~~~~~~~~~~~~~~~~~
+
+ A spectre variant 2 attacker can :ref:`poison <poison_btb>` the branch
+ target buffer (BTB) before issuing syscall to launch an attack.
+ After entering the kernel, the kernel could use the poisoned branch
+ target buffer on indirect jump and jump to gadget code in speculative
+ execution.
+
+ If an attacker tries to control the memory addresses leaked during
+ speculative execution, he would also need to pass a parameter to the
+ gadget, either through a register or a known address in memory. After
+ the gadget has executed, he can measure the side effect.
+
+ The kernel can protect itself against consuming poisoned branch
+ target buffer entries by using return trampolines (also known as
+ "retpoline") :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` for all
+ indirect branches. Return trampolines trap speculative execution paths
+ to prevent jumping to gadget code during speculative execution.
+ x86 CPUs with Enhanced Indirect Branch Restricted Speculation
+ (Enhanced IBRS) available in hardware should use the feature to
+ mitigate Spectre variant 2 instead of retpoline. Enhanced IBRS is
+ more efficient than retpoline.
+
+ There may be gadget code in firmware which could be exploited with
+ Spectre variant 2 attack by a rogue user process. To mitigate such
+ attacks on x86, Indirect Branch Restricted Speculation (IBRS) feature
+ is turned on before the kernel invokes any firmware code.
+
+2. A user process attacking another user process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ A malicious user process can try to attack another user process,
+ either via a context switch on the same hardware thread, or from the
+ sibling hyperthread sharing a physical processor core on simultaneous
+ multi-threading (SMT) system.
+
+ Spectre variant 1 attacks generally require passing parameters
+ between the processes, which needs a data passing relationship, such
+ as remote procedure calls (RPC). Those parameters are used in gadget
+ code to derive invalid data pointers accessing privileged memory in
+ the attacked process.
+
+ Spectre variant 2 attacks can be launched from a rogue process by
+ :ref:`poisoning <poison_btb>` the branch target buffer. This can
+ influence the indirect branch targets for a victim process that either
+ runs later on the same hardware thread, or running concurrently on
+ a sibling hardware thread sharing the same physical core.
+
+ A user process can protect itself against Spectre variant 2 attacks
+ by using the prctl() syscall to disable indirect branch speculation
+ for itself. An administrator can also cordon off an unsafe process
+ from polluting the branch target buffer by disabling the process's
+ indirect branch speculation. This comes with a performance cost
+ from not using indirect branch speculation and clearing the branch
+ target buffer. When SMT is enabled on x86, for a process that has
+ indirect branch speculation disabled, Single Threaded Indirect Branch
+ Predictors (STIBP) :ref:`[4] <spec_ref4>` are turned on to prevent the
+ sibling thread from controlling branch target buffer. In addition,
+ the Indirect Branch Prediction Barrier (IBPB) is issued to clear the
+ branch target buffer when context switching to and from such process.
+
+ On x86, the return stack buffer is stuffed on context switch.
+ This prevents the branch target buffer from being used for branch
+ prediction when the return stack buffer underflows while switching to
+ a deeper call stack. Any poisoned entries in the return stack buffer
+ left by the previous process will also be cleared.
+
+ User programs should use address space randomization to make attacks
+ more difficult (Set /proc/sys/kernel/randomize_va_space = 1 or 2).
+
+3. A virtualized guest attacking the host
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The attack mechanism is similar to how user processes attack the
+ kernel. The kernel is entered via hyper-calls or other virtualization
+ exit paths.
+
+ For Spectre variant 1 attacks, rogue guests can pass parameters
+ (e.g. in registers) via hyper-calls to derive invalid pointers to
+ speculate into privileged memory after entering the kernel. For places
+ where such kernel code has been identified, nospec accessor macros
+ are used to stop speculative memory access.
+
+ For Spectre variant 2 attacks, rogue guests can :ref:`poison
+ <poison_btb>` the branch target buffer or return stack buffer, causing
+ the kernel to jump to gadget code in the speculative execution paths.
+
+ To mitigate variant 2, the host kernel can use return trampolines
+ for indirect branches to bypass the poisoned branch target buffer,
+ and flushing the return stack buffer on VM exit. This prevents rogue
+ guests from affecting indirect branching in the host kernel.
+
+ To protect host processes from rogue guests, host processes can have
+ indirect branch speculation disabled via prctl(). The branch target
+ buffer is cleared before context switching to such processes.
+
+4. A virtualized guest attacking other guest
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ A rogue guest may attack another guest to get data accessible by the
+ other guest.
+
+ Spectre variant 1 attacks are possible if parameters can be passed
+ between guests. This may be done via mechanisms such as shared memory
+ or message passing. Such parameters could be used to derive data
+ pointers to privileged data in guest. The privileged data could be
+ accessed by gadget code in the victim's speculation paths.
+
+ Spectre variant 2 attacks can be launched from a rogue guest by
+ :ref:`poisoning <poison_btb>` the branch target buffer or the return
+ stack buffer. Such poisoned entries could be used to influence
+ speculation execution paths in the victim guest.
+
+ Linux kernel mitigates attacks to other guests running in the same
+ CPU hardware thread by flushing the return stack buffer on VM exit,
+ and clearing the branch target buffer before switching to a new guest.
+
+ If SMT is used, Spectre variant 2 attacks from an untrusted guest
+ in the sibling hyperthread can be mitigated by the administrator,
+ by turning off the unsafe guest's indirect branch speculation via
+ prctl(). A guest can also protect itself by turning on microcode
+ based mitigations (such as IBPB or STIBP on x86) within the guest.
+
+.. _spectre_sys_info:
+
+Spectre system information
+--------------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current
+mitigation status of the system for Spectre: whether the system is
+vulnerable, and which mitigations are active.
+
+The sysfs file showing Spectre variant 1 mitigation status is:
+
+ /sys/devices/system/cpu/vulnerabilities/spectre_v1
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable.
+ * - 'Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers'
+ - The swapgs protections are disabled; otherwise it has
+ protection in the kernel on a case by case base with explicit
+ pointer sanitation and usercopy LFENCE barriers.
+ * - 'Mitigation: usercopy/swapgs barriers and __user pointer sanitization'
+ - Protection in the kernel on a case by case base with explicit
+ pointer sanitation, usercopy LFENCE barriers, and swapgs LFENCE
+ barriers.
+
+However, the protections are put in place on a case by case basis,
+and there is no guarantee that all possible attack vectors for Spectre
+variant 1 are covered.
+
+The spectre_v2 kernel file reports if the kernel has been compiled with
+retpoline mitigation or if the CPU has hardware mitigation, and if the
+CPU has support for additional process-specific mitigation.
+
+This file also reports CPU features enabled by microcode to mitigate
+attack between user processes:
+
+1. Indirect Branch Prediction Barrier (IBPB) to add additional
+ isolation between processes of different users.
+2. Single Thread Indirect Branch Predictors (STIBP) to add additional
+ isolation between CPU threads running on the same core.
+
+These CPU features may impact performance when used and can be enabled
+per process on a case-by-case base.
+
+The sysfs file showing Spectre variant 2 mitigation status is:
+
+ /sys/devices/system/cpu/vulnerabilities/spectre_v2
+
+The possible values in this file are:
+
+ - Kernel status:
+
+ ==================================== =================================
+ 'Not affected' The processor is not vulnerable
+ 'Vulnerable' Vulnerable, no mitigation
+ 'Mitigation: Full generic retpoline' Software-focused mitigation
+ 'Mitigation: Full AMD retpoline' AMD-specific software mitigation
+ 'Mitigation: Enhanced IBRS' Hardware-focused mitigation
+ ==================================== =================================
+
+ - Firmware status: Show if Indirect Branch Restricted Speculation (IBRS) is
+ used to protect against Spectre variant 2 attacks when calling firmware (x86 only).
+
+ ========== =============================================================
+ 'IBRS_FW' Protection against user program attacks when calling firmware
+ ========== =============================================================
+
+ - Indirect branch prediction barrier (IBPB) status for protection between
+ processes of different users. This feature can be controlled through
+ prctl() per process, or through kernel command line options. This is
+ an x86 only feature. For more details see below.
+
+ =================== ========================================================
+ 'IBPB: disabled' IBPB unused
+ 'IBPB: always-on' Use IBPB on all tasks
+ 'IBPB: conditional' Use IBPB on SECCOMP or indirect branch restricted tasks
+ =================== ========================================================
+
+ - Single threaded indirect branch prediction (STIBP) status for protection
+ between different hyper threads. This feature can be controlled through
+ prctl per process, or through kernel command line options. This is x86
+ only feature. For more details see below.
+
+ ==================== ========================================================
+ 'STIBP: disabled' STIBP unused
+ 'STIBP: forced' Use STIBP on all tasks
+ 'STIBP: conditional' Use STIBP on SECCOMP or indirect branch restricted tasks
+ ==================== ========================================================
+
+ - Return stack buffer (RSB) protection status:
+
+ ============= ===========================================
+ 'RSB filling' Protection of RSB on context switch enabled
+ ============= ===========================================
+
+Full mitigation might require a microcode update from the CPU
+vendor. When the necessary microcode is not available, the kernel will
+report vulnerability.
+
+Turning on mitigation for Spectre variant 1 and Spectre variant 2
+-----------------------------------------------------------------
+
+1. Kernel mitigation
+^^^^^^^^^^^^^^^^^^^^
+
+Spectre variant 1
+~~~~~~~~~~~~~~~~~
+
+ For the Spectre variant 1, vulnerable kernel code (as determined
+ by code audit or scanning tools) is annotated on a case by case
+ basis to use nospec accessor macros for bounds clipping :ref:`[2]
+ <spec_ref2>` to avoid any usable disclosure gadgets. However, it may
+ not cover all attack vectors for Spectre variant 1.
+
+ Copy-from-user code has an LFENCE barrier to prevent the access_ok()
+ check from being mis-speculated. The barrier is done by the
+ barrier_nospec() macro.
+
+ For the swapgs variant of Spectre variant 1, LFENCE barriers are
+ added to interrupt, exception and NMI entry where needed. These
+ barriers are done by the FENCE_SWAPGS_KERNEL_ENTRY and
+ FENCE_SWAPGS_USER_ENTRY macros.
+
+Spectre variant 2
+~~~~~~~~~~~~~~~~~
+
+ For Spectre variant 2 mitigation, the compiler turns indirect calls or
+ jumps in the kernel into equivalent return trampolines (retpolines)
+ :ref:`[3] <spec_ref3>` :ref:`[9] <spec_ref9>` to go to the target
+ addresses. Speculative execution paths under retpolines are trapped
+ in an infinite loop to prevent any speculative execution jumping to
+ a gadget.
+
+ To turn on retpoline mitigation on a vulnerable CPU, the kernel
+ needs to be compiled with a gcc compiler that supports the
+ -mindirect-branch=thunk-extern -mindirect-branch-register options.
+ If the kernel is compiled with a Clang compiler, the compiler needs
+ to support -mretpoline-external-thunk option. The kernel config
+ CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
+ the latest updated microcode.
+
+ On Intel Skylake-era systems the mitigation covers most, but not all,
+ cases. See :ref:`[3] <spec_ref3>` for more details.
+
+ On CPUs with hardware mitigation for Spectre variant 2 (e.g. Enhanced
+ IBRS on x86), retpoline is automatically disabled at run time.
+
+ The retpoline mitigation is turned on by default on vulnerable
+ CPUs. It can be forced on or off by the administrator
+ via the kernel command line and sysfs control files. See
+ :ref:`spectre_mitigation_control_command_line`.
+
+ On x86, indirect branch restricted speculation is turned on by default
+ before invoking any firmware code to prevent Spectre variant 2 exploits
+ using the firmware.
+
+ Using kernel address space randomization (CONFIG_RANDOMIZE_SLAB=y
+ and CONFIG_SLAB_FREELIST_RANDOM=y in the kernel configuration) makes
+ attacks on the kernel generally more difficult.
+
+2. User program mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ User programs can mitigate Spectre variant 1 using LFENCE or "bounds
+ clipping". For more details see :ref:`[2] <spec_ref2>`.
+
+ For Spectre variant 2 mitigation, individual user programs
+ can be compiled with return trampolines for indirect branches.
+ This protects them from consuming poisoned entries in the branch
+ target buffer left by malicious software. Alternatively, the
+ programs can disable their indirect branch speculation via prctl()
+ (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
+ On x86, this will turn on STIBP to guard against attacks from the
+ sibling thread when the user program is running, and use IBPB to
+ flush the branch target buffer when switching to/from the program.
+
+ Restricting indirect branch speculation on a user program will
+ also prevent the program from launching a variant 2 attack
+ on x86. All sand-boxed SECCOMP programs have indirect branch
+ speculation restricted by default. Administrators can change
+ that behavior via the kernel command line and sysfs control files.
+ See :ref:`spectre_mitigation_control_command_line`.
+
+ Programs that disable their indirect branch speculation will have
+ more overhead and run slower.
+
+ User programs should use address space randomization
+ (/proc/sys/kernel/randomize_va_space = 1 or 2) to make attacks more
+ difficult.
+
+3. VM mitigation
+^^^^^^^^^^^^^^^^
+
+ Within the kernel, Spectre variant 1 attacks from rogue guests are
+ mitigated on a case by case basis in VM exit paths. Vulnerable code
+ uses nospec accessor macros for "bounds clipping", to avoid any
+ usable disclosure gadgets. However, this may not cover all variant
+ 1 attack vectors.
+
+ For Spectre variant 2 attacks from rogue guests to the kernel, the
+ Linux kernel uses retpoline or Enhanced IBRS to prevent consumption of
+ poisoned entries in branch target buffer left by rogue guests. It also
+ flushes the return stack buffer on every VM exit to prevent a return
+ stack buffer underflow so poisoned branch target buffer could be used,
+ or attacker guests leaving poisoned entries in the return stack buffer.
+
+ To mitigate guest-to-guest attacks in the same CPU hardware thread,
+ the branch target buffer is sanitized by flushing before switching
+ to a new guest on a CPU.
+
+ The above mitigations are turned on by default on vulnerable CPUs.
+
+ To mitigate guest-to-guest attacks from sibling thread when SMT is
+ in use, an untrusted guest running in the sibling thread can have
+ its indirect branch speculation disabled by administrator via prctl().
+
+ The kernel also allows guests to use any microcode based mitigation
+ they choose to use (such as IBPB or STIBP on x86) to protect themselves.
+
+.. _spectre_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+Spectre variant 2 mitigation can be disabled or force enabled at the
+kernel command line.
+
+ nospectre_v1
+
+ [X86,PPC] Disable mitigations for Spectre Variant 1
+ (bounds check bypass). With this option data leaks are
+ possible in the system.
+
+ nospectre_v2
+
+ [X86] Disable all mitigations for the Spectre variant 2
+ (indirect branch prediction) vulnerability. System may
+ allow data leaks with this option, which is equivalent
+ to spectre_v2=off.
+
+
+ spectre_v2=
+
+ [X86] Control mitigation of Spectre variant 2
+ (indirect branch speculation) vulnerability.
+ The default operation protects the kernel from
+ user space attacks.
+
+ on
+ unconditionally enable, implies
+ spectre_v2_user=on
+ off
+ unconditionally disable, implies
+ spectre_v2_user=off
+ auto
+ kernel detects whether your CPU model is
+ vulnerable
+
+ Selecting 'on' will, and 'auto' may, choose a
+ mitigation method at run time according to the
+ CPU, the available microcode, the setting of the
+ CONFIG_RETPOLINE configuration option, and the
+ compiler with which the kernel was built.
+
+ Selecting 'on' will also enable the mitigation
+ against user space to user space task attacks.
+
+ Selecting 'off' will disable both the kernel and
+ the user space protections.
+
+ Specific mitigations can also be selected manually:
+
+ retpoline
+ replace indirect branches
+ retpoline,generic
+ google's original retpoline
+ retpoline,amd
+ AMD-specific minimal thunk
+
+ Not specifying this option is equivalent to
+ spectre_v2=auto.
+
+For user space mitigation:
+
+ spectre_v2_user=
+
+ [X86] Control mitigation of Spectre variant 2
+ (indirect branch speculation) vulnerability between
+ user space tasks
+
+ on
+ Unconditionally enable mitigations. Is
+ enforced by spectre_v2=on
+
+ off
+ Unconditionally disable mitigations. Is
+ enforced by spectre_v2=off
+
+ prctl
+ Indirect branch speculation is enabled,
+ but mitigation can be enabled via prctl
+ per thread. The mitigation control state
+ is inherited on fork.
+
+ prctl,ibpb
+ Like "prctl" above, but only STIBP is
+ controlled per thread. IBPB is issued
+ always when switching between different user
+ space processes.
+
+ seccomp
+ Same as "prctl" above, but all seccomp
+ threads will enable the mitigation unless
+ they explicitly opt out.
+
+ seccomp,ibpb
+ Like "seccomp" above, but only STIBP is
+ controlled per thread. IBPB is issued
+ always when switching between different
+ user space processes.
+
+ auto
+ Kernel selects the mitigation depending on
+ the available CPU features and vulnerability.
+
+ Default mitigation:
+ If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl"
+
+ Not specifying this option is equivalent to
+ spectre_v2_user=auto.
+
+ In general the kernel by default selects
+ reasonable mitigations for the current CPU. To
+ disable Spectre variant 2 mitigations, boot with
+ spectre_v2=off. Spectre variant 1 mitigations
+ cannot be disabled.
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace
+^^^^^^^^^^^^^^^^^^^^
+
+ If all userspace applications are from trusted sources and do not
+ execute externally supplied untrusted code, then the mitigations can
+ be disabled.
+
+2. Protect sensitive programs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ For security-sensitive programs that have secrets (e.g. crypto
+ keys), protection against Spectre variant 2 can be put in place by
+ disabling indirect branch speculation when the program is running
+ (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
+
+3. Sandbox untrusted programs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ Untrusted programs that could be a source of attacks can be cordoned
+ off by disabling their indirect branch speculation when they are run
+ (See :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).
+ This prevents untrusted programs from polluting the branch target
+ buffer. All programs running in SECCOMP sandboxes have indirect
+ branch speculation restricted by default. This behavior can be
+ changed via the kernel command line and sysfs control files. See
+ :ref:`spectre_mitigation_control_command_line`.
+
+3. High security mode
+^^^^^^^^^^^^^^^^^^^^^
+
+ All Spectre variant 2 mitigations can be forced on
+ at boot time for all programs (See the "on" option in
+ :ref:`spectre_mitigation_control_command_line`). This will add
+ overhead as indirect branch speculations for all programs will be
+ restricted.
+
+ On x86, branch target buffer will be flushed with IBPB when switching
+ to a new program. STIBP is left on all the time to protect programs
+ against variant 2 attacks originating from programs running on
+ sibling threads.
+
+ Alternatively, STIBP can be used only when running programs
+ whose indirect branch speculation is explicitly disabled,
+ while IBPB is still used all the time when switching to a new
+ program to clear the branch target buffer (See "ibpb" option in
+ :ref:`spectre_mitigation_control_command_line`). This "ibpb" option
+ has less performance cost than the "on" option, which leaves STIBP
+ on all the time.
+
+References on Spectre
+---------------------
+
+Intel white papers:
+
+.. _spec_ref1:
+
+[1] `Intel analysis of speculative execution side channels <https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf>`_.
+
+.. _spec_ref2:
+
+[2] `Bounds check bypass <https://software.intel.com/security-software-guidance/software-guidance/bounds-check-bypass>`_.
+
+.. _spec_ref3:
+
+[3] `Deep dive: Retpoline: A branch target injection mitigation <https://software.intel.com/security-software-guidance/insights/deep-dive-retpoline-branch-target-injection-mitigation>`_.
+
+.. _spec_ref4:
+
+[4] `Deep Dive: Single Thread Indirect Branch Predictors <https://software.intel.com/security-software-guidance/insights/deep-dive-single-thread-indirect-branch-predictors>`_.
+
+AMD white papers:
+
+.. _spec_ref5:
+
+[5] `AMD64 technology indirect branch control extension <https://developer.amd.com/wp-content/resources/Architecture_Guidelines_Update_Indirect_Branch_Control.pdf>`_.
+
+.. _spec_ref6:
+
+[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf>`_.
+
+ARM white papers:
+
+.. _spec_ref7:
+
+[7] `Cache speculation side-channels <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/download-the-whitepaper>`_.
+
+.. _spec_ref8:
+
+[8] `Cache speculation issues update <https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/latest-updates/cache-speculation-issues-update>`_.
+
+Google white paper:
+
+.. _spec_ref9:
+
+[9] `Retpoline: a software construct for preventing branch-target-injection <https://support.google.com/faqs/answer/7625886>`_.
+
+MIPS white paper:
+
+.. _spec_ref10:
+
+[10] `MIPS: response on speculative execution and side channel vulnerabilities <https://www.mips.com/blog/mips-response-on-speculative-execution-and-side-channel-vulnerabilities/>`_.
+
+Academic papers:
+
+.. _spec_ref11:
+
+[11] `Spectre Attacks: Exploiting Speculative Execution <https://spectreattack.com/spectre.pdf>`_.
+
+.. _spec_ref12:
+
+[12] `NetSpectre: Read Arbitrary Memory over Network <https://arxiv.org/abs/1807.10535>`_.
+
+.. _spec_ref13:
+
+[13] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://www.usenix.org/system/files/conference/woot18/woot18-paper-koruyeh.pdf>`_.
diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
new file mode 100644
index 0000000..af6865b
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
@@ -0,0 +1,279 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+TAA - TSX Asynchronous Abort
+======================================
+
+TAA is a hardware vulnerability that allows unprivileged speculative access to
+data which is available in various CPU internal buffers by using asynchronous
+aborts within an Intel TSX transactional region.
+
+Affected processors
+-------------------
+
+This vulnerability only affects Intel processors that support Intel
+Transactional Synchronization Extensions (TSX) when the TAA_NO bit (bit 8)
+is 0 in the IA32_ARCH_CAPABILITIES MSR. On processors where the MDS_NO bit
+(bit 5) is 0 in the IA32_ARCH_CAPABILITIES MSR, the existing MDS mitigations
+also mitigate against TAA.
+
+Whether a processor is affected or not can be read out from the TAA
+vulnerability file in sysfs. See :ref:`tsx_async_abort_sys_info`.
+
+Related CVEs
+------------
+
+The following CVE entry is related to this TAA issue:
+
+ ============== ===== ===================================================
+ CVE-2019-11135 TAA TSX Asynchronous Abort (TAA) condition on some
+ microprocessors utilizing speculative execution may
+ allow an authenticated user to potentially enable
+ information disclosure via a side channel with
+ local access.
+ ============== ===== ===================================================
+
+Problem
+-------
+
+When performing store, load or L1 refill operations, processors write
+data into temporary microarchitectural structures (buffers). The data in
+those buffers can be forwarded to load operations as an optimization.
+
+Intel TSX is an extension to the x86 instruction set architecture that adds
+hardware transactional memory support to improve performance of multi-threaded
+software. TSX lets the processor expose and exploit concurrency hidden in an
+application due to dynamically avoiding unnecessary synchronization.
+
+TSX supports atomic memory transactions that are either committed (success) or
+aborted. During an abort, operations that happened within the transactional region
+are rolled back. An asynchronous abort takes place, among other options, when a
+different thread accesses a cache line that is also used within the transactional
+region when that access might lead to a data race.
+
+Immediately after an uncompleted asynchronous abort, certain speculatively
+executed loads may read data from those internal buffers and pass it to dependent
+operations. This can be then used to infer the value via a cache side channel
+attack.
+
+Because the buffers are potentially shared between Hyper-Threads cross
+Hyper-Thread attacks are possible.
+
+The victim of a malicious actor does not need to make use of TSX. Only the
+attacker needs to begin a TSX transaction and raise an asynchronous abort
+which in turn potenitally leaks data stored in the buffers.
+
+More detailed technical information is available in the TAA specific x86
+architecture section: :ref:`Documentation/x86/tsx_async_abort.rst <tsx_async_abort>`.
+
+
+Attack scenarios
+----------------
+
+Attacks against the TAA vulnerability can be implemented from unprivileged
+applications running on hosts or guests.
+
+As for MDS, the attacker has no control over the memory addresses that can
+be leaked. Only the victim is responsible for bringing data to the CPU. As
+a result, the malicious actor has to sample as much data as possible and
+then postprocess it to try to infer any useful information from it.
+
+A potential attacker only has read access to the data. Also, there is no direct
+privilege escalation by using this technique.
+
+
+.. _tsx_async_abort_sys_info:
+
+TAA system information
+-----------------------
+
+The Linux kernel provides a sysfs interface to enumerate the current TAA status
+of mitigated systems. The relevant sysfs file is:
+
+/sys/devices/system/cpu/vulnerabilities/tsx_async_abort
+
+The possible values in this file are:
+
+.. list-table::
+
+ * - 'Vulnerable'
+ - The CPU is affected by this vulnerability and the microcode and kernel mitigation are not applied.
+ * - 'Vulnerable: Clear CPU buffers attempted, no microcode'
+ - The system tries to clear the buffers but the microcode might not support the operation.
+ * - 'Mitigation: Clear CPU buffers'
+ - The microcode has been updated to clear the buffers. TSX is still enabled.
+ * - 'Mitigation: TSX disabled'
+ - TSX is disabled.
+ * - 'Not affected'
+ - The CPU is not affected by this issue.
+
+.. _ucode_needed:
+
+Best effort mitigation mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the processor is vulnerable, but the availability of the microcode-based
+mitigation mechanism is not advertised via CPUID the kernel selects a best
+effort mitigation mode. This mode invokes the mitigation instructions
+without a guarantee that they clear the CPU buffers.
+
+This is done to address virtualization scenarios where the host has the
+microcode update applied, but the hypervisor is not yet updated to expose the
+CPUID to the guest. If the host has updated microcode the protection takes
+effect; otherwise a few CPU cycles are wasted pointlessly.
+
+The state in the tsx_async_abort sysfs file reflects this situation
+accordingly.
+
+
+Mitigation mechanism
+--------------------
+
+The kernel detects the affected CPUs and the presence of the microcode which is
+required. If a CPU is affected and the microcode is available, then the kernel
+enables the mitigation by default.
+
+
+The mitigation can be controlled at boot time via a kernel command line option.
+See :ref:`taa_mitigation_control_command_line`.
+
+.. _virt_mechanism:
+
+Virtualization mitigation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Affected systems where the host has TAA microcode and TAA is mitigated by
+having disabled TSX previously, are not vulnerable regardless of the status
+of the VMs.
+
+In all other cases, if the host either does not have the TAA microcode or
+the kernel is not mitigated, the system might be vulnerable.
+
+
+.. _taa_mitigation_control_command_line:
+
+Mitigation control on the kernel command line
+---------------------------------------------
+
+The kernel command line allows to control the TAA mitigations at boot time with
+the option "tsx_async_abort=". The valid arguments for this option are:
+
+ ============ =============================================================
+ off This option disables the TAA mitigation on affected platforms.
+ If the system has TSX enabled (see next parameter) and the CPU
+ is affected, the system is vulnerable.
+
+ full TAA mitigation is enabled. If TSX is enabled, on an affected
+ system it will clear CPU buffers on ring transitions. On
+ systems which are MDS-affected and deploy MDS mitigation,
+ TAA is also mitigated. Specifying this option on those
+ systems will have no effect.
+
+ full,nosmt The same as tsx_async_abort=full, with SMT disabled on
+ vulnerable CPUs that have TSX enabled. This is the complete
+ mitigation. When TSX is disabled, SMT is not disabled because
+ CPU is not vulnerable to cross-thread TAA attacks.
+ ============ =============================================================
+
+Not specifying this option is equivalent to "tsx_async_abort=full". For
+processors that are affected by both TAA and MDS, specifying just
+"tsx_async_abort=off" without an accompanying "mds=off" will have no
+effect as the same mitigation is used for both vulnerabilities.
+
+The kernel command line also allows to control the TSX feature using the
+parameter "tsx=" on CPUs which support TSX control. MSR_IA32_TSX_CTRL is used
+to control the TSX feature and the enumeration of the TSX feature bits (RTM
+and HLE) in CPUID.
+
+The valid options are:
+
+ ============ =============================================================
+ off Disables TSX on the system.
+
+ Note that this option takes effect only on newer CPUs which are
+ not vulnerable to MDS, i.e., have MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1
+ and which get the new IA32_TSX_CTRL MSR through a microcode
+ update. This new MSR allows for the reliable deactivation of
+ the TSX functionality.
+
+ on Enables TSX.
+
+ Although there are mitigations for all known security
+ vulnerabilities, TSX has been known to be an accelerator for
+ several previous speculation-related CVEs, and so there may be
+ unknown security risks associated with leaving it enabled.
+
+ auto Disables TSX if X86_BUG_TAA is present, otherwise enables TSX
+ on the system.
+ ============ =============================================================
+
+Not specifying this option is equivalent to "tsx=off".
+
+The following combinations of the "tsx_async_abort" and "tsx" are possible. For
+affected platforms tsx=auto is equivalent to tsx=off and the result will be:
+
+ ========= ========================== =========================================
+ tsx=on tsx_async_abort=full The system will use VERW to clear CPU
+ buffers. Cross-thread attacks are still
+ possible on SMT machines.
+ tsx=on tsx_async_abort=full,nosmt As above, cross-thread attacks on SMT
+ mitigated.
+ tsx=on tsx_async_abort=off The system is vulnerable.
+ tsx=off tsx_async_abort=full TSX might be disabled if microcode
+ provides a TSX control MSR. If so,
+ system is not vulnerable.
+ tsx=off tsx_async_abort=full,nosmt Ditto
+ tsx=off tsx_async_abort=off ditto
+ ========= ========================== =========================================
+
+
+For unaffected platforms "tsx=on" and "tsx_async_abort=full" does not clear CPU
+buffers. For platforms without TSX control (MSR_IA32_ARCH_CAPABILITIES.MDS_NO=0)
+"tsx" command line argument has no effect.
+
+For the affected platforms below table indicates the mitigation status for the
+combinations of CPUID bit MD_CLEAR and IA32_ARCH_CAPABILITIES MSR bits MDS_NO
+and TSX_CTRL_MSR.
+
+ ======= ========= ============= ========================================
+ MDS_NO MD_CLEAR TSX_CTRL_MSR Status
+ ======= ========= ============= ========================================
+ 0 0 0 Vulnerable (needs microcode)
+ 0 1 0 MDS and TAA mitigated via VERW
+ 1 1 0 MDS fixed, TAA vulnerable if TSX enabled
+ because MD_CLEAR has no meaning and
+ VERW is not guaranteed to clear buffers
+ 1 X 1 MDS fixed, TAA can be mitigated by
+ VERW or TSX_CTRL_MSR
+ ======= ========= ============= ========================================
+
+Mitigation selection guide
+--------------------------
+
+1. Trusted userspace and guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If all user space applications are from a trusted source and do not execute
+untrusted code which is supplied externally, then the mitigation can be
+disabled. The same applies to virtualized environments with trusted guests.
+
+
+2. Untrusted userspace and guests
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If there are untrusted applications or guests on the system, enabling TSX
+might allow a malicious actor to leak data from the host or from other
+processes running on the same physical core.
+
+If the microcode is available and the TSX is disabled on the host, attacks
+are prevented in a virtualized environment as well, even if the VMs do not
+explicitly enable the mitigation.
+
+
+.. _taa_default_mitigations:
+
+Default mitigations
+-------------------
+
+The kernel's default action for vulnerable processors is:
+
+ - Deploy TSX disable mitigation (tsx_async_abort=full tsx=off).
diff --git a/Documentation/admin-guide/hw_random.rst b/Documentation/admin-guide/hw_random.rst
new file mode 100644
index 0000000..121de96
--- /dev/null
+++ b/Documentation/admin-guide/hw_random.rst
@@ -0,0 +1,105 @@
+==========================================================
+Linux support for random number generator in i8xx chipsets
+==========================================================
+
+Introduction
+============
+
+The hw_random framework is software that makes use of a
+special hardware feature on your CPU or motherboard,
+a Random Number Generator (RNG). The software has two parts:
+a core providing the /dev/hwrng character device and its
+sysfs support, plus a hardware-specific driver that plugs
+into that core.
+
+To make the most effective use of these mechanisms, you
+should download the support software as well. Download the
+latest version of the "rng-tools" package from the
+hw_random driver's official Web site:
+
+ http://sourceforge.net/projects/gkernel/
+
+Those tools use /dev/hwrng to fill the kernel entropy pool,
+which is used internally and exported by the /dev/urandom and
+/dev/random special files.
+
+Theory of operation
+===================
+
+CHARACTER DEVICE. Using the standard open()
+and read() system calls, you can read random data from
+the hardware RNG device. This data is NOT CHECKED by any
+fitness tests, and could potentially be bogus (if the
+hardware is faulty or has been tampered with). Data is only
+output if the hardware "has-data" flag is set, but nevertheless
+a security-conscious person would run fitness tests on the
+data before assuming it is truly random.
+
+The rng-tools package uses such tests in "rngd", and lets you
+run them by hand with a "rngtest" utility.
+
+/dev/hwrng is char device major 10, minor 183.
+
+CLASS DEVICE. There is a /sys/class/misc/hw_random node with
+two unique attributes, "rng_available" and "rng_current". The
+"rng_available" attribute lists the hardware-specific drivers
+available, while "rng_current" lists the one which is currently
+connected to /dev/hwrng. If your system has more than one
+RNG available, you may change the one used by writing a name from
+the list in "rng_available" into "rng_current".
+
+==========================================================================
+
+
+Hardware driver for Intel/AMD/VIA Random Number Generators (RNG)
+ - Copyright 2000,2001 Jeff Garzik <jgarzik@pobox.com>
+ - Copyright 2000,2001 Philipp Rumpf <prumpf@mandrakesoft.com>
+
+
+About the Intel RNG hardware, from the firmware hub datasheet
+=============================================================
+
+The Firmware Hub integrates a Random Number Generator (RNG)
+using thermal noise generated from inherently random quantum
+mechanical properties of silicon. When not generating new random
+bits the RNG circuitry will enter a low power state. Intel will
+provide a binary software driver to give third party software
+access to our RNG for use as a security feature. At this time,
+the RNG is only to be used with a system in an OS-present state.
+
+Intel RNG Driver notes
+======================
+
+FIXME: support poll(2)
+
+.. note::
+
+ request_mem_region was removed, for three reasons:
+
+ 1) Only one RNG is supported by this driver;
+ 2) The location used by the RNG is a fixed location in
+ MMIO-addressable memory;
+ 3) users with properly working BIOS e820 handling will always
+ have the region in which the RNG is located reserved, so
+ request_mem_region calls always fail for proper setups.
+ However, for people who use mem=XX, BIOS e820 information is
+ **not** in /proc/iomem, and request_mem_region(RNG_ADDR) can
+ succeed.
+
+Driver details
+==============
+
+Based on:
+ Intel 82802AB/82802AC Firmware Hub (FWH) Datasheet
+ May 1999 Order Number: 290658-002 R
+
+Intel 82802 Firmware Hub:
+ Random Number Generator
+ Programmer's Reference Manual
+ December 1999 Order Number: 298029-001 R
+
+Intel 82802 Firmware HUB Random Number Generator Driver
+ Copyright (c) 2000 Matt Sottek <msottek@quiknet.com>
+
+Special thanks to Matt Sottek. I did the "guts", he
+did the "brains" and all the testing.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 0873685..34cc20e 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -16,15 +16,14 @@
README
kernel-parameters
devices
+ sysctl/index
-This section describes CPU vulnerabilities and provides an overview of the
-possible mitigations along with guidance for selecting mitigations if they
-are configurable at compile, boot or run time.
+This section describes CPU vulnerabilities and their mitigations.
.. toctree::
:maxdepth: 1
- l1tf
+ hw-vuln/index
Here is a set of documents aimed at users who are trying to track down
problems and bugs in particular.
@@ -40,6 +39,8 @@
ramoops
dynamic-debug-howto
init
+ kdump/index
+ perf/index
This is the beginning of a section with information of interest to
application developers. Documents covering various aspects of the kernel
@@ -58,11 +59,13 @@
initrd
cgroup-v2
+ cgroup-v1/index
serial-console
braille-console
parport
md
module-signing
+ rapidio
sysrq
unicode
vga-softcursor
@@ -71,10 +74,43 @@
java
ras
bcache
+ blockdev/index
+ ext4
+ binderfs
+ cifs/index
+ xfs
+ jfs
+ ufs
pm/index
thunderbolt
LSM/index
mm/index
+ namespaces/index
+ perf-security
+ acpi/index
+ aoe/index
+ btmrvl
+ clearing-warn-once
+ cpu-load
+ cputopology
+ device-mapper/index
+ efi-stub
+ gpio/index
+ highuid
+ hw_random
+ iostats
+ kernel-per-CPU-kthreads
+ laptops/index
+ auxdisplay/index
+ lcd-panel-cgram
+ ldm
+ lockup-watchdogs
+ numastat
+ pnp
+ rtc
+ svga
+ wimax/index
+ video-output
.. only:: subproject and html
diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst
new file mode 100644
index 0000000..5d63b18
--- /dev/null
+++ b/Documentation/admin-guide/iostats.rst
@@ -0,0 +1,197 @@
+=====================
+I/O statistics fields
+=====================
+
+Since 2.4.20 (and some versions before, with patches), and 2.5.45,
+more extensive disk statistics have been introduced to help measure disk
+activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
+the work for you, but in case you are interested in creating your own
+tools, the fields are explained here.
+
+In 2.4 now, the information is found as additional fields in
+``/proc/partitions``. In 2.6 and upper, the same information is found in two
+places: one is in the file ``/proc/diskstats``, and the other is within
+the sysfs file system, which must be mounted in order to obtain
+the information. Throughout this document we'll assume that sysfs
+is mounted on ``/sys``, although of course it may be mounted anywhere.
+Both ``/proc/diskstats`` and sysfs use the same source for the information
+and so should not differ.
+
+Here are examples of these different formats::
+
+ 2.4:
+ 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
+ 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
+
+ 2.6+ sysfs:
+ 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
+ 35486 38030 38030 38030
+
+ 2.6+ diskstats:
+ 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
+ 3 1 hda1 35486 38030 38030 38030
+
+ 4.18+ diskstats:
+ 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0
+
+On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
+a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
+
+The advantage of one over the other is that the sysfs choice works well
+if you are watching a known, small set of disks. ``/proc/diskstats`` may
+be a better choice if you are watching a large number of disks because
+you'll avoid the overhead of 50, 100, or 500 or more opens/closes with
+each snapshot of your disk statistics.
+
+In 2.4, the statistics fields are those after the device name. In
+the above example, the first field of statistics would be 446216.
+By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
+find just the eleven fields, beginning with 446216. If you look at
+``/proc/diskstats``, the eleven fields will be preceded by the major and
+minor device numbers, and device name. Each of these formats provides
+eleven fields of statistics, each meaning exactly the same things.
+All fields except field 9 are cumulative since boot. Field 9 should
+go to zero as I/Os complete; all others only increase (unless they
+overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long
+(native word size) numbers, and on a very busy or long-lived system they
+may wrap. Applications should be prepared to deal with that; unless
+your observations are measured in large numbers of minutes or hours,
+they should not wrap twice before you notice them.
+
+Each set of stats only applies to the indicated device; if you want
+system-wide stats you'll have to find all the devices and sum them all up.
+
+Field 1 -- # of reads completed
+ This is the total number of reads completed successfully.
+
+Field 2 -- # of reads merged, field 6 -- # of writes merged
+ Reads and writes which are adjacent to each other may be merged for
+ efficiency. Thus two 4K reads may become one 8K read before it is
+ ultimately handed to the disk, and so it will be counted (and queued)
+ as only one I/O. This field lets you know how often this was done.
+
+Field 3 -- # of sectors read
+ This is the total number of sectors read successfully.
+
+Field 4 -- # of milliseconds spent reading
+ This is the total number of milliseconds spent by all reads (as
+ measured from __make_request() to end_that_request_last()).
+
+Field 5 -- # of writes completed
+ This is the total number of writes completed successfully.
+
+Field 6 -- # of writes merged
+ See the description of field 2.
+
+Field 7 -- # of sectors written
+ This is the total number of sectors written successfully.
+
+Field 8 -- # of milliseconds spent writing
+ This is the total number of milliseconds spent by all writes (as
+ measured from __make_request() to end_that_request_last()).
+
+Field 9 -- # of I/Os currently in progress
+ The only field that should go to zero. Incremented as requests are
+ given to appropriate struct request_queue and decremented as they finish.
+
+Field 10 -- # of milliseconds spent doing I/Os
+ This field increases so long as field 9 is nonzero.
+
+ Since 5.0 this field counts jiffies when at least one request was
+ started or completed. If request runs more than 2 jiffies then some
+ I/O time will not be accounted unless there are other requests.
+
+Field 11 -- weighted # of milliseconds spent doing I/Os
+ This field is incremented at each I/O start, I/O completion, I/O
+ merge, or read of these stats by the number of I/Os in progress
+ (field 9) times the number of milliseconds spent doing I/O since the
+ last update of this field. This can provide an easy measure of both
+ I/O completion time and the backlog that may be accumulating.
+
+Field 12 -- # of discards completed
+ This is the total number of discards completed successfully.
+
+Field 13 -- # of discards merged
+ See the description of field 2
+
+Field 14 -- # of sectors discarded
+ This is the total number of sectors discarded successfully.
+
+Field 15 -- # of milliseconds spent discarding
+ This is the total number of milliseconds spent by all discards (as
+ measured from __make_request() to end_that_request_last()).
+
+To avoid introducing performance bottlenecks, no locks are held while
+modifying these counters. This implies that minor inaccuracies may be
+introduced when changes collide, so (for instance) adding up all the
+read I/Os issued per partition should equal those made to the disks ...
+but due to the lack of locking it may only be very close.
+
+In 2.6+, there are counters for each CPU, which make the lack of locking
+almost a non-issue. When the statistics are read, the per-CPU counters
+are summed (possibly overflowing the unsigned long variable they are
+summed to) and the result given to the user. There is no convenient
+user interface for accessing the per-CPU counters themselves.
+
+Disks vs Partitions
+-------------------
+
+There were significant changes between 2.4 and 2.6+ in the I/O subsystem.
+As a result, some statistic information disappeared. The translation from
+a disk address relative to a partition to the disk address relative to
+the host disk happens much earlier. All merges and timings now happen
+at the disk level rather than at both the disk and partition level as
+in 2.4. Consequently, you'll see a different statistics output on 2.6+ for
+partitions from that for disks. There are only *four* fields available
+for partitions on 2.6+ machines. This is reflected in the examples above.
+
+Field 1 -- # of reads issued
+ This is the total number of reads issued to this partition.
+
+Field 2 -- # of sectors read
+ This is the total number of sectors requested to be read from this
+ partition.
+
+Field 3 -- # of writes issued
+ This is the total number of writes issued to this partition.
+
+Field 4 -- # of sectors written
+ This is the total number of sectors requested to be written to
+ this partition.
+
+Note that since the address is translated to a disk-relative one, and no
+record of the partition-relative address is kept, the subsequent success
+or failure of the read cannot be attributed to the partition. In other
+words, the number of reads for partitions is counted slightly before time
+of queuing for partitions, and at completion for whole disks. This is
+a subtle distinction that is probably uninteresting for most cases.
+
+More significant is the error induced by counting the numbers of
+reads/writes before merges for partitions and after for disks. Since a
+typical workload usually contains a lot of successive and adjacent requests,
+the number of reads/writes issued can be several times higher than the
+number of reads/writes completed.
+
+In 2.6.25, the full statistic set is again available for partitions and
+disk and partition statistics are consistent again. Since we still don't
+keep record of the partition-relative address, an operation is attributed to
+the partition which contains the first sector of the request after the
+eventual merges. As requests can be merged across partition, this could lead
+to some (probably insignificant) inaccuracy.
+
+Additional notes
+----------------
+
+In 2.6+, sysfs is not mounted by default. If your distribution of
+Linux hasn't added it already, here's the line you'll want to add to
+your ``/etc/fstab``::
+
+ none /sys sysfs defaults 0 0
+
+
+In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they
+appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in
+``/proc/stat`` take a very different format from those in ``/proc/partitions``
+(see proc(5), if your system has it.)
+
+-- ricklind@us.ibm.com
diff --git a/Documentation/admin-guide/jfs.rst b/Documentation/admin-guide/jfs.rst
new file mode 100644
index 0000000..9e12d93
--- /dev/null
+++ b/Documentation/admin-guide/jfs.rst
@@ -0,0 +1,66 @@
+===========================================
+IBM's Journaled File System (JFS) for Linux
+===========================================
+
+JFS Homepage: http://jfs.sourceforge.net/
+
+The following mount options are supported:
+
+(*) == default
+
+iocharset=name
+ Character set to use for converting from Unicode to
+ ASCII. The default is to do no conversion. Use
+ iocharset=utf8 for UTF-8 translations. This requires
+ CONFIG_NLS_UTF8 to be set in the kernel .config file.
+ iocharset=none specifies the default behavior explicitly.
+
+resize=value
+ Resize the volume to <value> blocks. JFS only supports
+ growing a volume, not shrinking it. This option is only
+ valid during a remount, when the volume is mounted
+ read-write. The resize keyword with no value will grow
+ the volume to the full size of the partition.
+
+nointegrity
+ Do not write to the journal. The primary use of this option
+ is to allow for higher performance when restoring a volume
+ from backup media. The integrity of the volume is not
+ guaranteed if the system abnormally abends.
+
+integrity(*)
+ Commit metadata changes to the journal. Use this option to
+ remount a volume where the nointegrity option was
+ previously specified in order to restore normal behavior.
+
+errors=continue
+ Keep going on a filesystem error.
+errors=remount-ro(*)
+ Remount the filesystem read-only on an error.
+errors=panic
+ Panic and halt the machine if an error occurs.
+
+uid=value
+ Override on-disk uid with specified value
+gid=value
+ Override on-disk gid with specified value
+umask=value
+ Override on-disk umask with specified octal value. For
+ directories, the execute bit will be set if the corresponding
+ read bit is set.
+
+discard=minlen, discard/nodiscard(*)
+ This enables/disables the use of discard/TRIM commands.
+ The discard/TRIM commands are sent to the underlying
+ block device when blocks are freed. This is useful for SSD
+ devices and sparse/thinly-provisioned LUNs. The FITRIM ioctl
+ command is also available together with the nodiscard option.
+ The value of minlen specifies the minimum blockcount, when
+ a TRIM command to the block device is considered useful.
+ When no value is given to the discard option, it defaults to
+ 64 blocks, which means 256KiB in JFS.
+ The minlen value of discard overrides the minlen value given
+ on an FITRIM ioctl().
+
+The JFS mailing list can be subscribed to by using the link labeled
+"Mail list Subscribe" at our web page http://jfs.sourceforge.net/
diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt
new file mode 100644
index 0000000..220d0a8
--- /dev/null
+++ b/Documentation/admin-guide/kdump/gdbmacros.txt
@@ -0,0 +1,264 @@
+#
+# This file contains a few gdb macros (user defined commands) to extract
+# useful information from kernel crashdump (kdump) like stack traces of
+# all the processes or a particular process and trapinfo.
+#
+# These macros can be used by copying this file in .gdbinit (put in home
+# directory or current directory) or by invoking gdb command with
+# --command=<command-file-name> option
+#
+# Credits:
+# Alexander Nyberg <alexn@telia.com>
+# V Srivatsa <vatsa@in.ibm.com>
+# Maneesh Soni <maneesh@in.ibm.com>
+#
+
+define bttnobp
+ set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
+ set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
+ set $init_t=&init_task
+ set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
+ set var $stacksize = sizeof(union thread_union)
+ while ($next_t != $init_t)
+ set $next_t=(struct task_struct *)$next_t
+ printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
+ printf "===================\n"
+ set var $stackp = $next_t.thread.sp
+ set var $stack_top = ($stackp & ~($stacksize - 1)) + $stacksize
+
+ while ($stackp < $stack_top)
+ if (*($stackp) > _stext && *($stackp) < _sinittext)
+ info symbol *($stackp)
+ end
+ set $stackp += 4
+ end
+ set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
+ while ($next_th != $next_t)
+ set $next_th=(struct task_struct *)$next_th
+ printf "\npid %d; comm %s:\n", $next_t.pid, $next_t.comm
+ printf "===================\n"
+ set var $stackp = $next_t.thread.sp
+ set var $stack_top = ($stackp & ~($stacksize - 1)) + stacksize
+
+ while ($stackp < $stack_top)
+ if (*($stackp) > _stext && *($stackp) < _sinittext)
+ info symbol *($stackp)
+ end
+ set $stackp += 4
+ end
+ set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
+ end
+ set $next_t=(char *)($next_t->tasks.next) - $tasks_off
+ end
+end
+document bttnobp
+ dump all thread stack traces on a kernel compiled with !CONFIG_FRAME_POINTER
+end
+
+define btthreadstack
+ set var $pid_task = $arg0
+
+ printf "\npid %d; comm %s:\n", $pid_task.pid, $pid_task.comm
+ printf "task struct: "
+ print $pid_task
+ printf "===================\n"
+ set var $stackp = $pid_task.thread.sp
+ set var $stacksize = sizeof(union thread_union)
+ set var $stack_top = ($stackp & ~($stacksize - 1)) + $stacksize
+ set var $stack_bot = ($stackp & ~($stacksize - 1))
+
+ set $stackp = *((unsigned long *) $stackp)
+ while (($stackp < $stack_top) && ($stackp > $stack_bot))
+ set var $addr = *(((unsigned long *) $stackp) + 1)
+ info symbol $addr
+ set $stackp = *((unsigned long *) $stackp)
+ end
+end
+document btthreadstack
+ dump a thread stack using the given task structure pointer
+end
+
+
+define btt
+ set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
+ set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
+ set $init_t=&init_task
+ set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
+ while ($next_t != $init_t)
+ set $next_t=(struct task_struct *)$next_t
+ btthreadstack $next_t
+
+ set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
+ while ($next_th != $next_t)
+ set $next_th=(struct task_struct *)$next_th
+ btthreadstack $next_th
+ set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
+ end
+ set $next_t=(char *)($next_t->tasks.next) - $tasks_off
+ end
+end
+document btt
+ dump all thread stack traces on a kernel compiled with CONFIG_FRAME_POINTER
+end
+
+define btpid
+ set var $pid = $arg0
+ set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
+ set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
+ set $init_t=&init_task
+ set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
+ set var $pid_task = 0
+
+ while ($next_t != $init_t)
+ set $next_t=(struct task_struct *)$next_t
+
+ if ($next_t.pid == $pid)
+ set $pid_task = $next_t
+ end
+
+ set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
+ while ($next_th != $next_t)
+ set $next_th=(struct task_struct *)$next_th
+ if ($next_th.pid == $pid)
+ set $pid_task = $next_th
+ end
+ set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
+ end
+ set $next_t=(char *)($next_t->tasks.next) - $tasks_off
+ end
+
+ btthreadstack $pid_task
+end
+document btpid
+ backtrace of pid
+end
+
+
+define trapinfo
+ set var $pid = $arg0
+ set $tasks_off=((size_t)&((struct task_struct *)0)->tasks)
+ set $pid_off=((size_t)&((struct task_struct *)0)->thread_group.next)
+ set $init_t=&init_task
+ set $next_t=(((char *)($init_t->tasks).next) - $tasks_off)
+ set var $pid_task = 0
+
+ while ($next_t != $init_t)
+ set $next_t=(struct task_struct *)$next_t
+
+ if ($next_t.pid == $pid)
+ set $pid_task = $next_t
+ end
+
+ set $next_th=(((char *)$next_t->thread_group.next) - $pid_off)
+ while ($next_th != $next_t)
+ set $next_th=(struct task_struct *)$next_th
+ if ($next_th.pid == $pid)
+ set $pid_task = $next_th
+ end
+ set $next_th=(((char *)$next_th->thread_group.next) - $pid_off)
+ end
+ set $next_t=(char *)($next_t->tasks.next) - $tasks_off
+ end
+
+ printf "Trapno %ld, cr2 0x%lx, error_code %ld\n", $pid_task.thread.trap_no, \
+ $pid_task.thread.cr2, $pid_task.thread.error_code
+
+end
+document trapinfo
+ Run info threads and lookup pid of thread #1
+ 'trapinfo <pid>' will tell you by which trap & possibly
+ address the kernel panicked.
+end
+
+define dump_log_idx
+ set $idx = $arg0
+ if ($argc > 1)
+ set $prev_flags = $arg1
+ else
+ set $prev_flags = 0
+ end
+ set $msg = ((struct printk_log *) (log_buf + $idx))
+ set $prefix = 1
+ set $newline = 1
+ set $log = log_buf + $idx + sizeof(*$msg)
+
+ # prev & LOG_CONT && !(msg->flags & LOG_PREIX)
+ if (($prev_flags & 8) && !($msg->flags & 4))
+ set $prefix = 0
+ end
+
+ # msg->flags & LOG_CONT
+ if ($msg->flags & 8)
+ # (prev & LOG_CONT && !(prev & LOG_NEWLINE))
+ if (($prev_flags & 8) && !($prev_flags & 2))
+ set $prefix = 0
+ end
+ # (!(msg->flags & LOG_NEWLINE))
+ if (!($msg->flags & 2))
+ set $newline = 0
+ end
+ end
+
+ if ($prefix)
+ printf "[%5lu.%06lu] ", $msg->ts_nsec / 1000000000, $msg->ts_nsec % 1000000000
+ end
+ if ($msg->text_len != 0)
+ eval "printf \"%%%d.%ds\", $log", $msg->text_len, $msg->text_len
+ end
+ if ($newline)
+ printf "\n"
+ end
+ if ($msg->dict_len > 0)
+ set $dict = $log + $msg->text_len
+ set $idx = 0
+ set $line = 1
+ while ($idx < $msg->dict_len)
+ if ($line)
+ printf " "
+ set $line = 0
+ end
+ set $c = $dict[$idx]
+ if ($c == '\0')
+ printf "\n"
+ set $line = 1
+ else
+ if ($c < ' ' || $c >= 127 || $c == '\\')
+ printf "\\x%02x", $c
+ else
+ printf "%c", $c
+ end
+ end
+ set $idx = $idx + 1
+ end
+ printf "\n"
+ end
+end
+document dump_log_idx
+ Dump a single log given its index in the log buffer. The first
+ parameter is the index into log_buf, the second is optional and
+ specified the previous log buffer's flags, used for properly
+ formatting continued lines.
+end
+
+define dmesg
+ set $i = log_first_idx
+ set $end_idx = log_first_idx
+ set $prev_flags = 0
+
+ while (1)
+ set $msg = ((struct printk_log *) (log_buf + $i))
+ if ($msg->len == 0)
+ set $i = 0
+ else
+ dump_log_idx $i $prev_flags
+ set $i = $i + $msg->len
+ set $prev_flags = $msg->flags
+ end
+ if ($i == $end_idx)
+ loop_break
+ end
+ end
+end
+document dmesg
+ print the kernel ring buffer
+end
diff --git a/Documentation/admin-guide/kdump/index.rst b/Documentation/admin-guide/kdump/index.rst
new file mode 100644
index 0000000..8e2ebd0
--- /dev/null
+++ b/Documentation/admin-guide/kdump/index.rst
@@ -0,0 +1,20 @@
+
+================================================================
+Documentation for Kdump - The kexec-based Crash Dumping Solution
+================================================================
+
+This document includes overview, setup and installation, and analysis
+information.
+
+.. toctree::
+ :maxdepth: 1
+
+ kdump
+ vmcoreinfo
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
new file mode 100644
index 0000000..ac7e131
--- /dev/null
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -0,0 +1,534 @@
+================================================================
+Documentation for Kdump - The kexec-based Crash Dumping Solution
+================================================================
+
+This document includes overview, setup and installation, and analysis
+information.
+
+Overview
+========
+
+Kdump uses kexec to quickly boot to a dump-capture kernel whenever a
+dump of the system kernel's memory needs to be taken (for example, when
+the system panics). The system kernel's memory image is preserved across
+the reboot and is accessible to the dump-capture kernel.
+
+You can use common commands, such as cp and scp, to copy the
+memory image to a dump file on the local disk, or across the network to
+a remote system.
+
+Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64,
+s390x, arm and arm64 architectures.
+
+When the system kernel boots, it reserves a small section of memory for
+the dump-capture kernel. This ensures that ongoing Direct Memory Access
+(DMA) from the system kernel does not corrupt the dump-capture kernel.
+The kexec -p command loads the dump-capture kernel into this reserved
+memory.
+
+On x86 machines, the first 640 KB of physical memory is needed to boot,
+regardless of where the kernel loads. Therefore, kexec backs up this
+region just before rebooting into the dump-capture kernel.
+
+Similarly on PPC64 machines first 32KB of physical memory is needed for
+booting regardless of where the kernel is loaded and to support 64K page
+size kexec backs up the first 64KB memory.
+
+For s390x, when kdump is triggered, the crashkernel region is exchanged
+with the region [0, crashkernel region size] and then the kdump kernel
+runs in [0, crashkernel region size]. Therefore no relocatable kernel is
+needed for s390x.
+
+All of the necessary information about the system kernel's core image is
+encoded in the ELF format, and stored in a reserved area of memory
+before a crash. The physical address of the start of the ELF header is
+passed to the dump-capture kernel through the elfcorehdr= boot
+parameter. Optionally the size of the ELF header can also be passed
+when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax.
+
+
+With the dump-capture kernel, you can access the memory image through
+/proc/vmcore. This exports the dump as an ELF-format file that you can
+write out using file copy commands such as cp or scp. Further, you can
+use analysis tools such as the GNU Debugger (GDB) and the Crash tool to
+debug the dump file. This method ensures that the dump pages are correctly
+ordered.
+
+
+Setup and Installation
+======================
+
+Install kexec-tools
+-------------------
+
+1) Login as the root user.
+
+2) Download the kexec-tools user-space package from the following URL:
+
+http://kernel.org/pub/linux/utils/kernel/kexec/kexec-tools.tar.gz
+
+This is a symlink to the latest version.
+
+The latest kexec-tools git tree is available at:
+
+- git://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
+- http://www.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git
+
+There is also a gitweb interface available at
+http://www.kernel.org/git/?p=utils/kernel/kexec/kexec-tools.git
+
+More information about kexec-tools can be found at
+http://horms.net/projects/kexec/
+
+3) Unpack the tarball with the tar command, as follows::
+
+ tar xvpzf kexec-tools.tar.gz
+
+4) Change to the kexec-tools directory, as follows::
+
+ cd kexec-tools-VERSION
+
+5) Configure the package, as follows::
+
+ ./configure
+
+6) Compile the package, as follows::
+
+ make
+
+7) Install the package, as follows::
+
+ make install
+
+
+Build the system and dump-capture kernels
+-----------------------------------------
+There are two possible methods of using Kdump.
+
+1) Build a separate custom dump-capture kernel for capturing the
+ kernel core dump.
+
+2) Or use the system kernel binary itself as dump-capture kernel and there is
+ no need to build a separate dump-capture kernel. This is possible
+ only with the architectures which support a relocatable kernel. As
+ of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support
+ relocatable kernel.
+
+Building a relocatable kernel is advantageous from the point of view that
+one does not have to build a second kernel for capturing the dump. But
+at the same time one might want to build a custom dump capture kernel
+suitable to his needs.
+
+Following are the configuration setting required for system and
+dump-capture kernels for enabling kdump support.
+
+System kernel config options
+----------------------------
+
+1) Enable "kexec system call" in "Processor type and features."::
+
+ CONFIG_KEXEC=y
+
+2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
+ filesystems." This is usually enabled by default::
+
+ CONFIG_SYSFS=y
+
+ Note that "sysfs file system support" might not appear in the "Pseudo
+ filesystems" menu if "Configure standard kernel features (for small
+ systems)" is not enabled in "General Setup." In this case, check the
+ .config file itself to ensure that sysfs is turned on, as follows::
+
+ grep 'CONFIG_SYSFS' .config
+
+3) Enable "Compile the kernel with debug info" in "Kernel hacking."::
+
+ CONFIG_DEBUG_INFO=Y
+
+ This causes the kernel to be built with debug symbols. The dump
+ analysis tools require a vmlinux with debug symbols in order to read
+ and analyze a dump file.
+
+Dump-capture kernel config options (Arch Independent)
+-----------------------------------------------------
+
+1) Enable "kernel crash dumps" support under "Processor type and
+ features"::
+
+ CONFIG_CRASH_DUMP=y
+
+2) Enable "/proc/vmcore support" under "Filesystems" -> "Pseudo filesystems"::
+
+ CONFIG_PROC_VMCORE=y
+
+ (CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.)
+
+Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
+--------------------------------------------------------------------
+
+1) On i386, enable high memory support under "Processor type and
+ features"::
+
+ CONFIG_HIGHMEM64G=y
+
+ or::
+
+ CONFIG_HIGHMEM4G
+
+2) On i386 and x86_64, disable symmetric multi-processing support
+ under "Processor type and features"::
+
+ CONFIG_SMP=n
+
+ (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
+ when loading the dump-capture kernel, see section "Load the Dump-capture
+ Kernel".)
+
+3) If one wants to build and use a relocatable kernel,
+ Enable "Build a relocatable kernel" support under "Processor type and
+ features"::
+
+ CONFIG_RELOCATABLE=y
+
+4) Use a suitable value for "Physical address where the kernel is
+ loaded" (under "Processor type and features"). This only appears when
+ "kernel crash dumps" is enabled. A suitable value depends upon
+ whether kernel is relocatable or not.
+
+ If you are using a relocatable kernel use CONFIG_PHYSICAL_START=0x100000
+ This will compile the kernel for physical address 1MB, but given the fact
+ kernel is relocatable, it can be run from any physical address hence
+ kexec boot loader will load it in memory region reserved for dump-capture
+ kernel.
+
+ Otherwise it should be the start of memory region reserved for
+ second kernel using boot parameter "crashkernel=Y@X". Here X is
+ start of memory region reserved for dump-capture kernel.
+ Generally X is 16MB (0x1000000). So you can set
+ CONFIG_PHYSICAL_START=0x1000000
+
+5) Make and install the kernel and its modules. DO NOT add this kernel
+ to the boot loader configuration files.
+
+Dump-capture kernel config options (Arch Dependent, ppc64)
+----------------------------------------------------------
+
+1) Enable "Build a kdump crash kernel" support under "Kernel" options::
+
+ CONFIG_CRASH_DUMP=y
+
+2) Enable "Build a relocatable kernel" support::
+
+ CONFIG_RELOCATABLE=y
+
+ Make and install the kernel and its modules.
+
+Dump-capture kernel config options (Arch Dependent, ia64)
+----------------------------------------------------------
+
+- No specific options are required to create a dump-capture kernel
+ for ia64, other than those specified in the arch independent section
+ above. This means that it is possible to use the system kernel
+ as a dump-capture kernel if desired.
+
+ The crashkernel region can be automatically placed by the system
+ kernel at run time. This is done by specifying the base address as 0,
+ or omitting it all together::
+
+ crashkernel=256M@0
+
+ or::
+
+ crashkernel=256M
+
+ If the start address is specified, note that the start address of the
+ kernel will be aligned to 64Mb, so if the start address is not then
+ any space below the alignment point will be wasted.
+
+Dump-capture kernel config options (Arch Dependent, arm)
+----------------------------------------------------------
+
+- To use a relocatable kernel,
+ Enable "AUTO_ZRELADDR" support under "Boot" options::
+
+ AUTO_ZRELADDR=y
+
+Dump-capture kernel config options (Arch Dependent, arm64)
+----------------------------------------------------------
+
+- Please note that kvm of the dump-capture kernel will not be enabled
+ on non-VHE systems even if it is configured. This is because the CPU
+ will not be reset to EL2 on panic.
+
+Extended crashkernel syntax
+===========================
+
+While the "crashkernel=size[@offset]" syntax is sufficient for most
+configurations, sometimes it's handy to have the reserved memory dependent
+on the value of System RAM -- that's mostly for distributors that pre-setup
+the kernel command line to avoid a unbootable system after some memory has
+been removed from the machine.
+
+The syntax is::
+
+ crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]
+ range=start-[end]
+
+For example::
+
+ crashkernel=512M-2G:64M,2G-:128M
+
+This would mean:
+
+ 1) if the RAM is smaller than 512M, then don't reserve anything
+ (this is the "rescue" case)
+ 2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
+ 3) if the RAM size is larger than 2G, then reserve 128M
+
+
+
+Boot into System Kernel
+=======================
+
+1) Update the boot loader (such as grub, yaboot, or lilo) configuration
+ files as necessary.
+
+2) Boot the system kernel with the boot parameter "crashkernel=Y@X",
+ where Y specifies how much memory to reserve for the dump-capture kernel
+ and X specifies the beginning of this reserved memory. For example,
+ "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
+ starting at physical address 0x01000000 (16MB) for the dump-capture kernel.
+
+ On x86 and x86_64, use "crashkernel=64M@16M".
+
+ On ppc64, use "crashkernel=128M@32M".
+
+ On ia64, 256M@256M is a generous value that typically works.
+ The region may be automatically placed on ia64, see the
+ dump-capture kernel config option notes above.
+ If use sparse memory, the size should be rounded to GRANULE boundaries.
+
+ On s390x, typically use "crashkernel=xxM". The value of xx is dependent
+ on the memory consumption of the kdump system. In general this is not
+ dependent on the memory size of the production system.
+
+ On arm, the use of "crashkernel=Y@X" is no longer necessary; the
+ kernel will automatically locate the crash kernel image within the
+ first 512MB of RAM if X is not given.
+
+ On arm64, use "crashkernel=Y[@X]". Note that the start address of
+ the kernel, X if explicitly specified, must be aligned to 2MiB (0x200000).
+
+Load the Dump-capture Kernel
+============================
+
+After booting to the system kernel, dump-capture kernel needs to be
+loaded.
+
+Based on the architecture and type of image (relocatable or not), one
+can choose to load the uncompressed vmlinux or compressed bzImage/vmlinuz
+of dump-capture kernel. Following is the summary.
+
+For i386 and x86_64:
+
+ - Use vmlinux if kernel is not relocatable.
+ - Use bzImage/vmlinuz if kernel is relocatable.
+
+For ppc64:
+
+ - Use vmlinux
+
+For ia64:
+
+ - Use vmlinux or vmlinuz.gz
+
+For s390x:
+
+ - Use image or bzImage
+
+For arm:
+
+ - Use zImage
+
+For arm64:
+
+ - Use vmlinux or Image
+
+If you are using an uncompressed vmlinux image then use following command
+to load dump-capture kernel::
+
+ kexec -p <dump-capture-kernel-vmlinux-image> \
+ --initrd=<initrd-for-dump-capture-kernel> --args-linux \
+ --append="root=<root-dev> <arch-specific-options>"
+
+If you are using a compressed bzImage/vmlinuz, then use following command
+to load dump-capture kernel::
+
+ kexec -p <dump-capture-kernel-bzImage> \
+ --initrd=<initrd-for-dump-capture-kernel> \
+ --append="root=<root-dev> <arch-specific-options>"
+
+If you are using a compressed zImage, then use following command
+to load dump-capture kernel::
+
+ kexec --type zImage -p <dump-capture-kernel-bzImage> \
+ --initrd=<initrd-for-dump-capture-kernel> \
+ --dtb=<dtb-for-dump-capture-kernel> \
+ --append="root=<root-dev> <arch-specific-options>"
+
+If you are using an uncompressed Image, then use following command
+to load dump-capture kernel::
+
+ kexec -p <dump-capture-kernel-Image> \
+ --initrd=<initrd-for-dump-capture-kernel> \
+ --append="root=<root-dev> <arch-specific-options>"
+
+Please note, that --args-linux does not need to be specified for ia64.
+It is planned to make this a no-op on that architecture, but for now
+it should be omitted
+
+Following are the arch specific command line options to be used while
+loading dump-capture kernel.
+
+For i386, x86_64 and ia64:
+
+ "1 irqpoll maxcpus=1 reset_devices"
+
+For ppc64:
+
+ "1 maxcpus=1 noirqdistrib reset_devices"
+
+For s390x:
+
+ "1 maxcpus=1 cgroup_disable=memory"
+
+For arm:
+
+ "1 maxcpus=1 reset_devices"
+
+For arm64:
+
+ "1 maxcpus=1 reset_devices"
+
+Notes on loading the dump-capture kernel:
+
+* By default, the ELF headers are stored in ELF64 format to support
+ systems with more than 4GB memory. On i386, kexec automatically checks if
+ the physical RAM size exceeds the 4 GB limit and if not, uses ELF32.
+ So, on non-PAE systems, ELF32 is always used.
+
+ The --elf32-core-headers option can be used to force the generation of ELF32
+ headers. This is necessary because GDB currently cannot open vmcore files
+ with ELF64 headers on 32-bit systems.
+
+* The "irqpoll" boot parameter reduces driver initialization failures
+ due to shared interrupts in the dump-capture kernel.
+
+* You must specify <root-dev> in the format corresponding to the root
+ device name in the output of mount command.
+
+* Boot parameter "1" boots the dump-capture kernel into single-user
+ mode without networking. If you want networking, use "3".
+
+* We generally don't have to bring up a SMP kernel just to capture the
+ dump. Hence generally it is useful either to build a UP dump-capture
+ kernel or specify maxcpus=1 option while loading dump-capture kernel.
+ Note, though maxcpus always works, you had better replace it with
+ nr_cpus to save memory if supported by the current ARCH, such as x86.
+
+* You should enable multi-cpu support in dump-capture kernel if you intend
+ to use multi-thread programs with it, such as parallel dump feature of
+ makedumpfile. Otherwise, the multi-thread program may have a great
+ performance degradation. To enable multi-cpu support, you should bring up an
+ SMP dump-capture kernel and specify maxcpus/nr_cpus, disable_cpu_apicid=[X]
+ options while loading it.
+
+* For s390x there are two kdump modes: If a ELF header is specified with
+ the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
+ is done on all other architectures. If no elfcorehdr= kernel parameter is
+ specified, the s390x kdump kernel dynamically creates the header. The
+ second mode has the advantage that for CPU and memory hotplug, kdump has
+ not to be reloaded with kexec_load().
+
+* For s390x systems with many attached devices the "cio_ignore" kernel
+ parameter should be used for the kdump kernel in order to prevent allocation
+ of kernel memory for devices that are not relevant for kdump. The same
+ applies to systems that use SCSI/FCP devices. In that case the
+ "allow_lun_scan" zfcp module parameter should be set to zero before
+ setting FCP devices online.
+
+Kernel Panic
+============
+
+After successfully loading the dump-capture kernel as previously
+described, the system will reboot into the dump-capture kernel if a
+system crash is triggered. Trigger points are located in panic(),
+die(), die_nmi() and in the sysrq handler (ALT-SysRq-c).
+
+The following conditions will execute a crash trigger point:
+
+If a hard lockup is detected and "NMI watchdog" is configured, the system
+will boot into the dump-capture kernel ( die_nmi() ).
+
+If die() is called, and it happens to be a thread with pid 0 or 1, or die()
+is called inside interrupt context or die() is called and panic_on_oops is set,
+the system will boot into the dump-capture kernel.
+
+On powerpc systems when a soft-reset is generated, die() is called by all cpus
+and the system will boot into the dump-capture kernel.
+
+For testing purposes, you can trigger a crash by using "ALT-SysRq-c",
+"echo c > /proc/sysrq-trigger" or write a module to force the panic.
+
+Write Out the Dump File
+=======================
+
+After the dump-capture kernel is booted, write out the dump file with
+the following command::
+
+ cp /proc/vmcore <dump-file>
+
+
+Analysis
+========
+
+Before analyzing the dump image, you should reboot into a stable kernel.
+
+You can do limited analysis using GDB on the dump file copied out of
+/proc/vmcore. Use the debug vmlinux built with -g and run the following
+command::
+
+ gdb vmlinux <dump-file>
+
+Stack trace for the task on processor 0, register display, and memory
+display work fine.
+
+Note: GDB cannot analyze core files generated in ELF64 format for x86.
+On systems with a maximum of 4GB of memory, you can generate
+ELF32-format headers using the --elf32-core-headers kernel option on the
+dump kernel.
+
+You can also use the Crash utility to analyze dump files in Kdump
+format. Crash is available on Dave Anderson's site at the following URL:
+
+ http://people.redhat.com/~anderson/
+
+Trigger Kdump on WARN()
+=======================
+
+The kernel parameter, panic_on_warn, calls panic() in all WARN() paths. This
+will cause a kdump to occur at the panic() call. In cases where a user wants
+to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
+to achieve the same behaviour.
+
+Contact
+=======
+
+- Vivek Goyal (vgoyal@redhat.com)
+- Maneesh Soni (maneesh@in.ibm.com)
+
+GDB macros
+==========
+
+.. include:: gdbmacros.txt
+ :literal:
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
new file mode 100644
index 0000000..007a6b8
--- /dev/null
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -0,0 +1,488 @@
+==========
+VMCOREINFO
+==========
+
+What is it?
+===========
+
+VMCOREINFO is a special ELF note section. It contains various
+information from the kernel like structure size, page size, symbol
+values, field offsets, etc. These data are packed into an ELF note
+section and used by user-space tools like crash and makedumpfile to
+analyze a kernel's memory layout.
+
+Common variables
+================
+
+init_uts_ns.name.release
+------------------------
+
+The version of the Linux kernel. Used to find the corresponding source
+code from which the kernel has been built. For example, crash uses it to
+find the corresponding vmlinux in order to process vmcore.
+
+PAGE_SIZE
+---------
+
+The size of a page. It is the smallest unit of data used by the memory
+management facilities. It is usually 4096 bytes of size and a page is
+aligned on 4096 bytes. Used for computing page addresses.
+
+init_uts_ns
+-----------
+
+The UTS namespace which is used to isolate two specific elements of the
+system that relate to the uname(2) system call. It is named after the
+data structure used to store information returned by the uname(2) system
+call.
+
+User-space tools can get the kernel name, host name, kernel release
+number, kernel version, architecture name and OS type from it.
+
+node_online_map
+---------------
+
+An array node_states[N_ONLINE] which represents the set of online nodes
+in a system, one bit position per node number. Used to keep track of
+which nodes are in the system and online.
+
+swapper_pg_dir
+--------------
+
+The global page directory pointer of the kernel. Used to translate
+virtual to physical addresses.
+
+_stext
+------
+
+Defines the beginning of the text section. In general, _stext indicates
+the kernel start address. Used to convert a virtual address from the
+direct kernel map to a physical address.
+
+vmap_area_list
+--------------
+
+Stores the virtual area list. makedumpfile gets the vmalloc start value
+from this variable and its value is necessary for vmalloc translation.
+
+mem_map
+-------
+
+Physical addresses are translated to struct pages by treating them as
+an index into the mem_map array. Right-shifting a physical address
+PAGE_SHIFT bits converts it into a page frame number which is an index
+into that mem_map array.
+
+Used to map an address to the corresponding struct page.
+
+contig_page_data
+----------------
+
+Makedumpfile gets the pglist_data structure from this symbol, which is
+used to describe the memory layout.
+
+User-space tools use this to exclude free pages when dumping memory.
+
+mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
+--------------------------------------------------------------------------
+
+The address of the mem_section array, its length, structure size, and
+the section_mem_map offset.
+
+It exists in the sparse memory mapping model, and it is also somewhat
+similar to the mem_map variable, both of them are used to translate an
+address.
+
+page
+----
+
+The size of a page structure. struct page is an important data structure
+and it is widely used to compute contiguous memory.
+
+pglist_data
+-----------
+
+The size of a pglist_data structure. This value is used to check if the
+pglist_data structure is valid. It is also used for checking the memory
+type.
+
+zone
+----
+
+The size of a zone structure. This value is used to check if the zone
+structure has been found. It is also used for excluding free pages.
+
+free_area
+---------
+
+The size of a free_area structure. It indicates whether the free_area
+structure is valid or not. Useful when excluding free pages.
+
+list_head
+---------
+
+The size of a list_head structure. Used when iterating lists in a
+post-mortem analysis session.
+
+nodemask_t
+----------
+
+The size of a nodemask_t type. Used to compute the number of online
+nodes.
+
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
+-------------------------------------------------------------------------------------------------
+
+User-space tools compute their values based on the offset of these
+variables. The variables are used when excluding unnecessary pages.
+
+(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_spanned_pages|node_id)
+-----------------------------------------------------------------------------------------
+
+On NUMA machines, each NUMA node has a pg_data_t to describe its memory
+layout. On UMA machines there is a single pglist_data which describes the
+whole memory.
+
+These values are used to check the memory type and to compute the
+virtual address for memory map.
+
+(zone, free_area|vm_stat|spanned_pages)
+---------------------------------------
+
+Each node is divided into a number of blocks called zones which
+represent ranges within memory. A zone is described by a structure zone.
+
+User-space tools compute required values based on the offset of these
+variables.
+
+(free_area, free_list)
+----------------------
+
+Offset of the free_list's member. This value is used to compute the number
+of free pages.
+
+Each zone has a free_area structure array called free_area[MAX_ORDER].
+The free_list represents a linked list of free page blocks.
+
+(list_head, next|prev)
+----------------------
+
+Offsets of the list_head's members. list_head is used to define a
+circular linked list. User-space tools need these in order to traverse
+lists.
+
+(vmap_area, va_start|list)
+--------------------------
+
+Offsets of the vmap_area's members. They carry vmalloc-specific
+information. Makedumpfile gets the start address of the vmalloc region
+from this.
+
+(zone.free_area, MAX_ORDER)
+---------------------------
+
+Free areas descriptor. User-space tools use this value to iterate the
+free_area ranges. MAX_ORDER is used by the zone buddy allocator.
+
+log_first_idx
+-------------
+
+Index of the first record stored in the buffer log_buf. Used by
+user-space tools to read the strings in the log_buf.
+
+log_buf
+-------
+
+Console output is written to the ring buffer log_buf at index
+log_first_idx. Used to get the kernel log.
+
+log_buf_len
+-----------
+
+log_buf's length.
+
+clear_idx
+---------
+
+The index that the next printk() record to read after the last clear
+command. It indicates the first record after the last SYSLOG_ACTION
+_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
+the dmesg log.
+
+log_next_idx
+------------
+
+The index of the next record to store in the buffer log_buf. Used to
+compute the index of the current buffer position.
+
+printk_log
+----------
+
+The size of a structure printk_log. Used to compute the size of
+messages, and extract dmesg log. It encapsulates header information for
+log_buf, such as timestamp, syslog level, etc.
+
+(printk_log, ts_nsec|len|text_len|dict_len)
+-------------------------------------------
+
+It represents field offsets in struct printk_log. User space tools
+parse it and check whether the values of printk_log's members have been
+changed.
+
+(free_area.free_list, MIGRATE_TYPES)
+------------------------------------
+
+The number of migrate types for pages. The free_list is described by the
+array. Used by tools to compute the number of free pages.
+
+NR_FREE_PAGES
+-------------
+
+On linux-2.6.21 or later, the number of free pages is in
+vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
+
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
+------------------------------------------------------------------------------
+
+Page attributes. These flags are used to filter various unnecessary for
+dumping pages.
+
+PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+-----------------------------------------------------------------------------
+
+More page attributes. These flags are used to filter various unnecessary for
+dumping pages.
+
+
+HUGETLB_PAGE_DTOR
+-----------------
+
+The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
+excludes these pages.
+
+x86_64
+======
+
+phys_base
+---------
+
+Used to convert the virtual address of an exported kernel symbol to its
+corresponding physical address.
+
+init_top_pgt
+------------
+
+Used to walk through the whole page table and convert virtual addresses
+to physical addresses. The init_top_pgt is somewhat similar to
+swapper_pg_dir, but it is only used in x86_64.
+
+pgtable_l5_enabled
+------------------
+
+User-space tools need to know whether the crash kernel was in 5-level
+paging mode.
+
+node_data
+---------
+
+This is a struct pglist_data array and stores all NUMA nodes
+information. Makedumpfile gets the pglist_data structure from it.
+
+(node_data, MAX_NUMNODES)
+-------------------------
+
+The maximum number of nodes in system.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+KERNEL_IMAGE_SIZE
+-----------------
+
+Currently unused by Makedumpfile. Used to compute the module virtual
+address by Crash.
+
+sme_mask
+--------
+
+AMD-specific with SME support: it indicates the secure memory encryption
+mask. Makedumpfile tools need to know whether the crash kernel was
+encrypted. If SME is enabled in the first kernel, the crash kernel's
+page table entries (pgd/pud/pmd/pte) contain the memory encryption
+mask. This is used to remove the SME mask and obtain the true physical
+address.
+
+Currently, sme_mask stores the value of the C-bit position. If needed,
+additional SME-relevant info can be placed in that variable.
+
+For example::
+
+ [ misc ][ enc bit ][ other misc SME info ]
+ 0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
+ 63 59 55 51 47 43 39 35 31 27 ... 3
+
+x86_32
+======
+
+X86_PAE
+-------
+
+Denotes whether physical address extensions are enabled. It has the cost
+of a higher page table lookup overhead, and also consumes more page
+table space per process. Used to check whether PAE was enabled in the
+crash kernel when converting virtual addresses to physical addresses.
+
+ia64
+====
+
+pgdat_list|(pgdat_list, MAX_NUMNODES)
+-------------------------------------
+
+pg_data_t array storing all NUMA nodes information. MAX_NUMNODES
+indicates the number of the nodes.
+
+node_memblk|(node_memblk, NR_NODE_MEMBLKS)
+------------------------------------------
+
+List of node memory chunks. Filled when parsing the SRAT table to obtain
+information about memory nodes. NR_NODE_MEMBLKS indicates the number of
+node memory chunks.
+
+These values are used to compute the number of nodes the crashed kernel used.
+
+node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
+----------------------------------------------------------------
+
+The size of a struct node_memblk_s and the offsets of the
+node_memblk_s's members. Used to compute the number of nodes.
+
+PGTABLE_3|PGTABLE_4
+-------------------
+
+User-space tools need to know whether the crash kernel was in 3-level or
+4-level paging mode. Used to distinguish the page table.
+
+ARM64
+=====
+
+VA_BITS
+-------
+
+The maximum number of bits for virtual addresses. Used to compute the
+virtual memory ranges.
+
+kimage_voffset
+--------------
+
+The offset between the kernel virtual and physical mappings. Used to
+translate virtual to physical addresses.
+
+PHYS_OFFSET
+-----------
+
+Indicates the physical address of the start of memory. Similar to
+kimage_voffset, which is used to translate virtual to physical
+addresses.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+arm
+===
+
+ARM_LPAE
+--------
+
+It indicates whether the crash kernel supports large physical address
+extensions. Used to translate virtual to physical addresses.
+
+s390
+====
+
+lowcore_ptr
+-----------
+
+An array with a pointer to the lowcore of every CPU. Used to print the
+psw and all registers information.
+
+high_memory
+-----------
+
+Used to get the vmalloc_start address from the high_memory symbol.
+
+(lowcore_ptr, NR_CPUS)
+----------------------
+
+The maximum number of CPUs.
+
+powerpc
+=======
+
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+contig_page_data
+----------------
+
+See above.
+
+vmemmap_list
+------------
+
+The vmemmap_list maintains the entire vmemmap physical mapping. Used
+to get vmemmap list count and populated vmemmap regions info. If the
+vmemmap address translation information is stored in the crash kernel,
+it is used to translate vmemmap kernel virtual addresses.
+
+mmu_vmemmap_psize
+-----------------
+
+The size of a page. Used to translate virtual to physical addresses.
+
+mmu_psize_defs
+--------------
+
+Page size definitions, i.e. 4k, 64k, or 16M.
+
+Used to make vtop translations.
+
+vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|(vmemmap_backing, virt_addr)
+--------------------------------------------------------------------------------------------
+
+The vmemmap virtual address space management does not have a traditional
+page table to track which virtual struct pages are backed by a physical
+mapping. The virtual to physical mappings are tracked in a simple linked
+list format.
+
+User-space tools need to know the offset of list, phys and virt_addr
+when computing the count of vmemmap regions.
+
+mmu_psize_def|(mmu_psize_def, shift)
+------------------------------------
+
+The size of a struct mmu_psize_def and the offset of mmu_psize_def's
+member.
+
+Used in vtop translations.
+
+sh
+==
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+X2TLB
+-----
+
+Indicates whether the crashed kernel enabled SH extended mode.
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index b8d0bc0..d05d531 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -9,11 +9,11 @@
punctuation and sorting digits before letters in a case insensitive
manner), and with descriptions where known.
-The kernel parses parameters from the kernel command line up to "--";
+The kernel parses parameters from the kernel command line up to "``--``";
if it doesn't recognize a parameter and it doesn't contain a '.', the
parameter gets passed to init: parameters with '=' go into init's
environment, others are passed as command line arguments to init.
-Everything after "--" is passed as an argument to init.
+Everything after "``--``" is passed as an argument to init.
Module parameters can be specified in two ways: via the kernel command
line with a module name prefix, or via modprobe, e.g.::
@@ -88,6 +88,7 @@
APIC APIC support is enabled.
APM Advanced Power Management support is enabled.
ARM ARM architecture is enabled.
+ ARM64 ARM64 architecture is enabled.
AX25 Appropriate AX.25 support is enabled.
CLK Common clock infrastructure is enabled.
CMA Contiguous Memory Area support is enabled.
@@ -117,7 +118,7 @@
LOOP Loopback device support is enabled.
M68k M68k architecture is enabled.
These options have more detailed description inside of
- Documentation/m68k/kernel-options.txt.
+ Documentation/m68k/kernel-options.rst.
MDA MDA console support is enabled.
MIPS MIPS architecture is enabled.
MOUSE Appropriate mouse support is enabled.
@@ -166,7 +167,7 @@
X86-32 X86-32, aka i386 architecture is enabled.
X86-64 X86-64 architecture is enabled.
More X86-64 boot options can be found in
- Documentation/x86/x86_64/boot-options.txt .
+ Documentation/x86/x86_64/boot-options.rst.
X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64)
X86_UV SGI UV support is enabled.
XEN Xen support is enabled
@@ -180,10 +181,10 @@
Parameters denoted with BOOT are actually interpreted by the boot
loader, and have no meaning to the kernel directly.
Do not modify the syntax of boot loader parameters without extreme
-need or coordination with <Documentation/x86/boot.txt>.
+need or coordination with <Documentation/x86/boot.rst>.
There are also arch-specific kernel-parameters not documented here.
-See for example <Documentation/x86/x86_64/boot-options.txt>.
+See for example <Documentation/x86/x86_64/boot-options.rst>.
Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
a trailing = on the name of any parameter states that that parameter will
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0c404cd..9983ac7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -13,7 +13,7 @@
For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force"
are available
- See also Documentation/power/runtime_pm.txt, pci=noacpi
+ See also Documentation/power/runtime_pm.rst, pci=noacpi
acpi_apic_instance= [ACPI, IOAPIC]
Format: <int>
@@ -53,7 +53,7 @@
ACPI_DEBUG_PRINT statements, e.g.,
ACPI_DEBUG_PRINT((ACPI_DB_INFO, ...
The debug_level mask defaults to "info". See
- Documentation/acpi/debug.txt for more information about
+ Documentation/firmware-guide/acpi/debug.rst for more information about
debug layers and levels.
Enable processor driver info messages:
@@ -223,7 +223,7 @@
acpi_sleep= [HW,ACPI] Sleep options
Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
old_ordering, nonvs, sci_force_enable, nobl }
- See Documentation/power/video.txt for information on
+ See Documentation/power/video.rst for information on
s3_bios and s3_mode.
s3_beep is for debugging; it makes the PC's speaker beep
as soon as the kernel's real-mode entry point is called.
@@ -331,7 +331,7 @@
APC and your system crashes randomly.
apic= [APIC,X86] Advanced Programmable Interrupt Controller
- Change the output verbosity whilst booting
+ Change the output verbosity while booting
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
@@ -430,7 +430,7 @@
blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input.
- See Documentation/block/cmdline-partition.txt
+ See Documentation/block/cmdline-partition.rst
boot_delay= Milliseconds to delay each printk during boot.
Values larger than 10 seconds (10000) are changed to
@@ -461,6 +461,11 @@
possible to determine what the correct size should be.
This option provides an override for these situations.
+ carrier_timeout=
+ [NET] Specifies amount of time (in seconds) that
+ the kernel should wait for a network carrier. By default
+ it waits 120 seconds.
+
ca_keys= [KEYS] This parameter identifies a specific key(s) on
the system trusted keyring to be used for certificate
trust validation.
@@ -473,7 +478,7 @@
others).
ccw_timeout_log [S390]
- See Documentation/s390/CommonIO for details.
+ See Documentation/s390/common_io.rst for details.
cgroup_disable= [KNL] Disable a particular controller
Format: {name of the controller(s) to disable}
@@ -486,10 +491,14 @@
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
- cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1
- Format: { controller[,controller...] | "all" }
+ cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1
+ Format: { { controller | "all" | "named" }
+ [,{ controller | "all" | "named" }...] }
Like cgroup_disable, but only applies to cgroup v1;
the blacklisted controllers remain available in cgroup2.
+ "all" blacklists all controllers and "named" disables
+ named mounts. Specifying both "all" and "named" disables
+ all v1 hierarchies.
cgroup.memory= [KNL] Pass options to the cgroup memory controller.
Format: <string>
@@ -507,7 +516,7 @@
/selinux/checkreqprot.
cio_ignore= [S390]
- See Documentation/s390/CommonIO for details.
+ See Documentation/s390/common_io.rst for details.
clk_ignore_unused
[CLK]
Prevents the clock framework from automatically gating
@@ -674,6 +683,9 @@
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system
+ cpuidle.governor=
+ [CPU_IDLE] Name of the cpuidle governor to use.
+
cpufreq.off=1 [CPU_FREQ]
disable the cpufreq sub-system
@@ -692,15 +704,18 @@
upon panic. This parameter reserves the physical
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
- is selected automatically. Check
- Documentation/kdump/kdump.txt for further details.
+ is selected automatically.
+ [KNL, x86_64] select a region under 4G first, and
+ fall back to reserve region above 4G when '@offset'
+ hasn't been specified.
+ See Documentation/admin-guide/kdump/kdump.rst for further details.
crashkernel=range1:size1[,range2:size2,...][@offset]
[KNL] Same as above, but depends on the memory
in the running system. The syntax of range is
start-[end] where start and end are both
a memory unit (amount[KMG]). See also
- Documentation/kdump/kdump.txt for an example.
+ Documentation/admin-guide/kdump/kdump.rst for an example.
crashkernel=size[KMG],high
[KNL, x86_64] range could be above 4G. Allow kernel
@@ -790,12 +805,12 @@
tracking down these problems.
debug_pagealloc=
- [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
- parameter enables the feature at boot time. In
- default, it is disabled. We can avoid allocating huge
- chunk of memory for debug pagealloc if we don't enable
- it at boot time and the system will work mostly same
- with the kernel built without CONFIG_DEBUG_PAGEALLOC.
+ [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
+ enables the feature at boot time. By default, it is
+ disabled and the system will work mostly the same as a
+ kernel built without CONFIG_DEBUG_PAGEALLOC.
+ Note: to get most of debug_pagealloc error reports, it's
+ useful to also enable the page_owner functionality.
on: enable the feature
debugpat [X86] Enable PAT debugging
@@ -847,6 +862,10 @@
disable_radix [PPC]
Disable RADIX MMU mode on POWER9
+ disable_tlbie [PPC]
+ Disable TLBIE instruction. Currently does not work
+ with KVM, with HASH MMU, or with coherent accelerators.
+
disable_cpu_apicid= [X86,APIC,SMP]
Format: <int>
The number of initial APIC ID for the
@@ -856,6 +875,12 @@
causing system reset or hang due to sending
INIT from AP to BSP.
+ perf_v4_pmi= [X86,INTEL]
+ Format: <bool>
+ Disable Intel PMU counter freezing feature.
+ The feature only exists starting from
+ Arch Perfmon v4 (Skylake and newer).
+
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this if
to workaround buggy firmware.
@@ -897,6 +922,10 @@
The filter can be disabled or changed to another
driver later using sysfs.
+ driver_async_probe= [KNL]
+ List of driver names to be probed asynchronously.
+ Format: <driver_name1>,<driver_name2>...
+
drm.edid_firmware=[<connector>:]<file>[,[<connector>:]<file>]
Broken monitors, graphic adapters, KVMs and EDIDless
panels may send no or incorrect EDID data sets.
@@ -907,7 +936,7 @@
edid/1680x1050.bin, or edid/1920x1080.bin is given
and no file with the same name exists. Details and
instructions how to build your own EDID data are
- available in Documentation/EDID/HOWTO.txt. An EDID
+ available in Documentation/driver-api/edid.rst. An EDID
data set will only be used for a particular connector,
if its name and a colon are prepended to the EDID
name. Each connector may use a unique EDID data
@@ -938,7 +967,7 @@
for details.
nompx [X86] Disables Intel Memory Protection Extensions.
- See Documentation/x86/intel_mpx.txt for more
+ See Documentation/x86/intel_mpx.rst for more
information about the feature.
nopku [X86] Disable Memory Protection Keys CPU feature found
@@ -1015,6 +1044,16 @@
specified address. The serial port must already be
setup and configured. Options are not yet supported.
+ rda,<addr>
+ Start an early, polled-mode console on a serial port
+ of an RDA Micro SoC, such as RDA8810PL, at the
+ specified address. The serial port must already be
+ setup and configured. Options are not yet supported.
+
+ sbi
+ Use RISC-V SBI (Supervisor Binary Interface) for early
+ console.
+
smh Use ARM semihosting calls for early console.
s3c2410,<addr>
@@ -1054,9 +1093,21 @@
specified address. The serial port must already be
setup and configured. Options are not yet supported.
+ efifb,[options]
+ Start an early, unaccelerated console on the EFI
+ memory mapped framebuffer (if available). On cache
+ coherent non-x86 systems that use system memory for
+ the framebuffer, pass the 'ram' option so that it is
+ mapped with the correct attributes.
+
+ linflex,<addr>
+ Use early console provided by Freescale LinFlex UART
+ serial driver for NXP S32V234 SoCs. A valid base
+ address must be provided, and the serial port must
+ already be setup and configured.
+
earlyprintk= [X86,SH,ARM,M68k,S390]
earlyprintk=vga
- earlyprintk=efi
earlyprintk=sclp
earlyprintk=xen
earlyprintk=serial[,ttySn[,baudrate]]
@@ -1152,7 +1203,7 @@
that is to be dynamically loaded by Linux. If there are
multiple variables with the same name but with different
vendor GUIDs, all of them will be loaded. See
- Documentation/acpi/ssdt-overlays.txt for details.
+ Documentation/admin-guide/acpi/ssdt-overlays.rst for details.
eisa_irq_edge= [PARISC,HW]
@@ -1162,16 +1213,11 @@
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
- elevator= [IOSCHED]
- Format: {"cfq" | "deadline" | "noop"}
- See Documentation/block/cfq-iosched.txt and
- Documentation/block/deadline-iosched.txt for details.
-
elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
kexec loader will pass this option to capture kernel.
- See Documentation/kdump/kdump.txt for details.
+ See Documentation/admin-guide/kdump/kdump.rst for details.
enable_mtrr_cleanup [X86]
The kernel tries to adjust MTRR layout from continuous
@@ -1213,7 +1259,7 @@
See also Documentation/fault-injection/.
floppy= [HW]
- See Documentation/blockdev/floppy.txt.
+ See Documentation/admin-guide/blockdev/floppy.rst.
force_pal_cache_flush
[IA-64] Avoid check_sal_cache_flush which may hang on
@@ -1350,9 +1396,6 @@
Valid parameters: "on", "off"
Default: "on"
- hisax= [HW,ISDN]
- See Documentation/isdn/README.HiSax.
-
hlt [BUGS=ARM,SH]
hpet= [X86-32,HPET] option to control HPET usage
@@ -1389,6 +1432,11 @@
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
+
+ hv_nopvspin [X86,HYPER_V] Disables the paravirt spinlock optimizations
+ which allow the hypervisor to 'idle' the
+ guest on lock contention.
+
keep_bootcon [KNL]
Do not unregister boot console at start. This is only
useful for debugging when something happens in the window
@@ -1464,7 +1512,7 @@
Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
.vlb_clock .pci_clock .noflush .nohpa .noprobe .nowerr
.cdrom .chs .ignore_cable are additional options
- See Documentation/ide/ide.txt.
+ See Documentation/ide/ide.rst.
ide-generic.probe-mask= [HW] (E)IDE subsystem
Format: <int>
@@ -1545,7 +1593,7 @@
Format: { "off" | "enforce" | "fix" | "log" }
default: "enforce"
- ima_appraise_tcb [IMA]
+ ima_appraise_tcb [IMA] Deprecated. Use ima_policy= instead.
The builtin appraise policy appraises all files
owned by uid=0.
@@ -1572,8 +1620,7 @@
uid=0.
The "appraise_tcb" policy appraises the integrity of
- all files owned by root. (This is the equivalent
- of ima_appraise_tcb.)
+ all files owned by root.
The "secure_boot" policy appraises the integrity
of files (eg. kexec kernel image, kernel modules,
@@ -1631,6 +1678,15 @@
initrd= [BOOT] Specify the location of the initial ramdisk
+ init_on_alloc= [MM] Fill newly allocated pages and heap objects with
+ zeroes.
+ Format: 0 | 1
+ Default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON.
+
+ init_on_free= [MM] Fill freed pages and heap objects with zeroes.
+ Format: 0 | 1
+ Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.
+
init_pkru= [x86] Specify the default memory protection keys rights
register contents for all processes. 0x55555554 by
default (disallow access to all but pkey 0). Can
@@ -1672,12 +1728,11 @@
By default, super page will be supported if Intel IOMMU
has the capability. With this option, super page will
not be supported.
- ecs_off [Default Off]
- By default, extended context tables will be supported if
- the hardware advertises that it has support both for the
- extended tables themselves, and also PASID support. With
- this option set, extended tables will not be used even
- on hardware which claims to support them.
+ sm_on [Default Off]
+ By default, scalable mode will be disabled even if the
+ hardware advertises that it has support for the scalable
+ mode translation. With this option set, scalable mode
+ will be used on hardware which claims to support it.
tboot_noforce [Default Off]
Do not force the Intel IOMMU enabled under tboot.
By default, tboot will force Intel IOMMU on, which
@@ -1687,6 +1742,11 @@
Note that using this option lowers the security
provided by tboot because it makes the system
vulnerable to DMA attacks.
+ nobounce [Default off]
+ Disable bounce buffer for unstrusted devices such as
+ the Thunderbolt devices. This will treat the untrusted
+ devices as the trusted ones, hence might expose security
+ risks of DMA attacks.
intel_idle.max_cstate= [KNL,HW,ACPI,X86]
0 disables intel_idle and fall back on acpi_idle.
@@ -1753,12 +1813,24 @@
nobypass [PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
+ iommu.strict= [ARM64] Configure TLB invalidation behaviour
+ Format: { "0" | "1" }
+ 0 - Lazy mode.
+ Request that DMA unmap operations use deferred
+ invalidation of hardware TLBs, for increased
+ throughput at the cost of reduced device isolation.
+ Will fall back to strict mode if not supported by
+ the relevant IOMMU driver.
+ 1 - Strict mode (default).
+ DMA unmap operations invalidate IOMMU hardware TLBs
+ synchronously.
+
iommu.passthrough=
- [ARM64] Configure DMA to bypass the IOMMU by default.
+ [ARM64, X86] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
- unset - Use IOMMU translation for DMA.
+ unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
io7= [HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
@@ -1777,6 +1849,9 @@
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
+ ipcmni_extend [KNL] Extend the maximum number of unique System V
+ IPC identifiers from 32,768 to 16,777,216.
+
irqaffinity= [SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
@@ -1795,6 +1870,11 @@
to let secondary kernels in charge of setting up
LPIs.
+ irqchip.gicv3_pseudo_nmi= [ARM64]
+ Enables support for pseudo-NMIs in the kernel. This
+ requires the kernel to be built with
+ CONFIG_ARM64_PSEUDO_NMI.
+
irqfixup [HW]
When an interrupt is not handled search all handlers
for it. Intended to get systems with badly broken
@@ -1946,6 +2026,25 @@
Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
the default is off.
+ kprobe_event=[probe-list]
+ [FTRACE] Add kprobe events and enable at boot time.
+ The probe-list is a semicolon delimited list of probe
+ definitions. Each definition is same as kprobe_events
+ interface, but the parameters are comma delimited.
+ For example, to add a kprobe event on vfs_read with
+ arg1 and arg2, add to the command line;
+
+ kprobe_event=p,vfs_read,$arg1,$arg2
+
+ See also Documentation/trace/kprobetrace.rst "Kernel
+ Boot Parameter" section.
+
+ kpti= [ARM64] Control page table isolation of user
+ and kernel address spaces.
+ Default: enabled on cores which need mitigation.
+ 0: force disabled
+ 1: force enabled
+
kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
Default is 0 (don't ignore, but inject #GP)
@@ -1956,6 +2055,25 @@
KVM MMU at runtime.
Default is 0 (off)
+ kvm.nx_huge_pages=
+ [KVM] Controls the software workaround for the
+ X86_BUG_ITLB_MULTIHIT bug.
+ force : Always deploy workaround.
+ off : Never deploy workaround.
+ auto : Deploy workaround based on the presence of
+ X86_BUG_ITLB_MULTIHIT.
+
+ Default is 'auto'.
+
+ If the software workaround is enabled for the host,
+ guests do need not to enable it for nested guests.
+
+ kvm.nx_huge_pages_recovery_ratio=
+ [KVM] Controls how many 4KiB pages are periodically zapped
+ back to huge pages. 0 disables the recovery, otherwise if
+ the value is N KVM will zap 1/Nth of the 4KiB pages every
+ minute. The default is 60.
+
kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM.
Default is 1 (enabled)
@@ -2073,10 +2191,13 @@
off
Disables hypervisor mitigations and doesn't
emit any warnings.
+ It also drops the swap size and available
+ RAM limit restriction on both hypervisor and
+ bare metal.
Default is 'flush'.
- For details see: Documentation/admin-guide/l1tf.rst
+ For details see: Documentation/admin-guide/hw-vuln/l1tf.rst
l2cr= [PPC]
@@ -2160,7 +2281,7 @@
memblock=debug [KNL] Enable memblock debug messages.
load_ramdisk= [RAM] List of ramdisks to load from floppy
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/admin-guide/blockdev/ramdisk.rst.
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
@@ -2174,6 +2295,15 @@
lockd.nlm_udpport=M [NFS] Assign UDP port.
Format: <integer>
+ lockdown= [SECURITY]
+ { integrity | confidentiality }
+ Enable the kernel lockdown feature. If set to
+ integrity, kernel features that allow userland to
+ modify the running kernel are disabled. If set to
+ confidentiality, kernel features that allow userland
+ to extract confidential information from the kernel
+ are also disabled.
+
locktorture.nreaders_stress= [KNL]
Set the number of locking read-acquisition kthreads.
Defaults to being automatically set based on the
@@ -2278,9 +2408,15 @@
ltpc= [NET]
Format: <io>,<irq>,<dma>
+ lsm.debug [SECURITY] Enable LSM initialization debugging output.
+
+ lsm=lsm1,...,lsmN
+ [SECURITY] Choose order of LSM initialization. This
+ overrides CONFIG_LSM, and the "security=" parameter.
+
machvec= [IA-64] Force the use of a particular machine-vector
(machvec) in a generic kernel.
- Example: machvec=hpzx1_swiotlb
+ Example: machvec=hpzx1
machtype= [Loongson] Share the same kernel image file between different
yeeloong laptop.
@@ -2307,7 +2443,7 @@
mce [X86-32] Machine Check Exception
- mce=option [X86-64] See Documentation/x86/x86_64/boot-options.txt
+ mce=option [X86-64] See Documentation/x86/x86_64/boot-options.rst
md= [HW] RAID subsystems devices and level
See Documentation/admin-guide/md.rst.
@@ -2316,6 +2452,38 @@
Format: <first>,<last>
Specifies range of consoles to be captured by the MDA.
+ mds= [X86,INTEL]
+ Control mitigation for the Micro-architectural Data
+ Sampling (MDS) vulnerability.
+
+ Certain CPUs are vulnerable to an exploit against CPU
+ internal buffers which can forward information to a
+ disclosure gadget under certain conditions.
+
+ In vulnerable processors, the speculatively
+ forwarded data can be used in a cache side channel
+ attack, to access data to which the attacker does
+ not have direct access.
+
+ This parameter controls the MDS mitigation. The
+ options are:
+
+ full - Enable MDS mitigation on vulnerable CPUs
+ full,nosmt - Enable MDS mitigation and disable
+ SMT on vulnerable CPUs
+ off - Unconditionally disable MDS mitigation
+
+ On TAA-affected machines, mds=off can be prevented by
+ an active TAA mitigation as both vulnerabilities are
+ mitigated with the same mechanism so in order to disable
+ this mitigation, you need to specify tsx_async_abort=off
+ too.
+
+ Not specifying this option is equivalent to
+ mds=full.
+
+ For details see: Documentation/admin-guide/hw-vuln/mds.rst
+
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
Amount of memory to be used when the kernel is not able
to see the whole system memory or for test.
@@ -2337,7 +2505,7 @@
set according to the
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
option.
- See Documentation/memory-hotplug.txt.
+ See Documentation/admin-guide/mm/memory-hotplug.rst.
memmap=exactmap [KNL,X86] Enable setting of an exact
E820 memory map, as specified by the user.
@@ -2408,7 +2576,7 @@
seconds. Use this parameter to check at some
other rate. 0 disables periodic checking.
- memtest= [KNL,X86,ARM] Enable memtest
+ memtest= [KNL,X86,ARM,PPC] Enable memtest
Format: <integer>
default : 0 <disable>
Specifies the number of memtest passes to be
@@ -2426,7 +2594,7 @@
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
- Refer to Documentation/x86/amd-memory-encryption.txt
+ Refer to Documentation/virt/kvm/amd-memory-encryption.rst
for details on when memory encryption can be activated.
mem_sleep_default= [SUSPEND] Default system suspend mode:
@@ -2473,6 +2641,50 @@
in the "bleeding edge" mini2440 support kernel at
http://repo.or.cz/w/linux-2.6/mini2440.git
+ mitigations=
+ [X86,PPC,S390,ARM64] Control optional mitigations for
+ CPU vulnerabilities. This is a set of curated,
+ arch-independent options, each of which is an
+ aggregation of existing arch-specific options.
+
+ off
+ Disable all optional CPU mitigations. This
+ improves system performance, but it may also
+ expose users to several CPU vulnerabilities.
+ Equivalent to: nopti [X86,PPC]
+ kpti=0 [ARM64]
+ nospectre_v1 [X86,PPC]
+ nobp=0 [S390]
+ nospectre_v2 [X86,PPC,S390,ARM64]
+ spectre_v2_user=off [X86]
+ spec_store_bypass_disable=off [X86,PPC]
+ ssbd=force-off [ARM64]
+ l1tf=off [X86]
+ mds=off [X86]
+ tsx_async_abort=off [X86]
+ kvm.nx_huge_pages=off [X86]
+
+ Exceptions:
+ This does not have any effect on
+ kvm.nx_huge_pages when
+ kvm.nx_huge_pages=force.
+
+ auto (default)
+ Mitigate all CPU vulnerabilities, but leave SMT
+ enabled, even if it's vulnerable. This is for
+ users who don't want to be surprised by SMT
+ getting disabled across kernel upgrades, or who
+ have other ways of avoiding SMT-based attacks.
+ Equivalent to: (default behavior)
+
+ auto,nosmt
+ Mitigate all CPU vulnerabilities, disabling SMT
+ if needed. This is for users who always want to
+ be fully mitigated, even if it means losing SMT.
+ Equivalent to: l1tf=flush,nosmt [X86]
+ mds=full,nosmt [X86]
+ tsx_async_abort=full,nosmt [X86]
+
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
@@ -2698,8 +2910,9 @@
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
When panic is specified, panic when an NMI watchdog
- timeout occurs (or 'nopanic' to override the opposite
- default). To disable both hard and soft lockup detectors,
+ timeout occurs (or 'nopanic' to not panic on an NMI
+ watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set)
+ To disable both hard and soft lockup detectors,
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
@@ -2734,6 +2947,17 @@
/sys/module/printk/parameters/console_suspend) to
turn on/off it dynamically.
+ novmcoredd [KNL,KDUMP]
+ Disable device dump. Device dump allows drivers to
+ append dump data to vmcore so you can collect driver
+ specified debug info. Drivers can append the data
+ without any limit and this data is stored in memory,
+ so this may cause significant memory stress. Disabling
+ device dump can help save memory but the driver debug
+ data will be no longer available. This parameter
+ is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP
+ is set.
+
noaliencache [MM, NUMA, SLAB] Disables the allocation of alien
caches in the slab allocator. Saves per-node memory,
but will impact performance.
@@ -2768,11 +2992,11 @@
noexec=on: enable non-executable mappings (default)
noexec=off: disable non-executable mappings
- nosmap [X86]
+ nosmap [X86,PPC]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
- nosmep [X86]
+ nosmep [X86,PPC]
Disable SMEP (Supervisor Mode Execution Prevention)
even if it is supported by processor.
@@ -2789,7 +3013,7 @@
register save and restore. The kernel will only save
legacy floating-point registers on task switch.
- nohugeiomap [KNL,x86] Disable kernel huge I/O mappings.
+ nohugeiomap [KNL,x86,PPC] Disable kernel huge I/O mappings.
nosmt [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
@@ -2798,14 +3022,14 @@
nosmt=force: Force disable SMT, cannot be undone
via the sysfs control file.
- nospectre_v1 [PPC] Disable mitigations for Spectre Variant 1 (bounds
- check bypass). With this option data leaks are possible
- in the system.
+ nospectre_v1 [X86,PPC] Disable mitigations for Spectre Variant 1
+ (bounds check bypass). With this option data leaks are
+ possible in the system.
- nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2
- (indirect branch prediction) vulnerability. System may
- allow data leaks with this option, which is equivalent
- to spectre_v2=off.
+ nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for
+ the Spectre variant 2 (indirect branch prediction)
+ vulnerability. System may allow data leaks with this
+ option.
nospec_store_bypass_disable
[HW] Disable all mitigations for the Speculative Store Bypass vulnerability
@@ -3001,7 +3225,7 @@
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
'node', 'default' can be specified
This can be set from sysctl after boot.
- See Documentation/sysctl/vm.txt for details.
+ See Documentation/admin-guide/sysctl/vm.rst for details.
ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
See Documentation/debugging-via-ohci1394.txt for more
@@ -3039,6 +3263,16 @@
This will also cause panics on machine check exceptions.
Useful together with panic=30 to trigger a reboot.
+ page_alloc.shuffle=
+ [KNL] Boolean flag to control whether the page allocator
+ should randomize its free lists. The randomization may
+ be automatically enabled if the kernel detects it is
+ running on a platform with a direct-mapped memory-side
+ cache, and this parameter can be used to
+ override/disable that behavior. The state of the flag
+ can be read from sysfs at:
+ /sys/module/page_alloc/parameters/shuffle.
+
page_owner= [KNL] Boot-time page_owner enabling option.
Storage of the information about who allocated
each page is disabled in default. With this switch,
@@ -3057,6 +3291,15 @@
timeout < 0: reboot immediately
Format: <timeout>
+ panic_print= Bitmask for printing system info when panic happens.
+ User can chose combination of the following bits:
+ bit 0: print all tasks info
+ bit 1: print system memory info
+ bit 2: print timer info
+ bit 3: print locks info if CONFIG_LOCKDEP is on
+ bit 4: print ftrace buffer
+ bit 5: print all printk messages in buffer
+
panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN().
@@ -3106,7 +3349,7 @@
pcd. [PARIDE]
See header of drivers/block/paride/pcd.c.
- See also Documentation/blockdev/paride.txt.
+ See also Documentation/admin-guide/blockdev/paride.rst.
pci=option[,option...] [PCI] various PCI subsystem options.
@@ -3266,12 +3509,13 @@
specify the device is described above.
If <order of align> is not specified,
PAGE_SIZE is used as alignment.
- PCI-PCI bridge can be specified, if resource
+ A PCI-PCI bridge can be specified if resource
windows need to be expanded.
To specify the alignment for several
instances of a device, the PCI vendor,
device, subvendor, and subdevice may be
- specified, e.g., 4096@pci:8086:9c22:103c:198f
+ specified, e.g., 12@pci:8086:9c22:103c:198f
+ for 4096-byte alignment.
ecrc= Enable/disable PCIe ECRC (transaction layer
end-to-end CRC checking).
bios: Use BIOS/firmware settings. This is the
@@ -3315,6 +3559,8 @@
bridges without forcing it upstream. Note:
this removes isolation between devices and
may put more devices in an IOMMU group.
+ force_floating [S390] Force usage of floating interrupts.
+ nomio [S390] Do not use MIO instructions.
pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
Management.
@@ -3348,7 +3594,7 @@
needed on a platform with proper driver support.
pd. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/admin-guide/blockdev/paride.rst.
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
@@ -3363,13 +3609,13 @@
and performance comparison.
pf. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/admin-guide/blockdev/paride.rst.
pg. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/admin-guide/blockdev/paride.rst.
pirq= [SMP,APIC] Manual mp-table setup
- See Documentation/x86/i386/IO-APIC.txt.
+ See Documentation/x86/i386/IO-APIC.rst.
plip= [PPT,NET] Parallel port network link
Format: { parport<nr> | timid | 0 }
@@ -3478,7 +3724,11 @@
prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk
before loading.
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/admin-guide/blockdev/ramdisk.rst.
+
+ psi= [KNL] Enable or disable pressure stall information
+ tracking.
+ Format: <bool>
psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to
probe for; one of (bare|imps|exps|lifebook|any).
@@ -3496,7 +3746,7 @@
pstore.backend= Specify the name of the pstore backend to use
pt. [PARIDE]
- See Documentation/blockdev/paride.txt.
+ See Documentation/admin-guide/blockdev/paride.rst.
pti= [X86_64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
@@ -3525,7 +3775,7 @@
See Documentation/admin-guide/md.rst.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
- See Documentation/blockdev/ramdisk.txt.
+ See Documentation/admin-guide/blockdev/ramdisk.rst.
random.trust_cpu={on,off}
[KNL] Enable or disable trusting the use of the
@@ -3540,18 +3790,20 @@
see CONFIG_RAS_CEC help text.
rcu_nocbs= [KNL]
- The argument is a cpu list, as described above.
+ The argument is a cpu list, as described above,
+ except that the string "all" can be used to
+ specify every CPU on the system.
In kernels built with CONFIG_RCU_NOCB_CPU=y, set
the specified list of CPUs to be no-callback CPUs.
- Invocation of these CPUs' RCU callbacks will
- be offloaded to "rcuox/N" kthreads created for
- that purpose, where "x" is "b" for RCU-bh, "p"
- for RCU-preempt, and "s" for RCU-sched, and "N"
- is the CPU number. This reduces OS jitter on the
- offloaded CPUs, which can be useful for HPC and
- real-time workloads. It can also improve energy
- efficiency for asymmetric multiprocessors.
+ Invocation of these CPUs' RCU callbacks will be
+ offloaded to "rcuox/N" kthreads created for that
+ purpose, where "x" is "p" for RCU-preempt, and
+ "s" for RCU-sched, and "N" is the CPU number.
+ This reduces OS jitter on the offloaded CPUs,
+ which can be useful for HPC and real-time
+ workloads. It can also improve energy efficiency
+ for asymmetric multiprocessors.
rcu_nocb_poll [KNL]
Rather than requiring that offloaded CPUs
@@ -3587,6 +3839,12 @@
the propagation of recent CPU-hotplug changes up
the rcu_node combining tree.
+ rcutree.use_softirq= [KNL]
+ If set to zero, move all RCU_SOFTIRQ processing to
+ per-CPU rcuc kthreads. Defaults to a non-zero
+ value, meaning that RCU_SOFTIRQ is used by default.
+ Specify rcutree.use_softirq=0 to use rcuc kthreads.
+
rcutree.rcu_fanout_exact= [KNL]
Disable autobalancing of the rcu_node combining
tree. This is used by rcutorture, and might
@@ -3601,12 +3859,6 @@
latencies, which will choose a value aligned
with the appropriate hardware boundaries.
- rcutree.jiffies_till_sched_qs= [KNL]
- Set required age in jiffies for a
- given grace period before RCU starts
- soliciting quiescent-state help from
- rcu_note_context_switch().
-
rcutree.jiffies_till_first_fqs= [KNL]
Set delay from grace-period initialization to
first attempt to force quiescent states.
@@ -3618,6 +3870,20 @@
quiescent states. Units are jiffies, minimum
value is one, and maximum value is HZ.
+ rcutree.jiffies_till_sched_qs= [KNL]
+ Set required age in jiffies for a
+ given grace period before RCU starts
+ soliciting quiescent-state help from
+ rcu_note_context_switch() and cond_resched().
+ If not specified, the kernel will calculate
+ a value based on the most recent settings
+ of rcutree.jiffies_till_first_fqs
+ and rcutree.jiffies_till_next_fqs.
+ This calculated value may be viewed in
+ rcutree.jiffies_to_sched_qs. Any attempt to set
+ rcutree.jiffies_to_sched_qs will be cheerfully
+ overwritten.
+
rcutree.kthread_prio= [KNL,BOOT]
Set the SCHED_FIFO priority of the RCU per-CPU
kthreads (rcuc/N). This value is also used for
@@ -3629,12 +3895,13 @@
RCU_BOOST is not set, valid values are 0-99 and
the default is zero (non-realtime operation).
- rcutree.rcu_nocb_leader_stride= [KNL]
- Set the number of NOCB kthread groups, which
- defaults to the square root of the number of
- CPUs. Larger numbers reduces the wakeup overhead
- on the per-CPU grace-period kthreads, but increases
- that same overhead on each group's leader.
+ rcutree.rcu_nocb_gp_stride= [KNL]
+ Set the number of NOCB callback kthreads in
+ each group, which defaults to the square root
+ of the number of CPUs. Larger numbers reduce
+ the wakeup overhead on the global grace-period
+ kthread, but increases that same overhead on
+ each group's NOCB grace-period kthread.
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
@@ -3661,6 +3928,11 @@
This wake_up() will be accompanied by a
WARN_ONCE() splat and an ftrace_dump().
+ rcutree.sysrq_rcu= [KNL]
+ Commandeer a sysrq key to dump out Tree RCU's
+ rcu_node tree with an eye towards determining
+ why a new grace period has not yet started.
+
rcuperf.gp_async= [KNL]
Measure performance of asynchronous
grace-period primitives such as call_rcu().
@@ -3712,24 +3984,6 @@
in microseconds. The default of zero says
no holdoff.
- rcutorture.cbflood_inter_holdoff= [KNL]
- Set holdoff time (jiffies) between successive
- callback-flood tests.
-
- rcutorture.cbflood_intra_holdoff= [KNL]
- Set holdoff time (jiffies) between successive
- bursts of callbacks within a given callback-flood
- test.
-
- rcutorture.cbflood_n_burst= [KNL]
- Set the number of bursts making up a given
- callback-flood test. Set this to zero to
- disable callback-flood testing.
-
- rcutorture.cbflood_n_per_burst= [KNL]
- Set the number of callbacks to be registered
- in a given burst of a callback-flood test.
-
rcutorture.fqs_duration= [KNL]
Set duration of force_quiescent_state bursts
in microseconds.
@@ -3742,6 +3996,23 @@
Set wait time between force_quiescent_state bursts
in seconds.
+ rcutorture.fwd_progress= [KNL]
+ Enable RCU grace-period forward-progress testing
+ for the types of RCU supporting this notion.
+
+ rcutorture.fwd_progress_div= [KNL]
+ Specify the fraction of a CPU-stall-warning
+ period to do tight-loop forward-progress testing.
+
+ rcutorture.fwd_progress_holdoff= [KNL]
+ Number of seconds to wait between successive
+ forward-progress tests.
+
+ rcutorture.fwd_progress_need_resched= [KNL]
+ Enclose cond_resched() calls within checks for
+ need_resched() during tight-loop forward-progress
+ testing.
+
rcutorture.gp_cond= [KNL]
Use conditional/asynchronous update-side
primitives, if available.
@@ -3835,6 +4106,10 @@
rcutorture.verbose= [KNL]
Enable additional printk() statements.
+ rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
+ Dump ftrace buffer after reporting RCU CPU
+ stall warning.
+
rcupdate.rcu_cpu_stall_suppress= [KNL]
Suppress RCU CPU stall warning messages.
@@ -3873,17 +4148,18 @@
rcupdate.rcu_self_test= [KNL]
Run the RCU early boot self tests
- rcupdate.rcu_self_test_bh= [KNL]
- Run the RCU bh early boot self tests
-
- rcupdate.rcu_self_test_sched= [KNL]
- Run the RCU sched early boot self tests
-
rdinit= [KNL]
Format: <full_path>
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
+ rdrand= [X86]
+ force - Override the decision by the kernel to hide the
+ advertisement of RDRAND support (this affects
+ certain AMD processors because of buggy BIOS
+ support, specifically around the suspend/resume
+ path).
+
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
@@ -3897,7 +4173,9 @@
[[,]s[mp]#### \
[[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \
[[,]f[orce]
- Where reboot_mode is one of warm (soft) or cold (hard) or gpio,
+ Where reboot_mode is one of warm (soft) or cold (hard) or gpio
+ (prefix with 'panic_' to set mode for panic
+ reboot only),
reboot_type is one of bios, acpi, kbd, triple, efi, or pci,
reboot_force is either force or not specified,
reboot_cpu is s[mp]#### with #### being the processor
@@ -3905,7 +4183,7 @@
relax_domain_level=
[KNL, SMP] Set scheduler's default relax_domain_level.
- See Documentation/cgroup-v1/cpusets.txt.
+ See Documentation/admin-guide/cgroup-v1/cpusets.rst.
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
Format: <base1>,<size1>[,<base2>,<size2>,...]
@@ -3935,7 +4213,7 @@
Specify the offset from the beginning of the partition
given by "resume=" at which the swap header is located,
in <PAGE_SIZE> units (needed only for swap files).
- See Documentation/power/swsusp-and-swap-files.txt
+ See Documentation/power/swsusp-and-swap-files.rst
resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to
read the resume files
@@ -4037,11 +4315,9 @@
Note: increases power consumption, thus should only be
enabled if running jitter sensitive (HPC/RT) workloads.
- security= [SECURITY] Choose a security module to enable at boot.
- If this boot parameter is not specified, only the first
- security module asking for security registration will be
- loaded. An invalid security module name will be treated
- as if no module has been chosen.
+ security= [SECURITY] Choose a legacy "major" security module to
+ enable at boot. This has been deprecated by the
+ "lsm=" parameter.
selinux= [SELINUX] Disable or enable SELinux at boot time.
Format: { "0" | "1" }
@@ -4165,7 +4441,7 @@
Format: <integer>
sonypi.*= [HW] Sony Programmable I/O Control Device driver
- See Documentation/laptops/sonypi.txt
+ See Documentation/admin-guide/laptops/sonypi.rst
spectre_v2= [X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
@@ -4414,10 +4690,15 @@
/sys/power/pm_test). Only available when CONFIG_PM_DEBUG
is set. Default value is 5.
+ svm= [PPC]
+ Format: { on | off | y | n | 1 | 0 }
+ This parameter controls use of the Protected
+ Execution Facility on pSeries.
+
swapaccount=[0|1]
[KNL] Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable
- it if 0 is given (See Documentation/cgroup-v1/memory.txt)
+ it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst)
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
Format: { <int> | force | noforce }
@@ -4492,27 +4773,6 @@
Force threading of all interrupt handlers except those
marked explicitly IRQF_NO_THREAD.
- tmem [KNL,XEN]
- Enable the Transcendent memory driver if built-in.
-
- tmem.cleancache=0|1 [KNL, XEN]
- Default is on (1). Disable the usage of the cleancache
- API to send anonymous pages to the hypervisor.
-
- tmem.frontswap=0|1 [KNL, XEN]
- Default is on (1). Disable the usage of the frontswap
- API to send swap pages to the hypervisor. If disabled
- the selfballooning and selfshrinking are force disabled.
-
- tmem.selfballooning=0|1 [KNL, XEN]
- Default is on (1). Disable the driving of swap pages
- to the hypervisor.
-
- tmem.selfshrinking=0|1 [KNL, XEN]
- Default is on (1). Partial swapoff that immediately
- transfers pages from Xen hypervisor back to the
- kernel based on different criteria.
-
topology= [S390]
Format: {off | on}
Specify if the kernel should make use of the cpu
@@ -4616,6 +4876,80 @@
[x86] unstable: mark the TSC clocksource as unstable, this
marks the TSC unconditionally unstable at bootup and
avoids any further wobbles once the TSC watchdog notices.
+ [x86] nowatchdog: disable clocksource watchdog. Used
+ in situations with strict latency requirements (where
+ interruptions from clocksource watchdog are not
+ acceptable).
+
+ tsx= [X86] Control Transactional Synchronization
+ Extensions (TSX) feature in Intel processors that
+ support TSX control.
+
+ This parameter controls the TSX feature. The options are:
+
+ on - Enable TSX on the system. Although there are
+ mitigations for all known security vulnerabilities,
+ TSX has been known to be an accelerator for
+ several previous speculation-related CVEs, and
+ so there may be unknown security risks associated
+ with leaving it enabled.
+
+ off - Disable TSX on the system. (Note that this
+ option takes effect only on newer CPUs which are
+ not vulnerable to MDS, i.e., have
+ MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and which get
+ the new IA32_TSX_CTRL MSR through a microcode
+ update. This new MSR allows for the reliable
+ deactivation of the TSX functionality.)
+
+ auto - Disable TSX if X86_BUG_TAA is present,
+ otherwise enable TSX on the system.
+
+ Not specifying this option is equivalent to tsx=off.
+
+ See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
+ for more details.
+
+ tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async
+ Abort (TAA) vulnerability.
+
+ Similar to Micro-architectural Data Sampling (MDS)
+ certain CPUs that support Transactional
+ Synchronization Extensions (TSX) are vulnerable to an
+ exploit against CPU internal buffers which can forward
+ information to a disclosure gadget under certain
+ conditions.
+
+ In vulnerable processors, the speculatively forwarded
+ data can be used in a cache side channel attack, to
+ access data to which the attacker does not have direct
+ access.
+
+ This parameter controls the TAA mitigation. The
+ options are:
+
+ full - Enable TAA mitigation on vulnerable CPUs
+ if TSX is enabled.
+
+ full,nosmt - Enable TAA mitigation and disable SMT on
+ vulnerable CPUs. If TSX is disabled, SMT
+ is not disabled because CPU is not
+ vulnerable to cross-thread TAA attacks.
+ off - Unconditionally disable TAA mitigation
+
+ On MDS-affected machines, tsx_async_abort=off can be
+ prevented by an active MDS mitigation as both vulnerabilities
+ are mitigated with the same mechanism so in order to disable
+ this mitigation, you need to specify mds=off too.
+
+ Not specifying this option is equivalent to
+ tsx_async_abort=full. On CPUs which are MDS affected
+ and deploy MDS mitigation, TAA mitigation is not
+ required and doesn't provide any additional
+ mitigation.
+
+ For details see:
+ Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
turbografx.map[2|3]= [HW,JOY]
TurboGraFX parallel port interface
@@ -4645,7 +4979,8 @@
usbcore.authorized_default=
[USB] Default USB device authorization:
(default -1 = authorized except for wireless USB,
- 0 = not authorized, 1 = authorized)
+ 0 = not authorized, 1 = authorized, 2 = authorized
+ if device connected to internal port)
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
@@ -4666,7 +5001,8 @@
usbcore.old_scheme_first=
[USB] Start with the old device initialization
- scheme (default 0 = off).
+ scheme, applies only to low and full-speed devices
+ (default 0 = off).
usbcore.usbfs_memory_mb=
[USB] Memory limit (in MB) for buffers allocated by
@@ -4849,7 +5185,7 @@
vector=percpu: enable percpu vector domain
video= [FB] Frame buffer configuration
- See Documentation/fb/modedb.txt.
+ See Documentation/fb/modedb.rst.
video.brightness_switch_enabled= [0,1]
If set to 1, on receiving an ACPI notify event
@@ -4877,12 +5213,24 @@
Can be used multiple times for multiple devices.
vga= [BOOT,X86-32] Select a particular video mode
- See Documentation/x86/boot.txt and
- Documentation/svga.txt.
+ See Documentation/x86/boot.rst and
+ Documentation/admin-guide/svga.rst.
Use vga=ask for menu.
This is actually a boot loader parameter; the value is
passed to the kernel using a special protocol.
+ vm_debug[=options] [KNL] Available with CONFIG_DEBUG_VM=y.
+ May slow down system boot speed, especially when
+ enabled on systems with a large amount of memory.
+ All options are enabled by default, and this
+ interface is meant to allow for selectively
+ enabling or disabling specific virtual memory
+ debugging features.
+
+ Available options are:
+ P Enable page structure init time poisoning
+ - Disable all of the above options
+
vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact
size of <nn>. This can be used to increase the
minimum size (128MB on x86). It can also be used to
@@ -4911,13 +5259,12 @@
targets for exploits that can control RIP.
emulate [default] Vsyscalls turn into traps and are
- emulated reasonably safely.
+ emulated reasonably safely. The vsyscall
+ page is readable.
- native Vsyscalls are native syscall instructions.
- This is a little bit faster than trapping
- and makes a few dynamic recompilers work
- better than they would in emulation mode.
- It also makes exploits much easier to write.
+ xonly Vsyscalls turn into traps and are
+ emulated reasonably safely. The vsyscall
+ page is not readable.
none Vsyscalls don't work at all. This makes
them quite hard to use for exploits but
@@ -4973,10 +5320,18 @@
Default: 3 = cyan.
watchdog timers [HW,WDT] For information on watchdog timers,
- see Documentation/watchdog/watchdog-parameters.txt
+ see Documentation/watchdog/watchdog-parameters.rst
or other driver-specific files in the
Documentation/watchdog/ directory.
+ watchdog_thresh=
+ [KNL]
+ Set the hard lockup detector stall duration
+ threshold in seconds. The soft lockup detector
+ threshold is set to twice the value. A value of 0
+ disables both lockup detectors. Default is 10
+ seconds.
+
workqueue.watchdog_thresh=
If CONFIG_WQ_WATCHDOG is configured, workqueue can
warn stall conditions and dump internal state to
@@ -5050,6 +5405,10 @@
the unplug protocol
never -- do not unplug even if version check succeeds
+ xen_legacy_crash [X86,XEN]
+ Crash from Xen panic notifier, without executing late
+ panic() code such as dumping handler.
+
xen_nopvspin [X86,XEN]
Disables the ticketlock slowpath using Xen PV
optimizations.
@@ -5057,6 +5416,8 @@
xen_nopv [X86]
Disables the PV optimizations forcing the HVM guest to
run as generic HVM guest with no PV drivers.
+ This option is obsoleted by the "nopv" option, which
+ has equivalent effect for XEN platform.
xen_scrub_pages= [XEN]
Boolean option to control scrubbing pages before giving them back
@@ -5064,11 +5425,51 @@
with /sys/devices/system/xen_memory/xen_memory0/scrub_pages.
Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
+ xen_timer_slop= [X86-64,XEN]
+ Set the timer slop (in nanoseconds) for the virtual Xen
+ timers (default is 100000). This adjusts the minimum
+ delta of virtualized Xen timers, where lower values
+ improve timer resolution at the expense of processing
+ more timer interrupts.
+
+ nopv= [X86,XEN,KVM,HYPER_V,VMWARE]
+ Disables the PV optimizations forcing the guest to run
+ as generic guest with no PV drivers. Currently support
+ XEN HVM, KVM, HYPER_V and VMWARE guest.
+
xirc2ps_cs= [NET,PCMCIA]
Format:
<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
+ xive= [PPC]
+ By default on POWER9 and above, the kernel will
+ natively use the XIVE interrupt controller. This option
+ allows the fallback firmware mode to be used:
+
+ off Fallback to firmware control of XIVE interrupt
+ controller on both pseries and powernv
+ platforms. Only useful on POWER9 and above.
+
xhci-hcd.quirks [USB,KNL]
A hex value specifying bitmask with supplemental xhci
host controller quirks. Meaning of each bit can be
consulted in header drivers/usb/host/xhci.h.
+
+ xmon [PPC]
+ Format: { early | on | rw | ro | off }
+ Controls if xmon debugger is enabled. Default is off.
+ Passing only "xmon" is equivalent to "xmon=early".
+ early Call xmon as early as possible on boot; xmon
+ debugger is called from setup_arch().
+ on xmon debugger hooks will be installed so xmon
+ is only called on a kernel crash. Default mode,
+ i.e. either "ro" or "rw" mode, is controlled
+ with CONFIG_XMON_DEFAULT_RO_MODE.
+ rw xmon debugger hooks will be installed so xmon
+ is called only on a kernel crash, mode is write,
+ meaning SPR registers, memory and, other data
+ can be written using xmon commands.
+ ro same as "rw" option above but SPR registers,
+ memory, and other data can't be written using
+ xmon commands.
+ off xmon is disabled.
diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
new file mode 100644
index 0000000..baeeba8
--- /dev/null
+++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
@@ -0,0 +1,354 @@
+==========================================
+Reducing OS jitter due to per-cpu kthreads
+==========================================
+
+This document lists per-CPU kthreads in the Linux kernel and presents
+options to control their OS jitter. Note that non-per-CPU kthreads are
+not listed here. To reduce OS jitter from non-per-CPU kthreads, bind
+them to a "housekeeping" CPU dedicated to such work.
+
+References
+==========
+
+- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs.
+
+- Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.
+
+- man taskset: Using the taskset command to bind tasks to sets
+ of CPUs.
+
+- man sched_setaffinity: Using the sched_setaffinity() system
+ call to bind tasks to sets of CPUs.
+
+- /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state,
+ writing "0" to offline and "1" to online.
+
+- In order to locate kernel-generated OS jitter on CPU N:
+
+ cd /sys/kernel/debug/tracing
+ echo 1 > max_graph_depth # Increase the "1" for more detail
+ echo function_graph > current_tracer
+ # run workload
+ cat per_cpu/cpuN/trace
+
+kthreads
+========
+
+Name:
+ ehca_comp/%u
+
+Purpose:
+ Periodically process Infiniband-related work.
+
+To reduce its OS jitter, do any of the following:
+
+1. Don't use eHCA Infiniband hardware, instead choosing hardware
+ that does not require per-CPU kthreads. This will prevent these
+ kthreads from being created in the first place. (This will
+ work for most people, as this hardware, though important, is
+ relatively old and is produced in relatively low unit volumes.)
+2. Do all eHCA-Infiniband-related work on other CPUs, including
+ interrupts.
+3. Rework the eHCA driver so that its per-CPU kthreads are
+ provisioned only on selected CPUs.
+
+
+Name:
+ irq/%d-%s
+
+Purpose:
+ Handle threaded interrupts.
+
+To reduce its OS jitter, do the following:
+
+1. Use irq affinity to force the irq threads to execute on
+ some other CPU.
+
+Name:
+ kcmtpd_ctr_%d
+
+Purpose:
+ Handle Bluetooth work.
+
+To reduce its OS jitter, do one of the following:
+
+1. Don't use Bluetooth, in which case these kthreads won't be
+ created in the first place.
+2. Use irq affinity to force Bluetooth-related interrupts to
+ occur on some other CPU and furthermore initiate all
+ Bluetooth activity on some other CPU.
+
+Name:
+ ksoftirqd/%u
+
+Purpose:
+ Execute softirq handlers when threaded or when under heavy load.
+
+To reduce its OS jitter, each softirq vector must be handled
+separately as follows:
+
+TIMER_SOFTIRQ
+-------------
+
+Do all of the following:
+
+1. To the extent possible, keep the CPU out of the kernel when it
+ is non-idle, for example, by avoiding system calls and by forcing
+ both kernel threads and interrupts to execute elsewhere.
+2. Build with CONFIG_HOTPLUG_CPU=y. After boot completes, force
+ the CPU offline, then bring it back online. This forces
+ recurring timers to migrate elsewhere. If you are concerned
+ with multiple CPUs, force them all offline before bringing the
+ first one back online. Once you have onlined the CPUs in question,
+ do not offline any other CPUs, because doing so could force the
+ timer back onto one of the CPUs in question.
+
+NET_TX_SOFTIRQ and NET_RX_SOFTIRQ
+---------------------------------
+
+Do all of the following:
+
+1. Force networking interrupts onto other CPUs.
+2. Initiate any network I/O on other CPUs.
+3. Once your application has started, prevent CPU-hotplug operations
+ from being initiated from tasks that might run on the CPU to
+ be de-jittered. (It is OK to force this CPU offline and then
+ bring it back online before you start your application.)
+
+BLOCK_SOFTIRQ
+-------------
+
+Do all of the following:
+
+1. Force block-device interrupts onto some other CPU.
+2. Initiate any block I/O on other CPUs.
+3. Once your application has started, prevent CPU-hotplug operations
+ from being initiated from tasks that might run on the CPU to
+ be de-jittered. (It is OK to force this CPU offline and then
+ bring it back online before you start your application.)
+
+IRQ_POLL_SOFTIRQ
+----------------
+
+Do all of the following:
+
+1. Force block-device interrupts onto some other CPU.
+2. Initiate any block I/O and block-I/O polling on other CPUs.
+3. Once your application has started, prevent CPU-hotplug operations
+ from being initiated from tasks that might run on the CPU to
+ be de-jittered. (It is OK to force this CPU offline and then
+ bring it back online before you start your application.)
+
+TASKLET_SOFTIRQ
+---------------
+
+Do one or more of the following:
+
+1. Avoid use of drivers that use tasklets. (Such drivers will contain
+ calls to things like tasklet_schedule().)
+2. Convert all drivers that you must use from tasklets to workqueues.
+3. Force interrupts for drivers using tasklets onto other CPUs,
+ and also do I/O involving these drivers on other CPUs.
+
+SCHED_SOFTIRQ
+-------------
+
+Do all of the following:
+
+1. Avoid sending scheduler IPIs to the CPU to be de-jittered,
+ for example, ensure that at most one runnable kthread is present
+ on that CPU. If a thread that expects to run on the de-jittered
+ CPU awakens, the scheduler will send an IPI that can result in
+ a subsequent SCHED_SOFTIRQ.
+2. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered
+ is marked as an adaptive-ticks CPU using the "nohz_full="
+ boot parameter. This reduces the number of scheduler-clock
+ interrupts that the de-jittered CPU receives, minimizing its
+ chances of being selected to do the load balancing work that
+ runs in SCHED_SOFTIRQ context.
+3. To the extent possible, keep the CPU out of the kernel when it
+ is non-idle, for example, by avoiding system calls and by
+ forcing both kernel threads and interrupts to execute elsewhere.
+ This further reduces the number of scheduler-clock interrupts
+ received by the de-jittered CPU.
+
+HRTIMER_SOFTIRQ
+---------------
+
+Do all of the following:
+
+1. To the extent possible, keep the CPU out of the kernel when it
+ is non-idle. For example, avoid system calls and force both
+ kernel threads and interrupts to execute elsewhere.
+2. Build with CONFIG_HOTPLUG_CPU=y. Once boot completes, force the
+ CPU offline, then bring it back online. This forces recurring
+ timers to migrate elsewhere. If you are concerned with multiple
+ CPUs, force them all offline before bringing the first one
+ back online. Once you have onlined the CPUs in question, do not
+ offline any other CPUs, because doing so could force the timer
+ back onto one of the CPUs in question.
+
+RCU_SOFTIRQ
+-----------
+
+Do at least one of the following:
+
+1. Offload callbacks and keep the CPU in either dyntick-idle or
+ adaptive-ticks state by doing all of the following:
+
+ a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be
+ de-jittered is marked as an adaptive-ticks CPU using the
+ "nohz_full=" boot parameter. Bind the rcuo kthreads to
+ housekeeping CPUs, which can tolerate OS jitter.
+ b. To the extent possible, keep the CPU out of the kernel
+ when it is non-idle, for example, by avoiding system
+ calls and by forcing both kernel threads and interrupts
+ to execute elsewhere.
+
+2. Enable RCU to do its processing remotely via dyntick-idle by
+ doing all of the following:
+
+ a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
+ b. Ensure that the CPU goes idle frequently, allowing other
+ CPUs to detect that it has passed through an RCU quiescent
+ state. If the kernel is built with CONFIG_NO_HZ_FULL=y,
+ userspace execution also allows other CPUs to detect that
+ the CPU in question has passed through a quiescent state.
+ c. To the extent possible, keep the CPU out of the kernel
+ when it is non-idle, for example, by avoiding system
+ calls and by forcing both kernel threads and interrupts
+ to execute elsewhere.
+
+Name:
+ kworker/%u:%d%s (cpu, id, priority)
+
+Purpose:
+ Execute workqueue requests
+
+To reduce its OS jitter, do any of the following:
+
+1. Run your workload at a real-time priority, which will allow
+ preempting the kworker daemons.
+2. A given workqueue can be made visible in the sysfs filesystem
+ by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
+ Such a workqueue can be confined to a given subset of the
+ CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs
+ files. The set of WQ_SYSFS workqueues can be displayed using
+ "ls sys/devices/virtual/workqueue". That said, the workqueues
+ maintainer would like to caution people against indiscriminately
+ sprinkling WQ_SYSFS across all the workqueues. The reason for
+ caution is that it is easy to add WQ_SYSFS, but because sysfs is
+ part of the formal user/kernel API, it can be nearly impossible
+ to remove it, even if its addition was a mistake.
+3. Do any of the following needed to avoid jitter that your
+ application cannot tolerate:
+
+ a. Build your kernel with CONFIG_SLUB=y rather than
+ CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
+ use of each CPU's workqueues to run its cache_reap()
+ function.
+ b. Avoid using oprofile, thus avoiding OS jitter from
+ wq_sync_buffer().
+ c. Limit your CPU frequency so that a CPU-frequency
+ governor is not required, possibly enlisting the aid of
+ special heatsinks or other cooling technologies. If done
+ correctly, and if you CPU architecture permits, you should
+ be able to build your kernel with CONFIG_CPU_FREQ=n to
+ avoid the CPU-frequency governor periodically running
+ on each CPU, including cs_dbs_timer() and od_dbs_timer().
+
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ d. As of v3.18, Christoph Lameter's on-demand vmstat workers
+ commit prevents OS jitter due to vmstat_update() on
+ CONFIG_SMP=y systems. Before v3.18, is not possible
+ to entirely get rid of the OS jitter, but you can
+ decrease its frequency by writing a large value to
+ /proc/sys/vm/stat_interval. The default value is HZ,
+ for an interval of one second. Of course, larger values
+ will make your virtual-memory statistics update more
+ slowly. Of course, you can also run your workload at
+ a real-time priority, thus preempting vmstat_update(),
+ but if your workload is CPU-bound, this is a bad idea.
+ However, there is an RFC patch from Christoph Lameter
+ (based on an earlier one from Gilad Ben-Yossef) that
+ reduces or even eliminates vmstat overhead for some
+ workloads at https://lkml.org/lkml/2013/9/4/379.
+ e. If running on high-end powerpc servers, build with
+ CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
+ daemon from running on each CPU every second or so.
+ (This will require editing Kconfig files and will defeat
+ this platform's RAS functionality.) This avoids jitter
+ due to the rtas_event_scan() function.
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ f. If running on Cell Processor, build your kernel with
+ CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
+ spu_gov_work().
+ WARNING: Please check your CPU specifications to
+ make sure that this is safe on your particular system.
+ g. If running on PowerMAC, build your kernel with
+ CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
+ avoiding OS jitter from rackmeter_do_timer().
+
+Name:
+ rcuc/%u
+
+Purpose:
+ Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
+
+To reduce its OS jitter, do at least one of the following:
+
+1. Build the kernel with CONFIG_PREEMPT=n. This prevents these
+ kthreads from being created in the first place, and also obviates
+ the need for RCU priority boosting. This approach is feasible
+ for workloads that do not require high degrees of responsiveness.
+2. Build the kernel with CONFIG_RCU_BOOST=n. This prevents these
+ kthreads from being created in the first place. This approach
+ is feasible only if your workload never requires RCU priority
+ boosting, for example, if you ensure frequent idle time on all
+ CPUs that might execute within the kernel.
+3. Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs=
+ boot parameter offloading RCU callbacks from all CPUs susceptible
+ to OS jitter. This approach prevents the rcuc/%u kthreads from
+ having any work to do, so that they are never awakened.
+4. Ensure that the CPU never enters the kernel, and, in particular,
+ avoid initiating any CPU hotplug operations on this CPU. This is
+ another way of preventing any callbacks from being queued on the
+ CPU, again preventing the rcuc/%u kthreads from having any work
+ to do.
+
+Name:
+ rcuop/%d and rcuos/%d
+
+Purpose:
+ Offload RCU callbacks from the corresponding CPU.
+
+To reduce its OS jitter, do at least one of the following:
+
+1. Use affinity, cgroups, or other mechanism to force these kthreads
+ to execute on some other CPU.
+2. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these
+ kthreads from being created in the first place. However, please
+ note that this will not eliminate OS jitter, but will instead
+ shift it to RCU_SOFTIRQ.
+
+Name:
+ watchdog/%u
+
+Purpose:
+ Detect software lockups on each CPU.
+
+To reduce its OS jitter, do at least one of the following:
+
+1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these
+ kthreads from being created in the first place.
+2. Boot with "nosoftlockup=0", which will also prevent these kthreads
+ from being created. Other related watchdog and softlockup boot
+ parameters may be found in Documentation/admin-guide/kernel-parameters.rst
+ and Documentation/watchdog/watchdog-parameters.rst.
+3. Echo a zero to /proc/sys/kernel/watchdog to disable the
+ watchdog timer.
+4. Echo a large number of /proc/sys/kernel/watchdog_thresh in
+ order to reduce the frequency of OS jitter due to the watchdog
+ timer down to a level that is acceptable for your workload.
diff --git a/Documentation/admin-guide/laptops/asus-laptop.rst b/Documentation/admin-guide/laptops/asus-laptop.rst
new file mode 100644
index 0000000..9517632
--- /dev/null
+++ b/Documentation/admin-guide/laptops/asus-laptop.rst
@@ -0,0 +1,271 @@
+==================
+Asus Laptop Extras
+==================
+
+Version 0.1
+
+August 6, 2009
+
+Corentin Chary <corentincj@iksaif.net>
+http://acpi4asus.sf.net/
+
+ This driver provides support for extra features of ACPI-compatible ASUS laptops.
+ It may also support some MEDION, JVC or VICTOR laptops (such as MEDION 9675 or
+ VICTOR XP7210 for example). It makes all the extra buttons generate input
+ events (like keyboards).
+
+ On some models adds support for changing the display brightness and output,
+ switching the LCD backlight on and off, and most importantly, allows you to
+ blink those fancy LEDs intended for reporting mail and wireless status.
+
+This driver supersedes the old asus_acpi driver.
+
+Requirements
+------------
+
+ Kernel 2.6.X sources, configured for your computer, with ACPI support.
+ You also need CONFIG_INPUT and CONFIG_ACPI.
+
+Status
+------
+
+ The features currently supported are the following (see below for
+ detailed description):
+
+ - Fn key combinations
+ - Bluetooth enable and disable
+ - Wlan enable and disable
+ - GPS enable and disable
+ - Video output switching
+ - Ambient Light Sensor on and off
+ - LED control
+ - LED Display control
+ - LCD brightness control
+ - LCD on and off
+
+ A compatibility table by model and feature is maintained on the web
+ site, http://acpi4asus.sf.net/.
+
+Usage
+-----
+
+ Try "modprobe asus-laptop". Check your dmesg (simply type dmesg). You should
+ see some lines like this :
+
+ Asus Laptop Extras version 0.42
+ - L2D model detected.
+
+ If it is not the output you have on your laptop, send it (and the laptop's
+ DSDT) to me.
+
+ That's all, now, all the events generated by the hotkeys of your laptop
+ should be reported via netlink events. You can check with
+ "acpi_genl monitor" (part of the acpica project).
+
+ Hotkeys are also reported as input keys (like keyboards) you can check
+ which key are supported using "xev" under X11.
+
+ You can get information on the version of your DSDT table by reading the
+ /sys/devices/platform/asus-laptop/infos entry. If you have a question or a
+ bug report to do, please include the output of this entry.
+
+LEDs
+----
+
+ You can modify LEDs be echoing values to `/sys/class/leds/asus/*/brightness`::
+
+ echo 1 > /sys/class/leds/asus::mail/brightness
+
+ will switch the mail LED on.
+
+ You can also know if they are on/off by reading their content and use
+ kernel triggers like disk-activity or heartbeat.
+
+Backlight
+---------
+
+ You can control lcd backlight power and brightness with
+ /sys/class/backlight/asus-laptop/. Brightness Values are between 0 and 15.
+
+Wireless devices
+----------------
+
+ You can turn the internal Bluetooth adapter on/off with the bluetooth entry
+ (only on models with Bluetooth). This usually controls the associated LED.
+ Same for Wlan adapter.
+
+Display switching
+-----------------
+
+ Note: the display switching code is currently considered EXPERIMENTAL.
+
+ Switching works for the following models:
+
+ - L3800C
+ - A2500H
+ - L5800C
+ - M5200N
+ - W1000N (albeit with some glitches)
+ - M6700R
+ - A6JC
+ - F3J
+
+ Switching doesn't work for the following:
+
+ - M3700N
+ - L2X00D (locks the laptop under certain conditions)
+
+ To switch the displays, echo values from 0 to 15 to
+ /sys/devices/platform/asus-laptop/display. The significance of those values
+ is as follows:
+
+ +-------+-----+-----+-----+-----+-----+
+ | Bin | Val | DVI | TV | CRT | LCD |
+ +-------+-----+-----+-----+-----+-----+
+ | 0000 | 0 | | | | |
+ +-------+-----+-----+-----+-----+-----+
+ | 0001 | 1 | | | | X |
+ +-------+-----+-----+-----+-----+-----+
+ | 0010 | 2 | | | X | |
+ +-------+-----+-----+-----+-----+-----+
+ | 0011 | 3 | | | X | X |
+ +-------+-----+-----+-----+-----+-----+
+ | 0100 | 4 | | X | | |
+ +-------+-----+-----+-----+-----+-----+
+ | 0101 | 5 | | X | | X |
+ +-------+-----+-----+-----+-----+-----+
+ | 0110 | 6 | | X | X | |
+ +-------+-----+-----+-----+-----+-----+
+ | 0111 | 7 | | X | X | X |
+ +-------+-----+-----+-----+-----+-----+
+ | 1000 | 8 | X | | | |
+ +-------+-----+-----+-----+-----+-----+
+ | 1001 | 9 | X | | | X |
+ +-------+-----+-----+-----+-----+-----+
+ | 1010 | 10 | X | | X | |
+ +-------+-----+-----+-----+-----+-----+
+ | 1011 | 11 | X | | X | X |
+ +-------+-----+-----+-----+-----+-----+
+ | 1100 | 12 | X | X | | |
+ +-------+-----+-----+-----+-----+-----+
+ | 1101 | 13 | X | X | | X |
+ +-------+-----+-----+-----+-----+-----+
+ | 1110 | 14 | X | X | X | |
+ +-------+-----+-----+-----+-----+-----+
+ | 1111 | 15 | X | X | X | X |
+ +-------+-----+-----+-----+-----+-----+
+
+ In most cases, the appropriate displays must be plugged in for the above
+ combinations to work. TV-Out may need to be initialized at boot time.
+
+ Debugging:
+
+ 1) Check whether the Fn+F8 key:
+
+ a) does not lock the laptop (try a boot with noapic / nolapic if it does)
+ b) generates events (0x6n, where n is the value corresponding to the
+ configuration above)
+ c) actually works
+
+ Record the disp value at every configuration.
+ 2) Echo values from 0 to 15 to /sys/devices/platform/asus-laptop/display.
+ Record its value, note any change. If nothing changes, try a broader range,
+ up to 65535.
+ 3) Send ANY output (both positive and negative reports are needed, unless your
+ machine is already listed above) to the acpi4asus-user mailing list.
+
+ Note: on some machines (e.g. L3C), after the module has been loaded, only 0x6n
+ events are generated and no actual switching occurs. In such a case, a line
+ like::
+
+ echo $((10#$arg-60)) > /sys/devices/platform/asus-laptop/display
+
+ will usually do the trick ($arg is the 0000006n-like event passed to acpid).
+
+ Note: there is currently no reliable way to read display status on xxN
+ (Centrino) models.
+
+LED display
+-----------
+
+ Some models like the W1N have a LED display that can be used to display
+ several items of information.
+
+ LED display works for the following models:
+
+ - W1000N
+ - W1J
+
+ To control the LED display, use the following::
+
+ echo 0x0T000DDD > /sys/devices/platform/asus-laptop/
+
+ where T control the 3 letters display, and DDD the 3 digits display,
+ according to the tables below::
+
+ DDD (digits)
+ 000 to 999 = display digits
+ AAA = ---
+ BBB to FFF = turn-off
+
+ T (type)
+ 0 = off
+ 1 = dvd
+ 2 = vcd
+ 3 = mp3
+ 4 = cd
+ 5 = tv
+ 6 = cpu
+ 7 = vol
+
+ For example "echo 0x01000001 >/sys/devices/platform/asus-laptop/ledd"
+ would display "DVD001".
+
+Driver options
+--------------
+
+ Options can be passed to the asus-laptop driver using the standard
+ module argument syntax (<param>=<value> when passing the option to the
+ module or asus-laptop.<param>=<value> on the kernel boot line when
+ asus-laptop is statically linked into the kernel).
+
+ wapf: WAPF defines the behavior of the Fn+Fx wlan key
+ The significance of values is yet to be found, but
+ most of the time:
+
+ - 0x0 should do nothing
+ - 0x1 should allow to control the device with Fn+Fx key.
+ - 0x4 should send an ACPI event (0x88) while pressing the Fn+Fx key
+ - 0x5 like 0x1 or 0x4
+
+ The default value is 0x1.
+
+Unsupported models
+------------------
+
+ These models will never be supported by this module, as they use a completely
+ different mechanism to handle LEDs and extra stuff (meaning we have no clue
+ how it works):
+
+ - ASUS A1300 (A1B), A1370D
+ - ASUS L7300G
+ - ASUS L8400
+
+Patches, Errors, Questions
+--------------------------
+
+ I appreciate any success or failure
+ reports, especially if they add to or correct the compatibility table.
+ Please include the following information in your report:
+
+ - Asus model name
+ - a copy of your ACPI tables, using the "acpidump" utility
+ - a copy of /sys/devices/platform/asus-laptop/infos
+ - which driver features work and which don't
+ - the observed behavior of non-working features
+
+ Any other comments or patches are also more than welcome.
+
+ acpi4asus-user@lists.sourceforge.net
+
+ http://sourceforge.net/projects/acpi4asus
diff --git a/Documentation/admin-guide/laptops/disk-shock-protection.rst b/Documentation/admin-guide/laptops/disk-shock-protection.rst
new file mode 100644
index 0000000..e97c5f7
--- /dev/null
+++ b/Documentation/admin-guide/laptops/disk-shock-protection.rst
@@ -0,0 +1,151 @@
+==========================
+Hard disk shock protection
+==========================
+
+Author: Elias Oltmanns <eo@nebensachen.de>
+
+Last modified: 2008-10-03
+
+
+.. 0. Contents
+
+ 1. Intro
+ 2. The interface
+ 3. References
+ 4. CREDITS
+
+
+1. Intro
+--------
+
+ATA/ATAPI-7 specifies the IDLE IMMEDIATE command with unload feature.
+Issuing this command should cause the drive to switch to idle mode and
+unload disk heads. This feature is being used in modern laptops in
+conjunction with accelerometers and appropriate software to implement
+a shock protection facility. The idea is to stop all I/O operations on
+the internal hard drive and park its heads on the ramp when critical
+situations are anticipated. The desire to have such a feature
+available on GNU/Linux systems has been the original motivation to
+implement a generic disk head parking interface in the Linux kernel.
+Please note, however, that other components have to be set up on your
+system in order to get disk shock protection working (see
+section 3. References below for pointers to more information about
+that).
+
+
+2. The interface
+----------------
+
+For each ATA device, the kernel exports the file
+`block/*/device/unload_heads` in sysfs (here assumed to be mounted under
+/sys). Access to `/sys/block/*/device/unload_heads` is denied with
+-EOPNOTSUPP if the device does not support the unload feature.
+Otherwise, writing an integer value to this file will take the heads
+of the respective drive off the platter and block all I/O operations
+for the specified number of milliseconds. When the timeout expires and
+no further disk head park request has been issued in the meantime,
+normal operation will be resumed. The maximal value accepted for a
+timeout is 30000 milliseconds. Exceeding this limit will return
+-EOVERFLOW, but heads will be parked anyway and the timeout will be
+set to 30 seconds. However, you can always change a timeout to any
+value between 0 and 30000 by issuing a subsequent head park request
+before the timeout of the previous one has expired. In particular, the
+total timeout can exceed 30 seconds and, more importantly, you can
+cancel a previously set timeout and resume normal operation
+immediately by specifying a timeout of 0. Values below -2 are rejected
+with -EINVAL (see below for the special meaning of -1 and -2). If the
+timeout specified for a recent head park request has not yet expired,
+reading from `/sys/block/*/device/unload_heads` will report the number
+of milliseconds remaining until normal operation will be resumed;
+otherwise, reading the unload_heads attribute will return 0.
+
+For example, do the following in order to park the heads of drive
+/dev/sda and stop all I/O operations for five seconds::
+
+ # echo 5000 > /sys/block/sda/device/unload_heads
+
+A simple::
+
+ # cat /sys/block/sda/device/unload_heads
+
+will show you how many milliseconds are left before normal operation
+will be resumed.
+
+A word of caution: The fact that the interface operates on a basis of
+milliseconds may raise expectations that cannot be satisfied in
+reality. In fact, the ATA specs clearly state that the time for an
+unload operation to complete is vendor specific. The hint in ATA-7
+that this will typically be within 500 milliseconds apparently has
+been dropped in ATA-8.
+
+There is a technical detail of this implementation that may cause some
+confusion and should be discussed here. When a head park request has
+been issued to a device successfully, all I/O operations on the
+controller port this device is attached to will be deferred. That is
+to say, any other device that may be connected to the same port will
+be affected too. The only exception is that a subsequent head unload
+request to that other device will be executed immediately. Further
+operations on that port will be deferred until the timeout specified
+for either device on the port has expired. As far as PATA (old style
+IDE) configurations are concerned, there can only be two devices
+attached to any single port. In SATA world we have port multipliers
+which means that a user-issued head parking request to one device may
+actually result in stopping I/O to a whole bunch of devices. However,
+since this feature is supposed to be used on laptops and does not seem
+to be very useful in any other environment, there will be mostly one
+device per port. Even if the CD/DVD writer happens to be connected to
+the same port as the hard drive, it generally *should* recover just
+fine from the occasional buffer under-run incurred by a head park
+request to the HD. Actually, when you are using an ide driver rather
+than its libata counterpart (i.e. your disk is called /dev/hda
+instead of /dev/sda), then parking the heads of one drive (drive X)
+will generally not affect the mode of operation of another drive
+(drive Y) on the same port as described above. It is only when a port
+reset is required to recover from an exception on drive Y that further
+I/O operations on that drive (and the reset itself) will be delayed
+until drive X is no longer in the parked state.
+
+Finally, there are some hard drives that only comply with an earlier
+version of the ATA standard than ATA-7, but do support the unload
+feature nonetheless. Unfortunately, there is no safe way Linux can
+detect these devices, so you won't be able to write to the
+unload_heads attribute. If you know that your device really does
+support the unload feature (for instance, because the vendor of your
+laptop or the hard drive itself told you so), then you can tell the
+kernel to enable the usage of this feature for that drive by writing
+the special value -1 to the unload_heads attribute::
+
+ # echo -1 > /sys/block/sda/device/unload_heads
+
+will enable the feature for /dev/sda, and giving -2 instead of -1 will
+disable it again.
+
+
+3. References
+-------------
+
+There are several laptops from different vendors featuring shock
+protection capabilities. As manufacturers have refused to support open
+source development of the required software components so far, Linux
+support for shock protection varies considerably between different
+hardware implementations. Ideally, this section should contain a list
+of pointers at different projects aiming at an implementation of shock
+protection on different systems. Unfortunately, I only know of a
+single project which, although still considered experimental, is fit
+for use. Please feel free to add projects that have been the victims
+of my ignorance.
+
+- http://www.thinkwiki.org/wiki/HDAPS
+
+ See this page for information about Linux support of the hard disk
+ active protection system as implemented in IBM/Lenovo Thinkpads.
+
+
+4. CREDITS
+----------
+
+This implementation of disk head parking has been inspired by a patch
+originally published by Jon Escombe <lists@dresco.co.uk>. My efforts
+to develop an implementation of this feature that is fit to be merged
+into mainline have been aided by various kernel developers, in
+particular by Tejun Heo and Bartlomiej Zolnierkiewicz.
diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst
new file mode 100644
index 0000000..cd9a1c2
--- /dev/null
+++ b/Documentation/admin-guide/laptops/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Laptop Drivers
+==============
+
+.. toctree::
+ :maxdepth: 1
+
+ asus-laptop
+ disk-shock-protection
+ laptop-mode
+ lg-laptop
+ sony-laptop
+ sonypi
+ thinkpad-acpi
+ toshiba_haps
diff --git a/Documentation/admin-guide/laptops/laptop-mode.rst b/Documentation/admin-guide/laptops/laptop-mode.rst
new file mode 100644
index 0000000..c984c42
--- /dev/null
+++ b/Documentation/admin-guide/laptops/laptop-mode.rst
@@ -0,0 +1,781 @@
+===============================================
+How to conserve battery power using laptop-mode
+===============================================
+
+Document Author: Bart Samwel (bart@samwel.tk)
+
+Date created: January 2, 2004
+
+Last modified: December 06, 2004
+
+Introduction
+------------
+
+Laptop mode is used to minimize the time that the hard disk needs to be spun up,
+to conserve battery power on laptops. It has been reported to cause significant
+power savings.
+
+.. Contents
+
+ * Introduction
+ * Installation
+ * Caveats
+ * The Details
+ * Tips & Tricks
+ * Control script
+ * ACPI integration
+ * Monitoring tool
+
+
+Installation
+------------
+
+To use laptop mode, you don't need to set any kernel configuration options
+or anything. Simply install all the files included in this document, and
+laptop mode will automatically be started when you're on battery. For
+your convenience, a tarball containing an installer can be downloaded at:
+
+ http://www.samwel.tk/laptop_mode/laptop_mode/
+
+To configure laptop mode, you need to edit the configuration file, which is
+located in /etc/default/laptop-mode on Debian-based systems, or in
+/etc/sysconfig/laptop-mode on other systems.
+
+Unfortunately, automatic enabling of laptop mode does not work for
+laptops that don't have ACPI. On those laptops, you need to start laptop
+mode manually. To start laptop mode, run "laptop_mode start", and to
+stop it, run "laptop_mode stop". (Note: The laptop mode tools package now
+has experimental support for APM, you might want to try that first.)
+
+
+Caveats
+-------
+
+* The downside of laptop mode is that you have a chance of losing up to 10
+ minutes of work. If you cannot afford this, don't use it! The supplied ACPI
+ scripts automatically turn off laptop mode when the battery almost runs out,
+ so that you won't lose any data at the end of your battery life.
+
+* Most desktop hard drives have a very limited lifetime measured in spindown
+ cycles, typically about 50.000 times (it's usually listed on the spec sheet).
+ Check your drive's rating, and don't wear down your drive's lifetime if you
+ don't need to.
+
+* If you mount some of your ext3/reiserfs filesystems with the -n option, then
+ the control script will not be able to remount them correctly. You must set
+ DO_REMOUNTS=0 in the control script, otherwise it will remount them with the
+ wrong options -- or it will fail because it cannot write to /etc/mtab.
+
+* If you have your filesystems listed as type "auto" in fstab, like I did, then
+ the control script will not recognize them as filesystems that need remounting.
+ You must list the filesystems with their true type instead.
+
+* It has been reported that some versions of the mutt mail client use file access
+ times to determine whether a folder contains new mail. If you use mutt and
+ experience this, you must disable the noatime remounting by setting the option
+ DO_REMOUNT_NOATIME to 0 in the configuration file.
+
+
+The Details
+-----------
+
+Laptop mode is controlled by the knob /proc/sys/vm/laptop_mode. This knob is
+present for all kernels that have the laptop mode patch, regardless of any
+configuration options. When the knob is set, any physical disk I/O (that might
+have caused the hard disk to spin up) causes Linux to flush all dirty blocks. The
+result of this is that after a disk has spun down, it will not be spun up
+anymore to write dirty blocks, because those blocks had already been written
+immediately after the most recent read operation. The value of the laptop_mode
+knob determines the time between the occurrence of disk I/O and when the flush
+is triggered. A sensible value for the knob is 5 seconds. Setting the knob to
+0 disables laptop mode.
+
+To increase the effectiveness of the laptop_mode strategy, the laptop_mode
+control script increases dirty_expire_centisecs and dirty_writeback_centisecs in
+/proc/sys/vm to about 10 minutes (by default), which means that pages that are
+dirtied are not forced to be written to disk as often. The control script also
+changes the dirty background ratio, so that background writeback of dirty pages
+is not done anymore. Combined with a higher commit value (also 10 minutes) for
+ext3 or ReiserFS filesystems (also done automatically by the control script),
+this results in concentration of disk activity in a small time interval which
+occurs only once every 10 minutes, or whenever the disk is forced to spin up by
+a cache miss. The disk can then be spun down in the periods of inactivity.
+
+If you want to find out which process caused the disk to spin up, you can
+gather information by setting the flag /proc/sys/vm/block_dump. When this flag
+is set, Linux reports all disk read and write operations that take place, and
+all block dirtyings done to files. This makes it possible to debug why a disk
+needs to spin up, and to increase battery life even more. The output of
+block_dump is written to the kernel output, and it can be retrieved using
+"dmesg". When you use block_dump and your kernel logging level also includes
+kernel debugging messages, you probably want to turn off klogd, otherwise
+the output of block_dump will be logged, causing disk activity that is not
+normally there.
+
+
+Configuration
+-------------
+
+The laptop mode configuration file is located in /etc/default/laptop-mode on
+Debian-based systems, or in /etc/sysconfig/laptop-mode on other systems. It
+contains the following options:
+
+MAX_AGE:
+
+Maximum time, in seconds, of hard drive spindown time that you are
+comfortable with. Worst case, it's possible that you could lose this
+amount of work if your battery fails while you're in laptop mode.
+
+MINIMUM_BATTERY_MINUTES:
+
+Automatically disable laptop mode if the remaining number of minutes of
+battery power is less than this value. Default is 10 minutes.
+
+AC_HD/BATT_HD:
+
+The idle timeout that should be set on your hard drive when laptop mode
+is active (BATT_HD) and when it is not active (AC_HD). The defaults are
+20 seconds (value 4) for BATT_HD and 2 hours (value 244) for AC_HD. The
+possible values are those listed in the manual page for "hdparm" for the
+"-S" option.
+
+HD:
+
+The devices for which the spindown timeout should be adjusted by laptop mode.
+Default is /dev/hda. If you specify multiple devices, separate them by a space.
+
+READAHEAD:
+
+Disk readahead, in 512-byte sectors, while laptop mode is active. A large
+readahead can prevent disk accesses for things like executable pages (which are
+loaded on demand while the application executes) and sequentially accessed data
+(MP3s).
+
+DO_REMOUNTS:
+
+The control script automatically remounts any mounted journaled filesystems
+with appropriate commit interval options. When this option is set to 0, this
+feature is disabled.
+
+DO_REMOUNT_NOATIME:
+
+When remounting, should the filesystems be remounted with the noatime option?
+Normally, this is set to "1" (enabled), but there may be programs that require
+access time recording.
+
+DIRTY_RATIO:
+
+The percentage of memory that is allowed to contain "dirty" or unsaved data
+before a writeback is forced, while laptop mode is active. Corresponds to
+the /proc/sys/vm/dirty_ratio sysctl.
+
+DIRTY_BACKGROUND_RATIO:
+
+The percentage of memory that is allowed to contain "dirty" or unsaved data
+after a forced writeback is done due to an exceeding of DIRTY_RATIO. Set
+this nice and low. This corresponds to the /proc/sys/vm/dirty_background_ratio
+sysctl.
+
+Note that the behaviour of dirty_background_ratio is quite different
+when laptop mode is active and when it isn't. When laptop mode is inactive,
+dirty_background_ratio is the threshold percentage at which background writeouts
+start taking place. When laptop mode is active, however, background writeouts
+are disabled, and the dirty_background_ratio only determines how much writeback
+is done when dirty_ratio is reached.
+
+DO_CPU:
+
+Enable CPU frequency scaling when in laptop mode. (Requires CPUFreq to be setup.
+See Documentation/admin-guide/pm/cpufreq.rst for more info. Disabled by default.)
+
+CPU_MAXFREQ:
+
+When on battery, what is the maximum CPU speed that the system should use? Legal
+values are "slowest" for the slowest speed that your CPU is able to operate at,
+or a value listed in /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies.
+
+
+Tips & Tricks
+-------------
+
+* Bartek Kania reports getting up to 50 minutes of extra battery life (on top
+ of his regular 3 to 3.5 hours) using a spindown time of 5 seconds (BATT_HD=1).
+
+* You can spin down the disk while playing MP3, by setting disk readahead
+ to 8MB (READAHEAD=16384). Effectively, the disk will read a complete MP3 at
+ once, and will then spin down while the MP3 is playing. (Thanks to Bartek
+ Kania.)
+
+* Drew Scott Daniels observed: "I don't know why, but when I decrease the number
+ of colours that my display uses it consumes less battery power. I've seen
+ this on powerbooks too. I hope that this is a piece of information that
+ might be useful to the Laptop Mode patch or its users."
+
+* In syslog.conf, you can prefix entries with a dash `-` to omit syncing the
+ file after every logging. When you're using laptop-mode and your disk doesn't
+ spin down, this is a likely culprit.
+
+* Richard Atterer observed that laptop mode does not work well with noflushd
+ (http://noflushd.sourceforge.net/), it seems that noflushd prevents laptop-mode
+ from doing its thing.
+
+* If you're worried about your data, you might want to consider using a USB
+ memory stick or something like that as a "working area". (Be aware though
+ that flash memory can only handle a limited number of writes, and overuse
+ may wear out your memory stick pretty quickly. Do _not_ use journalling
+ filesystems on flash memory sticks.)
+
+
+Configuration file for control and ACPI battery scripts
+-------------------------------------------------------
+
+This allows the tunables to be changed for the scripts via an external
+configuration file
+
+It should be installed as /etc/default/laptop-mode on Debian, and as
+/etc/sysconfig/laptop-mode on Red Hat, SUSE, Mandrake, and other work-alikes.
+
+Config file::
+
+ # Maximum time, in seconds, of hard drive spindown time that you are
+ # comfortable with. Worst case, it's possible that you could lose this
+ # amount of work if your battery fails you while in laptop mode.
+ #MAX_AGE=600
+
+ # Automatically disable laptop mode when the number of minutes of battery
+ # that you have left goes below this threshold.
+ MINIMUM_BATTERY_MINUTES=10
+
+ # Read-ahead, in 512-byte sectors. You can spin down the disk while playing MP3/OGG
+ # by setting the disk readahead to 8MB (READAHEAD=16384). Effectively, the disk
+ # will read a complete MP3 at once, and will then spin down while the MP3/OGG is
+ # playing.
+ #READAHEAD=4096
+
+ # Shall we remount journaled fs. with appropriate commit interval? (1=yes)
+ #DO_REMOUNTS=1
+
+ # And shall we add the "noatime" option to that as well? (1=yes)
+ #DO_REMOUNT_NOATIME=1
+
+ # Dirty synchronous ratio. At this percentage of dirty pages the process
+ # which
+ # calls write() does its own writeback
+ #DIRTY_RATIO=40
+
+ #
+ # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been
+ # exceeded, the kernel will wake flusher threads which will then reduce the
+ # amount of dirty memory to dirty_background_ratio. Set this nice and low,
+ # so once some writeout has commenced, we do a lot of it.
+ #
+ #DIRTY_BACKGROUND_RATIO=5
+
+ # kernel default dirty buffer age
+ #DEF_AGE=30
+ #DEF_UPDATE=5
+ #DEF_DIRTY_BACKGROUND_RATIO=10
+ #DEF_DIRTY_RATIO=40
+ #DEF_XFS_AGE_BUFFER=15
+ #DEF_XFS_SYNC_INTERVAL=30
+ #DEF_XFS_BUFD_INTERVAL=1
+
+ # This must be adjusted manually to the value of HZ in the running kernel
+ # on 2.4, until the XFS people change their 2.4 external interfaces to work in
+ # centisecs. This can be automated, but it's a work in progress that still
+ # needs# some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for
+ # external interfaces, and that is currently always set to 100. So you don't
+ # need to change this on 2.6.
+ #XFS_HZ=100
+
+ # Should the maximum CPU frequency be adjusted down while on battery?
+ # Requires CPUFreq to be setup.
+ # See Documentation/admin-guide/pm/cpufreq.rst for more info
+ #DO_CPU=0
+
+ # When on battery what is the maximum CPU speed that the system should
+ # use? Legal values are "slowest" for the slowest speed that your
+ # CPU is able to operate at, or a value listed in:
+ # /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
+ # Only applicable if DO_CPU=1.
+ #CPU_MAXFREQ=slowest
+
+ # Idle timeout for your hard drive (man hdparm for valid values, -S option)
+ # Default is 2 hours on AC (AC_HD=244) and 20 seconds for battery (BATT_HD=4).
+ #AC_HD=244
+ #BATT_HD=4
+
+ # The drives for which to adjust the idle timeout. Separate them by a space,
+ # e.g. HD="/dev/hda /dev/hdb".
+ #HD="/dev/hda"
+
+ # Set the spindown timeout on a hard drive?
+ #DO_HD=1
+
+
+Control script
+--------------
+
+Please note that this control script works for the Linux 2.4 and 2.6 series (thanks
+to Kiko Piris).
+
+Control script::
+
+ #!/bin/bash
+
+ # start or stop laptop_mode, best run by a power management daemon when
+ # ac gets connected/disconnected from a laptop
+ #
+ # install as /sbin/laptop_mode
+ #
+ # Contributors to this script: Kiko Piris
+ # Bart Samwel
+ # Micha Feigin
+ # Andrew Morton
+ # Herve Eychenne
+ # Dax Kelson
+ #
+ # Original Linux 2.4 version by: Jens Axboe
+
+ #############################################################################
+
+ # Source config
+ if [ -f /etc/default/laptop-mode ] ; then
+ # Debian
+ . /etc/default/laptop-mode
+ elif [ -f /etc/sysconfig/laptop-mode ] ; then
+ # Others
+ . /etc/sysconfig/laptop-mode
+ fi
+
+ # Don't raise an error if the config file is incomplete
+ # set defaults instead:
+
+ # Maximum time, in seconds, of hard drive spindown time that you are
+ # comfortable with. Worst case, it's possible that you could lose this
+ # amount of work if your battery fails you while in laptop mode.
+ MAX_AGE=${MAX_AGE:-'600'}
+
+ # Read-ahead, in kilobytes
+ READAHEAD=${READAHEAD:-'4096'}
+
+ # Shall we remount journaled fs. with appropriate commit interval? (1=yes)
+ DO_REMOUNTS=${DO_REMOUNTS:-'1'}
+
+ # And shall we add the "noatime" option to that as well? (1=yes)
+ DO_REMOUNT_NOATIME=${DO_REMOUNT_NOATIME:-'1'}
+
+ # Shall we adjust the idle timeout on a hard drive?
+ DO_HD=${DO_HD:-'1'}
+
+ # Adjust idle timeout on which hard drive?
+ HD="${HD:-'/dev/hda'}"
+
+ # spindown time for HD (hdparm -S values)
+ AC_HD=${AC_HD:-'244'}
+ BATT_HD=${BATT_HD:-'4'}
+
+ # Dirty synchronous ratio. At this percentage of dirty pages the process which
+ # calls write() does its own writeback
+ DIRTY_RATIO=${DIRTY_RATIO:-'40'}
+
+ # cpu frequency scaling
+ # See Documentation/admin-guide/pm/cpufreq.rst for more info
+ DO_CPU=${CPU_MANAGE:-'0'}
+ CPU_MAXFREQ=${CPU_MAXFREQ:-'slowest'}
+
+ #
+ # Allowed dirty background ratio, in percent. Once DIRTY_RATIO has been
+ # exceeded, the kernel will wake flusher threads which will then reduce the
+ # amount of dirty memory to dirty_background_ratio. Set this nice and low,
+ # so once some writeout has commenced, we do a lot of it.
+ #
+ DIRTY_BACKGROUND_RATIO=${DIRTY_BACKGROUND_RATIO:-'5'}
+
+ # kernel default dirty buffer age
+ DEF_AGE=${DEF_AGE:-'30'}
+ DEF_UPDATE=${DEF_UPDATE:-'5'}
+ DEF_DIRTY_BACKGROUND_RATIO=${DEF_DIRTY_BACKGROUND_RATIO:-'10'}
+ DEF_DIRTY_RATIO=${DEF_DIRTY_RATIO:-'40'}
+ DEF_XFS_AGE_BUFFER=${DEF_XFS_AGE_BUFFER:-'15'}
+ DEF_XFS_SYNC_INTERVAL=${DEF_XFS_SYNC_INTERVAL:-'30'}
+ DEF_XFS_BUFD_INTERVAL=${DEF_XFS_BUFD_INTERVAL:-'1'}
+
+ # This must be adjusted manually to the value of HZ in the running kernel
+ # on 2.4, until the XFS people change their 2.4 external interfaces to work in
+ # centisecs. This can be automated, but it's a work in progress that still needs
+ # some fixes. On 2.6 kernels, XFS uses USER_HZ instead of HZ for external
+ # interfaces, and that is currently always set to 100. So you don't need to
+ # change this on 2.6.
+ XFS_HZ=${XFS_HZ:-'100'}
+
+ #############################################################################
+
+ KLEVEL="$(uname -r |
+ {
+ IFS='.' read a b c
+ echo $a.$b
+ }
+ )"
+ case "$KLEVEL" in
+ "2.4"|"2.6")
+ ;;
+ *)
+ echo "Unhandled kernel version: $KLEVEL ('uname -r' = '$(uname -r)')" >&2
+ exit 1
+ ;;
+ esac
+
+ if [ ! -e /proc/sys/vm/laptop_mode ] ; then
+ echo "Kernel is not patched with laptop_mode patch." >&2
+ exit 1
+ fi
+
+ if [ ! -w /proc/sys/vm/laptop_mode ] ; then
+ echo "You do not have enough privileges to enable laptop_mode." >&2
+ exit 1
+ fi
+
+ # Remove an option (the first parameter) of the form option=<number> from
+ # a mount options string (the rest of the parameters).
+ parse_mount_opts () {
+ OPT="$1"
+ shift
+ echo ",$*," | sed \
+ -e 's/,'"$OPT"'=[0-9]*,/,/g' \
+ -e 's/,,*/,/g' \
+ -e 's/^,//' \
+ -e 's/,$//'
+ }
+
+ # Remove an option (the first parameter) without any arguments from
+ # a mount option string (the rest of the parameters).
+ parse_nonumber_mount_opts () {
+ OPT="$1"
+ shift
+ echo ",$*," | sed \
+ -e 's/,'"$OPT"',/,/g' \
+ -e 's/,,*/,/g' \
+ -e 's/^,//' \
+ -e 's/,$//'
+ }
+
+ # Find out the state of a yes/no option (e.g. "atime"/"noatime") in
+ # fstab for a given filesystem, and use this state to replace the
+ # value of the option in another mount options string. The device
+ # is the first argument, the option name the second, and the default
+ # value the third. The remainder is the mount options string.
+ #
+ # Example:
+ # parse_yesno_opts_wfstab /dev/hda1 atime atime defaults,noatime
+ #
+ # If fstab contains, say, "rw" for this filesystem, then the result
+ # will be "defaults,atime".
+ parse_yesno_opts_wfstab () {
+ L_DEV="$1"
+ OPT="$2"
+ DEF_OPT="$3"
+ shift 3
+ L_OPTS="$*"
+ PARSEDOPTS1="$(parse_nonumber_mount_opts $OPT $L_OPTS)"
+ PARSEDOPTS1="$(parse_nonumber_mount_opts no$OPT $PARSEDOPTS1)"
+ # Watch for a default atime in fstab
+ FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)"
+ if echo "$FSTAB_OPTS" | grep "$OPT" > /dev/null ; then
+ # option specified in fstab: extract the value and use it
+ if echo "$FSTAB_OPTS" | grep "no$OPT" > /dev/null ; then
+ echo "$PARSEDOPTS1,no$OPT"
+ else
+ # no$OPT not found -- so we must have $OPT.
+ echo "$PARSEDOPTS1,$OPT"
+ fi
+ else
+ # option not specified in fstab -- choose the default.
+ echo "$PARSEDOPTS1,$DEF_OPT"
+ fi
+ }
+
+ # Find out the state of a numbered option (e.g. "commit=NNN") in
+ # fstab for a given filesystem, and use this state to replace the
+ # value of the option in another mount options string. The device
+ # is the first argument, and the option name the second. The
+ # remainder is the mount options string in which the replacement
+ # must be done.
+ #
+ # Example:
+ # parse_mount_opts_wfstab /dev/hda1 commit defaults,commit=7
+ #
+ # If fstab contains, say, "commit=3,rw" for this filesystem, then the
+ # result will be "rw,commit=3".
+ parse_mount_opts_wfstab () {
+ L_DEV="$1"
+ OPT="$2"
+ shift 2
+ L_OPTS="$*"
+ PARSEDOPTS1="$(parse_mount_opts $OPT $L_OPTS)"
+ # Watch for a default commit in fstab
+ FSTAB_OPTS="$(awk '$1 == "'$L_DEV'" { print $4 }' /etc/fstab)"
+ if echo "$FSTAB_OPTS" | grep "$OPT=" > /dev/null ; then
+ # option specified in fstab: extract the value, and use it
+ echo -n "$PARSEDOPTS1,$OPT="
+ echo ",$FSTAB_OPTS," | sed \
+ -e 's/.*,'"$OPT"'=//' \
+ -e 's/,.*//'
+ else
+ # option not specified in fstab: set it to 0
+ echo "$PARSEDOPTS1,$OPT=0"
+ fi
+ }
+
+ deduce_fstype () {
+ MP="$1"
+ # My root filesystem unfortunately has
+ # type "unknown" in /etc/mtab. If we encounter
+ # "unknown", we try to get the type from fstab.
+ cat /etc/fstab |
+ grep -v '^#' |
+ while read FSTAB_DEV FSTAB_MP FSTAB_FST FSTAB_OPTS FSTAB_DUMP FSTAB_DUMP ; do
+ if [ "$FSTAB_MP" = "$MP" ]; then
+ echo $FSTAB_FST
+ exit 0
+ fi
+ done
+ }
+
+ if [ $DO_REMOUNT_NOATIME -eq 1 ] ; then
+ NOATIME_OPT=",noatime"
+ fi
+
+ case "$1" in
+ start)
+ AGE=$((100*$MAX_AGE))
+ XFS_AGE=$(($XFS_HZ*$MAX_AGE))
+ echo -n "Starting laptop_mode"
+
+ if [ -d /proc/sys/vm/pagebuf ] ; then
+ # (For 2.4 and early 2.6.)
+ # This only needs to be set, not reset -- it is only used when
+ # laptop mode is enabled.
+ echo $XFS_AGE > /proc/sys/vm/pagebuf/lm_flush_age
+ echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
+ elif [ -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
+ # (A couple of early 2.6 laptop mode patches had these.)
+ # The same goes for these.
+ echo $XFS_AGE > /proc/sys/fs/xfs/lm_age_buffer
+ echo $XFS_AGE > /proc/sys/fs/xfs/lm_sync_interval
+ elif [ -f /proc/sys/fs/xfs/age_buffer ] ; then
+ # (2.6.6)
+ # But not for these -- they are also used in normal
+ # operation.
+ echo $XFS_AGE > /proc/sys/fs/xfs/age_buffer
+ echo $XFS_AGE > /proc/sys/fs/xfs/sync_interval
+ elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then
+ # (2.6.7 upwards)
+ # And not for these either. These are in centisecs,
+ # not USER_HZ, so we have to use $AGE, not $XFS_AGE.
+ echo $AGE > /proc/sys/fs/xfs/age_buffer_centisecs
+ echo $AGE > /proc/sys/fs/xfs/xfssyncd_centisecs
+ echo 3000 > /proc/sys/fs/xfs/xfsbufd_centisecs
+ fi
+
+ case "$KLEVEL" in
+ "2.4")
+ echo 1 > /proc/sys/vm/laptop_mode
+ echo "30 500 0 0 $AGE $AGE 60 20 0" > /proc/sys/vm/bdflush
+ ;;
+ "2.6")
+ echo 5 > /proc/sys/vm/laptop_mode
+ echo "$AGE" > /proc/sys/vm/dirty_writeback_centisecs
+ echo "$AGE" > /proc/sys/vm/dirty_expire_centisecs
+ echo "$DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
+ echo "$DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
+ ;;
+ esac
+ if [ $DO_REMOUNTS -eq 1 ]; then
+ cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
+ PARSEDOPTS="$(parse_mount_opts "$OPTS")"
+ if [ "$FST" = 'unknown' ]; then
+ FST=$(deduce_fstype $MP)
+ fi
+ case "$FST" in
+ "ext3"|"reiserfs")
+ PARSEDOPTS="$(parse_mount_opts commit "$OPTS")"
+ mount $DEV -t $FST $MP -o remount,$PARSEDOPTS,commit=$MAX_AGE$NOATIME_OPT
+ ;;
+ "xfs")
+ mount $DEV -t $FST $MP -o remount,$OPTS$NOATIME_OPT
+ ;;
+ esac
+ if [ -b $DEV ] ; then
+ blockdev --setra $(($READAHEAD * 2)) $DEV
+ fi
+ done
+ fi
+ if [ $DO_HD -eq 1 ] ; then
+ for THISHD in $HD ; do
+ /sbin/hdparm -S $BATT_HD $THISHD > /dev/null 2>&1
+ /sbin/hdparm -B 1 $THISHD > /dev/null 2>&1
+ done
+ fi
+ if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then
+ if [ $CPU_MAXFREQ = 'slowest' ]; then
+ CPU_MAXFREQ=`cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq`
+ fi
+ echo $CPU_MAXFREQ > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
+ fi
+ echo "."
+ ;;
+ stop)
+ U_AGE=$((100*$DEF_UPDATE))
+ B_AGE=$((100*$DEF_AGE))
+ echo -n "Stopping laptop_mode"
+ echo 0 > /proc/sys/vm/laptop_mode
+ if [ -f /proc/sys/fs/xfs/age_buffer -a ! -f /proc/sys/fs/xfs/lm_age_buffer ] ; then
+ # These need to be restored, if there are no lm_*.
+ echo $(($XFS_HZ*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer
+ echo $(($XFS_HZ*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/sync_interval
+ elif [ -f /proc/sys/fs/xfs/age_buffer_centisecs ] ; then
+ # These need to be restored as well.
+ echo $((100*$DEF_XFS_AGE_BUFFER)) > /proc/sys/fs/xfs/age_buffer_centisecs
+ echo $((100*$DEF_XFS_SYNC_INTERVAL)) > /proc/sys/fs/xfs/xfssyncd_centisecs
+ echo $((100*$DEF_XFS_BUFD_INTERVAL)) > /proc/sys/fs/xfs/xfsbufd_centisecs
+ fi
+ case "$KLEVEL" in
+ "2.4")
+ echo "30 500 0 0 $U_AGE $B_AGE 60 20 0" > /proc/sys/vm/bdflush
+ ;;
+ "2.6")
+ echo "$U_AGE" > /proc/sys/vm/dirty_writeback_centisecs
+ echo "$B_AGE" > /proc/sys/vm/dirty_expire_centisecs
+ echo "$DEF_DIRTY_RATIO" > /proc/sys/vm/dirty_ratio
+ echo "$DEF_DIRTY_BACKGROUND_RATIO" > /proc/sys/vm/dirty_background_ratio
+ ;;
+ esac
+ if [ $DO_REMOUNTS -eq 1 ] ; then
+ cat /etc/mtab | while read DEV MP FST OPTS DUMP PASS ; do
+ # Reset commit and atime options to defaults.
+ if [ "$FST" = 'unknown' ]; then
+ FST=$(deduce_fstype $MP)
+ fi
+ case "$FST" in
+ "ext3"|"reiserfs")
+ PARSEDOPTS="$(parse_mount_opts_wfstab $DEV commit $OPTS)"
+ PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $PARSEDOPTS)"
+ mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
+ ;;
+ "xfs")
+ PARSEDOPTS="$(parse_yesno_opts_wfstab $DEV atime atime $OPTS)"
+ mount $DEV -t $FST $MP -o remount,$PARSEDOPTS
+ ;;
+ esac
+ if [ -b $DEV ] ; then
+ blockdev --setra 256 $DEV
+ fi
+ done
+ fi
+ if [ $DO_HD -eq 1 ] ; then
+ for THISHD in $HD ; do
+ /sbin/hdparm -S $AC_HD $THISHD > /dev/null 2>&1
+ /sbin/hdparm -B 255 $THISHD > /dev/null 2>&1
+ done
+ fi
+ if [ $DO_CPU -eq 1 -a -e /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq ]; then
+ echo `cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq` > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
+ fi
+ echo "."
+ ;;
+ *)
+ echo "Usage: $0 {start|stop}" 2>&1
+ exit 1
+ ;;
+
+ esac
+
+ exit 0
+
+
+ACPI integration
+----------------
+
+Dax Kelson submitted this so that the ACPI acpid daemon will
+kick off the laptop_mode script and run hdparm. The part that
+automatically disables laptop mode when the battery is low was
+written by Jan Topinski.
+
+/etc/acpi/events/ac_adapter::
+
+ event=ac_adapter
+ action=/etc/acpi/actions/ac.sh %e
+
+/etc/acpi/events/battery::
+
+ event=battery.*
+ action=/etc/acpi/actions/battery.sh %e
+
+/etc/acpi/actions/ac.sh::
+
+ #!/bin/bash
+
+ # ac on/offline event handler
+
+ status=`awk '/^state: / { print $2 }' /proc/acpi/ac_adapter/$2/state`
+
+ case $status in
+ "on-line")
+ /sbin/laptop_mode stop
+ exit 0
+ ;;
+ "off-line")
+ /sbin/laptop_mode start
+ exit 0
+ ;;
+ esac
+
+
+/etc/acpi/actions/battery.sh::
+
+ #! /bin/bash
+
+ # Automatically disable laptop mode when the battery almost runs out.
+
+ BATT_INFO=/proc/acpi/battery/$2/state
+
+ if [[ -f /proc/sys/vm/laptop_mode ]]
+ then
+ LM=`cat /proc/sys/vm/laptop_mode`
+ if [[ $LM -gt 0 ]]
+ then
+ if [[ -f $BATT_INFO ]]
+ then
+ # Source the config file only now that we know we need
+ if [ -f /etc/default/laptop-mode ] ; then
+ # Debian
+ . /etc/default/laptop-mode
+ elif [ -f /etc/sysconfig/laptop-mode ] ; then
+ # Others
+ . /etc/sysconfig/laptop-mode
+ fi
+ MINIMUM_BATTERY_MINUTES=${MINIMUM_BATTERY_MINUTES:-'10'}
+
+ ACTION="`cat $BATT_INFO | grep charging | cut -c 26-`"
+ if [[ ACTION -eq "discharging" ]]
+ then
+ PRESENT_RATE=`cat $BATT_INFO | grep "present rate:" | sed "s/.* \([0-9][0-9]* \).*/\1/" `
+ REMAINING=`cat $BATT_INFO | grep "remaining capacity:" | sed "s/.* \([0-9][0-9]* \).*/\1/" `
+ fi
+ if (($REMAINING * 60 / $PRESENT_RATE < $MINIMUM_BATTERY_MINUTES))
+ then
+ /sbin/laptop_mode stop
+ fi
+ else
+ logger -p daemon.warning "You are using laptop mode and your battery interface $BATT_INFO is missing. This may lead to loss of data when the battery runs out. Check kernel ACPI support and /proc/acpi/battery folder, and edit /etc/acpi/battery.sh to set BATT_INFO to the correct path."
+ fi
+ fi
+ fi
+
+
+Monitoring tool
+---------------
+
+Bartek Kania submitted this, it can be used to measure how much time your disk
+spends spun up/down. See tools/laptop/dslm/dslm.c
diff --git a/Documentation/admin-guide/laptops/lg-laptop.rst b/Documentation/admin-guide/laptops/lg-laptop.rst
new file mode 100644
index 0000000..ce9b146
--- /dev/null
+++ b/Documentation/admin-guide/laptops/lg-laptop.rst
@@ -0,0 +1,84 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+
+LG Gram laptop extra features
+=============================
+
+By Matan Ziv-Av <matan@svgalib.org>
+
+
+Hotkeys
+-------
+
+The following FN keys are ignored by the kernel without this driver:
+
+- FN-F1 (LG control panel) - Generates F15
+- FN-F5 (Touchpad toggle) - Generates F13
+- FN-F6 (Airplane mode) - Generates RFKILL
+- FN-F8 (Keyboard backlight) - Generates F16.
+ This key also changes keyboard backlight mode.
+- FN-F9 (Reader mode) - Generates F14
+
+The rest of the FN keys work without a need for a special driver.
+
+
+Reader mode
+-----------
+
+Writing 0/1 to /sys/devices/platform/lg-laptop/reader_mode disables/enables
+reader mode. In this mode the screen colors change (blue color reduced),
+and the reader mode indicator LED (on F9 key) turns on.
+
+
+FN Lock
+-------
+
+Writing 0/1 to /sys/devices/platform/lg-laptop/fn_lock disables/enables
+FN lock.
+
+
+Battery care limit
+------------------
+
+Writing 80/100 to /sys/devices/platform/lg-laptop/battery_care_limit
+sets the maximum capacity to charge the battery. Limiting the charge
+reduces battery capacity loss over time.
+
+This value is reset to 100 when the kernel boots.
+
+
+Fan mode
+--------
+
+Writing 1/0 to /sys/devices/platform/lg-laptop/fan_mode disables/enables
+the fan silent mode.
+
+
+USB charge
+----------
+
+Writing 0/1 to /sys/devices/platform/lg-laptop/usb_charge disables/enables
+charging another device from the USB port while the device is turned off.
+
+This value is reset to 0 when the kernel boots.
+
+
+LEDs
+~~~~
+
+The are two LED devices supported by the driver:
+
+Keyboard backlight
+------------------
+
+A led device named kbd_led controls the keyboard backlight. There are three
+lighting level: off (0), low (127) and high (255).
+
+The keyboard backlight is also controlled by the key combination FN-F8
+which cycles through those levels.
+
+
+Touchpad indicator LED
+----------------------
+
+On the F5 key. Controlled by led device names tpad_led.
diff --git a/Documentation/admin-guide/laptops/sony-laptop.rst b/Documentation/admin-guide/laptops/sony-laptop.rst
new file mode 100644
index 0000000..9edcc7f
--- /dev/null
+++ b/Documentation/admin-guide/laptops/sony-laptop.rst
@@ -0,0 +1,174 @@
+=========================================
+Sony Notebook Control Driver (SNC) Readme
+=========================================
+
+ - Copyright (C) 2004- 2005 Stelian Pop <stelian@popies.net>
+ - Copyright (C) 2007 Mattia Dongili <malattia@linux.it>
+
+This mini-driver drives the SNC and SPIC device present in the ACPI BIOS of the
+Sony Vaio laptops. This driver mixes both devices functions under the same
+(hopefully consistent) interface. This also means that the sonypi driver is
+obsoleted by sony-laptop now.
+
+Fn keys (hotkeys):
+------------------
+
+Some models report hotkeys through the SNC or SPIC devices, such events are
+reported both through the ACPI subsystem as acpi events and through the INPUT
+subsystem. See the logs of /proc/bus/input/devices to find out what those
+events are and which input devices are created by the driver.
+Additionally, loading the driver with the debug option will report all events
+in the kernel log.
+
+The "scancodes" passed to the input system (that can be remapped with udev)
+are indexes to the table "sony_laptop_input_keycode_map" in the sony-laptop.c
+module. For example the "FN/E" key combination (EJECTCD on some models)
+generates the scancode 20 (0x14).
+
+Backlight control:
+------------------
+If your laptop model supports it, you will find sysfs files in the
+/sys/class/backlight/sony/
+directory. You will be able to query and set the current screen
+brightness:
+
+ ====================== =========================================
+ brightness get/set screen brightness (an integer
+ between 0 and 7)
+ actual_brightness reading from this file will query the HW
+ to get real brightness value
+ max_brightness the maximum brightness value
+ ====================== =========================================
+
+
+Platform specific:
+------------------
+Loading the sony-laptop module will create a
+/sys/devices/platform/sony-laptop/
+directory populated with some files.
+
+You then read/write integer values from/to those files by using
+standard UNIX tools.
+
+The files are:
+
+ ====================== ==========================================
+ brightness_default screen brightness which will be set
+ when the laptop will be rebooted
+ cdpower power on/off the internal CD drive
+ audiopower power on/off the internal sound card
+ lanpower power on/off the internal ethernet card
+ (only in debug mode)
+ bluetoothpower power on/off the internal bluetooth device
+ fanspeed get/set the fan speed
+ ====================== ==========================================
+
+Note that some files may be missing if they are not supported
+by your particular laptop model.
+
+Example usage::
+
+ # echo "1" > /sys/devices/platform/sony-laptop/brightness_default
+
+sets the lowest screen brightness for the next and later reboots
+
+::
+
+ # echo "8" > /sys/devices/platform/sony-laptop/brightness_default
+
+sets the highest screen brightness for the next and later reboots
+
+::
+
+ # cat /sys/devices/platform/sony-laptop/brightness_default
+
+retrieves the value
+
+::
+
+ # echo "0" > /sys/devices/platform/sony-laptop/audiopower
+
+powers off the sound card
+
+::
+
+ # echo "1" > /sys/devices/platform/sony-laptop/audiopower
+
+powers on the sound card.
+
+
+RFkill control:
+---------------
+More recent Vaio models expose a consistent set of ACPI methods to
+control radio frequency emitting devices. If you are a lucky owner of
+such a laptop you will find the necessary rfkill devices under
+/sys/class/rfkill. Check those starting with sony-* in::
+
+ # grep . /sys/class/rfkill/*/{state,name}
+
+
+Development:
+------------
+
+If you want to help with the development of this driver (and
+you are not afraid of any side effects doing strange things with
+your ACPI BIOS could have on your laptop), load the driver and
+pass the option 'debug=1'.
+
+REPEAT:
+ **DON'T DO THIS IF YOU DON'T LIKE RISKY BUSINESS.**
+
+In your kernel logs you will find the list of all ACPI methods
+the SNC device has on your laptop.
+
+* For new models you will see a long list of meaningless method names,
+ reading the DSDT table source should reveal that:
+
+(1) the SNC device uses an internal capability lookup table
+(2) SN00 is used to find values in the lookup table
+(3) SN06 and SN07 are used to call into the real methods based on
+ offsets you can obtain iterating the table using SN00
+(4) SN02 used to enable events.
+
+Some values in the capability lookup table are more or less known, see
+the code for all sony_call_snc_handle calls, others are more obscure.
+
+* For old models you can see the GCDP/GCDP methods used to pwer on/off
+ the CD drive, but there are others and they are usually different from
+ model to model.
+
+**I HAVE NO IDEA WHAT THOSE METHODS DO.**
+
+The sony-laptop driver creates, for some of those methods (the most
+current ones found on several Vaio models), an entry under
+/sys/devices/platform/sony-laptop, just like the 'cdpower' one.
+You can create other entries corresponding to your own laptop methods by
+further editing the source (see the 'sony_nc_values' table, and add a new
+entry to this table with your get/set method names using the
+SNC_HANDLE_NAMES macro).
+
+Your mission, should you accept it, is to try finding out what
+those entries are for, by reading/writing random values from/to those
+files and find out what is the impact on your laptop.
+
+Should you find anything interesting, please report it back to me,
+I will not disavow all knowledge of your actions :)
+
+See also http://www.linux.it/~malattia/wiki/index.php/Sony_drivers for other
+useful info.
+
+Bugs/Limitations:
+-----------------
+
+* This driver is not based on official documentation from Sony
+ (because there is none), so there is no guarantee this driver
+ will work at all, or do the right thing. Although this hasn't
+ happened to me, this driver could do very bad things to your
+ laptop, including permanent damage.
+
+* The sony-laptop and sonypi drivers do not interact at all. In the
+ future, sonypi will be removed and replaced by sony-laptop.
+
+* spicctrl, which is the userspace tool used to communicate with the
+ sonypi driver (through /dev/sonypi) is deprecated as well since all
+ its features are now available under the sysfs tree via sony-laptop.
diff --git a/Documentation/admin-guide/laptops/sonypi.rst b/Documentation/admin-guide/laptops/sonypi.rst
new file mode 100644
index 0000000..c6eaaf4
--- /dev/null
+++ b/Documentation/admin-guide/laptops/sonypi.rst
@@ -0,0 +1,158 @@
+==================================================
+Sony Programmable I/O Control Device Driver Readme
+==================================================
+
+ - Copyright (C) 2001-2004 Stelian Pop <stelian@popies.net>
+ - Copyright (C) 2001-2002 Alcôve <www.alcove.com>
+ - Copyright (C) 2001 Michael Ashley <m.ashley@unsw.edu.au>
+ - Copyright (C) 2001 Junichi Morita <jun1m@mars.dti.ne.jp>
+ - Copyright (C) 2000 Takaya Kinjo <t-kinjo@tc4.so-net.ne.jp>
+ - Copyright (C) 2000 Andrew Tridgell <tridge@samba.org>
+
+This driver enables access to the Sony Programmable I/O Control Device which
+can be found in many Sony Vaio laptops. Some newer Sony laptops (seems to be
+limited to new FX series laptops, at least the FX501 and the FX702) lack a
+sonypi device and are not supported at all by this driver.
+
+It will give access (through a user space utility) to some events those laptops
+generate, like:
+
+ - jogdial events (the small wheel on the side of Vaios)
+ - capture button events (only on Vaio Picturebook series)
+ - Fn keys
+ - bluetooth button (only on C1VR model)
+ - programmable keys, back, help, zoom, thumbphrase buttons, etc.
+ (when available)
+
+Those events (see linux/sonypi.h) can be polled using the character device node
+/dev/sonypi (major 10, minor auto allocated or specified as a option).
+A simple daemon which translates the jogdial movements into mouse wheel events
+can be downloaded at: <http://popies.net/sonypi/>
+
+Another option to intercept the events is to get them directly through the
+input layer.
+
+This driver supports also some ioctl commands for setting the LCD screen
+brightness and querying the batteries charge information (some more
+commands may be added in the future).
+
+This driver can also be used to set the camera controls on Picturebook series
+(brightness, contrast etc), and is used by the video4linux driver for the
+Motion Eye camera.
+
+Please note that this driver was created by reverse engineering the Windows
+driver and the ACPI BIOS, because Sony doesn't agree to release any programming
+specs for its laptops. If someone convinces them to do so, drop me a note.
+
+Driver options:
+---------------
+
+Several options can be passed to the sonypi driver using the standard
+module argument syntax (<param>=<value> when passing the option to the
+module or sonypi.<param>=<value> on the kernel boot line when sonypi is
+statically linked into the kernel). Those options are:
+
+ =============== =======================================================
+ minor: minor number of the misc device /dev/sonypi,
+ default is -1 (automatic allocation, see /proc/misc
+ or kernel logs)
+
+ camera: if you have a PictureBook series Vaio (with the
+ integrated MotionEye camera), set this parameter to 1
+ in order to let the driver access to the camera
+
+ fnkeyinit: on some Vaios (C1VE, C1VR etc), the Fn key events don't
+ get enabled unless you set this parameter to 1.
+ Do not use this option unless it's actually necessary,
+ some Vaio models don't deal well with this option.
+ This option is available only if the kernel is
+ compiled without ACPI support (since it conflicts
+ with it and it shouldn't be required anyway if
+ ACPI is already enabled).
+
+ verbose: set to 1 to print unknown events received from the
+ sonypi device.
+ set to 2 to print all events received from the
+ sonypi device.
+
+ compat: uses some compatibility code for enabling the sonypi
+ events. If the driver worked for you in the past
+ (prior to version 1.5) and does not work anymore,
+ add this option and report to the author.
+
+ mask: event mask telling the driver what events will be
+ reported to the user. This parameter is required for
+ some Vaio models where the hardware reuses values
+ used in other Vaio models (like the FX series who does
+ not have a jogdial but reuses the jogdial events for
+ programmable keys events). The default event mask is
+ set to 0xffffffff, meaning that all possible events
+ will be tried. You can use the following bits to
+ construct your own event mask (from
+ drivers/char/sonypi.h)::
+
+ SONYPI_JOGGER_MASK 0x0001
+ SONYPI_CAPTURE_MASK 0x0002
+ SONYPI_FNKEY_MASK 0x0004
+ SONYPI_BLUETOOTH_MASK 0x0008
+ SONYPI_PKEY_MASK 0x0010
+ SONYPI_BACK_MASK 0x0020
+ SONYPI_HELP_MASK 0x0040
+ SONYPI_LID_MASK 0x0080
+ SONYPI_ZOOM_MASK 0x0100
+ SONYPI_THUMBPHRASE_MASK 0x0200
+ SONYPI_MEYE_MASK 0x0400
+ SONYPI_MEMORYSTICK_MASK 0x0800
+ SONYPI_BATTERY_MASK 0x1000
+ SONYPI_WIRELESS_MASK 0x2000
+
+ useinput: if set (which is the default) two input devices are
+ created, one which interprets the jogdial events as
+ mouse events, the other one which acts like a
+ keyboard reporting the pressing of the special keys.
+ =============== =======================================================
+
+Module use:
+-----------
+
+In order to automatically load the sonypi module on use, you can put those
+lines a configuration file in /etc/modprobe.d/::
+
+ alias char-major-10-250 sonypi
+ options sonypi minor=250
+
+This supposes the use of minor 250 for the sonypi device::
+
+ # mknod /dev/sonypi c 10 250
+
+Bugs:
+-----
+
+ - several users reported that this driver disables the BIOS-managed
+ Fn-keys which put the laptop in sleeping state, or switch the
+ external monitor on/off. There is no workaround yet, since this
+ driver disables all APM management for those keys, by enabling the
+ ACPI management (and the ACPI core stuff is not complete yet). If
+ you have one of those laptops with working Fn keys and want to
+ continue to use them, don't use this driver.
+
+ - some users reported that the laptop speed is lower (dhrystone
+ tested) when using the driver with the fnkeyinit parameter. I cannot
+ reproduce it on my laptop and not all users have this problem.
+ This happens because the fnkeyinit parameter enables the ACPI
+ mode (but without additional ACPI control, like processor
+ speed handling etc). Use ACPI instead of APM if it works on your
+ laptop.
+
+ - sonypi lacks the ability to distinguish between certain key
+ events on some models.
+
+ - some models with the nvidia card (geforce go 6200 tc) uses a
+ different way to adjust the backlighting of the screen. There
+ is a userspace utility to adjust the brightness on those models,
+ which can be downloaded from
+ http://www.acc.umu.se/~erikw/program/smartdimmer-0.1.tar.bz2
+
+ - since all development was done by reverse engineering, there is
+ *absolutely no guarantee* that this driver will not crash your
+ laptop. Permanently.
diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
new file mode 100644
index 0000000..822907d
--- /dev/null
+++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
@@ -0,0 +1,1585 @@
+===========================
+ThinkPad ACPI Extras Driver
+===========================
+
+Version 0.25
+
+October 16th, 2013
+
+- Borislav Deianov <borislav@users.sf.net>
+- Henrique de Moraes Holschuh <hmh@hmh.eng.br>
+
+http://ibm-acpi.sf.net/
+
+This is a Linux driver for the IBM and Lenovo ThinkPad laptops. It
+supports various features of these laptops which are accessible
+through the ACPI and ACPI EC framework, but not otherwise fully
+supported by the generic Linux ACPI drivers.
+
+This driver used to be named ibm-acpi until kernel 2.6.21 and release
+0.13-20070314. It used to be in the drivers/acpi tree, but it was
+moved to the drivers/misc tree and renamed to thinkpad-acpi for kernel
+2.6.22, and release 0.14. It was moved to drivers/platform/x86 for
+kernel 2.6.29 and release 0.22.
+
+The driver is named "thinkpad-acpi". In some places, like module
+names and log messages, "thinkpad_acpi" is used because of userspace
+issues.
+
+"tpacpi" is used as a shorthand where "thinkpad-acpi" would be too
+long due to length limitations on some Linux kernel versions.
+
+Status
+------
+
+The features currently supported are the following (see below for
+detailed description):
+
+ - Fn key combinations
+ - Bluetooth enable and disable
+ - video output switching, expansion control
+ - ThinkLight on and off
+ - CMOS/UCMS control
+ - LED control
+ - ACPI sounds
+ - temperature sensors
+ - Experimental: embedded controller register dump
+ - LCD brightness control
+ - Volume control
+ - Fan control and monitoring: fan speed, fan enable/disable
+ - WAN enable and disable
+ - UWB enable and disable
+ - LCD Shadow (PrivacyGuard) enable and disable
+
+A compatibility table by model and feature is maintained on the web
+site, http://ibm-acpi.sf.net/. I appreciate any success or failure
+reports, especially if they add to or correct the compatibility table.
+Please include the following information in your report:
+
+ - ThinkPad model name
+ - a copy of your ACPI tables, using the "acpidump" utility
+ - a copy of the output of dmidecode, with serial numbers
+ and UUIDs masked off
+ - which driver features work and which don't
+ - the observed behavior of non-working features
+
+Any other comments or patches are also more than welcome.
+
+
+Installation
+------------
+
+If you are compiling this driver as included in the Linux kernel
+sources, look for the CONFIG_THINKPAD_ACPI Kconfig option.
+It is located on the menu path: "Device Drivers" -> "X86 Platform
+Specific Device Drivers" -> "ThinkPad ACPI Laptop Extras".
+
+
+Features
+--------
+
+The driver exports two different interfaces to userspace, which can be
+used to access the features it provides. One is a legacy procfs-based
+interface, which will be removed at some time in the future. The other
+is a new sysfs-based interface which is not complete yet.
+
+The procfs interface creates the /proc/acpi/ibm directory. There is a
+file under that directory for each feature it supports. The procfs
+interface is mostly frozen, and will change very little if at all: it
+will not be extended to add any new functionality in the driver, instead
+all new functionality will be implemented on the sysfs interface.
+
+The sysfs interface tries to blend in the generic Linux sysfs subsystems
+and classes as much as possible. Since some of these subsystems are not
+yet ready or stabilized, it is expected that this interface will change,
+and any and all userspace programs must deal with it.
+
+
+Notes about the sysfs interface
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Unlike what was done with the procfs interface, correctness when talking
+to the sysfs interfaces will be enforced, as will correctness in the
+thinkpad-acpi's implementation of sysfs interfaces.
+
+Also, any bugs in the thinkpad-acpi sysfs driver code or in the
+thinkpad-acpi's implementation of the sysfs interfaces will be fixed for
+maximum correctness, even if that means changing an interface in
+non-compatible ways. As these interfaces mature both in the kernel and
+in thinkpad-acpi, such changes should become quite rare.
+
+Applications interfacing to the thinkpad-acpi sysfs interfaces must
+follow all sysfs guidelines and correctly process all errors (the sysfs
+interface makes extensive use of errors). File descriptors and open /
+close operations to the sysfs inodes must also be properly implemented.
+
+The version of thinkpad-acpi's sysfs interface is exported by the driver
+as a driver attribute (see below).
+
+Sysfs driver attributes are on the driver's sysfs attribute space,
+for 2.6.23+ this is /sys/bus/platform/drivers/thinkpad_acpi/ and
+/sys/bus/platform/drivers/thinkpad_hwmon/
+
+Sysfs device attributes are on the thinkpad_acpi device sysfs attribute
+space, for 2.6.23+ this is /sys/devices/platform/thinkpad_acpi/.
+
+Sysfs device attributes for the sensors and fan are on the
+thinkpad_hwmon device's sysfs attribute space, but you should locate it
+looking for a hwmon device with the name attribute of "thinkpad", or
+better yet, through libsensors. For 4.14+ sysfs attributes were moved to the
+hwmon device (/sys/bus/platform/devices/thinkpad_hwmon/hwmon/hwmon? or
+/sys/class/hwmon/hwmon?).
+
+Driver version
+--------------
+
+procfs: /proc/acpi/ibm/driver
+
+sysfs driver attribute: version
+
+The driver name and version. No commands can be written to this file.
+
+
+Sysfs interface version
+-----------------------
+
+sysfs driver attribute: interface_version
+
+Version of the thinkpad-acpi sysfs interface, as an unsigned long
+(output in hex format: 0xAAAABBCC), where:
+
+ AAAA
+ - major revision
+ BB
+ - minor revision
+ CC
+ - bugfix revision
+
+The sysfs interface version changelog for the driver can be found at the
+end of this document. Changes to the sysfs interface done by the kernel
+subsystems are not documented here, nor are they tracked by this
+attribute.
+
+Changes to the thinkpad-acpi sysfs interface are only considered
+non-experimental when they are submitted to Linux mainline, at which
+point the changes in this interface are documented and interface_version
+may be updated. If you are using any thinkpad-acpi features not yet
+sent to mainline for merging, you do so on your own risk: these features
+may disappear, or be implemented in a different and incompatible way by
+the time they are merged in Linux mainline.
+
+Changes that are backwards-compatible by nature (e.g. the addition of
+attributes that do not change the way the other attributes work) do not
+always warrant an update of interface_version. Therefore, one must
+expect that an attribute might not be there, and deal with it properly
+(an attribute not being there *is* a valid way to make it clear that a
+feature is not available in sysfs).
+
+
+Hot keys
+--------
+
+procfs: /proc/acpi/ibm/hotkey
+
+sysfs device attribute: hotkey_*
+
+In a ThinkPad, the ACPI HKEY handler is responsible for communicating
+some important events and also keyboard hot key presses to the operating
+system. Enabling the hotkey functionality of thinkpad-acpi signals the
+firmware that such a driver is present, and modifies how the ThinkPad
+firmware will behave in many situations.
+
+The driver enables the HKEY ("hot key") event reporting automatically
+when loaded, and disables it when it is removed.
+
+The driver will report HKEY events in the following format::
+
+ ibm/hotkey HKEY 00000080 0000xxxx
+
+Some of these events refer to hot key presses, but not all of them.
+
+The driver will generate events over the input layer for hot keys and
+radio switches, and over the ACPI netlink layer for other events. The
+input layer support accepts the standard IOCTLs to remap the keycodes
+assigned to each hot key.
+
+The hot key bit mask allows some control over which hot keys generate
+events. If a key is "masked" (bit set to 0 in the mask), the firmware
+will handle it. If it is "unmasked", it signals the firmware that
+thinkpad-acpi would prefer to handle it, if the firmware would be so
+kind to allow it (and it often doesn't!).
+
+Not all bits in the mask can be modified. Not all bits that can be
+modified do anything. Not all hot keys can be individually controlled
+by the mask. Some models do not support the mask at all. The behaviour
+of the mask is, therefore, highly dependent on the ThinkPad model.
+
+The driver will filter out any unmasked hotkeys, so even if the firmware
+doesn't allow disabling an specific hotkey, the driver will not report
+events for unmasked hotkeys.
+
+Note that unmasking some keys prevents their default behavior. For
+example, if Fn+F5 is unmasked, that key will no longer enable/disable
+Bluetooth by itself in firmware.
+
+Note also that not all Fn key combinations are supported through ACPI
+depending on the ThinkPad model and firmware version. On those
+ThinkPads, it is still possible to support some extra hotkeys by
+polling the "CMOS NVRAM" at least 10 times per second. The driver
+attempts to enables this functionality automatically when required.
+
+procfs notes
+^^^^^^^^^^^^
+
+The following commands can be written to the /proc/acpi/ibm/hotkey file::
+
+ echo 0xffffffff > /proc/acpi/ibm/hotkey -- enable all hot keys
+ echo 0 > /proc/acpi/ibm/hotkey -- disable all possible hot keys
+ ... any other 8-hex-digit mask ...
+ echo reset > /proc/acpi/ibm/hotkey -- restore the recommended mask
+
+The following commands have been deprecated and will cause the kernel
+to log a warning::
+
+ echo enable > /proc/acpi/ibm/hotkey -- does nothing
+ echo disable > /proc/acpi/ibm/hotkey -- returns an error
+
+The procfs interface does not support NVRAM polling control. So as to
+maintain maximum bug-to-bug compatibility, it does not report any masks,
+nor does it allow one to manipulate the hot key mask when the firmware
+does not support masks at all, even if NVRAM polling is in use.
+
+sysfs notes
+^^^^^^^^^^^
+
+ hotkey_bios_enabled:
+ DEPRECATED, WILL BE REMOVED SOON.
+
+ Returns 0.
+
+ hotkey_bios_mask:
+ DEPRECATED, DON'T USE, WILL BE REMOVED IN THE FUTURE.
+
+ Returns the hot keys mask when thinkpad-acpi was loaded.
+ Upon module unload, the hot keys mask will be restored
+ to this value. This is always 0x80c, because those are
+ the hotkeys that were supported by ancient firmware
+ without mask support.
+
+ hotkey_enable:
+ DEPRECATED, WILL BE REMOVED SOON.
+
+ 0: returns -EPERM
+ 1: does nothing
+
+ hotkey_mask:
+ bit mask to enable reporting (and depending on
+ the firmware, ACPI event generation) for each hot key
+ (see above). Returns the current status of the hot keys
+ mask, and allows one to modify it.
+
+ hotkey_all_mask:
+ bit mask that should enable event reporting for all
+ supported hot keys, when echoed to hotkey_mask above.
+ Unless you know which events need to be handled
+ passively (because the firmware *will* handle them
+ anyway), do *not* use hotkey_all_mask. Use
+ hotkey_recommended_mask, instead. You have been warned.
+
+ hotkey_recommended_mask:
+ bit mask that should enable event reporting for all
+ supported hot keys, except those which are always
+ handled by the firmware anyway. Echo it to
+ hotkey_mask above, to use. This is the default mask
+ used by the driver.
+
+ hotkey_source_mask:
+ bit mask that selects which hot keys will the driver
+ poll the NVRAM for. This is auto-detected by the driver
+ based on the capabilities reported by the ACPI firmware,
+ but it can be overridden at runtime.
+
+ Hot keys whose bits are set in hotkey_source_mask are
+ polled for in NVRAM, and reported as hotkey events if
+ enabled in hotkey_mask. Only a few hot keys are
+ available through CMOS NVRAM polling.
+
+ Warning: when in NVRAM mode, the volume up/down/mute
+ keys are synthesized according to changes in the mixer,
+ which uses a single volume up or volume down hotkey
+ press to unmute, as per the ThinkPad volume mixer user
+ interface. When in ACPI event mode, volume up/down/mute
+ events are reported by the firmware and can behave
+ differently (and that behaviour changes with firmware
+ version -- not just with firmware models -- as well as
+ OSI(Linux) state).
+
+ hotkey_poll_freq:
+ frequency in Hz for hot key polling. It must be between
+ 0 and 25 Hz. Polling is only carried out when strictly
+ needed.
+
+ Setting hotkey_poll_freq to zero disables polling, and
+ will cause hot key presses that require NVRAM polling
+ to never be reported.
+
+ Setting hotkey_poll_freq too low may cause repeated
+ pressings of the same hot key to be misreported as a
+ single key press, or to not even be detected at all.
+ The recommended polling frequency is 10Hz.
+
+ hotkey_radio_sw:
+ If the ThinkPad has a hardware radio switch, this
+ attribute will read 0 if the switch is in the "radios
+ disabled" position, and 1 if the switch is in the
+ "radios enabled" position.
+
+ This attribute has poll()/select() support.
+
+ hotkey_tablet_mode:
+ If the ThinkPad has tablet capabilities, this attribute
+ will read 0 if the ThinkPad is in normal mode, and
+ 1 if the ThinkPad is in tablet mode.
+
+ This attribute has poll()/select() support.
+
+ wakeup_reason:
+ Set to 1 if the system is waking up because the user
+ requested a bay ejection. Set to 2 if the system is
+ waking up because the user requested the system to
+ undock. Set to zero for normal wake-ups or wake-ups
+ due to unknown reasons.
+
+ This attribute has poll()/select() support.
+
+ wakeup_hotunplug_complete:
+ Set to 1 if the system was waken up because of an
+ undock or bay ejection request, and that request
+ was successfully completed. At this point, it might
+ be useful to send the system back to sleep, at the
+ user's choice. Refer to HKEY events 0x4003 and
+ 0x3003, below.
+
+ This attribute has poll()/select() support.
+
+input layer notes
+^^^^^^^^^^^^^^^^^
+
+A Hot key is mapped to a single input layer EV_KEY event, possibly
+followed by an EV_MSC MSC_SCAN event that shall contain that key's scan
+code. An EV_SYN event will always be generated to mark the end of the
+event block.
+
+Do not use the EV_MSC MSC_SCAN events to process keys. They are to be
+used as a helper to remap keys, only. They are particularly useful when
+remapping KEY_UNKNOWN keys.
+
+The events are available in an input device, with the following id:
+
+ ============== ==============================
+ Bus BUS_HOST
+ vendor 0x1014 (PCI_VENDOR_ID_IBM) or
+ 0x17aa (PCI_VENDOR_ID_LENOVO)
+ product 0x5054 ("TP")
+ version 0x4101
+ ============== ==============================
+
+The version will have its LSB incremented if the keymap changes in a
+backwards-compatible way. The MSB shall always be 0x41 for this input
+device. If the MSB is not 0x41, do not use the device as described in
+this section, as it is either something else (e.g. another input device
+exported by a thinkpad driver, such as HDAPS) or its functionality has
+been changed in a non-backwards compatible way.
+
+Adding other event types for other functionalities shall be considered a
+backwards-compatible change for this input device.
+
+Thinkpad-acpi Hot Key event map (version 0x4101):
+
+======= ======= ============== ==============================================
+ACPI Scan
+event code Key Notes
+======= ======= ============== ==============================================
+0x1001 0x00 FN+F1 -
+
+0x1002 0x01 FN+F2 IBM: battery (rare)
+ Lenovo: Screen lock
+
+0x1003 0x02 FN+F3 Many IBM models always report
+ this hot key, even with hot keys
+ disabled or with Fn+F3 masked
+ off
+ IBM: screen lock, often turns
+ off the ThinkLight as side-effect
+ Lenovo: battery
+
+0x1004 0x03 FN+F4 Sleep button (ACPI sleep button
+ semantics, i.e. sleep-to-RAM).
+ It always generates some kind
+ of event, either the hot key
+ event or an ACPI sleep button
+ event. The firmware may
+ refuse to generate further FN+F4
+ key presses until a S3 or S4 ACPI
+ sleep cycle is performed or some
+ time passes.
+
+0x1005 0x04 FN+F5 Radio. Enables/disables
+ the internal Bluetooth hardware
+ and W-WAN card if left in control
+ of the firmware. Does not affect
+ the WLAN card.
+ Should be used to turn on/off all
+ radios (Bluetooth+W-WAN+WLAN),
+ really.
+
+0x1006 0x05 FN+F6 -
+
+0x1007 0x06 FN+F7 Video output cycle.
+ Do you feel lucky today?
+
+0x1008 0x07 FN+F8 IBM: toggle screen expand
+ Lenovo: configure UltraNav,
+ or toggle screen expand
+
+0x1009 0x08 FN+F9 -
+
+... ... ... ...
+
+0x100B 0x0A FN+F11 -
+
+0x100C 0x0B FN+F12 Sleep to disk. You are always
+ supposed to handle it yourself,
+ either through the ACPI event,
+ or through a hotkey event.
+ The firmware may refuse to
+ generate further FN+F12 key
+ press events until a S3 or S4
+ ACPI sleep cycle is performed,
+ or some time passes.
+
+0x100D 0x0C FN+BACKSPACE -
+0x100E 0x0D FN+INSERT -
+0x100F 0x0E FN+DELETE -
+
+0x1010 0x0F FN+HOME Brightness up. This key is
+ always handled by the firmware
+ in IBM ThinkPads, even when
+ unmasked. Just leave it alone.
+ For Lenovo ThinkPads with a new
+ BIOS, it has to be handled either
+ by the ACPI OSI, or by userspace.
+ The driver does the right thing,
+ never mess with this.
+0x1011 0x10 FN+END Brightness down. See brightness
+ up for details.
+
+0x1012 0x11 FN+PGUP ThinkLight toggle. This key is
+ always handled by the firmware,
+ even when unmasked.
+
+0x1013 0x12 FN+PGDOWN -
+
+0x1014 0x13 FN+SPACE Zoom key
+
+0x1015 0x14 VOLUME UP Internal mixer volume up. This
+ key is always handled by the
+ firmware, even when unmasked.
+ NOTE: Lenovo seems to be changing
+ this.
+0x1016 0x15 VOLUME DOWN Internal mixer volume up. This
+ key is always handled by the
+ firmware, even when unmasked.
+ NOTE: Lenovo seems to be changing
+ this.
+0x1017 0x16 MUTE Mute internal mixer. This
+ key is always handled by the
+ firmware, even when unmasked.
+
+0x1018 0x17 THINKPAD ThinkPad/Access IBM/Lenovo key
+
+0x1019 0x18 unknown
+
+... ... ...
+
+0x1020 0x1F unknown
+======= ======= ============== ==============================================
+
+The ThinkPad firmware does not allow one to differentiate when most hot
+keys are pressed or released (either that, or we don't know how to, yet).
+For these keys, the driver generates a set of events for a key press and
+immediately issues the same set of events for a key release. It is
+unknown by the driver if the ThinkPad firmware triggered these events on
+hot key press or release, but the firmware will do it for either one, not
+both.
+
+If a key is mapped to KEY_RESERVED, it generates no input events at all.
+If a key is mapped to KEY_UNKNOWN, it generates an input event that
+includes an scan code. If a key is mapped to anything else, it will
+generate input device EV_KEY events.
+
+In addition to the EV_KEY events, thinkpad-acpi may also issue EV_SW
+events for switches:
+
+============== ==============================================
+SW_RFKILL_ALL T60 and later hardware rfkill rocker switch
+SW_TABLET_MODE Tablet ThinkPads HKEY events 0x5009 and 0x500A
+============== ==============================================
+
+Non hotkey ACPI HKEY event map
+------------------------------
+
+Events that are never propagated by the driver:
+
+====== ==================================================
+0x2304 System is waking up from suspend to undock
+0x2305 System is waking up from suspend to eject bay
+0x2404 System is waking up from hibernation to undock
+0x2405 System is waking up from hibernation to eject bay
+0x5001 Lid closed
+0x5002 Lid opened
+0x5009 Tablet swivel: switched to tablet mode
+0x500A Tablet swivel: switched to normal mode
+0x5010 Brightness level changed/control event
+0x6000 KEYBOARD: Numlock key pressed
+0x6005 KEYBOARD: Fn key pressed (TO BE VERIFIED)
+0x7000 Radio Switch may have changed state
+====== ==================================================
+
+
+Events that are propagated by the driver to userspace:
+
+====== =====================================================
+0x2313 ALARM: System is waking up from suspend because
+ the battery is nearly empty
+0x2413 ALARM: System is waking up from hibernation because
+ the battery is nearly empty
+0x3003 Bay ejection (see 0x2x05) complete, can sleep again
+0x3006 Bay hotplug request (hint to power up SATA link when
+ the optical drive tray is ejected)
+0x4003 Undocked (see 0x2x04), can sleep again
+0x4010 Docked into hotplug port replicator (non-ACPI dock)
+0x4011 Undocked from hotplug port replicator (non-ACPI dock)
+0x500B Tablet pen inserted into its storage bay
+0x500C Tablet pen removed from its storage bay
+0x6011 ALARM: battery is too hot
+0x6012 ALARM: battery is extremely hot
+0x6021 ALARM: a sensor is too hot
+0x6022 ALARM: a sensor is extremely hot
+0x6030 System thermal table changed
+0x6032 Thermal Control command set completion (DYTC, Windows)
+0x6040 Nvidia Optimus/AC adapter related (TO BE VERIFIED)
+0x60C0 X1 Yoga 2016, Tablet mode status changed
+0x60F0 Thermal Transformation changed (GMTS, Windows)
+====== =====================================================
+
+Battery nearly empty alarms are a last resort attempt to get the
+operating system to hibernate or shutdown cleanly (0x2313), or shutdown
+cleanly (0x2413) before power is lost. They must be acted upon, as the
+wake up caused by the firmware will have negated most safety nets...
+
+When any of the "too hot" alarms happen, according to Lenovo the user
+should suspend or hibernate the laptop (and in the case of battery
+alarms, unplug the AC adapter) to let it cool down. These alarms do
+signal that something is wrong, they should never happen on normal
+operating conditions.
+
+The "extremely hot" alarms are emergencies. According to Lenovo, the
+operating system is to force either an immediate suspend or hibernate
+cycle, or a system shutdown. Obviously, something is very wrong if this
+happens.
+
+
+Brightness hotkey notes
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Don't mess with the brightness hotkeys in a Thinkpad. If you want
+notifications for OSD, use the sysfs backlight class event support.
+
+The driver will issue KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN events
+automatically for the cases were userspace has to do something to
+implement brightness changes. When you override these events, you will
+either fail to handle properly the ThinkPads that require explicit
+action to change backlight brightness, or the ThinkPads that require
+that no action be taken to work properly.
+
+
+Bluetooth
+---------
+
+procfs: /proc/acpi/ibm/bluetooth
+
+sysfs device attribute: bluetooth_enable (deprecated)
+
+sysfs rfkill class: switch "tpacpi_bluetooth_sw"
+
+This feature shows the presence and current state of a ThinkPad
+Bluetooth device in the internal ThinkPad CDC slot.
+
+If the ThinkPad supports it, the Bluetooth state is stored in NVRAM,
+so it is kept across reboots and power-off.
+
+Procfs notes
+^^^^^^^^^^^^
+
+If Bluetooth is installed, the following commands can be used::
+
+ echo enable > /proc/acpi/ibm/bluetooth
+ echo disable > /proc/acpi/ibm/bluetooth
+
+Sysfs notes
+^^^^^^^^^^^
+
+ If the Bluetooth CDC card is installed, it can be enabled /
+ disabled through the "bluetooth_enable" thinkpad-acpi device
+ attribute, and its current status can also be queried.
+
+ enable:
+
+ - 0: disables Bluetooth / Bluetooth is disabled
+ - 1: enables Bluetooth / Bluetooth is enabled.
+
+ Note: this interface has been superseded by the generic rfkill
+ class. It has been deprecated, and it will be removed in year
+ 2010.
+
+ rfkill controller switch "tpacpi_bluetooth_sw": refer to
+ Documentation/driver-api/rfkill.rst for details.
+
+
+Video output control -- /proc/acpi/ibm/video
+--------------------------------------------
+
+This feature allows control over the devices used for video output -
+LCD, CRT or DVI (if available). The following commands are available::
+
+ echo lcd_enable > /proc/acpi/ibm/video
+ echo lcd_disable > /proc/acpi/ibm/video
+ echo crt_enable > /proc/acpi/ibm/video
+ echo crt_disable > /proc/acpi/ibm/video
+ echo dvi_enable > /proc/acpi/ibm/video
+ echo dvi_disable > /proc/acpi/ibm/video
+ echo auto_enable > /proc/acpi/ibm/video
+ echo auto_disable > /proc/acpi/ibm/video
+ echo expand_toggle > /proc/acpi/ibm/video
+ echo video_switch > /proc/acpi/ibm/video
+
+NOTE:
+ Access to this feature is restricted to processes owning the
+ CAP_SYS_ADMIN capability for safety reasons, as it can interact badly
+ enough with some versions of X.org to crash it.
+
+Each video output device can be enabled or disabled individually.
+Reading /proc/acpi/ibm/video shows the status of each device.
+
+Automatic video switching can be enabled or disabled. When automatic
+video switching is enabled, certain events (e.g. opening the lid,
+docking or undocking) cause the video output device to change
+automatically. While this can be useful, it also causes flickering
+and, on the X40, video corruption. By disabling automatic switching,
+the flickering or video corruption can be avoided.
+
+The video_switch command cycles through the available video outputs
+(it simulates the behavior of Fn-F7).
+
+Video expansion can be toggled through this feature. This controls
+whether the display is expanded to fill the entire LCD screen when a
+mode with less than full resolution is used. Note that the current
+video expansion status cannot be determined through this feature.
+
+Note that on many models (particularly those using Radeon graphics
+chips) the X driver configures the video card in a way which prevents
+Fn-F7 from working. This also disables the video output switching
+features of this driver, as it uses the same ACPI methods as
+Fn-F7. Video switching on the console should still work.
+
+UPDATE: refer to https://bugs.freedesktop.org/show_bug.cgi?id=2000
+
+
+ThinkLight control
+------------------
+
+procfs: /proc/acpi/ibm/light
+
+sysfs attributes: as per LED class, for the "tpacpi::thinklight" LED
+
+procfs notes
+^^^^^^^^^^^^
+
+The ThinkLight status can be read and set through the procfs interface. A
+few models which do not make the status available will show the ThinkLight
+status as "unknown". The available commands are::
+
+ echo on > /proc/acpi/ibm/light
+ echo off > /proc/acpi/ibm/light
+
+sysfs notes
+^^^^^^^^^^^
+
+The ThinkLight sysfs interface is documented by the LED class
+documentation, in Documentation/leds/leds-class.rst. The ThinkLight LED name
+is "tpacpi::thinklight".
+
+Due to limitations in the sysfs LED class, if the status of the ThinkLight
+cannot be read or if it is unknown, thinkpad-acpi will report it as "off".
+It is impossible to know if the status returned through sysfs is valid.
+
+
+CMOS/UCMS control
+-----------------
+
+procfs: /proc/acpi/ibm/cmos
+
+sysfs device attribute: cmos_command
+
+This feature is mostly used internally by the ACPI firmware to keep the legacy
+CMOS NVRAM bits in sync with the current machine state, and to record this
+state so that the ThinkPad will retain such settings across reboots.
+
+Some of these commands actually perform actions in some ThinkPad models, but
+this is expected to disappear more and more in newer models. As an example, in
+a T43 and in a X40, commands 12 and 13 still control the ThinkLight state for
+real, but commands 0 to 2 don't control the mixer anymore (they have been
+phased out) and just update the NVRAM.
+
+The range of valid cmos command numbers is 0 to 21, but not all have an
+effect and the behavior varies from model to model. Here is the behavior
+on the X40 (tpb is the ThinkPad Buttons utility):
+
+ - 0 - Related to "Volume down" key press
+ - 1 - Related to "Volume up" key press
+ - 2 - Related to "Mute on" key press
+ - 3 - Related to "Access IBM" key press
+ - 4 - Related to "LCD brightness up" key press
+ - 5 - Related to "LCD brightness down" key press
+ - 11 - Related to "toggle screen expansion" key press/function
+ - 12 - Related to "ThinkLight on"
+ - 13 - Related to "ThinkLight off"
+ - 14 - Related to "ThinkLight" key press (toggle ThinkLight)
+
+The cmos command interface is prone to firmware split-brain problems, as
+in newer ThinkPads it is just a compatibility layer. Do not use it, it is
+exported just as a debug tool.
+
+
+LED control
+-----------
+
+procfs: /proc/acpi/ibm/led
+sysfs attributes: as per LED class, see below for names
+
+Some of the LED indicators can be controlled through this feature. On
+some older ThinkPad models, it is possible to query the status of the
+LED indicators as well. Newer ThinkPads cannot query the real status
+of the LED indicators.
+
+Because misuse of the LEDs could induce an unaware user to perform
+dangerous actions (like undocking or ejecting a bay device while the
+buses are still active), or mask an important alarm (such as a nearly
+empty battery, or a broken battery), access to most LEDs is
+restricted.
+
+Unrestricted access to all LEDs requires that thinkpad-acpi be
+compiled with the CONFIG_THINKPAD_ACPI_UNSAFE_LEDS option enabled.
+Distributions must never enable this option. Individual users that
+are aware of the consequences are welcome to enabling it.
+
+Audio mute and microphone mute LEDs are supported, but currently not
+visible to userspace. They are used by the snd-hda-intel audio driver.
+
+procfs notes
+^^^^^^^^^^^^
+
+The available commands are::
+
+ echo '<LED number> on' >/proc/acpi/ibm/led
+ echo '<LED number> off' >/proc/acpi/ibm/led
+ echo '<LED number> blink' >/proc/acpi/ibm/led
+
+The <LED number> range is 0 to 15. The set of LEDs that can be
+controlled varies from model to model. Here is the common ThinkPad
+mapping:
+
+ - 0 - power
+ - 1 - battery (orange)
+ - 2 - battery (green)
+ - 3 - UltraBase/dock
+ - 4 - UltraBay
+ - 5 - UltraBase battery slot
+ - 6 - (unknown)
+ - 7 - standby
+ - 8 - dock status 1
+ - 9 - dock status 2
+ - 10, 11 - (unknown)
+ - 12 - thinkvantage
+ - 13, 14, 15 - (unknown)
+
+All of the above can be turned on and off and can be made to blink.
+
+sysfs notes
+^^^^^^^^^^^
+
+The ThinkPad LED sysfs interface is described in detail by the LED class
+documentation, in Documentation/leds/leds-class.rst.
+
+The LEDs are named (in LED ID order, from 0 to 12):
+"tpacpi::power", "tpacpi:orange:batt", "tpacpi:green:batt",
+"tpacpi::dock_active", "tpacpi::bay_active", "tpacpi::dock_batt",
+"tpacpi::unknown_led", "tpacpi::standby", "tpacpi::dock_status1",
+"tpacpi::dock_status2", "tpacpi::unknown_led2", "tpacpi::unknown_led3",
+"tpacpi::thinkvantage".
+
+Due to limitations in the sysfs LED class, if the status of the LED
+indicators cannot be read due to an error, thinkpad-acpi will report it as
+a brightness of zero (same as LED off).
+
+If the thinkpad firmware doesn't support reading the current status,
+trying to read the current LED brightness will just return whatever
+brightness was last written to that attribute.
+
+These LEDs can blink using hardware acceleration. To request that a
+ThinkPad indicator LED should blink in hardware accelerated mode, use the
+"timer" trigger, and leave the delay_on and delay_off parameters set to
+zero (to request hardware acceleration autodetection).
+
+LEDs that are known not to exist in a given ThinkPad model are not
+made available through the sysfs interface. If you have a dock and you
+notice there are LEDs listed for your ThinkPad that do not exist (and
+are not in the dock), or if you notice that there are missing LEDs,
+a report to ibm-acpi-devel@lists.sourceforge.net is appreciated.
+
+
+ACPI sounds -- /proc/acpi/ibm/beep
+----------------------------------
+
+The BEEP method is used internally by the ACPI firmware to provide
+audible alerts in various situations. This feature allows the same
+sounds to be triggered manually.
+
+The commands are non-negative integer numbers::
+
+ echo <number> >/proc/acpi/ibm/beep
+
+The valid <number> range is 0 to 17. Not all numbers trigger sounds
+and the sounds vary from model to model. Here is the behavior on the
+X40:
+
+ - 0 - stop a sound in progress (but use 17 to stop 16)
+ - 2 - two beeps, pause, third beep ("low battery")
+ - 3 - single beep
+ - 4 - high, followed by low-pitched beep ("unable")
+ - 5 - single beep
+ - 6 - very high, followed by high-pitched beep ("AC/DC")
+ - 7 - high-pitched beep
+ - 9 - three short beeps
+ - 10 - very long beep
+ - 12 - low-pitched beep
+ - 15 - three high-pitched beeps repeating constantly, stop with 0
+ - 16 - one medium-pitched beep repeating constantly, stop with 17
+ - 17 - stop 16
+
+
+Temperature sensors
+-------------------
+
+procfs: /proc/acpi/ibm/thermal
+
+sysfs device attributes: (hwmon "thinkpad") temp*_input
+
+Most ThinkPads include six or more separate temperature sensors but only
+expose the CPU temperature through the standard ACPI methods. This
+feature shows readings from up to eight different sensors on older
+ThinkPads, and up to sixteen different sensors on newer ThinkPads.
+
+For example, on the X40, a typical output may be:
+
+temperatures:
+ 42 42 45 41 36 -128 33 -128
+
+On the T43/p, a typical output may be:
+
+temperatures:
+ 48 48 36 52 38 -128 31 -128 48 52 48 -128 -128 -128 -128 -128
+
+The mapping of thermal sensors to physical locations varies depending on
+system-board model (and thus, on ThinkPad model).
+
+http://thinkwiki.org/wiki/Thermal_Sensors is a public wiki page that
+tries to track down these locations for various models.
+
+Most (newer?) models seem to follow this pattern:
+
+- 1: CPU
+- 2: (depends on model)
+- 3: (depends on model)
+- 4: GPU
+- 5: Main battery: main sensor
+- 6: Bay battery: main sensor
+- 7: Main battery: secondary sensor
+- 8: Bay battery: secondary sensor
+- 9-15: (depends on model)
+
+For the R51 (source: Thomas Gruber):
+
+- 2: Mini-PCI
+- 3: Internal HDD
+
+For the T43, T43/p (source: Shmidoax/Thinkwiki.org)
+http://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_T43.2C_T43p
+
+- 2: System board, left side (near PCMCIA slot), reported as HDAPS temp
+- 3: PCMCIA slot
+- 9: MCH (northbridge) to DRAM Bus
+- 10: Clock-generator, mini-pci card and ICH (southbridge), under Mini-PCI
+ card, under touchpad
+- 11: Power regulator, underside of system board, below F2 key
+
+The A31 has a very atypical layout for the thermal sensors
+(source: Milos Popovic, http://thinkwiki.org/wiki/Thermal_Sensors#ThinkPad_A31)
+
+- 1: CPU
+- 2: Main Battery: main sensor
+- 3: Power Converter
+- 4: Bay Battery: main sensor
+- 5: MCH (northbridge)
+- 6: PCMCIA/ambient
+- 7: Main Battery: secondary sensor
+- 8: Bay Battery: secondary sensor
+
+
+Procfs notes
+^^^^^^^^^^^^
+
+ Readings from sensors that are not available return -128.
+ No commands can be written to this file.
+
+Sysfs notes
+^^^^^^^^^^^
+
+ Sensors that are not available return the ENXIO error. This
+ status may change at runtime, as there are hotplug thermal
+ sensors, like those inside the batteries and docks.
+
+ thinkpad-acpi thermal sensors are reported through the hwmon
+ subsystem, and follow all of the hwmon guidelines at
+ Documentation/hwmon.
+
+EXPERIMENTAL: Embedded controller register dump
+-----------------------------------------------
+
+This feature is not included in the thinkpad driver anymore.
+Instead the EC can be accessed through /sys/kernel/debug/ec with
+a userspace tool which can be found here:
+ftp://ftp.suse.com/pub/people/trenn/sources/ec
+
+Use it to determine the register holding the fan
+speed on some models. To do that, do the following:
+
+ - make sure the battery is fully charged
+ - make sure the fan is running
+ - use above mentioned tool to read out the EC
+
+Often fan and temperature values vary between
+readings. Since temperatures don't change vary fast, you can take
+several quick dumps to eliminate them.
+
+You can use a similar method to figure out the meaning of other
+embedded controller registers - e.g. make sure nothing else changes
+except the charging or discharging battery to determine which
+registers contain the current battery capacity, etc. If you experiment
+with this, do send me your results (including some complete dumps with
+a description of the conditions when they were taken.)
+
+
+LCD brightness control
+----------------------
+
+procfs: /proc/acpi/ibm/brightness
+
+sysfs backlight device "thinkpad_screen"
+
+This feature allows software control of the LCD brightness on ThinkPad
+models which don't have a hardware brightness slider.
+
+It has some limitations: the LCD backlight cannot be actually turned
+on or off by this interface, it just controls the backlight brightness
+level.
+
+On IBM (and some of the earlier Lenovo) ThinkPads, the backlight control
+has eight brightness levels, ranging from 0 to 7. Some of the levels
+may not be distinct. Later Lenovo models that implement the ACPI
+display backlight brightness control methods have 16 levels, ranging
+from 0 to 15.
+
+For IBM ThinkPads, there are two interfaces to the firmware for direct
+brightness control, EC and UCMS (or CMOS). To select which one should be
+used, use the brightness_mode module parameter: brightness_mode=1 selects
+EC mode, brightness_mode=2 selects UCMS mode, brightness_mode=3 selects EC
+mode with NVRAM backing (so that brightness changes are remembered across
+shutdown/reboot).
+
+The driver tries to select which interface to use from a table of
+defaults for each ThinkPad model. If it makes a wrong choice, please
+report this as a bug, so that we can fix it.
+
+Lenovo ThinkPads only support brightness_mode=2 (UCMS).
+
+When display backlight brightness controls are available through the
+standard ACPI interface, it is best to use it instead of this direct
+ThinkPad-specific interface. The driver will disable its native
+backlight brightness control interface if it detects that the standard
+ACPI interface is available in the ThinkPad.
+
+If you want to use the thinkpad-acpi backlight brightness control
+instead of the generic ACPI video backlight brightness control for some
+reason, you should use the acpi_backlight=vendor kernel parameter.
+
+The brightness_enable module parameter can be used to control whether
+the LCD brightness control feature will be enabled when available.
+brightness_enable=0 forces it to be disabled. brightness_enable=1
+forces it to be enabled when available, even if the standard ACPI
+interface is also available.
+
+Procfs notes
+^^^^^^^^^^^^
+
+The available commands are::
+
+ echo up >/proc/acpi/ibm/brightness
+ echo down >/proc/acpi/ibm/brightness
+ echo 'level <level>' >/proc/acpi/ibm/brightness
+
+Sysfs notes
+^^^^^^^^^^^
+
+The interface is implemented through the backlight sysfs class, which is
+poorly documented at this time.
+
+Locate the thinkpad_screen device under /sys/class/backlight, and inside
+it there will be the following attributes:
+
+ max_brightness:
+ Reads the maximum brightness the hardware can be set to.
+ The minimum is always zero.
+
+ actual_brightness:
+ Reads what brightness the screen is set to at this instant.
+
+ brightness:
+ Writes request the driver to change brightness to the
+ given value. Reads will tell you what brightness the
+ driver is trying to set the display to when "power" is set
+ to zero and the display has not been dimmed by a kernel
+ power management event.
+
+ power:
+ power management mode, where 0 is "display on", and 1 to 3
+ will dim the display backlight to brightness level 0
+ because thinkpad-acpi cannot really turn the backlight
+ off. Kernel power management events can temporarily
+ increase the current power management level, i.e. they can
+ dim the display.
+
+
+WARNING:
+
+ Whatever you do, do NOT ever call thinkpad-acpi backlight-level change
+ interface and the ACPI-based backlight level change interface
+ (available on newer BIOSes, and driven by the Linux ACPI video driver)
+ at the same time. The two will interact in bad ways, do funny things,
+ and maybe reduce the life of the backlight lamps by needlessly kicking
+ its level up and down at every change.
+
+
+Volume control (Console Audio control)
+--------------------------------------
+
+procfs: /proc/acpi/ibm/volume
+
+ALSA: "ThinkPad Console Audio Control", default ID: "ThinkPadEC"
+
+NOTE: by default, the volume control interface operates in read-only
+mode, as it is supposed to be used for on-screen-display purposes.
+The read/write mode can be enabled through the use of the
+"volume_control=1" module parameter.
+
+NOTE: distros are urged to not enable volume_control by default, this
+should be done by the local admin only. The ThinkPad UI is for the
+console audio control to be done through the volume keys only, and for
+the desktop environment to just provide on-screen-display feedback.
+Software volume control should be done only in the main AC97/HDA
+mixer.
+
+
+About the ThinkPad Console Audio control
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ThinkPads have a built-in amplifier and muting circuit that drives the
+console headphone and speakers. This circuit is after the main AC97
+or HDA mixer in the audio path, and under exclusive control of the
+firmware.
+
+ThinkPads have three special hotkeys to interact with the console
+audio control: volume up, volume down and mute.
+
+It is worth noting that the normal way the mute function works (on
+ThinkPads that do not have a "mute LED") is:
+
+1. Press mute to mute. It will *always* mute, you can press it as
+ many times as you want, and the sound will remain mute.
+
+2. Press either volume key to unmute the ThinkPad (it will _not_
+ change the volume, it will just unmute).
+
+This is a very superior design when compared to the cheap software-only
+mute-toggle solution found on normal consumer laptops: you can be
+absolutely sure the ThinkPad will not make noise if you press the mute
+button, no matter the previous state.
+
+The IBM ThinkPads, and the earlier Lenovo ThinkPads have variable-gain
+amplifiers driving the speakers and headphone output, and the firmware
+also handles volume control for the headphone and speakers on these
+ThinkPads without any help from the operating system (this volume
+control stage exists after the main AC97 or HDA mixer in the audio
+path).
+
+The newer Lenovo models only have firmware mute control, and depend on
+the main HDA mixer to do volume control (which is done by the operating
+system). In this case, the volume keys are filtered out for unmute
+key press (there are some firmware bugs in this area) and delivered as
+normal key presses to the operating system (thinkpad-acpi is not
+involved).
+
+
+The ThinkPad-ACPI volume control
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The preferred way to interact with the Console Audio control is the
+ALSA interface.
+
+The legacy procfs interface allows one to read the current state,
+and if volume control is enabled, accepts the following commands::
+
+ echo up >/proc/acpi/ibm/volume
+ echo down >/proc/acpi/ibm/volume
+ echo mute >/proc/acpi/ibm/volume
+ echo unmute >/proc/acpi/ibm/volume
+ echo 'level <level>' >/proc/acpi/ibm/volume
+
+The <level> number range is 0 to 14 although not all of them may be
+distinct. To unmute the volume after the mute command, use either the
+up or down command (the level command will not unmute the volume), or
+the unmute command.
+
+You can use the volume_capabilities parameter to tell the driver
+whether your thinkpad has volume control or mute-only control:
+volume_capabilities=1 for mixers with mute and volume control,
+volume_capabilities=2 for mixers with only mute control.
+
+If the driver misdetects the capabilities for your ThinkPad model,
+please report this to ibm-acpi-devel@lists.sourceforge.net, so that we
+can update the driver.
+
+There are two strategies for volume control. To select which one
+should be used, use the volume_mode module parameter: volume_mode=1
+selects EC mode, and volume_mode=3 selects EC mode with NVRAM backing
+(so that volume/mute changes are remembered across shutdown/reboot).
+
+The driver will operate in volume_mode=3 by default. If that does not
+work well on your ThinkPad model, please report this to
+ibm-acpi-devel@lists.sourceforge.net.
+
+The driver supports the standard ALSA module parameters. If the ALSA
+mixer is disabled, the driver will disable all volume functionality.
+
+
+Fan control and monitoring: fan speed, fan enable/disable
+---------------------------------------------------------
+
+procfs: /proc/acpi/ibm/fan
+
+sysfs device attributes: (hwmon "thinkpad") fan1_input, pwm1, pwm1_enable, fan2_input
+
+sysfs hwmon driver attributes: fan_watchdog
+
+NOTE NOTE NOTE:
+ fan control operations are disabled by default for
+ safety reasons. To enable them, the module parameter "fan_control=1"
+ must be given to thinkpad-acpi.
+
+This feature attempts to show the current fan speed, control mode and
+other fan data that might be available. The speed is read directly
+from the hardware registers of the embedded controller. This is known
+to work on later R, T, X and Z series ThinkPads but may show a bogus
+value on other models.
+
+Some Lenovo ThinkPads support a secondary fan. This fan cannot be
+controlled separately, it shares the main fan control.
+
+Fan levels
+^^^^^^^^^^
+
+Most ThinkPad fans work in "levels" at the firmware interface. Level 0
+stops the fan. The higher the level, the higher the fan speed, although
+adjacent levels often map to the same fan speed. 7 is the highest
+level, where the fan reaches the maximum recommended speed.
+
+Level "auto" means the EC changes the fan level according to some
+internal algorithm, usually based on readings from the thermal sensors.
+
+There is also a "full-speed" level, also known as "disengaged" level.
+In this level, the EC disables the speed-locked closed-loop fan control,
+and drives the fan as fast as it can go, which might exceed hardware
+limits, so use this level with caution.
+
+The fan usually ramps up or down slowly from one speed to another, and
+it is normal for the EC to take several seconds to react to fan
+commands. The full-speed level may take up to two minutes to ramp up to
+maximum speed, and in some ThinkPads, the tachometer readings go stale
+while the EC is transitioning to the full-speed level.
+
+WARNING WARNING WARNING: do not leave the fan disabled unless you are
+monitoring all of the temperature sensor readings and you are ready to
+enable it if necessary to avoid overheating.
+
+An enabled fan in level "auto" may stop spinning if the EC decides the
+ThinkPad is cool enough and doesn't need the extra airflow. This is
+normal, and the EC will spin the fan up if the various thermal readings
+rise too much.
+
+On the X40, this seems to depend on the CPU and HDD temperatures.
+Specifically, the fan is turned on when either the CPU temperature
+climbs to 56 degrees or the HDD temperature climbs to 46 degrees. The
+fan is turned off when the CPU temperature drops to 49 degrees and the
+HDD temperature drops to 41 degrees. These thresholds cannot
+currently be controlled.
+
+The ThinkPad's ACPI DSDT code will reprogram the fan on its own when
+certain conditions are met. It will override any fan programming done
+through thinkpad-acpi.
+
+The thinkpad-acpi kernel driver can be programmed to revert the fan
+level to a safe setting if userspace does not issue one of the procfs
+fan commands: "enable", "disable", "level" or "watchdog", or if there
+are no writes to pwm1_enable (or to pwm1 *if and only if* pwm1_enable is
+set to 1, manual mode) within a configurable amount of time of up to
+120 seconds. This functionality is called fan safety watchdog.
+
+Note that the watchdog timer stops after it enables the fan. It will be
+rearmed again automatically (using the same interval) when one of the
+above mentioned fan commands is received. The fan watchdog is,
+therefore, not suitable to protect against fan mode changes made through
+means other than the "enable", "disable", and "level" procfs fan
+commands, or the hwmon fan control sysfs interface.
+
+Procfs notes
+^^^^^^^^^^^^
+
+The fan may be enabled or disabled with the following commands::
+
+ echo enable >/proc/acpi/ibm/fan
+ echo disable >/proc/acpi/ibm/fan
+
+Placing a fan on level 0 is the same as disabling it. Enabling a fan
+will try to place it in a safe level if it is too slow or disabled.
+
+The fan level can be controlled with the command::
+
+ echo 'level <level>' > /proc/acpi/ibm/fan
+
+Where <level> is an integer from 0 to 7, or one of the words "auto" or
+"full-speed" (without the quotes). Not all ThinkPads support the "auto"
+and "full-speed" levels. The driver accepts "disengaged" as an alias for
+"full-speed", and reports it as "disengaged" for backwards
+compatibility.
+
+On the X31 and X40 (and ONLY on those models), the fan speed can be
+controlled to a certain degree. Once the fan is running, it can be
+forced to run faster or slower with the following command::
+
+ echo 'speed <speed>' > /proc/acpi/ibm/fan
+
+The sustainable range of fan speeds on the X40 appears to be from about
+3700 to about 7350. Values outside this range either do not have any
+effect or the fan speed eventually settles somewhere in that range. The
+fan cannot be stopped or started with this command. This functionality
+is incomplete, and not available through the sysfs interface.
+
+To program the safety watchdog, use the "watchdog" command::
+
+ echo 'watchdog <interval in seconds>' > /proc/acpi/ibm/fan
+
+If you want to disable the watchdog, use 0 as the interval.
+
+Sysfs notes
+^^^^^^^^^^^
+
+The sysfs interface follows the hwmon subsystem guidelines for the most
+part, and the exception is the fan safety watchdog.
+
+Writes to any of the sysfs attributes may return the EINVAL error if
+that operation is not supported in a given ThinkPad or if the parameter
+is out-of-bounds, and EPERM if it is forbidden. They may also return
+EINTR (interrupted system call), and EIO (I/O error while trying to talk
+to the firmware).
+
+Features not yet implemented by the driver return ENOSYS.
+
+hwmon device attribute pwm1_enable:
+ - 0: PWM offline (fan is set to full-speed mode)
+ - 1: Manual PWM control (use pwm1 to set fan level)
+ - 2: Hardware PWM control (EC "auto" mode)
+ - 3: reserved (Software PWM control, not implemented yet)
+
+ Modes 0 and 2 are not supported by all ThinkPads, and the
+ driver is not always able to detect this. If it does know a
+ mode is unsupported, it will return -EINVAL.
+
+hwmon device attribute pwm1:
+ Fan level, scaled from the firmware values of 0-7 to the hwmon
+ scale of 0-255. 0 means fan stopped, 255 means highest normal
+ speed (level 7).
+
+ This attribute only commands the fan if pmw1_enable is set to 1
+ (manual PWM control).
+
+hwmon device attribute fan1_input:
+ Fan tachometer reading, in RPM. May go stale on certain
+ ThinkPads while the EC transitions the PWM to offline mode,
+ which can take up to two minutes. May return rubbish on older
+ ThinkPads.
+
+hwmon device attribute fan2_input:
+ Fan tachometer reading, in RPM, for the secondary fan.
+ Available only on some ThinkPads. If the secondary fan is
+ not installed, will always read 0.
+
+hwmon driver attribute fan_watchdog:
+ Fan safety watchdog timer interval, in seconds. Minimum is
+ 1 second, maximum is 120 seconds. 0 disables the watchdog.
+
+To stop the fan: set pwm1 to zero, and pwm1_enable to 1.
+
+To start the fan in a safe mode: set pwm1_enable to 2. If that fails
+with EINVAL, try to set pwm1_enable to 1 and pwm1 to at least 128 (255
+would be the safest choice, though).
+
+
+WAN
+---
+
+procfs: /proc/acpi/ibm/wan
+
+sysfs device attribute: wwan_enable (deprecated)
+
+sysfs rfkill class: switch "tpacpi_wwan_sw"
+
+This feature shows the presence and current state of the built-in
+Wireless WAN device.
+
+If the ThinkPad supports it, the WWAN state is stored in NVRAM,
+so it is kept across reboots and power-off.
+
+It was tested on a Lenovo ThinkPad X60. It should probably work on other
+ThinkPad models which come with this module installed.
+
+Procfs notes
+^^^^^^^^^^^^
+
+If the W-WAN card is installed, the following commands can be used::
+
+ echo enable > /proc/acpi/ibm/wan
+ echo disable > /proc/acpi/ibm/wan
+
+Sysfs notes
+^^^^^^^^^^^
+
+ If the W-WAN card is installed, it can be enabled /
+ disabled through the "wwan_enable" thinkpad-acpi device
+ attribute, and its current status can also be queried.
+
+ enable:
+ - 0: disables WWAN card / WWAN card is disabled
+ - 1: enables WWAN card / WWAN card is enabled.
+
+ Note: this interface has been superseded by the generic rfkill
+ class. It has been deprecated, and it will be removed in year
+ 2010.
+
+ rfkill controller switch "tpacpi_wwan_sw": refer to
+ Documentation/driver-api/rfkill.rst for details.
+
+
+LCD Shadow control
+------------------
+
+procfs: /proc/acpi/ibm/lcdshadow
+
+Some newer T480s and T490s ThinkPads provide a feature called
+PrivacyGuard. By turning this feature on, the usable vertical and
+horizontal viewing angles of the LCD can be limited (as if some privacy
+screen was applied manually in front of the display).
+
+procfs notes
+^^^^^^^^^^^^
+
+The available commands are::
+
+ echo '0' >/proc/acpi/ibm/lcdshadow
+ echo '1' >/proc/acpi/ibm/lcdshadow
+
+The first command ensures the best viewing angle and the latter one turns
+on the feature, restricting the viewing angles.
+
+
+EXPERIMENTAL: UWB
+-----------------
+
+This feature is considered EXPERIMENTAL because it has not been extensively
+tested and validated in various ThinkPad models yet. The feature may not
+work as expected. USE WITH CAUTION! To use this feature, you need to supply
+the experimental=1 parameter when loading the module.
+
+sysfs rfkill class: switch "tpacpi_uwb_sw"
+
+This feature exports an rfkill controller for the UWB device, if one is
+present and enabled in the BIOS.
+
+Sysfs notes
+^^^^^^^^^^^
+
+ rfkill controller switch "tpacpi_uwb_sw": refer to
+ Documentation/driver-api/rfkill.rst for details.
+
+Adaptive keyboard
+-----------------
+
+sysfs device attribute: adaptive_kbd_mode
+
+This sysfs attribute controls the keyboard "face" that will be shown on the
+Lenovo X1 Carbon 2nd gen (2014)'s adaptive keyboard. The value can be read
+and set.
+
+- 1 = Home mode
+- 2 = Web-browser mode
+- 3 = Web-conference mode
+- 4 = Function mode
+- 5 = Layflat mode
+
+For more details about which buttons will appear depending on the mode, please
+review the laptop's user guide:
+http://www.lenovo.com/shop/americas/content/user_guides/x1carbon_2_ug_en.pdf
+
+Multiple Commands, Module Parameters
+------------------------------------
+
+Multiple commands can be written to the proc files in one shot by
+separating them with commas, for example::
+
+ echo enable,0xffff > /proc/acpi/ibm/hotkey
+ echo lcd_disable,crt_enable > /proc/acpi/ibm/video
+
+Commands can also be specified when loading the thinkpad-acpi module,
+for example::
+
+ modprobe thinkpad_acpi hotkey=enable,0xffff video=auto_disable
+
+
+Enabling debugging output
+-------------------------
+
+The module takes a debug parameter which can be used to selectively
+enable various classes of debugging output, for example::
+
+ modprobe thinkpad_acpi debug=0xffff
+
+will enable all debugging output classes. It takes a bitmask, so
+to enable more than one output class, just add their values.
+
+ ============= ======================================
+ Debug bitmask Description
+ ============= ======================================
+ 0x8000 Disclose PID of userspace programs
+ accessing some functions of the driver
+ 0x0001 Initialization and probing
+ 0x0002 Removal
+ 0x0004 RF Transmitter control (RFKILL)
+ (bluetooth, WWAN, UWB...)
+ 0x0008 HKEY event interface, hotkeys
+ 0x0010 Fan control
+ 0x0020 Backlight brightness
+ 0x0040 Audio mixer/volume control
+ ============= ======================================
+
+There is also a kernel build option to enable more debugging
+information, which may be necessary to debug driver problems.
+
+The level of debugging information output by the driver can be changed
+at runtime through sysfs, using the driver attribute debug_level. The
+attribute takes the same bitmask as the debug module parameter above.
+
+
+Force loading of module
+-----------------------
+
+If thinkpad-acpi refuses to detect your ThinkPad, you can try to specify
+the module parameter force_load=1. Regardless of whether this works or
+not, please contact ibm-acpi-devel@lists.sourceforge.net with a report.
+
+
+Sysfs interface changelog
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+========= ===============================================================
+0x000100: Initial sysfs support, as a single platform driver and
+ device.
+0x000200: Hot key support for 32 hot keys, and radio slider switch
+ support.
+0x010000: Hot keys are now handled by default over the input
+ layer, the radio switch generates input event EV_RADIO,
+ and the driver enables hot key handling by default in
+ the firmware.
+
+0x020000: ABI fix: added a separate hwmon platform device and
+ driver, which must be located by name (thinkpad)
+ and the hwmon class for libsensors4 (lm-sensors 3)
+ compatibility. Moved all hwmon attributes to this
+ new platform device.
+
+0x020100: Marker for thinkpad-acpi with hot key NVRAM polling
+ support. If you must, use it to know you should not
+ start a userspace NVRAM poller (allows to detect when
+ NVRAM is compiled out by the user because it is
+ unneeded/undesired in the first place).
+0x020101: Marker for thinkpad-acpi with hot key NVRAM polling
+ and proper hotkey_mask semantics (version 8 of the
+ NVRAM polling patch). Some development snapshots of
+ 0.18 had an earlier version that did strange things
+ to hotkey_mask.
+
+0x020200: Add poll()/select() support to the following attributes:
+ hotkey_radio_sw, wakeup_hotunplug_complete, wakeup_reason
+
+0x020300: hotkey enable/disable support removed, attributes
+ hotkey_bios_enabled and hotkey_enable deprecated and
+ marked for removal.
+
+0x020400: Marker for 16 LEDs support. Also, LEDs that are known
+ to not exist in a given model are not registered with
+ the LED sysfs class anymore.
+
+0x020500: Updated hotkey driver, hotkey_mask is always available
+ and it is always able to disable hot keys. Very old
+ thinkpads are properly supported. hotkey_bios_mask
+ is deprecated and marked for removal.
+
+0x020600: Marker for backlight change event support.
+
+0x020700: Support for mute-only mixers.
+ Volume control in read-only mode by default.
+ Marker for ALSA mixer support.
+
+0x030000: Thermal and fan sysfs attributes were moved to the hwmon
+ device instead of being attached to the backing platform
+ device.
+========= ===============================================================
diff --git a/Documentation/admin-guide/laptops/toshiba_haps.rst b/Documentation/admin-guide/laptops/toshiba_haps.rst
new file mode 100644
index 0000000..d28b6c3
--- /dev/null
+++ b/Documentation/admin-guide/laptops/toshiba_haps.rst
@@ -0,0 +1,87 @@
+====================================
+Toshiba HDD Active Protection Sensor
+====================================
+
+Kernel driver: toshiba_haps
+
+Author: Azael Avalos <coproscefalo@gmail.com>
+
+
+.. 0. Contents
+
+ 1. Description
+ 2. Interface
+ 3. Accelerometer axes
+ 4. Supported devices
+ 5. Usage
+
+
+1. Description
+--------------
+
+This driver provides support for the accelerometer found in various Toshiba
+laptops, being called "Toshiba HDD Protection - Shock Sensor" officially,
+and detects laptops automatically with this device.
+On Windows, Toshiba provided software monitors this device and provides
+automatic HDD protection (head unload) on sudden moves or harsh vibrations,
+however, this driver only provides a notification via a sysfs file to let
+userspace tools or daemons act accordingly, as well as providing a sysfs
+file to set the desired protection level or sensor sensibility.
+
+
+2. Interface
+------------
+
+This device comes with 3 methods:
+
+==== =====================================================================
+_STA Checks existence of the device, returning Zero if the device does not
+ exists or is not supported.
+PTLV Sets the desired protection level.
+RSSS Shuts down the HDD protection interface for a few seconds,
+ then restores normal operation.
+==== =====================================================================
+
+Note:
+ The presence of Solid State Drives (SSD) can make this driver to fail loading,
+ given the fact that such drives have no movable parts, and thus, not requiring
+ any "protection" as well as failing during the evaluation of the _STA method
+ found under this device.
+
+
+3. Accelerometer axes
+---------------------
+
+This device does not report any axes, however, to query the sensor position
+a couple HCI (Hardware Configuration Interface) calls (0x6D and 0xA6) are
+provided to query such information, handled by the kernel module toshiba_acpi
+since kernel version 3.15.
+
+
+4. Supported devices
+--------------------
+
+This driver binds itself to the ACPI device TOS620A, and any Toshiba laptop
+with this device is supported, given the fact that they have the presence of
+conventional HDD and not only SSD, or a combination of both HDD and SSD.
+
+
+5. Usage
+--------
+
+The sysfs files under /sys/devices/LNXSYSTM:00/LNXSYBUS:00/TOS620A:00/ are:
+
+================ ============================================================
+protection_level The protection_level is readable and writeable, and
+ provides a way to let userspace query the current protection
+ level, as well as set the desired protection level, the
+ available protection levels are::
+
+ ============ ======= ========== ========
+ 0 - Disabled 1 - Low 2 - Medium 3 - High
+ ============ ======= ========== ========
+
+reset_protection The reset_protection entry is writeable only, being "1"
+ the only parameter it accepts, it is used to trigger
+ a reset of the protection interface.
+================ ============================================================
diff --git a/Documentation/admin-guide/lcd-panel-cgram.rst b/Documentation/admin-guide/lcd-panel-cgram.rst
new file mode 100644
index 0000000..a3eb00c
--- /dev/null
+++ b/Documentation/admin-guide/lcd-panel-cgram.rst
@@ -0,0 +1,27 @@
+======================================
+Parallel port LCD/Keypad Panel support
+======================================
+
+Some LCDs allow you to define up to 8 characters, mapped to ASCII
+characters 0 to 7. The escape code to define a new character is
+'\e[LG' followed by one digit from 0 to 7, representing the character
+number, and up to 8 couples of hex digits terminated by a semi-colon
+(';'). Each couple of digits represents a line, with 1-bits for each
+illuminated pixel with LSB on the right. Lines are numbered from the
+top of the character to the bottom. On a 5x7 matrix, only the 5 lower
+bits of the 7 first bytes are used for each character. If the string
+is incomplete, only complete lines will be redefined. Here are some
+examples::
+
+ printf "\e[LG0010101050D1F0C04;" => 0 = [enter]
+ printf "\e[LG1040E1F0000000000;" => 1 = [up]
+ printf "\e[LG2000000001F0E0400;" => 2 = [down]
+ printf "\e[LG3040E1F001F0E0400;" => 3 = [up-down]
+ printf "\e[LG40002060E1E0E0602;" => 4 = [left]
+ printf "\e[LG500080C0E0F0E0C08;" => 5 = [right]
+ printf "\e[LG60016051516141400;" => 6 = "IP"
+
+ printf "\e[LG00103071F1F070301;" => big speaker
+ printf "\e[LG00002061E1E060200;" => small speaker
+
+Willy
diff --git a/Documentation/admin-guide/ldm.rst b/Documentation/admin-guide/ldm.rst
new file mode 100644
index 0000000..12c5713
--- /dev/null
+++ b/Documentation/admin-guide/ldm.rst
@@ -0,0 +1,121 @@
+==========================================
+LDM - Logical Disk Manager (Dynamic Disks)
+==========================================
+
+:Author: Originally Written by FlatCap - Richard Russon <ldm@flatcap.org>.
+:Last Updated: Anton Altaparmakov on 30 March 2007 for Windows Vista.
+
+Overview
+--------
+
+Windows 2000, XP, and Vista use a new partitioning scheme. It is a complete
+replacement for the MSDOS style partitions. It stores its information in a
+1MiB journalled database at the end of the physical disk. The size of
+partitions is limited only by disk space. The maximum number of partitions is
+nearly 2000.
+
+Any partitions created under the LDM are called "Dynamic Disks". There are no
+longer any primary or extended partitions. Normal MSDOS style partitions are
+now known as Basic Disks.
+
+If you wish to use Spanned, Striped, Mirrored or RAID 5 Volumes, you must use
+Dynamic Disks. The journalling allows Windows to make changes to these
+partitions and filesystems without the need to reboot.
+
+Once the LDM driver has divided up the disk, you can use the MD driver to
+assemble any multi-partition volumes, e.g. Stripes, RAID5.
+
+To prevent legacy applications from repartitioning the disk, the LDM creates a
+dummy MSDOS partition containing one disk-sized partition. This is what is
+supported with the Linux LDM driver.
+
+A newer approach that has been implemented with Vista is to put LDM on top of a
+GPT label disk. This is not supported by the Linux LDM driver yet.
+
+
+Example
+-------
+
+Below we have a 50MiB disk, divided into seven partitions.
+
+.. note::
+
+ The missing 1MiB at the end of the disk is where the LDM database is
+ stored.
+
++-------++--------------+---------+-----++--------------+---------+----+
+|Device || Offset Bytes | Sectors | MiB || Size Bytes | Sectors | MiB|
++=======++==============+=========+=====++==============+=========+====+
+|hda || 0 | 0 | 0 || 52428800 | 102400 | 50|
++-------++--------------+---------+-----++--------------+---------+----+
+|hda1 || 51380224 | 100352 | 49 || 1048576 | 2048 | 1|
++-------++--------------+---------+-----++--------------+---------+----+
+|hda2 || 16384 | 32 | 0 || 6979584 | 13632 | 6|
++-------++--------------+---------+-----++--------------+---------+----+
+|hda3 || 6995968 | 13664 | 6 || 10485760 | 20480 | 10|
++-------++--------------+---------+-----++--------------+---------+----+
+|hda4 || 17481728 | 34144 | 16 || 4194304 | 8192 | 4|
++-------++--------------+---------+-----++--------------+---------+----+
+|hda5 || 21676032 | 42336 | 20 || 5242880 | 10240 | 5|
++-------++--------------+---------+-----++--------------+---------+----+
+|hda6 || 26918912 | 52576 | 25 || 10485760 | 20480 | 10|
++-------++--------------+---------+-----++--------------+---------+----+
+|hda7 || 37404672 | 73056 | 35 || 13959168 | 27264 | 13|
++-------++--------------+---------+-----++--------------+---------+----+
+
+The LDM Database may not store the partitions in the order that they appear on
+disk, but the driver will sort them.
+
+When Linux boots, you will see something like::
+
+ hda: 102400 sectors w/32KiB Cache, CHS=50/64/32
+ hda: [LDM] hda1 hda2 hda3 hda4 hda5 hda6 hda7
+
+
+Compiling LDM Support
+---------------------
+
+To enable LDM, choose the following two options:
+
+ - "Advanced partition selection" CONFIG_PARTITION_ADVANCED
+ - "Windows Logical Disk Manager (Dynamic Disk) support" CONFIG_LDM_PARTITION
+
+If you believe the driver isn't working as it should, you can enable the extra
+debugging code. This will produce a LOT of output. The option is:
+
+ - "Windows LDM extra logging" CONFIG_LDM_DEBUG
+
+N.B. The partition code cannot be compiled as a module.
+
+As with all the partition code, if the driver doesn't see signs of its type of
+partition, it will pass control to another driver, so there is no harm in
+enabling it.
+
+If you have Dynamic Disks but don't enable the driver, then all you will see
+is a dummy MSDOS partition filling the whole disk. You won't be able to mount
+any of the volumes on the disk.
+
+
+Booting
+-------
+
+If you enable LDM support, then lilo is capable of booting from any of the
+discovered partitions. However, grub does not understand the LDM partitioning
+and cannot boot from a Dynamic Disk.
+
+
+More Documentation
+------------------
+
+There is an Overview of the LDM together with complete Technical Documentation.
+It is available for download.
+
+ http://www.linux-ntfs.org/
+
+If you have any LDM questions that aren't answered in the documentation, email
+me.
+
+Cheers,
+ FlatCap - Richard Russon
+ ldm@flatcap.org
+
diff --git a/Documentation/admin-guide/lockup-watchdogs.rst b/Documentation/admin-guide/lockup-watchdogs.rst
new file mode 100644
index 0000000..290840c
--- /dev/null
+++ b/Documentation/admin-guide/lockup-watchdogs.rst
@@ -0,0 +1,83 @@
+===============================================================
+Softlockup detector and hardlockup detector (aka nmi_watchdog)
+===============================================================
+
+The Linux kernel can act as a watchdog to detect both soft and hard
+lockups.
+
+A 'softlockup' is defined as a bug that causes the kernel to loop in
+kernel mode for more than 20 seconds (see "Implementation" below for
+details), without giving other tasks a chance to run. The current
+stack trace is displayed upon detection and, by default, the system
+will stay locked up. Alternatively, the kernel can be configured to
+panic; a sysctl, "kernel.softlockup_panic", a kernel parameter,
+"softlockup_panic" (see "Documentation/admin-guide/kernel-parameters.rst" for
+details), and a compile option, "BOOTPARAM_SOFTLOCKUP_PANIC", are
+provided for this.
+
+A 'hardlockup' is defined as a bug that causes the CPU to loop in
+kernel mode for more than 10 seconds (see "Implementation" below for
+details), without letting other interrupts have a chance to run.
+Similarly to the softlockup case, the current stack trace is displayed
+upon detection and the system will stay locked up unless the default
+behavior is changed, which can be done through a sysctl,
+'hardlockup_panic', a compile time knob, "BOOTPARAM_HARDLOCKUP_PANIC",
+and a kernel parameter, "nmi_watchdog"
+(see "Documentation/admin-guide/kernel-parameters.rst" for details).
+
+The panic option can be used in combination with panic_timeout (this
+timeout is set through the confusingly named "kernel.panic" sysctl),
+to cause the system to reboot automatically after a specified amount
+of time.
+
+Implementation
+==============
+
+The soft and hard lockup detectors are built on top of the hrtimer and
+perf subsystems, respectively. A direct consequence of this is that,
+in principle, they should work in any architecture where these
+subsystems are present.
+
+A periodic hrtimer runs to generate interrupts and kick the watchdog
+task. An NMI perf event is generated every "watchdog_thresh"
+(compile-time initialized to 10 and configurable through sysctl of the
+same name) seconds to check for hardlockups. If any CPU in the system
+does not receive any hrtimer interrupt during that time the
+'hardlockup detector' (the handler for the NMI perf event) will
+generate a kernel warning or call panic, depending on the
+configuration.
+
+The watchdog task is a high priority kernel thread that updates a
+timestamp every time it is scheduled. If that timestamp is not updated
+for 2*watchdog_thresh seconds (the softlockup threshold) the
+'softlockup detector' (coded inside the hrtimer callback function)
+will dump useful debug information to the system log, after which it
+will call panic if it was instructed to do so or resume execution of
+other kernel code.
+
+The period of the hrtimer is 2*watchdog_thresh/5, which means it has
+two or three chances to generate an interrupt before the hardlockup
+detector kicks in.
+
+As explained above, a kernel knob is provided that allows
+administrators to configure the period of the hrtimer and the perf
+event. The right value for a particular environment is a trade-off
+between fast response to lockups and detection overhead.
+
+By default, the watchdog runs on all online cores. However, on a
+kernel configured with NO_HZ_FULL, by default the watchdog runs only
+on the housekeeping cores, not the cores specified in the "nohz_full"
+boot argument. If we allowed the watchdog to run by default on
+the "nohz_full" cores, we would have to run timer ticks to activate
+the scheduler, which would prevent the "nohz_full" functionality
+from protecting the user code on those cores from the kernel.
+Of course, disabling it by default on the nohz_full cores means that
+when those cores do enter the kernel, by default we will not be
+able to detect if they lock up. However, allowing the watchdog
+to continue to run on the housekeeping (non-tickless) cores means
+that we will continue to detect lockups properly on those cores.
+
+In either case, the set of cores excluded from running the watchdog
+may be adjusted via the kernel.watchdog_cpumask sysctl. For
+nohz_full cores, this may be useful for debugging a case where the
+kernel seems to be hanging on the nohz_full cores.
diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 84de718..3c51084 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -756,3 +756,6 @@
The cache mode for raid5. raid5 could include an extra disk for
caching. The mode can be "write-throuth" and "write-back". The
default is "write-through".
+
+ ppl_write_hint
+ NVMe stream ID to be set for each PPL write request.
diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst
new file mode 100644
index 0000000..4e06ffa
--- /dev/null
+++ b/Documentation/admin-guide/mm/cma_debugfs.rst
@@ -0,0 +1,25 @@
+=====================
+CMA Debugfs Interface
+=====================
+
+The CMA debugfs interface is useful to retrieve basic information out of the
+different CMA areas and to test allocation/release in each of the areas.
+
+Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
+kernel's CMA index. So the first CMA zone would be:
+
+ <debugfs>/cma/cma-0
+
+The structure of the files created under that directory is as follows:
+
+ - [RO] base_pfn: The base PFN (Page Frame Number) of the zone.
+ - [RO] count: Amount of memory in the CMA area.
+ - [RO] order_per_bit: Order of pages represented by one bit.
+ - [RO] bitmap: The bitmap of page states in the zone.
+ - [WO] alloc: Allocate N pages from that CMA area. For example::
+
+ echo 5 > <debugfs>/cma/cma-2/alloc
+
+would try to allocate 5 pages from the cma-2 area.
+
+ - [WO] free: Free N pages from that CMA area, similar to the above.
diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst
index 291699c..c2531b1 100644
--- a/Documentation/admin-guide/mm/concepts.rst
+++ b/Documentation/admin-guide/mm/concepts.rst
@@ -4,13 +4,13 @@
Concepts overview
=================
-The memory management in Linux is complex system that evolved over the
-years and included more and more functionality to support variety of
+The memory management in Linux is a complex system that evolved over the
+years and included more and more functionality to support a variety of
systems from MMU-less microcontrollers to supercomputers. The memory
-management for systems without MMU is called ``nommu`` and it
+management for systems without an MMU is called ``nommu`` and it
definitely deserves a dedicated document, which hopefully will be
eventually written. Yet, although some of the concepts are the same,
-here we assume that MMU is available and CPU can translate a virtual
+here we assume that an MMU is available and a CPU can translate a virtual
address to a physical address.
.. contents:: :local:
@@ -21,10 +21,10 @@
The physical memory in a computer system is a limited resource and
even for systems that support memory hotplug there is a hard limit on
the amount of memory that can be installed. The physical memory is not
-necessary contiguous, it might be accessible as a set of distinct
+necessarily contiguous; it might be accessible as a set of distinct
address ranges. Besides, different CPU architectures, and even
-different implementations of the same architecture have different view
-how these address ranges defined.
+different implementations of the same architecture have different views
+of how these address ranges are defined.
All this makes dealing directly with physical memory quite complex and
to avoid this complexity a concept of virtual memory was developed.
@@ -48,8 +48,8 @@
Each physical memory page can be mapped as one or more virtual
pages. These mappings are described by page tables that allow
-translation from virtual address used by programs to real address in
-the physical memory. The page tables organized hierarchically.
+translation from a virtual address used by programs to the physical
+memory address. The page tables are organized hierarchically.
The tables at the lowest level of the hierarchy contain physical
addresses of actual pages used by the software. The tables at higher
@@ -121,8 +121,8 @@
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
systems. In such systems the memory is arranged into banks that have
different access latency depending on the "distance" from the
-processor. Each bank is referred as `node` and for each node Linux
-constructs an independent memory management subsystem. A node has it's
+processor. Each bank is referred to as a `node` and for each node Linux
+constructs an independent memory management subsystem. A node has its
own set of zones, lists of free and used pages and various statistics
counters. You can find more details about NUMA in
:ref:`Documentation/vm/numa.rst <numa>` and in
@@ -149,9 +149,9 @@
call. Usually, the anonymous mappings only define virtual memory areas
that the program is allowed to access. The read accesses will result
in creation of a page table entry that references a special physical
-page filled with zeroes. When the program performs a write, regular
+page filled with zeroes. When the program performs a write, a regular
physical page will be allocated to hold the written data. The page
-will be marked dirty and if the kernel will decide to repurpose it,
+will be marked dirty and if the kernel decides to repurpose it,
the dirty page will be swapped out.
Reclaim
@@ -181,8 +181,8 @@
The process of freeing the reclaimable physical memory pages and
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
pages either asynchronously or synchronously, depending on the state
-of the system. When system is not loaded, most of the memory is free
-and allocation request will be satisfied immediately from the free
+of the system. When the system is not loaded, most of the memory is free
+and allocation requests will be satisfied immediately from the free
pages supply. As the load increases, the amount of the free pages goes
down and when it reaches a certain threshold (high watermark), an
allocation request will awaken the ``kswapd`` daemon. It will
@@ -190,7 +190,7 @@
they contain is available elsewhere, or evict to the backing storage
device (remember those dirty pages?). As memory usage increases even
more and reaches another threshold - min watermark - an allocation
-will trigger the `direct reclaim`. In this case allocation is stalled
+will trigger `direct reclaim`. In this case allocation is stalled
until enough memory pages are reclaimed to satisfy the request.
Compaction
@@ -200,7 +200,7 @@
fragmented. Although with virtual memory it is possible to present
scattered physical pages as virtually contiguous range, sometimes it is
necessary to allocate large physically contiguous memory areas. Such
-need may arise, for instance, when a device driver requires large
+need may arise, for instance, when a device driver requires a large
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
addresses the fragmentation issue. This mechanism moves occupied pages
from the lower part of a memory zone to free pages in the upper part
@@ -208,15 +208,16 @@
together at the beginning of the zone and allocations of large
physically contiguous areas become possible.
-Like reclaim, the compaction may happen asynchronously in ``kcompactd``
-daemon or synchronously as a result of memory allocation request.
+Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
+daemon or synchronously as a result of a memory allocation request.
OOM killer
==========
-It may happen, that on a loaded machine memory will be exhausted. When
-the kernel detects that the system runs out of memory (OOM) it invokes
-`OOM killer`. Its mission is simple: all it has to do is to select a
-task to sacrifice for the sake of the overall system health. The
-selected task is killed in a hope that after it exits enough memory
-will be freed to continue normal operation.
+It is possible that on a loaded machine memory will be exhausted and the
+kernel will be unable to reclaim enough memory to continue to operate. In
+order to save the rest of the system, it invokes the `OOM killer`.
+
+The `OOM killer` selects a task to sacrifice for the sake of the overall
+system health. The selected task is killed in a hope that after it exits
+enough memory will be freed to continue normal operation.
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index ceead68..11db464 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -11,7 +11,7 @@
Linux memory management is a complex system with many configurable
settings. Most of these settings are available via ``/proc``
filesystem and can be quired and adjusted using ``sysctl``. These APIs
-are described in Documentation/sysctl/vm.txt and in `man 5 proc`_.
+are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_.
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
@@ -26,10 +26,13 @@
:maxdepth: 1
concepts
+ cma_debugfs
hugetlbpage
idle_page_tracking
ksm
+ memory-hotplug
numa_memory_policy
+ numaperf
pagemap
soft-dirty
transhuge
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index 9303786..874eb0c 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -59,7 +59,7 @@
If a region of memory must be split into at least one new MADV_MERGEABLE
or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
-will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.txt).
+will exceed ``vm.max_map_count`` (see Documentation/admin-guide/sysctl/vm.rst).
Like other madvise calls, they are intended for use on mapped areas of
the user address space: they will report ENOMEM if the specified range
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
new file mode 100644
index 0000000..5c4432c
--- /dev/null
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -0,0 +1,444 @@
+.. _admin_guide_memory_hotplug:
+
+==============
+Memory Hotplug
+==============
+
+:Created: Jul 28 2007
+:Updated: Add some details about locking internals: Aug 20 2018
+
+This document is about memory hotplug including how-to-use and current status.
+Because Memory Hotplug is still under development, contents of this text will
+be changed often.
+
+.. contents:: :local:
+
+.. note::
+
+ (1) x86_64's has special implementation for memory hotplug.
+ This text does not describe it.
+ (2) This text assumes that sysfs is mounted at ``/sys``.
+
+
+Introduction
+============
+
+Purpose of memory hotplug
+-------------------------
+
+Memory Hotplug allows users to increase/decrease the amount of memory.
+Generally, there are two purposes.
+
+(A) For changing the amount of memory.
+ This is to allow a feature like capacity on demand.
+(B) For installing/removing DIMMs or NUMA-nodes physically.
+ This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
+
+(A) is required by highly virtualized environments and (B) is required by
+hardware which supports memory power management.
+
+Linux memory hotplug is designed for both purpose.
+
+Phases of memory hotplug
+------------------------
+
+There are 2 phases in Memory Hotplug:
+
+ 1) Physical Memory Hotplug phase
+ 2) Logical Memory Hotplug phase.
+
+The First phase is to communicate hardware/firmware and make/erase
+environment for hotplugged memory. Basically, this phase is necessary
+for the purpose (B), but this is good phase for communication between
+highly virtualized environments too.
+
+When memory is hotplugged, the kernel recognizes new memory, makes new memory
+management tables, and makes sysfs files for new memory's operation.
+
+If firmware supports notification of connection of new memory to OS,
+this phase is triggered automatically. ACPI can notify this event. If not,
+"probe" operation by system administration is used instead.
+(see :ref:`memory_hotplug_physical_mem`).
+
+Logical Memory Hotplug phase is to change memory state into
+available/unavailable for users. Amount of memory from user's view is
+changed by this phase. The kernel makes all memory in it as free pages
+when a memory range is available.
+
+In this document, this phase is described as online/offline.
+
+Logical Memory Hotplug phase is triggered by write of sysfs file by system
+administrator. For the hot-add case, it must be executed after Physical Hotplug
+phase by hand.
+(However, if you writes udev's hotplug scripts for memory hotplug, these
+phases can be execute in seamless way.)
+
+Unit of Memory online/offline operation
+---------------------------------------
+
+Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
+into chunks of the same size. These chunks are called "sections". The size of
+a memory section is architecture dependent. For example, power uses 16MiB, ia64
+uses 1GiB.
+
+Memory sections are combined into chunks referred to as "memory blocks". The
+size of a memory block is architecture dependent and represents the logical
+unit upon which memory online/offline operations are to be performed. The
+default size of a memory block is the same as memory section size unless an
+architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
+
+To determine the size (in bytes) of a memory block please read this file::
+
+ /sys/devices/system/memory/block_size_bytes
+
+Kernel Configuration
+====================
+
+To use memory hotplug feature, kernel must be compiled with following
+config options.
+
+- For all memory hotplug:
+ - Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``)
+ - Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``)
+
+- To enable memory removal, the following are also necessary:
+ - Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``)
+ - Page Migration (``CONFIG_MIGRATION``)
+
+- For ACPI memory hotplug, the following are also necessary:
+ - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``)
+ - This option can be kernel module.
+
+- As a related configuration, if your box has a feature of NUMA-node hotplug
+ via ACPI, then this option is necessary too.
+
+ - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
+ (``CONFIG_ACPI_CONTAINER``).
+
+ This option can be kernel module too.
+
+
+.. _memory_hotplug_sysfs_files:
+
+sysfs files for memory hotplug
+==============================
+
+All memory blocks have their device information in sysfs. Each memory block
+is described under ``/sys/devices/system/memory`` as::
+
+ /sys/devices/system/memory/memoryXXX
+
+where XXX is the memory block id.
+
+For the memory block covered by the sysfs directory. It is expected that all
+memory sections in this range are present and no memory holes exist in the
+range. Currently there is no way to determine if there is a memory hole, but
+the existence of one should not affect the hotplug capabilities of the memory
+block.
+
+For example, assume 1GiB memory block size. A device for a memory starting at
+0x100000000 is ``/sys/device/system/memory/memory4``::
+
+ (0x100000000 / 1Gib = 4)
+
+This device covers address range [0x100000000 ... 0x140000000)
+
+Under each memory block, you can see 5 files:
+
+- ``/sys/devices/system/memory/memoryXXX/phys_index``
+- ``/sys/devices/system/memory/memoryXXX/phys_device``
+- ``/sys/devices/system/memory/memoryXXX/state``
+- ``/sys/devices/system/memory/memoryXXX/removable``
+- ``/sys/devices/system/memory/memoryXXX/valid_zones``
+
+=================== ============================================================
+``phys_index`` read-only and contains memory block id, same as XXX.
+``state`` read-write
+
+ - at read: contains online/offline state of memory.
+ - at write: user can specify "online_kernel",
+
+ "online_movable", "online", "offline" command
+ which will be performed on all sections in the block.
+``phys_device`` read-only: designed to show the name of physical memory
+ device. This is not well implemented now.
+``removable`` read-only: contains an integer value indicating
+ whether the memory block is removable or not
+ removable. A value of 1 indicates that the memory
+ block is removable and a value of 0 indicates that
+ it is not removable. A memory block is removable only if
+ every section in the block is removable.
+``valid_zones`` read-only: designed to show which zones this memory block
+ can be onlined to.
+
+ The first column shows it`s default zone.
+
+ "memory6/valid_zones: Normal Movable" shows this memoryblock
+ can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
+ by online_movable.
+
+ "memory7/valid_zones: Movable Normal" shows this memoryblock
+ can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
+ by online_kernel.
+=================== ============================================================
+
+.. note::
+
+ These directories/files appear after physical memory hotplug phase.
+
+If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
+via symbolic links located in the ``/sys/devices/system/node/node*`` directories.
+
+For example::
+
+ /sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+
+A backlink will also be created::
+
+ /sys/devices/system/memory/memory9/node0 -> ../../node/node0
+
+.. _memory_hotplug_physical_mem:
+
+Physical memory hot-add phase
+=============================
+
+Hardware(Firmware) Support
+--------------------------
+
+On x86_64/ia64 platform, memory hotplug by ACPI is supported.
+
+In general, the firmware (ACPI) which supports memory hotplug defines
+memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
+Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev
+script. This will be done automatically.
+
+But scripts for memory hotplug are not contained in generic udev package(now).
+You may have to write it by yourself or online/offline memory by hand.
+Please see :ref:`memory_hotplug_how_to_online_memory` and
+:ref:`memory_hotplug_how_to_offline_memory`.
+
+If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
+"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
+calls hotplug code for all of objects which are defined in it.
+If memory device is found, memory hotplug code will be called.
+
+Notify memory hot-add event by hand
+-----------------------------------
+
+On some architectures, the firmware may not notify the kernel of a memory
+hotplug event. Therefore, the memory "probe" interface is supported to
+explicitly notify the kernel. This interface depends on
+CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
+if hotplug is supported, although for x86 this should be handled by ACPI
+notification.
+
+Probe interface is located at::
+
+ /sys/devices/system/memory/probe
+
+You can tell the physical address of new memory to the kernel by::
+
+ % echo start_address_of_new_memory > /sys/devices/system/memory/probe
+
+Then, [start_address_of_new_memory, start_address_of_new_memory +
+memory_block_size] memory range is hot-added. In this case, hotplug script is
+not called (in current implementation). You'll have to online memory by
+yourself. Please see :ref:`memory_hotplug_how_to_online_memory`.
+
+Logical Memory hot-add phase
+============================
+
+State of memory
+---------------
+
+To see (online/offline) state of a memory block, read 'state' file::
+
+ % cat /sys/device/system/memory/memoryXXX/state
+
+
+- If the memory block is online, you'll read "online".
+- If the memory block is offline, you'll read "offline".
+
+
+.. _memory_hotplug_how_to_online_memory:
+
+How to online memory
+--------------------
+
+When the memory is hot-added, the kernel decides whether or not to "online"
+it according to the policy which can be read from "auto_online_blocks" file::
+
+ % cat /sys/devices/system/memory/auto_online_blocks
+
+The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
+option. If it is disabled the default is "offline" which means the newly added
+memory is not in a ready-to-use state and you have to "online" the newly added
+memory blocks manually. Automatic onlining can be requested by writing "online"
+to "auto_online_blocks" file::
+
+ % echo online > /sys/devices/system/memory/auto_online_blocks
+
+This sets a global policy and impacts all memory blocks that will subsequently
+be hotplugged. Currently offline blocks keep their state. It is possible, under
+certain circumstances, that some memory blocks will be added but will fail to
+online. User space tools can check their "state" files
+(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually.
+
+If the automatic onlining wasn't requested, failed, or some memory block was
+offlined it is possible to change the individual block's state by writing to the
+"state" file::
+
+ % echo online > /sys/devices/system/memory/memoryXXX/state
+
+This onlining will not change the ZONE type of the target memory block,
+If the memory block doesn't belong to any zone an appropriate kernel zone
+(usually ZONE_NORMAL) will be used unless movable_node kernel command line
+option is specified when ZONE_MOVABLE will be used.
+
+You can explicitly request to associate it with ZONE_MOVABLE by::
+
+ % echo online_movable > /sys/devices/system/memory/memoryXXX/state
+
+.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
+
+Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
+
+ % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+
+.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
+
+An explicit zone onlining can fail (e.g. when the range is already within
+and existing and incompatible zone already).
+
+After this, memory block XXX's state will be 'online' and the amount of
+available memory will be increased.
+
+This may be changed in future.
+
+Logical memory remove
+=====================
+
+Memory offline and ZONE_MOVABLE
+-------------------------------
+
+Memory offlining is more complicated than memory online. Because memory offline
+has to make the whole memory block be unused, memory offline can fail if
+the memory block includes memory which cannot be freed.
+
+In general, memory offline can use 2 techniques.
+
+(1) reclaim and free all memory in the memory block.
+(2) migrate all pages in the memory block.
+
+In the current implementation, Linux's memory offline uses method (2), freeing
+all pages in the memory block by page migration. But not all pages are
+migratable. Under current Linux, migratable pages are anonymous pages and
+page caches. For offlining a memory block by migration, the kernel has to
+guarantee that the memory block contains only migratable pages.
+
+Now, a boot option for making a memory block which consists of migratable pages
+is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
+create ZONE_MOVABLE...a zone which is just used for movable pages.
+(See also Documentation/admin-guide/kernel-parameters.rst)
+
+Assume the system has "TOTAL" amount of memory at boot time, this boot option
+creates ZONE_MOVABLE as following.
+
+1) When kernelcore=YYYY boot option is used,
+ Size of memory not for movable pages (not for offline) is YYYY.
+ Size of memory for movable pages (for offline) is TOTAL-YYYY.
+
+2) When movablecore=ZZZZ boot option is used,
+ Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
+ Size of memory for movable pages (for offline) is ZZZZ.
+
+.. note::
+
+ Unfortunately, there is no information to show which memory block belongs
+ to ZONE_MOVABLE. This is TBD.
+
+.. _memory_hotplug_how_to_offline_memory:
+
+How to offline memory
+---------------------
+
+You can offline a memory block by using the same sysfs interface that was used
+in memory onlining::
+
+ % echo offline > /sys/devices/system/memory/memoryXXX/state
+
+If offline succeeds, the state of the memory block is changed to be "offline".
+If it fails, some error core (like -EBUSY) will be returned by the kernel.
+Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
+it. If it doesn't contain 'unmovable' memory, you'll get success.
+
+A memory block under ZONE_MOVABLE is considered to be able to be offlined
+easily. But under some busy state, it may return -EBUSY. Even if a memory
+block cannot be offlined due to -EBUSY, you can retry offlining it and may be
+able to offline it (or not). (For example, a page is referred to by some kernel
+internal call and released soon.)
+
+Consideration:
+ Memory hotplug's design direction is to make the possibility of memory
+ offlining higher and to guarantee unplugging memory under any situation. But
+ it needs more work. Returning -EBUSY under some situation may be good because
+ the user can decide to retry more or not by himself. Currently, memory
+ offlining code does some amount of retry with 120 seconds timeout.
+
+Physical memory remove
+======================
+
+Need more implementation yet....
+ - Notification completion of remove works by OS to firmware.
+ - Guard from remove if not yet.
+
+
+Locking Internals
+=================
+
+When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
+the device_hotplug_lock should be held to:
+
+- synchronize against online/offline requests (e.g. via sysfs). This way, memory
+ block devices can only be accessed (.online/.state attributes) by user
+ space once memory has been fully added. And when removing memory, we
+ know nobody is in critical sections.
+- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
+
+Especially, there is a possible lock inversion that is avoided using
+device_hotplug_lock when adding memory and user space tries to online that
+memory faster than expected:
+
+- device_online() will first take the device_lock(), followed by
+ mem_hotplug_lock
+- add_memory_resource() will first take the mem_hotplug_lock, followed by
+ the device_lock() (while creating the devices, during bus_add_device()).
+
+As the device is visible to user space before taking the device_lock(), this
+can result in a lock inversion.
+
+onlining/offlining of memory should be done via device_online()/
+device_offline() - to make sure it is properly synchronized to actions
+via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
+
+When adding/removing/onlining/offlining memory or adding/removing
+heterogeneous/device memory, we should always hold the mem_hotplug_lock in
+write mode to serialise memory hotplug (e.g. access to global/zone
+variables).
+
+In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
+mode allows for a quite efficient get_online_mems/put_online_mems
+implementation, so code accessing memory can protect from that memory
+vanishing.
+
+
+Future Work
+===========
+
+ - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
+ sysctl or new control file.
+ - showing memory block and physical device relationship.
+ - test and make it better memory offlining.
+ - support HugeTLB page migration and offlining.
+ - memmap removing at memory offline.
+ - physical remove memory.
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index d78c5b3..8463f55 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -15,7 +15,7 @@
support.
Memory policies should not be confused with cpusets
-(``Documentation/cgroup-v1/cpusets.txt``)
+(``Documentation/admin-guide/cgroup-v1/cpusets.rst``)
which is an administrative mechanism for restricting the nodes from which
memory may be allocated by a set of processes. Memory policies are a
programming interface that a NUMA-aware application can take advantage of. When
diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
new file mode 100644
index 0000000..a80c3c3
--- /dev/null
+++ b/Documentation/admin-guide/mm/numaperf.rst
@@ -0,0 +1,170 @@
+.. _numaperf:
+
+=============
+NUMA Locality
+=============
+
+Some platforms may have multiple types of memory attached to a compute
+node. These disparate memory ranges may share some characteristics, such
+as CPU cache coherence, but may have different performance. For example,
+different media types and buses affect bandwidth and latency.
+
+A system supports such heterogeneous memory by grouping each memory type
+under different domains, or "nodes", based on locality and performance
+characteristics. Some memory may share the same node as a CPU, and others
+are provided as memory only nodes. While memory only nodes do not provide
+CPUs, they may still be local to one or more compute nodes relative to
+other nodes. The following diagram shows one such example of two compute
+nodes with local memory and a memory only node for each of compute node::
+
+ +------------------+ +------------------+
+ | Compute Node 0 +-----+ Compute Node 1 |
+ | Local Node0 Mem | | Local Node1 Mem |
+ +--------+---------+ +--------+---------+
+ | |
+ +--------+---------+ +--------+---------+
+ | Slower Node2 Mem | | Slower Node3 Mem |
+ +------------------+ +--------+---------+
+
+A "memory initiator" is a node containing one or more devices such as
+CPUs or separate memory I/O devices that can initiate memory requests.
+A "memory target" is a node containing one or more physical address
+ranges accessible from one or more memory initiators.
+
+When multiple memory initiators exist, they may not all have the same
+performance when accessing a given memory target. Each initiator-target
+pair may be organized into different ranked access classes to represent
+this relationship. The highest performing initiator to a given target
+is considered to be one of that target's local initiators, and given
+the highest access class, 0. Any given target may have one or more
+local initiators, and any given initiator may have multiple local
+memory targets.
+
+To aid applications matching memory targets with their initiators, the
+kernel provides symlinks to each other. The following example lists the
+relationship for the access class "0" memory initiators and targets::
+
+ # symlinks -v /sys/devices/system/node/nodeX/access0/targets/
+ relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
+
+ # symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
+ relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
+
+A memory initiator may have multiple memory targets in the same access
+class. The target memory's initiators in a given class indicate the
+nodes' access characteristics share the same performance relative to other
+linked initiator nodes. Each target within an initiator's access class,
+though, do not necessarily perform the same as each other.
+
+================
+NUMA Performance
+================
+
+Applications may wish to consider which node they want their memory to
+be allocated from based on the node's performance characteristics. If
+the system provides these attributes, the kernel exports them under the
+node sysfs hierarchy by appending the attributes directory under the
+memory node's access class 0 initiators as follows::
+
+ /sys/devices/system/node/nodeY/access0/initiators/
+
+These attributes apply only when accessed from nodes that have the
+are linked under the this access's inititiators.
+
+The performance characteristics the kernel provides for the local initiators
+are exported are as follows::
+
+ # tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
+ /sys/devices/system/node/nodeY/access0/initiators/
+ |-- read_bandwidth
+ |-- read_latency
+ |-- write_bandwidth
+ `-- write_latency
+
+The bandwidth attributes are provided in MiB/second.
+
+The latency attributes are provided in nanoseconds.
+
+The values reported here correspond to the rated latency and bandwidth
+for the platform.
+
+==========
+NUMA Cache
+==========
+
+System memory may be constructed in a hierarchy of elements with various
+performance characteristics in order to provide large address space of
+slower performing memory cached by a smaller higher performing memory. The
+system physical addresses memory initiators are aware of are provided
+by the last memory level in the hierarchy. The system meanwhile uses
+higher performing memory to transparently cache access to progressively
+slower levels.
+
+The term "far memory" is used to denote the last level memory in the
+hierarchy. Each increasing cache level provides higher performing
+initiator access, and the term "near memory" represents the fastest
+cache provided by the system.
+
+This numbering is different than CPU caches where the cache level (ex:
+L1, L2, L3) uses the CPU-side view where each increased level is lower
+performing. In contrast, the memory cache level is centric to the last
+level memory, so the higher numbered cache level corresponds to memory
+nearer to the CPU, and further from far memory.
+
+The memory-side caches are not directly addressable by software. When
+software accesses a system address, the system will return it from the
+near memory cache if it is present. If it is not present, the system
+accesses the next level of memory until there is either a hit in that
+cache level, or it reaches far memory.
+
+An application does not need to know about caching attributes in order
+to use the system. Software may optionally query the memory cache
+attributes in order to maximize the performance out of such a setup.
+If the system provides a way for the kernel to discover this information,
+for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
+the kernel will append these attributes to the NUMA node memory target.
+
+When the kernel first registers a memory cache with a node, the kernel
+will create the following directory::
+
+ /sys/devices/system/node/nodeX/memory_side_cache/
+
+If that directory is not present, the system either does not not provide
+a memory-side cache, or that information is not accessible to the kernel.
+
+The attributes for each level of cache is provided under its cache
+level index::
+
+ /sys/devices/system/node/nodeX/memory_side_cache/indexA/
+ /sys/devices/system/node/nodeX/memory_side_cache/indexB/
+ /sys/devices/system/node/nodeX/memory_side_cache/indexC/
+
+Each cache level's directory provides its attributes. For example, the
+following shows a single cache level and the attributes available for
+software to query::
+
+ # tree sys/devices/system/node/node0/memory_side_cache/
+ /sys/devices/system/node/node0/memory_side_cache/
+ |-- index1
+ | |-- indexing
+ | |-- line_size
+ | |-- size
+ | `-- write_policy
+
+The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
+for any other indexed based, multi-way associativity.
+
+The "line_size" is the number of bytes accessed from the next cache
+level on a miss.
+
+The "size" is the number of bytes provided by this cache level.
+
+The "write_policy" will be 0 for write-back, and non-zero for
+write-through caching.
+
+========
+See Also
+========
+
+[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
+- Section 5.2.27
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index 3f7bade..340a5ae 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -75,9 +75,10 @@
20. NOPAGE
21. KSM
22. THP
- 23. BALLOON
+ 23. OFFLINE
24. ZERO_PAGE
25. IDLE
+ 26. PGTABLE
* ``/proc/kpagecgroup``. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
@@ -118,8 +119,8 @@
identical memory pages dynamically shared between one or more processes
22 - THP
contiguous pages which construct transparent hugepages
-23 - BALLOON
- balloon compaction page
+23 - OFFLINE
+ page is logically offline
24 - ZERO_PAGE
zero page for pfn_zero or huge_zero page
25 - IDLE
@@ -128,6 +129,8 @@
Note that this flag may be stale in case the page was accessed via
a PTE. To make sure the flag is up-to-date one has to read
``/sys/kernel/mm/page_idle/bitmap`` first.
+26 - PGTABLE
+ page is in use as a page table
IO related page flags
---------------------
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 7ab93a8..bd57145 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -53,7 +53,7 @@
collapses sequences of basic pages into huge pages.
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
-interface and using madivse(2) and prctl(2) system calls.
+interface and using madvise(2) and prctl(2) system calls.
Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all
diff --git a/Documentation/admin-guide/namespaces/compatibility-list.rst b/Documentation/admin-guide/namespaces/compatibility-list.rst
new file mode 100644
index 0000000..318800b
--- /dev/null
+++ b/Documentation/admin-guide/namespaces/compatibility-list.rst
@@ -0,0 +1,43 @@
+=============================
+Namespaces compatibility list
+=============================
+
+This document contains the information about the problems user
+may have when creating tasks living in different namespaces.
+
+Here's the summary. This matrix shows the known problems, that
+occur when tasks share some namespace (the columns) while living
+in different other namespaces (the rows):
+
+==== === === === === ==== ===
+- UTS IPC VFS PID User Net
+==== === === === === ==== ===
+UTS X
+IPC X 1
+VFS X
+PID 1 1 X
+User 2 2 X
+Net X
+==== === === === === ==== ===
+
+1. Both the IPC and the PID namespaces provide IDs to address
+ object inside the kernel. E.g. semaphore with IPCID or
+ process group with pid.
+
+ In both cases, tasks shouldn't try exposing this ID to some
+ other task living in a different namespace via a shared filesystem
+ or IPC shmem/message. The fact is that this ID is only valid
+ within the namespace it was obtained in and may refer to some
+ other object in another namespace.
+
+2. Intentionally, two equal user IDs in different user namespaces
+ should not be equal from the VFS point of view. In other
+ words, user 10 in one user namespace shouldn't have the same
+ access permissions to files, belonging to user 10 in another
+ namespace.
+
+ The same is true for the IPC namespaces being shared - two users
+ from different user namespaces should not access the same IPC objects
+ even having equal UIDs.
+
+ But currently this is not so.
diff --git a/Documentation/admin-guide/namespaces/index.rst b/Documentation/admin-guide/namespaces/index.rst
new file mode 100644
index 0000000..384f2e0
--- /dev/null
+++ b/Documentation/admin-guide/namespaces/index.rst
@@ -0,0 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Namespaces
+==========
+
+.. toctree::
+ :maxdepth: 1
+
+ compatibility-list
+ resource-control
diff --git a/Documentation/admin-guide/namespaces/resource-control.rst b/Documentation/admin-guide/namespaces/resource-control.rst
new file mode 100644
index 0000000..369556e
--- /dev/null
+++ b/Documentation/admin-guide/namespaces/resource-control.rst
@@ -0,0 +1,18 @@
+===========================
+Namespaces research control
+===========================
+
+There are a lot of kinds of objects in the kernel that don't have
+individual limits or that have limits that are ineffective when a set
+of processes is allowed to switch user ids. With user namespaces
+enabled in a kernel for people who don't trust their users or their
+users programs to play nice this problems becomes more acute.
+
+Therefore it is recommended that memory control groups be enabled in
+kernels that enable user namespaces, and it is further recommended
+that userspace configure memory control groups to limit how much
+memory user's they don't trust to play nice can use.
+
+Memory control groups can be configured by installing the libcgroup
+package present on most distros editing /etc/cgrules.conf,
+/etc/cgconfig.conf and setting up libpam-cgroup.
diff --git a/Documentation/admin-guide/numastat.rst b/Documentation/admin-guide/numastat.rst
new file mode 100644
index 0000000..aaf1667
--- /dev/null
+++ b/Documentation/admin-guide/numastat.rst
@@ -0,0 +1,30 @@
+===============================
+Numa policy hit/miss statistics
+===============================
+
+/sys/devices/system/node/node*/numastat
+
+All units are pages. Hugepages have separate counters.
+
+=============== ============================================================
+numa_hit A process wanted to allocate memory from this node,
+ and succeeded.
+
+numa_miss A process wanted to allocate memory from another node,
+ but ended up with memory from this node.
+
+numa_foreign A process wanted to allocate on this node,
+ but ended up with memory from another one.
+
+local_node A process ran on this node and got memory from it.
+
+other_node A process ran on this node and got memory from another node.
+
+interleave_hit Interleaving wanted to allocate from this node
+ and succeeded.
+=============== ============================================================
+
+For easier reading you can use the numastat utility from the numactl package
+(http://oss.sgi.com/projects/libnuma/). Note that it only works
+well right now on machines with a small number of CPUs.
+
diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
new file mode 100644
index 0000000..72effa7
--- /dev/null
+++ b/Documentation/admin-guide/perf-security.rst
@@ -0,0 +1,230 @@
+.. _perf_security:
+
+Perf Events and tool security
+=============================
+
+Overview
+--------
+
+Usage of Performance Counters for Linux (perf_events) [1]_ , [2]_ , [3]_
+can impose a considerable risk of leaking sensitive data accessed by
+monitored processes. The data leakage is possible both in scenarios of
+direct usage of perf_events system call API [2]_ and over data files
+generated by Perf tool user mode utility (Perf) [3]_ , [4]_ . The risk
+depends on the nature of data that perf_events performance monitoring
+units (PMU) [2]_ and Perf collect and expose for performance analysis.
+Collected system and performance data may be split into several
+categories:
+
+1. System hardware and software configuration data, for example: a CPU
+ model and its cache configuration, an amount of available memory and
+ its topology, used kernel and Perf versions, performance monitoring
+ setup including experiment time, events configuration, Perf command
+ line parameters, etc.
+
+2. User and kernel module paths and their load addresses with sizes,
+ process and thread names with their PIDs and TIDs, timestamps for
+ captured hardware and software events.
+
+3. Content of kernel software counters (e.g., for context switches, page
+ faults, CPU migrations), architectural hardware performance counters
+ (PMC) [8]_ and machine specific registers (MSR) [9]_ that provide
+ execution metrics for various monitored parts of the system (e.g.,
+ memory controller (IMC), interconnect (QPI/UPI) or peripheral (PCIe)
+ uncore counters) without direct attribution to any execution context
+ state.
+
+4. Content of architectural execution context registers (e.g., RIP, RSP,
+ RBP on x86_64), process user and kernel space memory addresses and
+ data, content of various architectural MSRs that capture data from
+ this category.
+
+Data that belong to the fourth category can potentially contain
+sensitive process data. If PMUs in some monitoring modes capture values
+of execution context registers or data from process memory then access
+to such monitoring capabilities requires to be ordered and secured
+properly. So, perf_events/Perf performance monitoring is the subject for
+security access control management [5]_ .
+
+perf_events/Perf access control
+-------------------------------
+
+To perform security checks, the Linux implementation splits processes
+into two categories [6]_ : a) privileged processes (whose effective user
+ID is 0, referred to as superuser or root), and b) unprivileged
+processes (whose effective UID is nonzero). Privileged processes bypass
+all kernel security permission checks so perf_events performance
+monitoring is fully available to privileged processes without access,
+scope and resource restrictions.
+
+Unprivileged processes are subject to a full security permission check
+based on the process's credentials [5]_ (usually: effective UID,
+effective GID, and supplementary group list).
+
+Linux divides the privileges traditionally associated with superuser
+into distinct units, known as capabilities [6]_ , which can be
+independently enabled and disabled on per-thread basis for processes and
+files of unprivileged users.
+
+Unprivileged processes with enabled CAP_SYS_ADMIN capability are treated
+as privileged processes with respect to perf_events performance
+monitoring and bypass *scope* permissions checks in the kernel.
+
+Unprivileged processes using perf_events system call API is also subject
+for PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose
+outcome determines whether monitoring is permitted. So unprivileged
+processes provided with CAP_SYS_PTRACE capability are effectively
+permitted to pass the check.
+
+Other capabilities being granted to unprivileged processes can
+effectively enable capturing of additional data required for later
+performance analysis of monitored processes or a system. For example,
+CAP_SYSLOG capability permits reading kernel space memory addresses from
+/proc/kallsyms file.
+
+perf_events/Perf privileged users
+---------------------------------
+
+Mechanisms of capabilities, privileged capability-dumb files [6]_ and
+file system ACLs [10]_ can be used to create a dedicated group of
+perf_events/Perf privileged users who are permitted to execute
+performance monitoring without scope limits. The following steps can be
+taken to create such a group of privileged Perf users.
+
+1. Create perf_users group of privileged Perf users, assign perf_users
+ group to Perf tool executable and limit access to the executable for
+ other users in the system who are not in the perf_users group:
+
+::
+
+ # groupadd perf_users
+ # ls -alhF
+ -rwxr-xr-x 2 root root 11M Oct 19 15:12 perf
+ # chgrp perf_users perf
+ # ls -alhF
+ -rwxr-xr-x 2 root perf_users 11M Oct 19 15:12 perf
+ # chmod o-rwx perf
+ # ls -alhF
+ -rwxr-x--- 2 root perf_users 11M Oct 19 15:12 perf
+
+2. Assign the required capabilities to the Perf tool executable file and
+ enable members of perf_users group with performance monitoring
+ privileges [6]_ :
+
+::
+
+ # setcap "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap -v "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf
+ perf: OK
+ # getcap perf
+ perf = cap_sys_ptrace,cap_sys_admin,cap_syslog+ep
+
+As a result, members of perf_users group are capable of conducting
+performance monitoring by using functionality of the configured Perf
+tool executable that, when executes, passes perf_events subsystem scope
+checks.
+
+This specific access control management is only available to superuser
+or root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_
+capabilities.
+
+perf_events/Perf unprivileged users
+-----------------------------------
+
+perf_events/Perf *scope* and *access* control for unprivileged processes
+is governed by perf_event_paranoid [2]_ setting:
+
+-1:
+ Impose no *scope* and *access* restrictions on using perf_events
+ performance monitoring. Per-user per-cpu perf_event_mlock_kb [2]_
+ locking limit is ignored when allocating memory buffers for storing
+ performance data. This is the least secure mode since allowed
+ monitored *scope* is maximized and no perf_events specific limits
+ are imposed on *resources* allocated for performance monitoring.
+
+>=0:
+ *scope* includes per-process and system wide performance monitoring
+ but excludes raw tracepoints and ftrace function tracepoints
+ monitoring. CPU and system events happened when executing either in
+ user or in kernel space can be monitored and captured for later
+ analysis. Per-user per-cpu perf_event_mlock_kb locking limit is
+ imposed but ignored for unprivileged processes with CAP_IPC_LOCK
+ [6]_ capability.
+
+>=1:
+ *scope* includes per-process performance monitoring only and
+ excludes system wide performance monitoring. CPU and system events
+ happened when executing either in user or in kernel space can be
+ monitored and captured for later analysis. Per-user per-cpu
+ perf_event_mlock_kb locking limit is imposed but ignored for
+ unprivileged processes with CAP_IPC_LOCK capability.
+
+>=2:
+ *scope* includes per-process performance monitoring only. CPU and
+ system events happened when executing in user space only can be
+ monitored and captured for later analysis. Per-user per-cpu
+ perf_event_mlock_kb locking limit is imposed but ignored for
+ unprivileged processes with CAP_IPC_LOCK capability.
+
+perf_events/Perf resource control
+---------------------------------
+
+Open file descriptors
++++++++++++++++++++++
+
+The perf_events system call API [2]_ allocates file descriptors for
+every configured PMU event. Open file descriptors are a per-process
+accountable resource governed by the RLIMIT_NOFILE [11]_ limit
+(ulimit -n), which is usually derived from the login shell process. When
+configuring Perf collection for a long list of events on a large server
+system, this limit can be easily hit preventing required monitoring
+configuration. RLIMIT_NOFILE limit can be increased on per-user basis
+modifying content of the limits.conf file [12]_ . Ordinarily, a Perf
+sampling session (perf record) requires an amount of open perf_event
+file descriptors that is not less than the number of monitored events
+multiplied by the number of monitored CPUs.
+
+Memory allocation
++++++++++++++++++
+
+The amount of memory available to user processes for capturing
+performance monitoring data is governed by the perf_event_mlock_kb [2]_
+setting. This perf_event specific resource setting defines overall
+per-cpu limits of memory allowed for mapping by the user processes to
+execute performance monitoring. The setting essentially extends the
+RLIMIT_MEMLOCK [11]_ limit, but only for memory regions mapped
+specifically for capturing monitored performance events and related data.
+
+For example, if a machine has eight cores and perf_event_mlock_kb limit
+is set to 516 KiB, then a user process is provided with 516 KiB * 8 =
+4128 KiB of memory above the RLIMIT_MEMLOCK limit (ulimit -l) for
+perf_event mmap buffers. In particular, this means that, if the user
+wants to start two or more performance monitoring processes, the user is
+required to manually distribute the available 4128 KiB between the
+monitoring processes, for example, using the --mmap-pages Perf record
+mode option. Otherwise, the first started performance monitoring process
+allocates all available 4128 KiB and the other processes will fail to
+proceed due to the lack of memory.
+
+RLIMIT_MEMLOCK and perf_event_mlock_kb resource constraints are ignored
+for processes with the CAP_IPC_LOCK capability. Thus, perf_events/Perf
+privileged users can be provided with memory above the constraints for
+perf_events/Perf performance monitoring purpose by providing the Perf
+executable with CAP_IPC_LOCK capability.
+
+Bibliography
+------------
+
+.. [1] `<https://lwn.net/Articles/337493/>`_
+.. [2] `<http://man7.org/linux/man-pages/man2/perf_event_open.2.html>`_
+.. [3] `<http://web.eece.maine.edu/~vweaver/projects/perf_events/>`_
+.. [4] `<https://perf.wiki.kernel.org/index.php/Main_Page>`_
+.. [5] `<https://www.kernel.org/doc/html/latest/security/credentials.html>`_
+.. [6] `<http://man7.org/linux/man-pages/man7/capabilities.7.html>`_
+.. [7] `<http://man7.org/linux/man-pages/man2/ptrace.2.html>`_
+.. [8] `<https://en.wikipedia.org/wiki/Hardware_performance_counter>`_
+.. [9] `<https://en.wikipedia.org/wiki/Model-specific_register>`_
+.. [10] `<http://man7.org/linux/man-pages/man5/acl.5.html>`_
+.. [11] `<http://man7.org/linux/man-pages/man2/getrlimit.2.html>`_
+.. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_
+
diff --git a/Documentation/admin-guide/perf/arm-ccn.rst b/Documentation/admin-guide/perf/arm-ccn.rst
new file mode 100644
index 0000000..832b0c6
--- /dev/null
+++ b/Documentation/admin-guide/perf/arm-ccn.rst
@@ -0,0 +1,61 @@
+==========================
+ARM Cache Coherent Network
+==========================
+
+CCN-504 is a ring-bus interconnect consisting of 11 crosspoints
+(XPs), with each crosspoint supporting up to two device ports,
+so nodes (devices) 0 and 1 are connected to crosspoint 0,
+nodes 2 and 3 to crosspoint 1 etc.
+
+PMU (perf) driver
+-----------------
+
+The CCN driver registers a perf PMU driver, which provides
+description of available events and configuration options
+in sysfs, see /sys/bus/event_source/devices/ccn*.
+
+The "format" directory describes format of the config, config1
+and config2 fields of the perf_event_attr structure. The "events"
+directory provides configuration templates for all documented
+events, that can be used with perf tool. For example "xp_valid_flit"
+is an equivalent of "type=0x8,event=0x4". Other parameters must be
+explicitly specified.
+
+For events originating from device, "node" defines its index.
+
+Crosspoint PMU events require "xp" (index), "bus" (bus number)
+and "vc" (virtual channel ID).
+
+Crosspoint watchpoint-based events (special "event" value 0xfe)
+require "xp" and "vc" as as above plus "port" (device port index),
+"dir" (transmit/receive direction), comparator values ("cmp_l"
+and "cmp_h") and "mask", being index of the comparator mask.
+
+Masks are defined separately from the event description
+(due to limited number of the config values) in the "cmp_mask"
+directory, with first 8 configurable by user and additional
+4 hardcoded for the most frequent use cases.
+
+Cycle counter is described by a "type" value 0xff and does
+not require any other settings.
+
+The driver also provides a "cpumask" sysfs attribute, which contains
+a single CPU ID, of the processor which will be used to handle all
+the CCN PMU events. It is recommended that the user space tools
+request the events on this processor (if not, the perf_event->cpu value
+will be overwritten anyway). In case of this processor being offlined,
+the events are migrated to another one and the attribute is updated.
+
+Example of perf tool use::
+
+ / # perf list | grep ccn
+ ccn/cycles/ [Kernel PMU event]
+ <...>
+ ccn/xp_valid_flit,xp=?,port=?,vc=?,dir=?/ [Kernel PMU event]
+ <...>
+
+ / # perf stat -a -e ccn/cycles/,ccn/xp_valid_flit,xp=1,port=0,vc=1,dir=1/ \
+ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/arm_dsu_pmu.rst b/Documentation/admin-guide/perf/arm_dsu_pmu.rst
new file mode 100644
index 0000000..7fd34db
--- /dev/null
+++ b/Documentation/admin-guide/perf/arm_dsu_pmu.rst
@@ -0,0 +1,29 @@
+==================================
+ARM DynamIQ Shared Unit (DSU) PMU
+==================================
+
+ARM DynamIQ Shared Unit integrates one or more cores with an L3 memory system,
+control logic and external interfaces to form a multicore cluster. The PMU
+allows counting the various events related to the L3 cache, Snoop Control Unit
+etc, using 32bit independent counters. It also provides a 64bit cycle counter.
+
+The PMU can only be accessed via CPU system registers and are common to the
+cores connected to the same DSU. Like most of the other uncore PMUs, DSU
+PMU doesn't support process specific events and cannot be used in sampling mode.
+
+The DSU provides a bitmap for a subset of implemented events via hardware
+registers. There is no way for the driver to determine if the other events
+are available or not. Hence the driver exposes only those events advertised
+by the DSU, in "events" directory under::
+
+ /sys/bus/event_sources/devices/arm_dsu_<N>/
+
+The user should refer to the TRM of the product to figure out the supported events
+and use the raw event code for the unlisted events.
+
+The driver also exposes the CPUs connected to the DSU instance in "associated_cpus".
+
+
+e.g usage::
+
+ perf stat -a -e arm_dsu_0/cycles/
diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst
new file mode 100644
index 0000000..404a5c3
--- /dev/null
+++ b/Documentation/admin-guide/perf/hisi-pmu.rst
@@ -0,0 +1,60 @@
+======================================================
+HiSilicon SoC uncore Performance Monitoring Unit (PMU)
+======================================================
+
+The HiSilicon SoC chip includes various independent system device PMUs
+such as L3 cache (L3C), Hydra Home Agent (HHA) and DDRC. These PMUs are
+independent and have hardware logic to gather statistics and performance
+information.
+
+The HiSilicon SoC encapsulates multiple CPU and IO dies. Each CPU cluster
+(CCL) is made up of 4 cpu cores sharing one L3 cache; each CPU die is
+called Super CPU cluster (SCCL) and is made up of 6 CCLs. Each SCCL has
+two HHAs (0 - 1) and four DDRCs (0 - 3), respectively.
+
+HiSilicon SoC uncore PMU driver
+-------------------------------
+
+Each device PMU has separate registers for event counting, control and
+interrupt, and the PMU driver shall register perf PMU drivers like L3C,
+HHA and DDRC etc. The available events and configuration options shall
+be described in the sysfs, see:
+
+/sys/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>/, or
+/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>.
+The "perf list" command shall list the available events from sysfs.
+
+Each L3C, HHA and DDRC is registered as a separate PMU with perf. The PMU
+name will appear in event listing as hisi_sccl<sccl-id>_module<index-id>.
+where "sccl-id" is the identifier of the SCCL and "index-id" is the index of
+module.
+
+e.g. hisi_sccl3_l3c0/rd_hit_cpipe is READ_HIT_CPIPE event of L3C index #0 in
+SCCL ID #3.
+
+e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in
+SCCL ID #1.
+
+The driver also provides a "cpumask" sysfs attribute, which shows the CPU core
+ID used to count the uncore PMU event.
+
+Example usage of perf::
+
+ $# perf list
+ hisi_sccl3_l3c0/rd_hit_cpipe/ [kernel PMU event]
+ ------------------------------------------
+ hisi_sccl3_l3c0/wr_hit_cpipe/ [kernel PMU event]
+ ------------------------------------------
+ hisi_sccl1_l3c0/rd_hit_cpipe/ [kernel PMU event]
+ ------------------------------------------
+ hisi_sccl1_l3c0/wr_hit_cpipe/ [kernel PMU event]
+ ------------------------------------------
+
+ $# perf stat -a -e hisi_sccl3_l3c0/rd_hit_cpipe/ sleep 5
+ $# perf stat -a -e hisi_sccl3_l3c0/config=0x02/ sleep 5
+
+The current driver does not support sampling. So "perf record" is unsupported.
+Also attach to a task is unsupported as the events are all uncore.
+
+Note: Please contact the maintainer for a complete list of events supported for
+the PMU devices in the SoC and its information if needed.
diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst
new file mode 100644
index 0000000..517a205
--- /dev/null
+++ b/Documentation/admin-guide/perf/imx-ddr.rst
@@ -0,0 +1,52 @@
+=====================================================
+Freescale i.MX8 DDR Performance Monitoring Unit (PMU)
+=====================================================
+
+There are no performance counters inside the DRAM controller, so performance
+signals are brought out to the edge of the controller where a set of 4 x 32 bit
+counters is implemented. This is controlled by the CSV modes programed in counter
+control register which causes a large number of PERF signals to be generated.
+
+Selection of the value for each counter is done via the config registers. There
+is one register for each counter. Counter 0 is special in that it always counts
+“time” and when expired causes a lock on itself and the other counters and an
+interrupt is raised. If any other counter overflows, it continues counting, and
+no interrupt is raised.
+
+The "format" directory describes format of the config (event ID) and config1
+(AXI filtering) fields of the perf_event_attr structure, see /sys/bus/event_source/
+devices/imx8_ddr0/format/. The "events" directory describes the events types
+hardware supported that can be used with perf tool, see /sys/bus/event_source/
+devices/imx8_ddr0/events/.
+ e.g.::
+ perf stat -a -e imx8_ddr0/cycles/ cmd
+ perf stat -a -e imx8_ddr0/read/,imx8_ddr0/write/ cmd
+
+AXI filtering is only used by CSV modes 0x41 (axid-read) and 0x42 (axid-write)
+to count reading or writing matches filter setting. Filter setting is various
+from different DRAM controller implementations, which is distinguished by quirks
+in the driver.
+
+* With DDR_CAP_AXI_ID_FILTER quirk.
+ Filter is defined with two configuration parts:
+ --AXI_ID defines AxID matching value.
+ --AXI_MASKING defines which bits of AxID are meaningful for the matching.
+ 0:corresponding bit is masked.
+ 1: corresponding bit is not masked, i.e. used to do the matching.
+
+ AXI_ID and AXI_MASKING are mapped on DPCR1 register in performance counter.
+ When non-masked bits are matching corresponding AXI_ID bits then counter is
+ incremented. Perf counter is incremented if
+ AxID && AXI_MASKING == AXI_ID && AXI_MASKING
+
+ This filter doesn't support filter different AXI ID for axid-read and axid-write
+ event at the same time as this filter is shared between counters.
+ e.g.::
+ perf stat -a -e imx8_ddr0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd
+ perf stat -a -e imx8_ddr0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd
+
+ NOTE: axi_mask is inverted in userspace(i.e. set bits are bits to mask), and
+ it will be reverted in driver automatically. so that the user can just specify
+ axi_id to monitor a specific id, rather than having to specify axi_mask.
+ e.g.::
+ perf stat -a -e imx8_ddr0/axid-read,axi_id=0x12/ cmd, which will monitor ARID=0x12
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
new file mode 100644
index 0000000..ee4bfd2
--- /dev/null
+++ b/Documentation/admin-guide/perf/index.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+Performance monitor support
+===========================
+
+.. toctree::
+ :maxdepth: 1
+
+ hisi-pmu
+ qcom_l2_pmu
+ qcom_l3_pmu
+ arm-ccn
+ xgene-pmu
+ arm_dsu_pmu
+ thunderx2-pmu
diff --git a/Documentation/admin-guide/perf/qcom_l2_pmu.rst b/Documentation/admin-guide/perf/qcom_l2_pmu.rst
new file mode 100644
index 0000000..c130178
--- /dev/null
+++ b/Documentation/admin-guide/perf/qcom_l2_pmu.rst
@@ -0,0 +1,39 @@
+=====================================================================
+Qualcomm Technologies Level-2 Cache Performance Monitoring Unit (PMU)
+=====================================================================
+
+This driver supports the L2 cache clusters found in Qualcomm Technologies
+Centriq SoCs. There are multiple physical L2 cache clusters, each with their
+own PMU. Each cluster has one or more CPUs associated with it.
+
+There is one logical L2 PMU exposed, which aggregates the results from
+the physical PMUs.
+
+The driver provides a description of its available events and configuration
+options in sysfs, see /sys/devices/l2cache_0.
+
+The "format" directory describes the format of the events.
+
+Events can be envisioned as a 2-dimensional array. Each column represents
+a group of events. There are 8 groups. Only one entry from each
+group can be in use at a time. If multiple events from the same group
+are specified, the conflicting events cannot be counted at the same time.
+
+Events are specified as 0xCCG, where CC is 2 hex digits specifying
+the code (array row) and G specifies the group (column) 0-7.
+
+In addition there is a cycle counter event specified by the value 0xFE
+which is outside the above scheme.
+
+The driver provides a "cpumask" sysfs attribute which contains a mask
+consisting of one CPU per cluster which will be used to handle all the PMU
+events on that cluster.
+
+Examples for use with perf::
+
+ perf stat -e l2cache_0/config=0x001/,l2cache_0/config=0x042/ -a sleep 1
+
+ perf stat -e l2cache_0/config=0xfe/ -C 2 sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/qcom_l3_pmu.rst b/Documentation/admin-guide/perf/qcom_l3_pmu.rst
new file mode 100644
index 0000000..a3d014a
--- /dev/null
+++ b/Documentation/admin-guide/perf/qcom_l3_pmu.rst
@@ -0,0 +1,26 @@
+===========================================================================
+Qualcomm Datacenter Technologies L3 Cache Performance Monitoring Unit (PMU)
+===========================================================================
+
+This driver supports the L3 cache PMUs found in Qualcomm Datacenter Technologies
+Centriq SoCs. The L3 cache on these SOCs is composed of multiple slices, shared
+by all cores within a socket. Each slice is exposed as a separate uncore perf
+PMU with device name l3cache_<socket>_<instance>. User space is responsible
+for aggregating across slices.
+
+The driver provides a description of its available events and configuration
+options in sysfs, see /sys/devices/l3cache*. Given that these are uncore PMUs
+the driver also exposes a "cpumask" sysfs attribute which contains a mask
+consisting of one CPU per socket which will be used to handle all the PMU
+events on that socket.
+
+The hardware implements 32bit event counters and has a flat 8bit event space
+exposed via the "event" format attribute. In addition to the 32bit physical
+counters the driver supports virtual 64bit hardware counters by using hardware
+counter chaining. This feature is exposed via the "lc" (long counter) format
+flag. E.g.::
+
+ perf stat -e l3cache_0_0/read-miss,lc/
+
+Given that these are uncore PMUs the driver does not support sampling, therefore
+"perf record" will not work. Per-task perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst
new file mode 100644
index 0000000..08e3367
--- /dev/null
+++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst
@@ -0,0 +1,42 @@
+=============================================================
+Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
+=============================================================
+
+The ThunderX2 SoC PMU consists of independent, system-wide, per-socket
+PMUs such as the Level 3 Cache (L3C) and DDR4 Memory Controller (DMC).
+
+The DMC has 8 interleaved channels and the L3C has 16 interleaved tiles.
+Events are counted for the default channel (i.e. channel 0) and prorated
+to the total number of channels/tiles.
+
+The DMC and L3C support up to 4 counters. Counters are independently
+programmable and can be started and stopped individually. Each counter
+can be set to a different event. Counters are 32-bit and do not support
+an overflow interrupt; they are read every 2 seconds.
+
+PMU UNCORE (perf) driver:
+
+The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and
+L3C devices. Each PMU can be used to count up to 4 events
+simultaneously. The PMUs provide a description of their available events
+and configuration options under sysfs, see
+/sys/devices/uncore_<l3c_S/dmc_S/>; S is the socket id.
+
+The driver does not support sampling, therefore "perf record" will not
+work. Per-task perf sessions are also not supported.
+
+Examples::
+
+ # perf stat -a -e uncore_dmc_0/cnt_cycles/ sleep 1
+
+ # perf stat -a -e \
+ uncore_dmc_0/cnt_cycles/,\
+ uncore_dmc_0/data_transfers/,\
+ uncore_dmc_0/read_txns/,\
+ uncore_dmc_0/write_txns/ sleep 1
+
+ # perf stat -a -e \
+ uncore_l3c_0/read_request/,\
+ uncore_l3c_0/read_hit/,\
+ uncore_l3c_0/inv_request/,\
+ uncore_l3c_0/inv_hit/ sleep 1
diff --git a/Documentation/admin-guide/perf/xgene-pmu.rst b/Documentation/admin-guide/perf/xgene-pmu.rst
new file mode 100644
index 0000000..644f8ed
--- /dev/null
+++ b/Documentation/admin-guide/perf/xgene-pmu.rst
@@ -0,0 +1,49 @@
+================================================
+APM X-Gene SoC Performance Monitoring Unit (PMU)
+================================================
+
+X-Gene SoC PMU consists of various independent system device PMUs such as
+L3 cache(s), I/O bridge(s), memory controller bridge(s) and memory
+controller(s). These PMU devices are loosely architected to follow the
+same model as the PMU for ARM cores. The PMUs share the same top level
+interrupt and status CSR region.
+
+PMU (perf) driver
+-----------------
+
+The xgene-pmu driver registers several perf PMU drivers. Each of the perf
+driver provides description of its available events and configuration options
+in sysfs, see /sys/devices/<l3cX/iobX/mcbX/mcX>/.
+
+The "format" directory describes format of the config (event ID),
+config1 (agent ID) fields of the perf_event_attr structure. The "events"
+directory provides configuration templates for all supported event types that
+can be used with perf tool. For example, "l3c0/bank-fifo-full/" is an
+equivalent of "l3c0/config=0x0b/".
+
+Most of the SoC PMU has a specific list of agent ID used for monitoring
+performance of a specific datapath. For example, agents of a L3 cache can be
+a specific CPU or an I/O bridge. Each PMU has a set of 2 registers capable of
+masking the agents from which the request come from. If the bit with
+the bit number corresponding to the agent is set, the event is counted only if
+it is caused by a request from that agent. Each agent ID bit is inversely mapped
+to a corresponding bit in "config1" field. By default, the event will be
+counted for all agent requests (config1 = 0x0). For all the supported agents of
+each PMU, please refer to APM X-Gene User Manual.
+
+Each perf driver also provides a "cpumask" sysfs attribute, which contains a
+single CPU ID of the processor which will be used to handle all the PMU events.
+
+Example for perf tool use::
+
+ / # perf list | grep -e l3c -e iob -e mcb -e mc
+ l3c0/ackq-full/ [Kernel PMU event]
+ <...>
+ mcb1/mcb-csw-stall/ [Kernel PMU event]
+
+ / # perf stat -a -e l3c0/read-miss/,mcb1/csw-write-request/ sleep 1
+
+ / # perf stat -a -e l3c0/read-miss,config1=0xfffffffffffffffe/ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 47153e6..0c74a77 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
@@ -5,9 +8,10 @@
CPU Performance Scaling
=======================
-::
+:Copyright: |copy| 2017 Intel Corporation
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
The Concept of CPU Performance Scaling
======================================
@@ -150,7 +154,7 @@
a governor ``sysfs`` interface to it. Next, the governor is started by
invoking its ``->start()`` callback.
-That callback it expected to register per-CPU utilization update callbacks for
+That callback is expected to register per-CPU utilization update callbacks for
all of the online CPUs belonging to the given policy with the CPU scheduler.
The utilization update callbacks will be invoked by the CPU scheduler on
important events, like task enqueue and dequeue, on every iteration of the
@@ -396,8 +400,8 @@
the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn,
if it is invoked by the CFS scheduling class, the governor will use the
Per-Entity Load Tracking (PELT) metric for the root control group of the
-given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_
-LWN.net article for a description of the PELT mechanism). Then, the new
+given CPU as the CPU utilization estimate (see the *Per-entity load tracking*
+LWN.net article [1]_ for a description of the PELT mechanism). Then, the new
CPU frequency to apply is computed in accordance with the formula
f = 1.25 * ``f_0`` * ``util`` / ``max``
@@ -698,4 +702,8 @@
:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
-.. _Per-entity load tracking: https://lwn.net/Articles/531853/
+References
+==========
+
+.. [1] Jonathan Corbet, *Per-entity load tracking*,
+ https://lwn.net/Articles/531853/
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
new file mode 100644
index 0000000..e70b365
--- /dev/null
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -0,0 +1,723 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
+.. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>`
+
+========================
+CPU Idle Time Management
+========================
+
+:Copyright: |copy| 2018 Intel Corporation
+
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+
+Concepts
+========
+
+Modern processors are generally able to enter states in which the execution of
+a program is suspended and instructions belonging to it are not fetched from
+memory or executed. Those states are the *idle* states of the processor.
+
+Since part of the processor hardware is not used in idle states, entering them
+generally allows power drawn by the processor to be reduced and, in consequence,
+it is an opportunity to save energy.
+
+CPU idle time management is an energy-efficiency feature concerned about using
+the idle states of processors for this purpose.
+
+Logical CPUs
+------------
+
+CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that
+is the part of the kernel responsible for the distribution of computational
+work in the system). In its view, CPUs are *logical* units. That is, they need
+not be separate physical entities and may just be interfaces appearing to
+software as individual single-core processors. In other words, a CPU is an
+entity which appears to be fetching instructions that belong to one sequence
+(program) from memory and executing them, but it need not work this way
+physically. Generally, three different cases can be consider here.
+
+First, if the whole processor can only follow one sequence of instructions (one
+program) at a time, it is a CPU. In that case, if the hardware is asked to
+enter an idle state, that applies to the processor as a whole.
+
+Second, if the processor is multi-core, each core in it is able to follow at
+least one program at a time. The cores need not be entirely independent of each
+other (for example, they may share caches), but still most of the time they
+work physically in parallel with each other, so if each of them executes only
+one program, those programs run mostly independently of each other at the same
+time. The entire cores are CPUs in that case and if the hardware is asked to
+enter an idle state, that applies to the core that asked for it in the first
+place, but it also may apply to a larger unit (say a "package" or a "cluster")
+that the core belongs to (in fact, it may apply to an entire hierarchy of larger
+units containing the core). Namely, if all of the cores in the larger unit
+except for one have been put into idle states at the "core level" and the
+remaining core asks the processor to enter an idle state, that may trigger it
+to put the whole larger unit into an idle state which also will affect the
+other cores in that unit.
+
+Finally, each core in a multi-core processor may be able to follow more than one
+program in the same time frame (that is, each core may be able to fetch
+instructions from multiple locations in memory and execute them in the same time
+frame, but not necessarily entirely in parallel with each other). In that case
+the cores present themselves to software as "bundles" each consisting of
+multiple individual single-core "processors", referred to as *hardware threads*
+(or hyper-threads specifically on Intel hardware), that each can follow one
+sequence of instructions. Then, the hardware threads are CPUs from the CPU idle
+time management perspective and if the processor is asked to enter an idle state
+by one of them, the hardware thread (or CPU) that asked for it is stopped, but
+nothing more happens, unless all of the other hardware threads within the same
+core also have asked the processor to enter an idle state. In that situation,
+the core may be put into an idle state individually or a larger unit containing
+it may be put into an idle state as a whole (if the other cores within the
+larger unit are in idle states already).
+
+Idle CPUs
+---------
+
+Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as
+*idle* by the Linux kernel when there are no tasks to run on them except for the
+special "idle" task.
+
+Tasks are the CPU scheduler's representation of work. Each task consists of a
+sequence of instructions to execute, or code, data to be manipulated while
+running that code, and some context information that needs to be loaded into the
+processor every time the task's code is run by a CPU. The CPU scheduler
+distributes work by assigning tasks to run to the CPUs present in the system.
+
+Tasks can be in various states. In particular, they are *runnable* if there are
+no specific conditions preventing their code from being run by a CPU as long as
+there is a CPU available for that (for example, they are not waiting for any
+events to occur or similar). When a task becomes runnable, the CPU scheduler
+assigns it to one of the available CPUs to run and if there are no more runnable
+tasks assigned to it, the CPU will load the given task's context and run its
+code (from the instruction following the last one executed so far, possibly by
+another CPU). [If there are multiple runnable tasks assigned to one CPU
+simultaneously, they will be subject to prioritization and time sharing in order
+to allow them to make some progress over time.]
+
+The special "idle" task becomes runnable if there are no other runnable tasks
+assigned to the given CPU and the CPU is then regarded as idle. In other words,
+in Linux idle CPUs run the code of the "idle" task called *the idle loop*. That
+code may cause the processor to be put into one of its idle states, if they are
+supported, in order to save energy, but if the processor does not support any
+idle states, or there is not enough time to spend in an idle state before the
+next wakeup event, or there are strict latency constraints preventing any of the
+available idle states from being used, the CPU will simply execute more or less
+useless instructions in a loop until it is assigned a new task to run.
+
+
+.. _idle-loop:
+
+The Idle Loop
+=============
+
+The idle loop code takes two major steps in every iteration of it. First, it
+calls into a code module referred to as the *governor* that belongs to the CPU
+idle time management subsystem called ``CPUIdle`` to select an idle state for
+the CPU to ask the hardware to enter. Second, it invokes another code module
+from the ``CPUIdle`` subsystem, called the *driver*, to actually ask the
+processor hardware to enter the idle state selected by the governor.
+
+The role of the governor is to find an idle state most suitable for the
+conditions at hand. For this purpose, idle states that the hardware can be
+asked to enter by logical CPUs are represented in an abstract way independent of
+the platform or the processor architecture and organized in a one-dimensional
+(linear) array. That array has to be prepared and supplied by the ``CPUIdle``
+driver matching the platform the kernel is running on at the initialization
+time. This allows ``CPUIdle`` governors to be independent of the underlying
+hardware and to work with any platforms that the Linux kernel can run on.
+
+Each idle state present in that array is characterized by two parameters to be
+taken into account by the governor, the *target residency* and the (worst-case)
+*exit latency*. The target residency is the minimum time the hardware must
+spend in the given state, including the time needed to enter it (which may be
+substantial), in order to save more energy than it would save by entering one of
+the shallower idle states instead. [The "depth" of an idle state roughly
+corresponds to the power drawn by the processor in that state.] The exit
+latency, in turn, is the maximum time it will take a CPU asking the processor
+hardware to enter an idle state to start executing the first instruction after a
+wakeup from that state. Note that in general the exit latency also must cover
+the time needed to enter the given state in case the wakeup occurs when the
+hardware is entering it and it must be entered completely to be exited in an
+ordered manner.
+
+There are two types of information that can influence the governor's decisions.
+First of all, the governor knows the time until the closest timer event. That
+time is known exactly, because the kernel programs timers and it knows exactly
+when they will trigger, and it is the maximum time the hardware that the given
+CPU depends on can spend in an idle state, including the time necessary to enter
+and exit it. However, the CPU may be woken up by a non-timer event at any time
+(in particular, before the closest timer triggers) and it generally is not known
+when that may happen. The governor can only see how much time the CPU actually
+was idle after it has been woken up (that time will be referred to as the *idle
+duration* from now on) and it can use that information somehow along with the
+time until the closest timer to estimate the idle duration in future. How the
+governor uses that information depends on what algorithm is implemented by it
+and that is the primary reason for having more than one governor in the
+``CPUIdle`` subsystem.
+
+There are three ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_
+and ``ladder``. Which of them is used by default depends on the configuration
+of the kernel and in particular on whether or not the scheduler tick can be
+`stopped by the idle loop <idle-cpus-and-tick_>`_. It is possible to change the
+governor at run time if the ``cpuidle_sysfs_switch`` command line parameter has
+been passed to the kernel, but that is not safe in general, so it should not be
+done on production systems (that may change in the future, though). The name of
+the ``CPUIdle`` governor currently used by the kernel can be read from the
+:file:`current_governor_ro` (or :file:`current_governor` if
+``cpuidle_sysfs_switch`` is present in the kernel command line) file under
+:file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
+
+Which ``CPUIdle`` driver is used, on the other hand, usually depends on the
+platform the kernel is running on, but there are platforms with more than one
+matching driver. For example, there are two drivers that can work with the
+majority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with
+hardcoded idle states information and the other able to read that information
+from the system's ACPI tables, respectively. Still, even in those cases, the
+driver chosen at the system initialization time cannot be replaced later, so the
+decision on which one of them to use has to be made early (on Intel platforms
+the ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some
+reason or if it does not recognize the processor). The name of the ``CPUIdle``
+driver currently used by the kernel can be read from the :file:`current_driver`
+file under :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``.
+
+
+.. _idle-cpus-and-tick:
+
+Idle CPUs and The Scheduler Tick
+================================
+
+The scheduler tick is a timer that triggers periodically in order to implement
+the time sharing strategy of the CPU scheduler. Of course, if there are
+multiple runnable tasks assigned to one CPU at the same time, the only way to
+allow them to make reasonable progress in a given time frame is to make them
+share the available CPU time. Namely, in rough approximation, each task is
+given a slice of the CPU time to run its code, subject to the scheduling class,
+prioritization and so on and when that time slice is used up, the CPU should be
+switched over to running (the code of) another task. The currently running task
+may not want to give the CPU away voluntarily, however, and the scheduler tick
+is there to make the switch happen regardless. That is not the only role of the
+tick, but it is the primary reason for using it.
+
+The scheduler tick is problematic from the CPU idle time management perspective,
+because it triggers periodically and relatively often (depending on the kernel
+configuration, the length of the tick period is between 1 ms and 10 ms).
+Thus, if the tick is allowed to trigger on idle CPUs, it will not make sense
+for them to ask the hardware to enter idle states with target residencies above
+the tick period length. Moreover, in that case the idle duration of any CPU
+will never exceed the tick period length and the energy used for entering and
+exiting idle states due to the tick wakeups on idle CPUs will be wasted.
+
+Fortunately, it is not really necessary to allow the tick to trigger on idle
+CPUs, because (by definition) they have no tasks to run except for the special
+"idle" one. In other words, from the CPU scheduler perspective, the only user
+of the CPU time on them is the idle loop. Since the time of an idle CPU need
+not be shared between multiple runnable tasks, the primary reason for using the
+tick goes away if the given CPU is idle. Consequently, it is possible to stop
+the scheduler tick entirely on idle CPUs in principle, even though that may not
+always be worth the effort.
+
+Whether or not it makes sense to stop the scheduler tick in the idle loop
+depends on what is expected by the governor. First, if there is another
+(non-tick) timer due to trigger within the tick range, stopping the tick clearly
+would be a waste of time, even though the timer hardware may not need to be
+reprogrammed in that case. Second, if the governor is expecting a non-timer
+wakeup within the tick range, stopping the tick is not necessary and it may even
+be harmful. Namely, in that case the governor will select an idle state with
+the target residency within the time until the expected wakeup, so that state is
+going to be relatively shallow. The governor really cannot select a deep idle
+state then, as that would contradict its own expectation of a wakeup in short
+order. Now, if the wakeup really occurs shortly, stopping the tick would be a
+waste of time and in this case the timer hardware would need to be reprogrammed,
+which is expensive. On the other hand, if the tick is stopped and the wakeup
+does not occur any time soon, the hardware may spend indefinite amount of time
+in the shallow idle state selected by the governor, which will be a waste of
+energy. Hence, if the governor is expecting a wakeup of any kind within the
+tick range, it is better to allow the tick trigger. Otherwise, however, the
+governor will select a relatively deep idle state, so the tick should be stopped
+so that it does not wake up the CPU too early.
+
+In any case, the governor knows what it is expecting and the decision on whether
+or not to stop the scheduler tick belongs to it. Still, if the tick has been
+stopped already (in one of the previous iterations of the loop), it is better
+to leave it as is and the governor needs to take that into account.
+
+The kernel can be configured to disable stopping the scheduler tick in the idle
+loop altogether. That can be done through the build-time configuration of it
+(by unsetting the ``CONFIG_NO_HZ_IDLE`` configuration option) or by passing
+``nohz=off`` to it in the command line. In both cases, as the stopping of the
+scheduler tick is disabled, the governor's decisions regarding it are simply
+ignored by the idle loop code and the tick is never stopped.
+
+The systems that run kernels configured to allow the scheduler tick to be
+stopped on idle CPUs are referred to as *tickless* systems and they are
+generally regarded as more energy-efficient than the systems running kernels in
+which the tick cannot be stopped. If the given system is tickless, it will use
+the ``menu`` governor by default and if it is not tickless, the default
+``CPUIdle`` governor on it will be ``ladder``.
+
+
+.. _menu-gov:
+
+The ``menu`` Governor
+=====================
+
+The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems.
+It is quite complex, but the basic principle of its design is straightforward.
+Namely, when invoked to select an idle state for a CPU (i.e. an idle state that
+the CPU will ask the processor hardware to enter), it attempts to predict the
+idle duration and uses the predicted value for idle state selection.
+
+It first obtains the time until the closest timer event with the assumption
+that the scheduler tick will be stopped. That time, referred to as the *sleep
+length* in what follows, is the upper bound on the time before the next CPU
+wakeup. It is used to determine the sleep length range, which in turn is needed
+to get the sleep length correction factor.
+
+The ``menu`` governor maintains two arrays of sleep length correction factors.
+One of them is used when tasks previously running on the given CPU are waiting
+for some I/O operations to complete and the other one is used when that is not
+the case. Each array contains several correction factor values that correspond
+to different sleep length ranges organized so that each range represented in the
+array is approximately 10 times wider than the previous one.
+
+The correction factor for the given sleep length range (determined before
+selecting the idle state for the CPU) is updated after the CPU has been woken
+up and the closer the sleep length is to the observed idle duration, the closer
+to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
+The sleep length is multiplied by the correction factor for the range that it
+falls into to obtain the first approximation of the predicted idle duration.
+
+Next, the governor uses a simple pattern recognition algorithm to refine its
+idle duration prediction. Namely, it saves the last 8 observed idle duration
+values and, when predicting the idle duration next time, it computes the average
+and variance of them. If the variance is small (smaller than 400 square
+milliseconds) or it is small relative to the average (the average is greater
+that 6 times the standard deviation), the average is regarded as the "typical
+interval" value. Otherwise, the longest of the saved observed idle duration
+values is discarded and the computation is repeated for the remaining ones.
+Again, if the variance of them is small (in the above sense), the average is
+taken as the "typical interval" value and so on, until either the "typical
+interval" is determined or too many data points are disregarded, in which case
+the "typical interval" is assumed to equal "infinity" (the maximum unsigned
+integer value). The "typical interval" computed this way is compared with the
+sleep length multiplied by the correction factor and the minimum of the two is
+taken as the predicted idle duration.
+
+Then, the governor computes an extra latency limit to help "interactive"
+workloads. It uses the observation that if the exit latency of the selected
+idle state is comparable with the predicted idle duration, the total time spent
+in that state probably will be very short and the amount of energy to save by
+entering it will be relatively small, so likely it is better to avoid the
+overhead related to entering that state and exiting it. Thus selecting a
+shallower state is likely to be a better option then. The first approximation
+of the extra latency limit is the predicted idle duration itself which
+additionally is divided by a value depending on the number of tasks that
+previously ran on the given CPU and now they are waiting for I/O operations to
+complete. The result of that division is compared with the latency limit coming
+from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
+framework and the minimum of the two is taken as the limit for the idle states'
+exit latency.
+
+Now, the governor is ready to walk the list of idle states and choose one of
+them. For this purpose, it compares the target residency of each state with
+the predicted idle duration and the exit latency of it with the computed latency
+limit. It selects the state with the target residency closest to the predicted
+idle duration, but still below it, and exit latency that does not exceed the
+limit.
+
+In the final step the governor may still need to refine the idle state selection
+if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
+happens if the idle duration predicted by it is less than the tick period and
+the tick has not been stopped already (in a previous iteration of the idle
+loop). Then, the sleep length used in the previous computations may not reflect
+the real time until the closest timer event and if it really is greater than
+that time, the governor may need to select a shallower state with a suitable
+target residency.
+
+
+.. _teo-gov:
+
+The Timer Events Oriented (TEO) Governor
+========================================
+
+The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor
+for tickless systems. It follows the same basic strategy as the ``menu`` `one
+<menu-gov_>`_: it always tries to find the deepest idle state suitable for the
+given conditions. However, it applies a different approach to that problem.
+
+First, it does not use sleep length correction factors, but instead it attempts
+to correlate the observed idle duration values with the available idle states
+and use that information to pick up the idle state that is most likely to
+"match" the upcoming CPU idle interval. Second, it does not take the tasks
+that were running on the given CPU in the past and are waiting on some I/O
+operations to complete now at all (there is no guarantee that they will run on
+the same CPU when they become runnable again) and the pattern detection code in
+it avoids taking timer wakeups into account. It also only uses idle duration
+values less than the current time till the closest timer (with the scheduler
+tick excluded) for that purpose.
+
+Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain
+the *sleep length*, which is the time until the closest timer event with the
+assumption that the scheduler tick will be stopped (that also is the upper bound
+on the time until the next CPU wakeup). That value is then used to preselect an
+idle state on the basis of three metrics maintained for each idle state provided
+by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``.
+
+The ``hits`` and ``misses`` metrics measure the likelihood that a given idle
+state will "match" the observed (post-wakeup) idle duration if it "matches" the
+sleep length. They both are subject to decay (after a CPU wakeup) every time
+the target residency of the idle state corresponding to them is less than or
+equal to the sleep length and the target residency of the next idle state is
+greater than the sleep length (that is, when the idle state corresponding to
+them "matches" the sleep length). The ``hits`` metric is increased if the
+former condition is satisfied and the target residency of the given idle state
+is less than or equal to the observed idle duration and the target residency of
+the next idle state is greater than the observed idle duration at the same time
+(that is, it is increased when the given idle state "matches" both the sleep
+length and the observed idle duration). In turn, the ``misses`` metric is
+increased when the given idle state "matches" the sleep length only and the
+observed idle duration is too short for its target residency.
+
+The ``early_hits`` metric measures the likelihood that a given idle state will
+"match" the observed (post-wakeup) idle duration if it does not "match" the
+sleep length. It is subject to decay on every CPU wakeup and it is increased
+when the idle state corresponding to it "matches" the observed (post-wakeup)
+idle duration and the target residency of the next idle state is less than or
+equal to the sleep length (i.e. the idle state "matching" the sleep length is
+deeper than the given one).
+
+The governor walks the list of idle states provided by the ``CPUIdle`` driver
+and finds the last (deepest) one with the target residency less than or equal
+to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle
+state are compared with each other and it is preselected if the ``hits`` one is
+greater (which means that that idle state is likely to "match" the observed idle
+duration after CPU wakeup). If the ``misses`` one is greater, the governor
+preselects the shallower idle state with the maximum ``early_hits`` metric
+(or if there are multiple shallower idle states with equal ``early_hits``
+metric which also is the maximum, the shallowest of them will be preselected).
+[If there is a wakeup latency constraint coming from the `PM QoS framework
+<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the
+target residency within the sleep length, the deepest idle state with the exit
+latency within the constraint is preselected without consulting the ``hits``,
+``misses`` and ``early_hits`` metrics.]
+
+Next, the governor takes several idle duration values observed most recently
+into consideration and if at least a half of them are greater than or equal to
+the target residency of the preselected idle state, that idle state becomes the
+final candidate to ask for. Otherwise, the average of the most recent idle
+duration values below the target residency of the preselected idle state is
+computed and the governor walks the idle states shallower than the preselected
+one and finds the deepest of them with the target residency within that average.
+That idle state is then taken as the final candidate to ask for.
+
+Still, at this point the governor may need to refine the idle state selection if
+it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That
+generally happens if the target residency of the idle state selected so far is
+less than the tick period and the tick has not been stopped already (in a
+previous iteration of the idle loop). Then, like in the ``menu`` governor
+`case <menu-gov_>`_, the sleep length used in the previous computations may not
+reflect the real time until the closest timer event and if it really is greater
+than that time, a shallower state with a suitable target residency may need to
+be selected.
+
+
+.. _idle-states-representation:
+
+Representation of Idle States
+=============================
+
+For the CPU idle time management purposes all of the physical idle states
+supported by the processor have to be represented as a one-dimensional array of
+|struct cpuidle_state| objects each allowing an individual (logical) CPU to ask
+the processor hardware to enter an idle state of certain properties. If there
+is a hierarchy of units in the processor, one |struct cpuidle_state| object can
+cover a combination of idle states supported by the units at different levels of
+the hierarchy. In that case, the `target residency and exit latency parameters
+of it <idle-loop_>`_, must reflect the properties of the idle state at the
+deepest level (i.e. the idle state of the unit containing all of the other
+units).
+
+For example, take a processor with two cores in a larger unit referred to as
+a "module" and suppose that asking the hardware to enter a specific idle state
+(say "X") at the "core" level by one core will trigger the module to try to
+enter a specific idle state of its own (say "MX") if the other core is in idle
+state "X" already. In other words, asking for idle state "X" at the "core"
+level gives the hardware a license to go as deep as to idle state "MX" at the
+"module" level, but there is no guarantee that this is going to happen (the core
+asking for idle state "X" may just end up in that state by itself instead).
+Then, the target residency of the |struct cpuidle_state| object representing
+idle state "X" must reflect the minimum time to spend in idle state "MX" of
+the module (including the time needed to enter it), because that is the minimum
+time the CPU needs to be idle to save any energy in case the hardware enters
+that state. Analogously, the exit latency parameter of that object must cover
+the exit time of idle state "MX" of the module (and usually its entry time too),
+because that is the maximum delay between a wakeup signal and the time the CPU
+will start to execute the first new instruction (assuming that both cores in the
+module will always be ready to execute instructions as soon as the module
+becomes operational as a whole).
+
+There are processors without direct coordination between different levels of the
+hierarchy of units inside them, however. In those cases asking for an idle
+state at the "core" level does not automatically affect the "module" level, for
+example, in any way and the ``CPUIdle`` driver is responsible for the entire
+handling of the hierarchy. Then, the definition of the idle state objects is
+entirely up to the driver, but still the physical properties of the idle state
+that the processor hardware finally goes into must always follow the parameters
+used by the governor for idle state selection (for instance, the actual exit
+latency of that idle state must not exceed the exit latency parameter of the
+idle state object selected by the governor).
+
+In addition to the target residency and exit latency idle state parameters
+discussed above, the objects representing idle states each contain a few other
+parameters describing the idle state and a pointer to the function to run in
+order to ask the hardware to enter that state. Also, for each
+|struct cpuidle_state| object, there is a corresponding
+:c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containing usage
+statistics of the given idle state. That information is exposed by the kernel
+via ``sysfs``.
+
+For each CPU in the system, there is a :file:`/sys/devices/system/cpu<N>/cpuidle/`
+directory in ``sysfs``, where the number ``<N>`` is assigned to the given
+CPU at the initialization time. That directory contains a set of subdirectories
+called :file:`state0`, :file:`state1` and so on, up to the number of idle state
+objects defined for the given CPU minus one. Each of these directories
+corresponds to one idle state object and the larger the number in its name, the
+deeper the (effective) idle state represented by it. Each of them contains
+a number of files (attributes) representing the properties of the idle state
+object corresponding to it, as follows:
+
+``above``
+ Total number of times this idle state had been asked for, but the
+ observed idle duration was certainly too short to match its target
+ residency.
+
+``below``
+ Total number of times this idle state had been asked for, but cerainly
+ a deeper idle state would have been a better match for the observed idle
+ duration.
+
+``desc``
+ Description of the idle state.
+
+``disable``
+ Whether or not this idle state is disabled.
+
+``latency``
+ Exit latency of the idle state in microseconds.
+
+``name``
+ Name of the idle state.
+
+``power``
+ Power drawn by hardware in this idle state in milliwatts (if specified,
+ 0 otherwise).
+
+``residency``
+ Target residency of the idle state in microseconds.
+
+``time``
+ Total time spent in this idle state by the given CPU (as measured by the
+ kernel) in microseconds.
+
+``usage``
+ Total number of times the hardware has been asked by the given CPU to
+ enter this idle state.
+
+The :file:`desc` and :file:`name` files both contain strings. The difference
+between them is that the name is expected to be more concise, while the
+description may be longer and it may contain white space or special characters.
+The other files listed above contain integer numbers.
+
+The :file:`disable` attribute is the only writeable one. If it contains 1, the
+given idle state is disabled for this particular CPU, which means that the
+governor will never select it for this particular CPU and the ``CPUIdle``
+driver will never ask the hardware to enter it for that CPU as a result.
+However, disabling an idle state for one CPU does not prevent it from being
+asked for by the other CPUs, so it must be disabled for all of them in order to
+never be asked for by any of them. [Note that, due to the way the ``ladder``
+governor is implemented, disabling an idle state prevents that governor from
+selecting any idle states deeper than the disabled one too.]
+
+If the :file:`disable` attribute contains 0, the given idle state is enabled for
+this particular CPU, but it still may be disabled for some or all of the other
+CPUs in the system at the same time. Writing 1 to it causes the idle state to
+be disabled for this particular CPU and writing 0 to it allows the governor to
+take it into consideration for the given CPU and the driver to ask for it,
+unless that state was disabled globally in the driver (in which case it cannot
+be used at all).
+
+The :file:`power` attribute is not defined very well, especially for idle state
+objects representing combinations of idle states at different levels of the
+hierarchy of units in the processor, and it generally is hard to obtain idle
+state power numbers for complex hardware, so :file:`power` often contains 0 (not
+available) and if it contains a nonzero number, that number may not be very
+accurate and it should not be relied on for anything meaningful.
+
+The number in the :file:`time` file generally may be greater than the total time
+really spent by the given CPU in the given idle state, because it is measured by
+the kernel and it may not cover the cases in which the hardware refused to enter
+this idle state and entered a shallower one instead of it (or even it did not
+enter any idle state at all). The kernel can only measure the time span between
+asking the hardware to enter an idle state and the subsequent wakeup of the CPU
+and it cannot say what really happened in the meantime at the hardware level.
+Moreover, if the idle state object in question represents a combination of idle
+states at different levels of the hierarchy of units in the processor,
+the kernel can never say how deep the hardware went down the hierarchy in any
+particular case. For these reasons, the only reliable way to find out how
+much time has been spent by the hardware in different idle states supported by
+it is to use idle state residency counters in the hardware, if available.
+
+
+.. _cpu-pm-qos:
+
+Power Management Quality of Service for CPUs
+============================================
+
+The power management quality of service (PM QoS) framework in the Linux kernel
+allows kernel code and user space processes to set constraints on various
+energy-efficiency features of the kernel to prevent performance from dropping
+below a required level. The PM QoS constraints can be set globally, in
+predefined categories referred to as PM QoS classes, or against individual
+devices.
+
+CPU idle time management can be affected by PM QoS in two ways, through the
+global constraint in the ``PM_QOS_CPU_DMA_LATENCY`` class and through the
+resume latency constraints for individual CPUs. Kernel code (e.g. device
+drivers) can set both of them with the help of special internal interfaces
+provided by the PM QoS framework. User space can modify the former by opening
+the :file:`cpu_dma_latency` special device file under :file:`/dev/` and writing
+a binary value (interpreted as a signed 32-bit integer) to it. In turn, the
+resume latency constraint for a CPU can be modified by user space by writing a
+string (representing a signed 32-bit integer) to the
+:file:`power/pm_qos_resume_latency_us` file under
+:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number
+``<N>`` is allocated at the system initialization time. Negative values
+will be rejected in both cases and, also in both cases, the written integer
+number will be interpreted as a requested PM QoS constraint in microseconds.
+
+The requested value is not automatically applied as a new constraint, however,
+as it may be less restrictive (greater in this particular case) than another
+constraint previously requested by someone else. For this reason, the PM QoS
+framework maintains a list of requests that have been made so far in each
+global class and for each device, aggregates them and applies the effective
+(minimum in this particular case) value as the new constraint.
+
+In fact, opening the :file:`cpu_dma_latency` special device file causes a new
+PM QoS request to be created and added to the priority list of requests in the
+``PM_QOS_CPU_DMA_LATENCY`` class and the file descriptor coming from the
+"open" operation represents that request. If that file descriptor is then
+used for writing, the number written to it will be associated with the PM QoS
+request represented by it as a new requested constraint value. Next, the
+priority list mechanism will be used to determine the new effective value of
+the entire list of requests and that effective value will be set as a new
+constraint. Thus setting a new requested constraint value will only change the
+real constraint if the effective "list" value is affected by it. In particular,
+for the ``PM_QOS_CPU_DMA_LATENCY`` class it only affects the real constraint if
+it is the minimum of the requested constraints in the list. The process holding
+a file descriptor obtained by opening the :file:`cpu_dma_latency` special device
+file controls the PM QoS request associated with that file descriptor, but it
+controls this particular PM QoS request only.
+
+Closing the :file:`cpu_dma_latency` special device file or, more precisely, the
+file descriptor obtained while opening it, causes the PM QoS request associated
+with that file descriptor to be removed from the ``PM_QOS_CPU_DMA_LATENCY``
+class priority list and destroyed. If that happens, the priority list mechanism
+will be used, again, to determine the new effective value for the whole list
+and that value will become the new real constraint.
+
+In turn, for each CPU there is only one resume latency PM QoS request
+associated with the :file:`power/pm_qos_resume_latency_us` file under
+:file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes
+this single PM QoS request to be updated regardless of which user space
+process does that. In other words, this PM QoS request is shared by the entire
+user space, so access to the file associated with it needs to be arbitrated
+to avoid confusion. [Arguably, the only legitimate use of this mechanism in
+practice is to pin a process to the CPU in question and let it use the
+``sysfs`` interface to control the resume latency constraint for it.] It
+still only is a request, however. It is a member of a priority list used to
+determine the effective value to be set as the resume latency constraint for the
+CPU in question every time the list of requests is updated this way or another
+(there may be other requests coming from kernel code in that list).
+
+CPU idle time governors are expected to regard the minimum of the global
+effective ``PM_QOS_CPU_DMA_LATENCY`` class constraint and the effective
+resume latency constraint for the given CPU as the upper limit for the exit
+latency of the idle states they can select for that CPU. They should never
+select any idle states with exit latency beyond that limit.
+
+
+Idle States Control Via Kernel Command Line
+===========================================
+
+In addition to the ``sysfs`` interface allowing individual idle states to be
+`disabled for individual CPUs <idle-states-representation_>`_, there are kernel
+command line parameters affecting CPU idle time management.
+
+The ``cpuidle.off=1`` kernel command line option can be used to disable the
+CPU idle time management entirely. It does not prevent the idle loop from
+running on idle CPUs, but it prevents the CPU idle time governors and drivers
+from being invoked. If it is added to the kernel command line, the idle loop
+will ask the hardware to enter idle states on idle CPUs via the CPU architecture
+support code that is expected to provide a default mechanism for this purpose.
+That default mechanism usually is the least common denominator for all of the
+processors implementing the architecture (i.e. CPU instruction set) in question,
+however, so it is rather crude and not very energy-efficient. For this reason,
+it is not recommended for production use.
+
+The ``cpuidle.governor=`` kernel command line switch allows the ``CPUIdle``
+governor to use to be specified. It has to be appended with a string matching
+the name of an available governor (e.g. ``cpuidle.governor=menu``) and that
+governor will be used instead of the default one. It is possible to force
+the ``menu`` governor to be used on the systems that use the ``ladder`` governor
+by default this way, for example.
+
+The other kernel command line parameters controlling CPU idle time management
+described below are only relevant for the *x86* architecture and some of
+them affect Intel processors only.
+
+The *x86* architecture support code recognizes three kernel command line
+options related to CPU idle time management: ``idle=poll``, ``idle=halt``,
+and ``idle=nomwait``. The first two of them disable the ``acpi_idle`` and
+``intel_idle`` drivers altogether, which effectively causes the entire
+``CPUIdle`` subsystem to be disabled and makes the idle loop invoke the
+architecture support code to deal with idle CPUs. How it does that depends on
+which of the two parameters is added to the kernel command line. In the
+``idle=halt`` case, the architecture support code will use the ``HLT``
+instruction of the CPUs (which, as a rule, suspends the execution of the program
+and causes the hardware to attempt to enter the shallowest available idle state)
+for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a
+more or less ``lightweight'' sequence of instructions in a tight loop. [Note
+that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
+CPUs from saving almost any energy at all may not be the only effect of it.
+For example, on Intel hardware it effectively prevents CPUs from using
+P-states (see |cpufreq|) that require any number of CPUs in a package to be
+idle, so it very well may hurt single-thread computations performance as well as
+energy-efficiency. Thus using it for performance reasons may not be a good idea
+at all.]
+
+The ``idle=nomwait`` option disables the ``intel_idle`` driver and causes
+``acpi_idle`` to be used (as long as all of the information needed by it is
+there in the system's ACPI tables), but it is not allowed to use the
+``MWAIT`` instruction of the CPUs to ask the hardware to enter idle states.
+
+In addition to the architecture-level kernel command line options affecting CPU
+idle time management, there are parameters affecting individual ``CPUIdle``
+drivers that can be passed to them via the kernel command line. Specifically,
+the ``intel_idle.max_cstate=<n>`` and ``processor.max_cstate=<n>`` parameters,
+where ``<n>`` is an idle state index also used in the name of the given
+state's directory in ``sysfs`` (see
+`Representation of Idle States <idle-states-representation_>`_), causes the
+``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the
+idle states deeper than idle state ``<n>``. In that case, they will never ask
+for any of those idle states or expose them to the governor. [The behavior of
+the two drivers is different for ``<n>`` equal to ``0``. Adding
+``intel_idle.max_cstate=0`` to the kernel command line disables the
+``intel_idle`` driver and allows ``acpi_idle`` to be used, whereas
+``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``.
+Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that
+can be loaded separately and ``max_cstate=<n>`` can be passed to it as a module
+parameter when it is loaded.]
diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst
index 49237ac..39f8f9f 100644
--- a/Documentation/admin-guide/pm/index.rst
+++ b/Documentation/admin-guide/pm/index.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
================
Power Management
================
diff --git a/Documentation/admin-guide/pm/intel_epb.rst b/Documentation/admin-guide/pm/intel_epb.rst
new file mode 100644
index 0000000..0051211
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_epb.rst
@@ -0,0 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+======================================
+Intel Performance and Energy Bias Hint
+======================================
+
+:Copyright: |copy| 2019 Intel Corporation
+
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+
+.. kernel-doc:: arch/x86/kernel/cpu/intel_epb.c
+ :doc: overview
+
+Intel Performance and Energy Bias Attribute in ``sysfs``
+========================================================
+
+The Intel Performance and Energy Bias Hint (EPB) value for a given (logical) CPU
+can be checked or updated through a ``sysfs`` attribute (file) under
+:file:`/sys/devices/system/cpu/cpu<N>/power/`, where the CPU number ``<N>``
+is allocated at the system initialization time:
+
+``energy_perf_bias``
+ Shows the current EPB value for the CPU in a sliding scale 0 - 15, where
+ a value of 0 corresponds to a hint preference for highest performance
+ and a value of 15 corresponds to the maximum energy savings.
+
+ In order to update the EPB value for the CPU, this attribute can be
+ written to, either with a number in the 0 - 15 sliding scale above, or
+ with one of the strings: "performance", "balance-performance", "normal",
+ "balance-power", "power" that represent values reflected by their
+ meaning.
+
+ This attribute is present for all online CPUs supporting the EPB
+ feature.
+
+Note that while the EPB interface to the processor is defined at the logical CPU
+level, the physical register backing it may be shared by multiple CPUs (for
+example, SMT siblings or cores in one package). For this reason, updating the
+EPB value for one CPU may cause the EPB values for other CPUs to change.
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index 8f1d3de..67e414e 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -1,10 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
===============================================
``intel_pstate`` CPU Performance Scaling Driver
===============================================
-::
+:Copyright: |copy| 2017 Intel Corporation
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
General Information
@@ -20,11 +23,10 @@
For the processors supported by ``intel_pstate``, the P-state concept is broader
than just an operating frequency or an operating performance point (see the
-`LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more
+LinuxCon Europe 2015 presentation by Kristen Accardi [1]_ for more
information about that). For this reason, the representation of P-states used
by ``intel_pstate`` internally follows the hardware specification (for details
-refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual
-Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core
+refer to Intel Software Developer’s Manual [2]_). However, the ``CPUFreq`` core
uses frequencies for identifying operating performance points of CPUs and
frequencies are involved in the user space interface exposed by it, so
``intel_pstate`` maps its internal representation of P-states to frequencies too
@@ -465,6 +467,13 @@
policy for the time interval between the last two invocations of the
driver's utilization update callback by the CPU scheduler for that CPU.
+One more policy attribute is present if the `HWP feature is enabled in the
+processor <Active Mode With HWP_>`_:
+
+``base_frequency``
+ Shows the base frequency of the CPU. Any frequency above this will be
+ in the turbo frequency range.
+
The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the
same as for other scaling drivers.
@@ -488,7 +497,15 @@
2. Each individual CPU is affected by its own per-policy limits (that is, it
cannot be requested to run faster than its own per-policy maximum and it
- cannot be requested to run slower than its own per-policy minimum).
+ cannot be requested to run slower than its own per-policy minimum). The
+ effective performance depends on whether the platform supports per core
+ P-states, hyper-threading is enabled and on current performance requests
+ from other CPUs. When platform doesn't support per core P-states, the
+ effective performance can be more than the policy limits set on a CPU, if
+ other CPUs are requesting higher performance at that moment. Even with per
+ core P-states support, when hyper-threading is enabled, if the sibling CPU
+ is requesting higher performance, the other siblings will get higher
+ performance than their policy limits.
3. The global and per-policy limits can be set independently.
@@ -546,9 +563,9 @@
On the majority of systems supported by ``intel_pstate``, the ACPI tables
provided by the platform firmware contain ``_PSS`` objects returning information
-that can be used for CPU performance scaling (refer to the `ACPI specification`_
-for details on the ``_PSS`` objects and the format of the information returned
-by them).
+that can be used for CPU performance scaling (refer to the ACPI specification
+[3]_ for details on the ``_PSS`` objects and the format of the information
+returned by them).
The information returned by the ACPI ``_PSS`` objects is used by the
``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate``
@@ -713,6 +730,14 @@
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
-.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
-.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
-.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
+References
+==========
+
+.. [1] Kristen Accardi, *Balancing Power and Performance in the Linux Kernel*,
+ http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
+
+.. [2] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide*,
+ http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
+
+.. [3] *Advanced Configuration and Power Interface Specification*,
+ https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst
index dbf5acd..cd3a28c 100644
--- a/Documentation/admin-guide/pm/sleep-states.rst
+++ b/Documentation/admin-guide/pm/sleep-states.rst
@@ -1,10 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
===================
System Sleep States
===================
-::
+:Copyright: |copy| 2017 Intel Corporation
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
Sleep states are global low-power states of the entire system in which user
space code cannot be executed and the overall system activity is significantly
diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst
index afe4d3f..dd0362e 100644
--- a/Documentation/admin-guide/pm/strategies.rst
+++ b/Documentation/admin-guide/pm/strategies.rst
@@ -1,10 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
===========================
Power Management Strategies
===========================
-::
+:Copyright: |copy| 2017 Intel Corporation
- Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
The Linux kernel supports two major high-level power management strategies.
diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst
index 0c81e4c..2b1f987 100644
--- a/Documentation/admin-guide/pm/system-wide.rst
+++ b/Documentation/admin-guide/pm/system-wide.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
============================
System-Wide Power Management
============================
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
index fa01bf0..fc298eb 100644
--- a/Documentation/admin-guide/pm/working-state.rst
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
==============================
Working-State Power Management
==============================
@@ -5,5 +7,7 @@
.. toctree::
:maxdepth: 2
+ cpuidle
cpufreq
intel_pstate
+ intel_epb
diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst
new file mode 100644
index 0000000..bab2d10
--- /dev/null
+++ b/Documentation/admin-guide/pnp.rst
@@ -0,0 +1,292 @@
+=================================
+Linux Plug and Play Documentation
+=================================
+
+:Author: Adam Belay <ambx1@neo.rr.com>
+:Last updated: Oct. 16, 2002
+
+
+Overview
+--------
+
+Plug and Play provides a means of detecting and setting resources for legacy or
+otherwise unconfigurable devices. The Linux Plug and Play Layer provides these
+services to compatible drivers.
+
+
+The User Interface
+------------------
+
+The Linux Plug and Play user interface provides a means to activate PnP devices
+for legacy and user level drivers that do not support Linux Plug and Play. The
+user interface is integrated into sysfs.
+
+In addition to the standard sysfs file the following are created in each
+device's directory:
+- id - displays a list of support EISA IDs
+- options - displays possible resource configurations
+- resources - displays currently allocated resources and allows resource changes
+
+activating a device
+^^^^^^^^^^^^^^^^^^^
+
+::
+
+ # echo "auto" > resources
+
+this will invoke the automatic resource config system to activate the device
+
+manually activating a device
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ # echo "manual <depnum> <mode>" > resources
+
+ <depnum> - the configuration number
+ <mode> - static or dynamic
+ static = for next boot
+ dynamic = now
+
+disabling a device
+^^^^^^^^^^^^^^^^^^
+
+::
+
+ # echo "disable" > resources
+
+
+EXAMPLE:
+
+Suppose you need to activate the floppy disk controller.
+
+1. change to the proper directory, in my case it is
+ /driver/bus/pnp/devices/00:0f::
+
+ # cd /driver/bus/pnp/devices/00:0f
+ # cat name
+ PC standard floppy disk controller
+
+2. check if the device is already active::
+
+ # cat resources
+ DISABLED
+
+ - Notice the string "DISABLED". This means the device is not active.
+
+3. check the device's possible configurations (optional)::
+
+ # cat options
+ Dependent: 01 - Priority acceptable
+ port 0x3f0-0x3f0, align 0x7, size 0x6, 16-bit address decoding
+ port 0x3f7-0x3f7, align 0x0, size 0x1, 16-bit address decoding
+ irq 6
+ dma 2 8-bit compatible
+ Dependent: 02 - Priority acceptable
+ port 0x370-0x370, align 0x7, size 0x6, 16-bit address decoding
+ port 0x377-0x377, align 0x0, size 0x1, 16-bit address decoding
+ irq 6
+ dma 2 8-bit compatible
+
+4. now activate the device::
+
+ # echo "auto" > resources
+
+5. finally check if the device is active::
+
+ # cat resources
+ io 0x3f0-0x3f5
+ io 0x3f7-0x3f7
+ irq 6
+ dma 2
+
+also there are a series of kernel parameters::
+
+ pnp_reserve_irq=irq1[,irq2] ....
+ pnp_reserve_dma=dma1[,dma2] ....
+ pnp_reserve_io=io1,size1[,io2,size2] ....
+ pnp_reserve_mem=mem1,size1[,mem2,size2] ....
+
+
+
+The Unified Plug and Play Layer
+-------------------------------
+
+All Plug and Play drivers, protocols, and services meet at a central location
+called the Plug and Play Layer. This layer is responsible for the exchange of
+information between PnP drivers and PnP protocols. Thus it automatically
+forwards commands to the proper protocol. This makes writing PnP drivers
+significantly easier.
+
+The following functions are available from the Plug and Play Layer:
+
+pnp_get_protocol
+ increments the number of uses by one
+
+pnp_put_protocol
+ deincrements the number of uses by one
+
+pnp_register_protocol
+ use this to register a new PnP protocol
+
+pnp_unregister_protocol
+ use this function to remove a PnP protocol from the Plug and Play Layer
+
+pnp_register_driver
+ adds a PnP driver to the Plug and Play Layer
+
+ this includes driver model integration
+ returns zero for success or a negative error number for failure; count
+ calls to the .add() method if you need to know how many devices bind to
+ the driver
+
+pnp_unregister_driver
+ removes a PnP driver from the Plug and Play Layer
+
+
+
+Plug and Play Protocols
+-----------------------
+
+This section contains information for PnP protocol developers.
+
+The following Protocols are currently available in the computing world:
+
+- PNPBIOS:
+ used for system devices such as serial and parallel ports.
+- ISAPNP:
+ provides PnP support for the ISA bus
+- ACPI:
+ among its many uses, ACPI provides information about system level
+ devices.
+
+It is meant to replace the PNPBIOS. It is not currently supported by Linux
+Plug and Play but it is planned to be in the near future.
+
+
+Requirements for a Linux PnP protocol:
+1. the protocol must use EISA IDs
+2. the protocol must inform the PnP Layer of a device's current configuration
+
+- the ability to set resources is optional but preferred.
+
+The following are PnP protocol related functions:
+
+pnp_add_device
+ use this function to add a PnP device to the PnP layer
+
+ only call this function when all wanted values are set in the pnp_dev
+ structure
+
+pnp_init_device
+ call this to initialize the PnP structure
+
+pnp_remove_device
+ call this to remove a device from the Plug and Play Layer.
+ it will fail if the device is still in use.
+ automatically will free mem used by the device and related structures
+
+pnp_add_id
+ adds an EISA ID to the list of supported IDs for the specified device
+
+For more information consult the source of a protocol such as
+/drivers/pnp/pnpbios/core.c.
+
+
+
+Linux Plug and Play Drivers
+---------------------------
+
+This section contains information for Linux PnP driver developers.
+
+The New Way
+^^^^^^^^^^^
+
+1. first make a list of supported EISA IDS
+
+ ex::
+
+ static const struct pnp_id pnp_dev_table[] = {
+ /* Standard LPT Printer Port */
+ {.id = "PNP0400", .driver_data = 0},
+ /* ECP Printer Port */
+ {.id = "PNP0401", .driver_data = 0},
+ {.id = ""}
+ };
+
+ Please note that the character 'X' can be used as a wild card in the function
+ portion (last four characters).
+
+ ex::
+
+ /* Unknown PnP modems */
+ { "PNPCXXX", UNKNOWN_DEV },
+
+ Supported PnP card IDs can optionally be defined.
+ ex::
+
+ static const struct pnp_id pnp_card_table[] = {
+ { "ANYDEVS", 0 },
+ { "", 0 }
+ };
+
+2. Optionally define probe and remove functions. It may make sense not to
+ define these functions if the driver already has a reliable method of detecting
+ the resources, such as the parport_pc driver.
+
+ ex::
+
+ static int
+ serial_pnp_probe(struct pnp_dev * dev, const struct pnp_id *card_id, const
+ struct pnp_id *dev_id)
+ {
+ . . .
+
+ ex::
+
+ static void serial_pnp_remove(struct pnp_dev * dev)
+ {
+ . . .
+
+ consult /drivers/serial/8250_pnp.c for more information.
+
+3. create a driver structure
+
+ ex::
+
+ static struct pnp_driver serial_pnp_driver = {
+ .name = "serial",
+ .card_id_table = pnp_card_table,
+ .id_table = pnp_dev_table,
+ .probe = serial_pnp_probe,
+ .remove = serial_pnp_remove,
+ };
+
+ * name and id_table cannot be NULL.
+
+4. register the driver
+
+ ex::
+
+ static int __init serial8250_pnp_init(void)
+ {
+ return pnp_register_driver(&serial_pnp_driver);
+ }
+
+The Old Way
+^^^^^^^^^^^
+
+A series of compatibility functions have been created to make it easy to convert
+ISAPNP drivers. They should serve as a temporary solution only.
+
+They are as follows::
+
+ struct pnp_card *pnp_find_card(unsigned short vendor,
+ unsigned short device,
+ struct pnp_card *from)
+
+ struct pnp_dev *pnp_find_dev(struct pnp_card *card,
+ unsigned short vendor,
+ unsigned short function,
+ struct pnp_dev *from)
+
diff --git a/Documentation/admin-guide/rapidio.rst b/Documentation/admin-guide/rapidio.rst
new file mode 100644
index 0000000..71ff658
--- /dev/null
+++ b/Documentation/admin-guide/rapidio.rst
@@ -0,0 +1,107 @@
+=======================
+RapidIO Subsystem Guide
+=======================
+
+:Author: Matt Porter
+
+Introduction
+============
+
+RapidIO is a high speed switched fabric interconnect with features aimed
+at the embedded market. RapidIO provides support for memory-mapped I/O
+as well as message-based transactions over the switched fabric network.
+RapidIO has a standardized discovery mechanism not unlike the PCI bus
+standard that allows simple detection of devices in a network.
+
+This documentation is provided for developers intending to support
+RapidIO on new architectures, write new drivers, or to understand the
+subsystem internals.
+
+Known Bugs and Limitations
+==========================
+
+Bugs
+----
+
+None. ;)
+
+Limitations
+-----------
+
+1. Access/management of RapidIO memory regions is not supported
+
+2. Multiple host enumeration is not supported
+
+RapidIO driver interface
+========================
+
+Drivers are provided a set of calls in order to interface with the
+subsystem to gather info on devices, request/map memory region
+resources, and manage mailboxes/doorbells.
+
+Functions
+---------
+
+.. kernel-doc:: include/linux/rio_drv.h
+ :internal:
+
+.. kernel-doc:: drivers/rapidio/rio-driver.c
+ :export:
+
+.. kernel-doc:: drivers/rapidio/rio.c
+ :export:
+
+Internals
+=========
+
+This chapter contains the autogenerated documentation of the RapidIO
+subsystem.
+
+Structures
+----------
+
+.. kernel-doc:: include/linux/rio.h
+ :internal:
+
+Enumeration and Discovery
+-------------------------
+
+.. kernel-doc:: drivers/rapidio/rio-scan.c
+ :internal:
+
+Driver functionality
+--------------------
+
+.. kernel-doc:: drivers/rapidio/rio.c
+ :internal:
+
+.. kernel-doc:: drivers/rapidio/rio-access.c
+ :internal:
+
+Device model support
+--------------------
+
+.. kernel-doc:: drivers/rapidio/rio-driver.c
+ :internal:
+
+PPC32 support
+-------------
+
+.. kernel-doc:: arch/powerpc/sysdev/fsl_rio.c
+ :internal:
+
+Credits
+=======
+
+The following people have contributed to the RapidIO subsystem directly
+or indirectly:
+
+1. Matt Porter\ mporter@kernel.crashing.org
+
+2. Randy Vinson\ rvinson@mvista.com
+
+3. Dan Malek\ dan@embeddedalley.com
+
+The following people have contributed to this document:
+
+1. Matt Porter\ mporter@kernel.crashing.org
diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst
index 1978967..2b20f5f 100644
--- a/Documentation/admin-guide/ras.rst
+++ b/Documentation/admin-guide/ras.rst
@@ -54,7 +54,7 @@
Types of errors
---------------
-Most mechanisms used on modern systems use use technologies like Hamming
+Most mechanisms used on modern systems use technologies like Hamming
Codes that allow error correction when the number of errors on a bit packet
is below a threshold. If the number of errors is above, those mechanisms
can indicate with a high degree of confidence that an error happened, but
@@ -199,7 +199,7 @@
mode).
.. [#f3] For more details about the Machine Check Architecture (MCA),
- please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
+ please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree.
EDAC - Error Detection And Correction
*************************************
diff --git a/Documentation/admin-guide/reporting-bugs.rst b/Documentation/admin-guide/reporting-bugs.rst
index 4650edb..49ac8dc 100644
--- a/Documentation/admin-guide/reporting-bugs.rst
+++ b/Documentation/admin-guide/reporting-bugs.rst
@@ -67,7 +67,7 @@
a bug in kernel.org bugzilla and send email to
linux-kernel@vger.kernel.org, referencing the bugzilla URL. (For more
information on the linux-kernel mailing list see
-http://www.tux.org/lkml/).
+http://vger.kernel.org/lkml/).
Tips for reporting bugs
diff --git a/Documentation/admin-guide/rtc.rst b/Documentation/admin-guide/rtc.rst
new file mode 100644
index 0000000..688c95b
--- /dev/null
+++ b/Documentation/admin-guide/rtc.rst
@@ -0,0 +1,140 @@
+=======================================
+Real Time Clock (RTC) Drivers for Linux
+=======================================
+
+When Linux developers talk about a "Real Time Clock", they usually mean
+something that tracks wall clock time and is battery backed so that it
+works even with system power off. Such clocks will normally not track
+the local time zone or daylight savings time -- unless they dual boot
+with MS-Windows -- but will instead be set to Coordinated Universal Time
+(UTC, formerly "Greenwich Mean Time").
+
+The newest non-PC hardware tends to just count seconds, like the time(2)
+system call reports, but RTCs also very commonly represent time using
+the Gregorian calendar and 24 hour time, as reported by gmtime(3).
+
+Linux has two largely-compatible userspace RTC API families you may
+need to know about:
+
+ * /dev/rtc ... is the RTC provided by PC compatible systems,
+ so it's not very portable to non-x86 systems.
+
+ * /dev/rtc0, /dev/rtc1 ... are part of a framework that's
+ supported by a wide variety of RTC chips on all systems.
+
+Programmers need to understand that the PC/AT functionality is not
+always available, and some systems can do much more. That is, the
+RTCs use the same API to make requests in both RTC frameworks (using
+different filenames of course), but the hardware may not offer the
+same functionality. For example, not every RTC is hooked up to an
+IRQ, so they can't all issue alarms; and where standard PC RTCs can
+only issue an alarm up to 24 hours in the future, other hardware may
+be able to schedule one any time in the upcoming century.
+
+
+Old PC/AT-Compatible driver: /dev/rtc
+--------------------------------------
+
+All PCs (even Alpha machines) have a Real Time Clock built into them.
+Usually they are built into the chipset of the computer, but some may
+actually have a Motorola MC146818 (or clone) on the board. This is the
+clock that keeps the date and time while your computer is turned off.
+
+ACPI has standardized that MC146818 functionality, and extended it in
+a few ways (enabling longer alarm periods, and wake-from-hibernate).
+That functionality is NOT exposed in the old driver.
+
+However it can also be used to generate signals from a slow 2Hz to a
+relatively fast 8192Hz, in increments of powers of two. These signals
+are reported by interrupt number 8. (Oh! So *that* is what IRQ 8 is
+for...) It can also function as a 24hr alarm, raising IRQ 8 when the
+alarm goes off. The alarm can also be programmed to only check any
+subset of the three programmable values, meaning that it could be set to
+ring on the 30th second of the 30th minute of every hour, for example.
+The clock can also be set to generate an interrupt upon every clock
+update, thus generating a 1Hz signal.
+
+The interrupts are reported via /dev/rtc (major 10, minor 135, read only
+character device) in the form of an unsigned long. The low byte contains
+the type of interrupt (update-done, alarm-rang, or periodic) that was
+raised, and the remaining bytes contain the number of interrupts since
+the last read. Status information is reported through the pseudo-file
+/proc/driver/rtc if the /proc filesystem was enabled. The driver has
+built in locking so that only one process is allowed to have the /dev/rtc
+interface open at a time.
+
+A user process can monitor these interrupts by doing a read(2) or a
+select(2) on /dev/rtc -- either will block/stop the user process until
+the next interrupt is received. This is useful for things like
+reasonably high frequency data acquisition where one doesn't want to
+burn up 100% CPU by polling gettimeofday etc. etc.
+
+At high frequencies, or under high loads, the user process should check
+the number of interrupts received since the last read to determine if
+there has been any interrupt "pileup" so to speak. Just for reference, a
+typical 486-33 running a tight read loop on /dev/rtc will start to suffer
+occasional interrupt pileup (i.e. > 1 IRQ event since last read) for
+frequencies above 1024Hz. So you really should check the high bytes
+of the value you read, especially at frequencies above that of the
+normal timer interrupt, which is 100Hz.
+
+Programming and/or enabling interrupt frequencies greater than 64Hz is
+only allowed by root. This is perhaps a bit conservative, but we don't want
+an evil user generating lots of IRQs on a slow 386sx-16, where it might have
+a negative impact on performance. This 64Hz limit can be changed by writing
+a different value to /proc/sys/dev/rtc/max-user-freq. Note that the
+interrupt handler is only a few lines of code to minimize any possibility
+of this effect.
+
+Also, if the kernel time is synchronized with an external source, the
+kernel will write the time back to the CMOS clock every 11 minutes. In
+the process of doing this, the kernel briefly turns off RTC periodic
+interrupts, so be aware of this if you are doing serious work. If you
+don't synchronize the kernel time with an external source (via ntp or
+whatever) then the kernel will keep its hands off the RTC, allowing you
+exclusive access to the device for your applications.
+
+The alarm and/or interrupt frequency are programmed into the RTC via
+various ioctl(2) calls as listed in ./include/linux/rtc.h
+Rather than write 50 pages describing the ioctl() and so on, it is
+perhaps more useful to include a small test program that demonstrates
+how to use them, and demonstrates the features of the driver. This is
+probably a lot more useful to people interested in writing applications
+that will be using this driver. See the code at the end of this document.
+
+(The original /dev/rtc driver was written by Paul Gortmaker.)
+
+
+New portable "RTC Class" drivers: /dev/rtcN
+--------------------------------------------
+
+Because Linux supports many non-ACPI and non-PC platforms, some of which
+have more than one RTC style clock, it needed a more portable solution
+than expecting a single battery-backed MC146818 clone on every system.
+Accordingly, a new "RTC Class" framework has been defined. It offers
+three different userspace interfaces:
+
+ * /dev/rtcN ... much the same as the older /dev/rtc interface
+
+ * /sys/class/rtc/rtcN ... sysfs attributes support readonly
+ access to some RTC attributes.
+
+ * /proc/driver/rtc ... the system clock RTC may expose itself
+ using a procfs interface. If there is no RTC for the system clock,
+ rtc0 is used by default. More information is (currently) shown
+ here than through sysfs.
+
+The RTC Class framework supports a wide variety of RTCs, ranging from those
+integrated into embeddable system-on-chip (SOC) processors to discrete chips
+using I2C, SPI, or some other bus to communicate with the host CPU. There's
+even support for PC-style RTCs ... including the features exposed on newer PCs
+through ACPI.
+
+The new framework also removes the "one RTC per system" restriction. For
+example, maybe the low-power battery-backed RTC is a discrete I2C chip, but
+a high functionality RTC is integrated into the SOC. That system might read
+the system clock from the discrete RTC, but use the integrated one for all
+other tasks, because of its greater functionality.
+
+Check out tools/testing/selftests/rtc/rtctest.c for an example usage of the
+ioctl interface.
diff --git a/Documentation/admin-guide/security-bugs.rst b/Documentation/admin-guide/security-bugs.rst
index 30187d4..dcd6c93 100644
--- a/Documentation/admin-guide/security-bugs.rst
+++ b/Documentation/admin-guide/security-bugs.rst
@@ -44,7 +44,7 @@
the logistics of QA and large scale rollouts which require release
coordination.
-Whilst embargoed information may be shared with trusted individuals in
+While embargoed information may be shared with trusted individuals in
order to develop a fix, such information will not be published alongside
the fix or on any other disclosure channel without the permission of the
reporter. This includes but is not limited to the original bug report
diff --git a/Documentation/admin-guide/svga.rst b/Documentation/admin-guide/svga.rst
new file mode 100644
index 0000000..b6c2f9a
--- /dev/null
+++ b/Documentation/admin-guide/svga.rst
@@ -0,0 +1,249 @@
+.. include:: <isonum.txt>
+
+=================================
+Video Mode Selection Support 2.13
+=================================
+
+:Copyright: |copy| 1995--1999 Martin Mares, <mj@ucw.cz>
+
+Intro
+~~~~~
+
+This small document describes the "Video Mode Selection" feature which
+allows the use of various special video modes supported by the video BIOS. Due
+to usage of the BIOS, the selection is limited to boot time (before the
+kernel decompression starts) and works only on 80X86 machines.
+
+.. note::
+
+ Short intro for the impatient: Just use vga=ask for the first time,
+ enter ``scan`` on the video mode prompt, pick the mode you want to use,
+ remember its mode ID (the four-digit hexadecimal number) and then
+ set the vga parameter to this number (converted to decimal first).
+
+The video mode to be used is selected by a kernel parameter which can be
+specified in the kernel Makefile (the SVGA_MODE=... line) or by the "vga=..."
+option of LILO (or some other boot loader you use) or by the "vidmode" utility
+(present in standard Linux utility packages). You can use the following values
+of this parameter::
+
+ NORMAL_VGA - Standard 80x25 mode available on all display adapters.
+
+ EXTENDED_VGA - Standard 8-pixel font mode: 80x43 on EGA, 80x50 on VGA.
+
+ ASK_VGA - Display a video mode menu upon startup (see below).
+
+ 0..35 - Menu item number (when you have used the menu to view the list of
+ modes available on your adapter, you can specify the menu item you want
+ to use). 0..9 correspond to "0".."9", 10..35 to "a".."z". Warning: the
+ mode list displayed may vary as the kernel version changes, because the
+ modes are listed in a "first detected -- first displayed" manner. It's
+ better to use absolute mode numbers instead.
+
+ 0x.... - Hexadecimal video mode ID (also displayed on the menu, see below
+ for exact meaning of the ID). Warning: rdev and LILO don't support
+ hexadecimal numbers -- you have to convert it to decimal manually.
+
+Menu
+~~~~
+
+The ASK_VGA mode causes the kernel to offer a video mode menu upon
+bootup. It displays a "Press <RETURN> to see video modes available, <SPACE>
+to continue or wait 30 secs" message. If you press <RETURN>, you enter the
+menu, if you press <SPACE> or wait 30 seconds, the kernel will boot up in
+the standard 80x25 mode.
+
+The menu looks like::
+
+ Video adapter: <name-of-detected-video-adapter>
+ Mode: COLSxROWS:
+ 0 0F00 80x25
+ 1 0F01 80x50
+ 2 0F02 80x43
+ 3 0F03 80x26
+ ....
+ Enter mode number or ``scan``: <flashing-cursor-here>
+
+<name-of-detected-video-adapter> tells what video adapter did Linux detect
+-- it's either a generic adapter name (MDA, CGA, HGC, EGA, VGA, VESA VGA [a VGA
+with VESA-compliant BIOS]) or a chipset name (e.g., Trident). Direct detection
+of chipsets is turned off by default as it's inherently unreliable due to
+absolutely insane PC design.
+
+"0 0F00 80x25" means that the first menu item (the menu items are numbered
+from "0" to "9" and from "a" to "z") is a 80x25 mode with ID=0x0f00 (see the
+next section for a description of mode IDs).
+
+<flashing-cursor-here> encourages you to enter the item number or mode ID
+you wish to set and press <RETURN>. If the computer complains something about
+"Unknown mode ID", it is trying to tell you that it isn't possible to set such
+a mode. It's also possible to press only <RETURN> which leaves the current mode.
+
+The mode list usually contains a few basic modes and some VESA modes. In
+case your chipset has been detected, some chipset-specific modes are shown as
+well (some of these might be missing or unusable on your machine as different
+BIOSes are often shipped with the same card and the mode numbers depend purely
+on the VGA BIOS).
+
+The modes displayed on the menu are partially sorted: The list starts with
+the standard modes (80x25 and 80x50) followed by "special" modes (80x28 and
+80x43), local modes (if the local modes feature is enabled), VESA modes and
+finally SVGA modes for the auto-detected adapter.
+
+If you are not happy with the mode list offered (e.g., if you think your card
+is able to do more), you can enter "scan" instead of item number / mode ID. The
+program will try to ask the BIOS for all possible video mode numbers and test
+what happens then. The screen will be probably flashing wildly for some time and
+strange noises will be heard from inside the monitor and so on and then, really
+all consistent video modes supported by your BIOS will appear (plus maybe some
+``ghost modes``). If you are afraid this could damage your monitor, don't use
+this function.
+
+After scanning, the mode ordering is a bit different: the auto-detected SVGA
+modes are not listed at all and the modes revealed by ``scan`` are shown before
+all VESA modes.
+
+Mode IDs
+~~~~~~~~
+
+Because of the complexity of all the video stuff, the video mode IDs
+used here are also a bit complex. A video mode ID is a 16-bit number usually
+expressed in a hexadecimal notation (starting with "0x"). You can set a mode
+by entering its mode directly if you know it even if it isn't shown on the menu.
+
+The ID numbers can be divided to those regions::
+
+ 0x0000 to 0x00ff - menu item references. 0x0000 is the first item. Don't use
+ outside the menu as this can change from boot to boot (especially if you
+ have used the ``scan`` feature).
+
+ 0x0100 to 0x017f - standard BIOS modes. The ID is a BIOS video mode number
+ (as presented to INT 10, function 00) increased by 0x0100.
+
+ 0x0200 to 0x08ff - VESA BIOS modes. The ID is a VESA mode ID increased by
+ 0x0100. All VESA modes should be autodetected and shown on the menu.
+
+ 0x0900 to 0x09ff - Video7 special modes. Set by calling INT 0x10, AX=0x6f05.
+ (Usually 940=80x43, 941=132x25, 942=132x44, 943=80x60, 944=100x60,
+ 945=132x28 for the standard Video7 BIOS)
+
+ 0x0f00 to 0x0fff - special modes (they are set by various tricks -- usually
+ by modifying one of the standard modes). Currently available:
+ 0x0f00 standard 80x25, don't reset mode if already set (=FFFF)
+ 0x0f01 standard with 8-point font: 80x43 on EGA, 80x50 on VGA
+ 0x0f02 VGA 80x43 (VGA switched to 350 scanlines with a 8-point font)
+ 0x0f03 VGA 80x28 (standard VGA scans, but 14-point font)
+ 0x0f04 leave current video mode
+ 0x0f05 VGA 80x30 (480 scans, 16-point font)
+ 0x0f06 VGA 80x34 (480 scans, 14-point font)
+ 0x0f07 VGA 80x60 (480 scans, 8-point font)
+ 0x0f08 Graphics hack (see the VIDEO_GFX_HACK paragraph below)
+
+ 0x1000 to 0x7fff - modes specified by resolution. The code has a "0xRRCC"
+ form where RR is a number of rows and CC is a number of columns.
+ E.g., 0x1950 corresponds to a 80x25 mode, 0x2b84 to 132x43 etc.
+ This is the only fully portable way to refer to a non-standard mode,
+ but it relies on the mode being found and displayed on the menu
+ (remember that mode scanning is not done automatically).
+
+ 0xff00 to 0xffff - aliases for backward compatibility:
+ 0xffff equivalent to 0x0f00 (standard 80x25)
+ 0xfffe equivalent to 0x0f01 (EGA 80x43 or VGA 80x50)
+
+If you add 0x8000 to the mode ID, the program will try to recalculate
+vertical display timing according to mode parameters, which can be used to
+eliminate some annoying bugs of certain VGA BIOSes (usually those used for
+cards with S3 chipsets and old Cirrus Logic BIOSes) -- mainly extra lines at the
+end of the display.
+
+Options
+~~~~~~~
+
+Build options for arch/x86/boot/* are selected by the kernel kconfig
+utility and the kernel .config file.
+
+VIDEO_GFX_HACK - includes special hack for setting of graphics modes
+to be used later by special drivers.
+Allows to set _any_ BIOS mode including graphic ones and forcing specific
+text screen resolution instead of peeking it from BIOS variables. Don't use
+unless you think you know what you're doing. To activate this setup, use
+mode number 0x0f08 (see the Mode IDs section above).
+
+Still doesn't work?
+~~~~~~~~~~~~~~~~~~~
+
+When the mode detection doesn't work (e.g., the mode list is incorrect or
+the machine hangs instead of displaying the menu), try to switch off some of
+the configuration options listed under "Options". If it fails, you can still use
+your kernel with the video mode set directly via the kernel parameter.
+
+In either case, please send me a bug report containing what _exactly_
+happens and how do the configuration switches affect the behaviour of the bug.
+
+If you start Linux from M$-DOS, you might also use some DOS tools for
+video mode setting. In this case, you must specify the 0x0f04 mode ("leave
+current settings") to Linux, because if you don't and you use any non-standard
+mode, Linux will switch to 80x25 automatically.
+
+If you set some extended mode and there's one or more extra lines on the
+bottom of the display containing already scrolled-out text, your VGA BIOS
+contains the most common video BIOS bug called "incorrect vertical display
+end setting". Adding 0x8000 to the mode ID might fix the problem. Unfortunately,
+this must be done manually -- no autodetection mechanisms are available.
+
+History
+~~~~~~~
+
+=============== ================================================================
+1.0 (??-Nov-95) First version supporting all adapters supported by the old
+ setup.S + Cirrus Logic 54XX. Present in some 1.3.4? kernels
+ and then removed due to instability on some machines.
+2.0 (28-Jan-96) Rewritten from scratch. Cirrus Logic 64XX support added, almost
+ everything is configurable, the VESA support should be much more
+ stable, explicit mode numbering allowed, "scan" implemented etc.
+2.1 (30-Jan-96) VESA modes moved to 0x200-0x3ff. Mode selection by resolution
+ supported. Few bugs fixed. VESA modes are listed prior to
+ modes supplied by SVGA autodetection as they are more reliable.
+ CLGD autodetect works better. Doesn't depend on 80x25 being
+ active when started. Scanning fixed. 80x43 (any VGA) added.
+ Code cleaned up.
+2.2 (01-Feb-96) EGA 80x43 fixed. VESA extended to 0x200-0x4ff (non-standard 02XX
+ VESA modes work now). Display end bug workaround supported.
+ Special modes renumbered to allow adding of the "recalculate"
+ flag, 0xffff and 0xfffe became aliases instead of real IDs.
+ Screen contents retained during mode changes.
+2.3 (15-Mar-96) Changed to work with 1.3.74 kernel.
+2.4 (18-Mar-96) Added patches by Hans Lermen fixing a memory overwrite problem
+ with some boot loaders. Memory management rewritten to reflect
+ these changes. Unfortunately, screen contents retaining works
+ only with some loaders now.
+ Added a Tseng 132x60 mode.
+2.5 (19-Mar-96) Fixed a VESA mode scanning bug introduced in 2.4.
+2.6 (25-Mar-96) Some VESA BIOS errors not reported -- it fixes error reports on
+ several cards with broken VESA code (e.g., ATI VGA).
+2.7 (09-Apr-96) - Accepted all VESA modes in range 0x100 to 0x7ff, because some
+ cards use very strange mode numbers.
+ - Added Realtek VGA modes (thanks to Gonzalo Tornaria).
+ - Hardware testing order slightly changed, tests based on ROM
+ contents done as first.
+ - Added support for special Video7 mode switching functions
+ (thanks to Tom Vander Aa).
+ - Added 480-scanline modes (especially useful for notebooks,
+ original version written by hhanemaa@cs.ruu.nl, patched by
+ Jeff Chua, rewritten by me).
+ - Screen store/restore fixed.
+2.8 (14-Apr-96) - Previous release was not compilable without CONFIG_VIDEO_SVGA.
+ - Better recognition of text modes during mode scan.
+2.9 (12-May-96) - Ignored VESA modes 0x80 - 0xff (more VESA BIOS bugs!)
+2.10(11-Nov-96) - The whole thing made optional.
+ - Added the CONFIG_VIDEO_400_HACK switch.
+ - Added the CONFIG_VIDEO_GFX_HACK switch.
+ - Code cleanup.
+2.11(03-May-97) - Yet another cleanup, now including also the documentation.
+ - Direct testing of SVGA adapters turned off by default, ``scan``
+ offered explicitly on the prompt line.
+ - Removed the doc section describing adding of new probing
+ functions as I try to get rid of _all_ hardware probing here.
+2.12(25-May-98) Added support for VESA frame buffer graphics.
+2.13(14-May-99) Minor documentation fixes.
+=============== ================================================================
diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst
new file mode 100644
index 0000000..599bcde
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/abi.rst
@@ -0,0 +1,67 @@
+================================
+Documentation for /proc/sys/abi/
+================================
+
+kernel version 2.6.0.test2
+
+Copyright (c) 2003, Fabian Frederick <ffrederick@users.sourceforge.net>
+
+For general info: index.rst.
+
+------------------------------------------------------------------------------
+
+This path is binary emulation relevant aka personality types aka abi.
+When a process is executed, it's linked to an exec_domain whose
+personality is defined using values available from /proc/sys/abi.
+You can find further details about abi in include/linux/personality.h.
+
+Here are the files featuring in 2.6 kernel:
+
+- defhandler_coff
+- defhandler_elf
+- defhandler_lcall7
+- defhandler_libcso
+- fake_utsname
+- trace
+
+defhandler_coff
+---------------
+
+defined value:
+ PER_SCOSVR3::
+
+ 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE
+
+defhandler_elf
+--------------
+
+defined value:
+ PER_LINUX::
+
+ 0
+
+defhandler_lcall7
+-----------------
+
+defined value :
+ PER_SVR4::
+
+ 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
+
+defhandler_libsco
+-----------------
+
+defined value:
+ PER_SVR4::
+
+ 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,
+
+fake_utsname
+------------
+
+Unused
+
+trace
+-----
+
+Unused
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
new file mode 100644
index 0000000..2a45119
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -0,0 +1,384 @@
+===============================
+Documentation for /proc/sys/fs/
+===============================
+
+kernel version 2.2.10
+
+Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
+
+Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
+
+For general info and legal blurb, please look in intro.rst.
+
+------------------------------------------------------------------------------
+
+This file contains documentation for the sysctl files in
+/proc/sys/fs/ and is valid for Linux kernel version 2.2.
+
+The files in this directory can be used to tune and monitor
+miscellaneous and general things in the operation of the Linux
+kernel. Since some of the files _can_ be used to screw up your
+system, it is advisable to read both documentation and source
+before actually making adjustments.
+
+1. /proc/sys/fs
+===============
+
+Currently, these files are in /proc/sys/fs:
+
+- aio-max-nr
+- aio-nr
+- dentry-state
+- dquot-max
+- dquot-nr
+- file-max
+- file-nr
+- inode-max
+- inode-nr
+- inode-state
+- nr_open
+- overflowuid
+- overflowgid
+- pipe-user-pages-hard
+- pipe-user-pages-soft
+- protected_fifos
+- protected_hardlinks
+- protected_regular
+- protected_symlinks
+- suid_dumpable
+- super-max
+- super-nr
+
+
+aio-nr & aio-max-nr
+-------------------
+
+aio-nr is the running total of the number of events specified on the
+io_setup system call for all currently active aio contexts. If aio-nr
+reaches aio-max-nr then io_setup will fail with EAGAIN. Note that
+raising aio-max-nr does not result in the pre-allocation or re-sizing
+of any kernel data structures.
+
+
+dentry-state
+------------
+
+From linux/include/linux/dcache.h::
+
+ struct dentry_stat_t dentry_stat {
+ int nr_dentry;
+ int nr_unused;
+ int age_limit; /* age in seconds */
+ int want_pages; /* pages requested by system */
+ int nr_negative; /* # of unused negative dentries */
+ int dummy; /* Reserved for future use */
+ };
+
+Dentries are dynamically allocated and deallocated.
+
+nr_dentry shows the total number of dentries allocated (active
++ unused). nr_unused shows the number of dentries that are not
+actively used, but are saved in the LRU list for future reuse.
+
+Age_limit is the age in seconds after which dcache entries
+can be reclaimed when memory is short and want_pages is
+nonzero when shrink_dcache_pages() has been called and the
+dcache isn't pruned yet.
+
+nr_negative shows the number of unused dentries that are also
+negative dentries which do not map to any files. Instead,
+they help speeding up rejection of non-existing files provided
+by the users.
+
+
+dquot-max & dquot-nr
+--------------------
+
+The file dquot-max shows the maximum number of cached disk
+quota entries.
+
+The file dquot-nr shows the number of allocated disk quota
+entries and the number of free disk quota entries.
+
+If the number of free cached disk quotas is very low and
+you have some awesome number of simultaneous system users,
+you might want to raise the limit.
+
+
+file-max & file-nr
+------------------
+
+The value in file-max denotes the maximum number of file-
+handles that the Linux kernel will allocate. When you get lots
+of error messages about running out of file handles, you might
+want to increase this limit.
+
+Historically,the kernel was able to allocate file handles
+dynamically, but not to free them again. The three values in
+file-nr denote the number of allocated file handles, the number
+of allocated but unused file handles, and the maximum number of
+file handles. Linux 2.6 always reports 0 as the number of free
+file handles -- this is not an error, it just means that the
+number of allocated file handles exactly matches the number of
+used file handles.
+
+Attempts to allocate more file descriptors than file-max are
+reported with printk, look for "VFS: file-max limit <number>
+reached".
+
+
+nr_open
+-------
+
+This denotes the maximum number of file-handles a process can
+allocate. Default value is 1024*1024 (1048576) which should be
+enough for most machines. Actual limit depends on RLIMIT_NOFILE
+resource limit.
+
+
+inode-max, inode-nr & inode-state
+---------------------------------
+
+As with file handles, the kernel allocates the inode structures
+dynamically, but can't free them yet.
+
+The value in inode-max denotes the maximum number of inode
+handlers. This value should be 3-4 times larger than the value
+in file-max, since stdin, stdout and network sockets also
+need an inode struct to handle them. When you regularly run
+out of inodes, you need to increase this value.
+
+The file inode-nr contains the first two items from
+inode-state, so we'll skip to that file...
+
+Inode-state contains three actual numbers and four dummies.
+The actual numbers are, in order of appearance, nr_inodes,
+nr_free_inodes and preshrink.
+
+Nr_inodes stands for the number of inodes the system has
+allocated, this can be slightly more than inode-max because
+Linux allocates them one pageful at a time.
+
+Nr_free_inodes represents the number of free inodes (?) and
+preshrink is nonzero when the nr_inodes > inode-max and the
+system needs to prune the inode list instead of allocating
+more.
+
+
+overflowgid & overflowuid
+-------------------------
+
+Some filesystems only support 16-bit UIDs and GIDs, although in Linux
+UIDs and GIDs are 32 bits. When one of these filesystems is mounted
+with writes enabled, any UID or GID that would exceed 65535 is translated
+to a fixed value before being written to disk.
+
+These sysctls allow you to change the value of the fixed UID and GID.
+The default is 65534.
+
+
+pipe-user-pages-hard
+--------------------
+
+Maximum total number of pages a non-privileged user may allocate for pipes.
+Once this limit is reached, no new pipes may be allocated until usage goes
+below the limit again. When set to 0, no limit is applied, which is the default
+setting.
+
+
+pipe-user-pages-soft
+--------------------
+
+Maximum total number of pages a non-privileged user may allocate for pipes
+before the pipe size gets limited to a single page. Once this limit is reached,
+new pipes will be limited to a single page in size for this user in order to
+limit total memory usage, and trying to increase them using fcntl() will be
+denied until usage goes below the limit again. The default value allows to
+allocate up to 1024 pipes at their default size. When set to 0, no limit is
+applied.
+
+
+protected_fifos
+---------------
+
+The intent of this protection is to avoid unintentional writes to
+an attacker-controlled FIFO, where a program expected to create a regular
+file.
+
+When set to "0", writing to FIFOs is unrestricted.
+
+When set to "1" don't allow O_CREAT open on FIFOs that we don't own
+in world writable sticky directories, unless they are owned by the
+owner of the directory.
+
+When set to "2" it also applies to group writable sticky directories.
+
+This protection is based on the restrictions in Openwall.
+
+
+protected_hardlinks
+--------------------
+
+A long-standing class of security issues is the hardlink-based
+time-of-check-time-of-use race, most commonly seen in world-writable
+directories like /tmp. The common method of exploitation of this flaw
+is to cross privilege boundaries when following a given hardlink (i.e. a
+root process follows a hardlink created by another user). Additionally,
+on systems without separated partitions, this stops unauthorized users
+from "pinning" vulnerable setuid/setgid files against being upgraded by
+the administrator, or linking to special files.
+
+When set to "0", hardlink creation behavior is unrestricted.
+
+When set to "1" hardlinks cannot be created by users if they do not
+already own the source file, or do not have read/write access to it.
+
+This protection is based on the restrictions in Openwall and grsecurity.
+
+
+protected_regular
+-----------------
+
+This protection is similar to protected_fifos, but it
+avoids writes to an attacker-controlled regular file, where a program
+expected to create one.
+
+When set to "0", writing to regular files is unrestricted.
+
+When set to "1" don't allow O_CREAT open on regular files that we
+don't own in world writable sticky directories, unless they are
+owned by the owner of the directory.
+
+When set to "2" it also applies to group writable sticky directories.
+
+
+protected_symlinks
+------------------
+
+A long-standing class of security issues is the symlink-based
+time-of-check-time-of-use race, most commonly seen in world-writable
+directories like /tmp. The common method of exploitation of this flaw
+is to cross privilege boundaries when following a given symlink (i.e. a
+root process follows a symlink belonging to another user). For a likely
+incomplete list of hundreds of examples across the years, please see:
+http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
+
+When set to "0", symlink following behavior is unrestricted.
+
+When set to "1" symlinks are permitted to be followed only when outside
+a sticky world-writable directory, or when the uid of the symlink and
+follower match, or when the directory owner matches the symlink's owner.
+
+This protection is based on the restrictions in Openwall and grsecurity.
+
+
+suid_dumpable:
+--------------
+
+This value can be used to query and set the core dump mode for setuid
+or otherwise protected/tainted binaries. The modes are
+
+= ========== ===============================================================
+0 (default) traditional behaviour. Any process which has changed
+ privilege levels or is execute only will not be dumped.
+1 (debug) all processes dump core when possible. The core dump is
+ owned by the current user and no security is applied. This is
+ intended for system debugging situations only.
+ Ptrace is unchecked.
+ This is insecure as it allows regular users to examine the
+ memory contents of privileged processes.
+2 (suidsafe) any binary which normally would not be dumped is dumped
+ anyway, but only if the "core_pattern" kernel sysctl is set to
+ either a pipe handler or a fully qualified path. (For more
+ details on this limitation, see CVE-2006-2451.) This mode is
+ appropriate when administrators are attempting to debug
+ problems in a normal environment, and either have a core dump
+ pipe handler that knows to treat privileged core dumps with
+ care, or specific directory defined for catching core dumps.
+ If a core dump happens without a pipe handler or fully
+ qualified path, a message will be emitted to syslog warning
+ about the lack of a correct setting.
+= ========== ===============================================================
+
+
+super-max & super-nr
+--------------------
+
+These numbers control the maximum number of superblocks, and
+thus the maximum number of mounted filesystems the kernel
+can have. You only need to increase super-max if you need to
+mount more filesystems than the current value in super-max
+allows you to.
+
+
+aio-nr & aio-max-nr
+-------------------
+
+aio-nr shows the current system-wide number of asynchronous io
+requests. aio-max-nr allows you to change the maximum value
+aio-nr can grow to.
+
+
+mount-max
+---------
+
+This denotes the maximum number of mounts that may exist
+in a mount namespace.
+
+
+
+2. /proc/sys/fs/binfmt_misc
+===========================
+
+Documentation for the files in /proc/sys/fs/binfmt_misc is
+in Documentation/admin-guide/binfmt-misc.rst.
+
+
+3. /proc/sys/fs/mqueue - POSIX message queues filesystem
+========================================================
+
+
+The "mqueue" filesystem provides the necessary kernel features to enable the
+creation of a user space library that implements the POSIX message queues
+API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System
+Interfaces specification.)
+
+The "mqueue" filesystem contains values for determining/setting the amount of
+resources used by the file system.
+
+/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the
+maximum number of message queues allowed on the system.
+
+/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the
+maximum number of messages in a queue value. In fact it is the limiting value
+for another (user) limit which is set in mq_open invocation. This attribute of
+a queue must be less or equal then msg_max.
+
+/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the
+maximum message size value (it is every message queue's attribute set during
+its creation).
+
+/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the
+default number of messages in a queue value if attr parameter of mq_open(2) is
+NULL. If it exceed msg_max, the default value is initialized msg_max.
+
+/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting
+the default message size value if attr parameter of mq_open(2) is NULL. If it
+exceed msgsize_max, the default value is initialized msgsize_max.
+
+4. /proc/sys/fs/epoll - Configuration options for the epoll interface
+=====================================================================
+
+This directory contains configuration options for the epoll(7) interface.
+
+max_user_watches
+----------------
+
+Every epoll file descriptor can store a number of files to be monitored
+for event readiness. Each one of these monitored files constitutes a "watch".
+This configuration option sets the maximum number of "watches" that are
+allowed for each user.
+Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes
+on a 64bit one.
+The current default value for max_user_watches is the 1/32 of the available
+low memory, divided for the "watch" cost in bytes.
diff --git a/Documentation/admin-guide/sysctl/index.rst b/Documentation/admin-guide/sysctl/index.rst
new file mode 100644
index 0000000..03346f9
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/index.rst
@@ -0,0 +1,98 @@
+===========================
+Documentation for /proc/sys
+===========================
+
+Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
+
+------------------------------------------------------------------------------
+
+'Why', I hear you ask, 'would anyone even _want_ documentation
+for them sysctl files? If anybody really needs it, it's all in
+the source...'
+
+Well, this documentation is written because some people either
+don't know they need to tweak something, or because they don't
+have the time or knowledge to read the source code.
+
+Furthermore, the programmers who built sysctl have built it to
+be actually used, not just for the fun of programming it :-)
+
+------------------------------------------------------------------------------
+
+Legal blurb:
+
+As usual, there are two main things to consider:
+
+1. you get what you pay for
+2. it's free
+
+The consequences are that I won't guarantee the correctness of
+this document, and if you come to me complaining about how you
+screwed up your system because of wrong documentation, I won't
+feel sorry for you. I might even laugh at you...
+
+But of course, if you _do_ manage to screw up your system using
+only the sysctl options used in this file, I'd like to hear of
+it. Not only to have a great laugh, but also to make sure that
+you're the last RTFMing person to screw up.
+
+In short, e-mail your suggestions, corrections and / or horror
+stories to: <riel@nl.linux.org>
+
+Rik van Riel.
+
+--------------------------------------------------------------
+
+Introduction
+============
+
+Sysctl is a means of configuring certain aspects of the kernel
+at run-time, and the /proc/sys/ directory is there so that you
+don't even need special tools to do it!
+In fact, there are only four things needed to use these config
+facilities:
+
+- a running Linux system
+- root access
+- common sense (this is especially hard to come by these days)
+- knowledge of what all those values mean
+
+As a quick 'ls /proc/sys' will show, the directory consists of
+several (arch-dependent?) subdirs. Each subdir is mainly about
+one part of the kernel, so you can do configuration on a piece
+by piece basis, or just some 'thematic frobbing'.
+
+This documentation is about:
+
+=============== ===============================================================
+abi/ execution domains & personalities
+debug/ <empty>
+dev/ device specific information (eg dev/cdrom/info)
+fs/ specific filesystems
+ filehandle, inode, dentry and quota tuning
+ binfmt_misc <Documentation/admin-guide/binfmt-misc.rst>
+kernel/ global kernel info / tuning
+ miscellaneous stuff
+net/ networking stuff, for documentation look in:
+ <Documentation/networking/>
+proc/ <empty>
+sunrpc/ SUN Remote Procedure Call (NFS)
+vm/ memory management tuning
+ buffer and cache management
+user/ Per user per user namespace limits
+=============== ===============================================================
+
+These are the subdirs I have on my system. There might be more
+or other subdirs in another setup. If you see another dir, I'd
+really like to hear about it :-)
+
+.. toctree::
+ :maxdepth: 1
+
+ abi
+ fs
+ kernel
+ net
+ sunrpc
+ user
+ vm
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
new file mode 100644
index 0000000..032c7cd
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -0,0 +1,1177 @@
+===================================
+Documentation for /proc/sys/kernel/
+===================================
+
+kernel version 2.2.10
+
+Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
+
+Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com>
+
+For general info and legal blurb, please look in index.rst.
+
+------------------------------------------------------------------------------
+
+This file contains documentation for the sysctl files in
+/proc/sys/kernel/ and is valid for Linux kernel version 2.2.
+
+The files in this directory can be used to tune and monitor
+miscellaneous and general things in the operation of the Linux
+kernel. Since some of the files _can_ be used to screw up your
+system, it is advisable to read both documentation and source
+before actually making adjustments.
+
+Currently, these files might (depending on your configuration)
+show up in /proc/sys/kernel:
+
+- acct
+- acpi_video_flags
+- auto_msgmni
+- bootloader_type [ X86 only ]
+- bootloader_version [ X86 only ]
+- cap_last_cap
+- core_pattern
+- core_pipe_limit
+- core_uses_pid
+- ctrl-alt-del
+- dmesg_restrict
+- domainname
+- hostname
+- hotplug
+- hardlockup_all_cpu_backtrace
+- hardlockup_panic
+- hung_task_panic
+- hung_task_check_count
+- hung_task_timeout_secs
+- hung_task_check_interval_secs
+- hung_task_warnings
+- hyperv_record_panic_msg
+- kexec_load_disabled
+- kptr_restrict
+- l2cr [ PPC only ]
+- modprobe ==> Documentation/debugging-modules.txt
+- modules_disabled
+- msg_next_id [ sysv ipc ]
+- msgmax
+- msgmnb
+- msgmni
+- nmi_watchdog
+- osrelease
+- ostype
+- overflowgid
+- overflowuid
+- panic
+- panic_on_oops
+- panic_on_stackoverflow
+- panic_on_unrecovered_nmi
+- panic_on_warn
+- panic_print
+- panic_on_rcu_stall
+- perf_cpu_time_max_percent
+- perf_event_paranoid
+- perf_event_max_stack
+- perf_event_mlock_kb
+- perf_event_max_contexts_per_stack
+- pid_max
+- powersave-nap [ PPC only ]
+- printk
+- printk_delay
+- printk_ratelimit
+- printk_ratelimit_burst
+- pty ==> Documentation/filesystems/devpts.txt
+- randomize_va_space
+- real-root-dev ==> Documentation/admin-guide/initrd.rst
+- reboot-cmd [ SPARC only ]
+- rtsig-max
+- rtsig-nr
+- sched_energy_aware
+- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst
+- sem
+- sem_next_id [ sysv ipc ]
+- sg-big-buff [ generic SCSI device (sg) ]
+- shm_next_id [ sysv ipc ]
+- shm_rmid_forced
+- shmall
+- shmmax [ sysv ipc ]
+- shmmni
+- softlockup_all_cpu_backtrace
+- soft_watchdog
+- stack_erasing
+- stop-a [ SPARC only ]
+- sysrq ==> Documentation/admin-guide/sysrq.rst
+- sysctl_writes_strict
+- tainted ==> Documentation/admin-guide/tainted-kernels.rst
+- threads-max
+- unknown_nmi_panic
+- watchdog
+- watchdog_thresh
+- version
+
+
+acct:
+=====
+
+highwater lowwater frequency
+
+If BSD-style process accounting is enabled these values control
+its behaviour. If free space on filesystem where the log lives
+goes below <lowwater>% accounting suspends. If free space gets
+above <highwater>% accounting resumes. <Frequency> determines
+how often do we check the amount of free space (value is in
+seconds). Default:
+4 2 30
+That is, suspend accounting if there left <= 2% free; resume it
+if we got >=4%; consider information about amount of free space
+valid for 30 seconds.
+
+
+acpi_video_flags:
+=================
+
+flags
+
+See Doc*/kernel/power/video.txt, it allows mode of video boot to be
+set during run time.
+
+
+auto_msgmni:
+============
+
+This variable has no effect and may be removed in future kernel
+releases. Reading it always returns 0.
+Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni
+upon memory add/remove or upon ipc namespace creation/removal.
+Echoing "1" into this file enabled msgmni automatic recomputing.
+Echoing "0" turned it off. auto_msgmni default value was 1.
+
+
+bootloader_type:
+================
+
+x86 bootloader identification
+
+This gives the bootloader type number as indicated by the bootloader,
+shifted left by 4, and OR'd with the low four bits of the bootloader
+version. The reason for this encoding is that this used to match the
+type_of_loader field in the kernel header; the encoding is kept for
+backwards compatibility. That is, if the full bootloader type number
+is 0x15 and the full version number is 0x234, this file will contain
+the value 340 = 0x154.
+
+See the type_of_loader and ext_loader_type fields in
+Documentation/x86/boot.rst for additional information.
+
+
+bootloader_version:
+===================
+
+x86 bootloader version
+
+The complete bootloader version number. In the example above, this
+file will contain the value 564 = 0x234.
+
+See the type_of_loader and ext_loader_ver fields in
+Documentation/x86/boot.rst for additional information.
+
+
+cap_last_cap:
+=============
+
+Highest valid capability of the running kernel. Exports
+CAP_LAST_CAP from the kernel.
+
+
+core_pattern:
+=============
+
+core_pattern is used to specify a core dumpfile pattern name.
+
+* max length 127 characters; default value is "core"
+* core_pattern is used as a pattern template for the output filename;
+ certain string patterns (beginning with '%') are substituted with
+ their actual values.
+* backward compatibility with core_uses_pid:
+
+ If core_pattern does not include "%p" (default does not)
+ and core_uses_pid is set, then .PID will be appended to
+ the filename.
+
+* corename format specifiers::
+
+ %<NUL> '%' is dropped
+ %% output one '%'
+ %p pid
+ %P global pid (init PID namespace)
+ %i tid
+ %I global tid (init PID namespace)
+ %u uid (in initial user namespace)
+ %g gid (in initial user namespace)
+ %d dump mode, matches PR_SET_DUMPABLE and
+ /proc/sys/fs/suid_dumpable
+ %s signal number
+ %t UNIX time of dump
+ %h hostname
+ %e executable filename (may be shortened)
+ %E executable path
+ %<OTHER> both are dropped
+
+* If the first character of the pattern is a '|', the kernel will treat
+ the rest of the pattern as a command to run. The core dump will be
+ written to the standard input of that program instead of to a file.
+
+
+core_pipe_limit:
+================
+
+This sysctl is only applicable when core_pattern is configured to pipe
+core files to a user space helper (when the first character of
+core_pattern is a '|', see above). When collecting cores via a pipe
+to an application, it is occasionally useful for the collecting
+application to gather data about the crashing process from its
+/proc/pid directory. In order to do this safely, the kernel must wait
+for the collecting process to exit, so as not to remove the crashing
+processes proc files prematurely. This in turn creates the
+possibility that a misbehaving userspace collecting process can block
+the reaping of a crashed process simply by never exiting. This sysctl
+defends against that. It defines how many concurrent crashing
+processes may be piped to user space applications in parallel. If
+this value is exceeded, then those crashing processes above that value
+are noted via the kernel log and their cores are skipped. 0 is a
+special value, indicating that unlimited processes may be captured in
+parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crashing pid>/). This
+value defaults to 0.
+
+
+core_uses_pid:
+==============
+
+The default coredump filename is "core". By setting
+core_uses_pid to 1, the coredump filename becomes core.PID.
+If core_pattern does not include "%p" (default does not)
+and core_uses_pid is set, then .PID will be appended to
+the filename.
+
+
+ctrl-alt-del:
+=============
+
+When the value in this file is 0, ctrl-alt-del is trapped and
+sent to the init(1) program to handle a graceful restart.
+When, however, the value is > 0, Linux's reaction to a Vulcan
+Nerve Pinch (tm) will be an immediate reboot, without even
+syncing its dirty buffers.
+
+Note:
+ when a program (like dosemu) has the keyboard in 'raw'
+ mode, the ctrl-alt-del is intercepted by the program before it
+ ever reaches the kernel tty layer, and it's up to the program
+ to decide what to do with it.
+
+
+dmesg_restrict:
+===============
+
+This toggle indicates whether unprivileged users are prevented
+from using dmesg(8) to view messages from the kernel's log buffer.
+When dmesg_restrict is set to (0) there are no restrictions. When
+dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use
+dmesg(8).
+
+The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the
+default value of dmesg_restrict.
+
+
+domainname & hostname:
+======================
+
+These files can be used to set the NIS/YP domainname and the
+hostname of your box in exactly the same way as the commands
+domainname and hostname, i.e.::
+
+ # echo "darkstar" > /proc/sys/kernel/hostname
+ # echo "mydomain" > /proc/sys/kernel/domainname
+
+has the same effect as::
+
+ # hostname "darkstar"
+ # domainname "mydomain"
+
+Note, however, that the classic darkstar.frop.org has the
+hostname "darkstar" and DNS (Internet Domain Name Server)
+domainname "frop.org", not to be confused with the NIS (Network
+Information Service) or YP (Yellow Pages) domainname. These two
+domain names are in general different. For a detailed discussion
+see the hostname(1) man page.
+
+
+hardlockup_all_cpu_backtrace:
+=============================
+
+This value controls the hard lockup detector behavior when a hard
+lockup condition is detected as to whether or not to gather further
+debug information. If enabled, arch-specific all-CPU stack dumping
+will be initiated.
+
+0: do nothing. This is the default behavior.
+
+1: on detection capture more debug information.
+
+
+hardlockup_panic:
+=================
+
+This parameter can be used to control whether the kernel panics
+when a hard lockup is detected.
+
+ 0 - don't panic on hard lockup
+ 1 - panic on hard lockup
+
+See Documentation/admin-guide/lockup-watchdogs.rst for more information. This can
+also be set using the nmi_watchdog kernel parameter.
+
+
+hotplug:
+========
+
+Path for the hotplug policy agent.
+Default value is "/sbin/hotplug".
+
+
+hung_task_panic:
+================
+
+Controls the kernel's behavior when a hung task is detected.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+0: continue operation. This is the default behavior.
+
+1: panic immediately.
+
+
+hung_task_check_count:
+======================
+
+The upper bound on the number of tasks that are checked.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+
+hung_task_timeout_secs:
+=======================
+
+When a task in D state did not get scheduled
+for more than this value report a warning.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+0: means infinite timeout - no checking done.
+
+Possible values to set are in range {0..LONG_MAX/HZ}.
+
+
+hung_task_check_interval_secs:
+==============================
+
+Hung task check interval. If hung task checking is enabled
+(see hung_task_timeout_secs), the check is done every
+hung_task_check_interval_secs seconds.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+0 (default): means use hung_task_timeout_secs as checking interval.
+Possible values to set are in range {0..LONG_MAX/HZ}.
+
+
+hung_task_warnings:
+===================
+
+The maximum number of warnings to report. During a check interval
+if a hung task is detected, this value is decreased by 1.
+When this value reaches 0, no more warnings will be reported.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+-1: report an infinite number of warnings.
+
+
+hyperv_record_panic_msg:
+========================
+
+Controls whether the panic kmsg data should be reported to Hyper-V.
+
+0: do not report panic kmsg data.
+
+1: report the panic kmsg data. This is the default behavior.
+
+
+kexec_load_disabled:
+====================
+
+A toggle indicating if the kexec_load syscall has been disabled. This
+value defaults to 0 (false: kexec_load enabled), but can be set to 1
+(true: kexec_load disabled). Once true, kexec can no longer be used, and
+the toggle cannot be set back to false. This allows a kexec image to be
+loaded before disabling the syscall, allowing a system to set up (and
+later use) an image without it being altered. Generally used together
+with the "modules_disabled" sysctl.
+
+
+kptr_restrict:
+==============
+
+This toggle indicates whether restrictions are placed on
+exposing kernel addresses via /proc and other interfaces.
+
+When kptr_restrict is set to 0 (the default) the address is hashed before
+printing. (This is the equivalent to %p.)
+
+When kptr_restrict is set to (1), kernel pointers printed using the %pK
+format specifier will be replaced with 0's unless the user has CAP_SYSLOG
+and effective user and group ids are equal to the real ids. This is
+because %pK checks are done at read() time rather than open() time, so
+if permissions are elevated between the open() and the read() (e.g via
+a setuid binary) then %pK will not leak kernel pointers to unprivileged
+users. Note, this is a temporary solution only. The correct long-term
+solution is to do the permission checks at open() time. Consider removing
+world read permissions from files that use %pK, and using dmesg_restrict
+to protect against uses of %pK in dmesg(8) if leaking kernel pointer
+values to unprivileged users is a concern.
+
+When kptr_restrict is set to (2), kernel pointers printed using
+%pK will be replaced with 0's regardless of privileges.
+
+
+l2cr: (PPC only)
+================
+
+This flag controls the L2 cache of G3 processor boards. If
+0, the cache is disabled. Enabled if nonzero.
+
+
+modules_disabled:
+=================
+
+A toggle value indicating if modules are allowed to be loaded
+in an otherwise modular kernel. This toggle defaults to off
+(0), but can be set true (1). Once true, modules can be
+neither loaded nor unloaded, and the toggle cannot be set back
+to false. Generally used with the "kexec_load_disabled" toggle.
+
+
+msg_next_id, sem_next_id, and shm_next_id:
+==========================================
+
+These three toggles allows to specify desired id for next allocated IPC
+object: message, semaphore or shared memory respectively.
+
+By default they are equal to -1, which means generic allocation logic.
+Possible values to set are in range {0..INT_MAX}.
+
+Notes:
+ 1) kernel doesn't guarantee, that new object will have desired id. So,
+ it's up to userspace, how to handle an object with "wrong" id.
+ 2) Toggle with non-default value will be set back to -1 by kernel after
+ successful IPC object allocation. If an IPC object allocation syscall
+ fails, it is undefined if the value remains unmodified or is reset to -1.
+
+
+nmi_watchdog:
+=============
+
+This parameter can be used to control the NMI watchdog
+(i.e. the hard lockup detector) on x86 systems.
+
+0 - disable the hard lockup detector
+
+1 - enable the hard lockup detector
+
+The hard lockup detector monitors each CPU for its ability to respond to
+timer interrupts. The mechanism utilizes CPU performance counter registers
+that are programmed to generate Non-Maskable Interrupts (NMIs) periodically
+while a CPU is busy. Hence, the alternative name 'NMI watchdog'.
+
+The NMI watchdog is disabled by default if the kernel is running as a guest
+in a KVM virtual machine. This default can be overridden by adding::
+
+ nmi_watchdog=1
+
+to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst).
+
+
+numa_balancing:
+===============
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls.
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
+===============================================================================================================================
+
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+
+osrelease, ostype & version:
+============================
+
+::
+
+ # cat osrelease
+ 2.1.88
+ # cat ostype
+ Linux
+ # cat version
+ #5 Wed Feb 25 21:49:24 MET 1998
+
+The files osrelease and ostype should be clear enough. Version
+needs a little more clarification however. The '#5' means that
+this is the fifth kernel built from this source base and the
+date behind it indicates the time the kernel was built.
+The only way to tune these values is to rebuild the kernel :-)
+
+
+overflowgid & overflowuid:
+==========================
+
+if your architecture did not always support 32-bit UIDs (i.e. arm,
+i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to
+applications that use the old 16-bit UID/GID system calls, if the
+actual UID or GID would exceed 65535.
+
+These sysctls allow you to change the value of the fixed UID and GID.
+The default is 65534.
+
+
+panic:
+======
+
+The value in this file represents the number of seconds the kernel
+waits before rebooting on a panic. When you use the software watchdog,
+the recommended setting is 60.
+
+
+panic_on_io_nmi:
+================
+
+Controls the kernel's behavior when a CPU receives an NMI caused by
+an IO error.
+
+0: try to continue operation (default)
+
+1: panic immediately. The IO error triggered an NMI. This indicates a
+ serious system condition which could result in IO data corruption.
+ Rather than continuing, panicking might be a better choice. Some
+ servers issue this sort of NMI when the dump button is pushed,
+ and you can use this option to take a crash dump.
+
+
+panic_on_oops:
+==============
+
+Controls the kernel's behaviour when an oops or BUG is encountered.
+
+0: try to continue operation
+
+1: panic immediately. If the `panic` sysctl is also non-zero then the
+ machine will be rebooted.
+
+
+panic_on_stackoverflow:
+=======================
+
+Controls the kernel's behavior when detecting the overflows of
+kernel, IRQ and exception stacks except a user stack.
+This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
+
+0: try to continue operation.
+
+1: panic immediately.
+
+
+panic_on_unrecovered_nmi:
+=========================
+
+The default Linux behaviour on an NMI of either memory or unknown is
+to continue operation. For many environments such as scientific
+computing it is preferable that the box is taken out and the error
+dealt with than an uncorrected parity/ECC error get propagated.
+
+A small number of systems do generate NMI's for bizarre random reasons
+such as power management so the default is off. That sysctl works like
+the existing panic controls already in that directory.
+
+
+panic_on_warn:
+==============
+
+Calls panic() in the WARN() path when set to 1. This is useful to avoid
+a kernel rebuild when attempting to kdump at the location of a WARN().
+
+0: only WARN(), default behaviour.
+
+1: call panic() after printing out WARN() location.
+
+
+panic_print:
+============
+
+Bitmask for printing system info when panic happens. User can chose
+combination of the following bits:
+
+===== ========================================
+bit 0 print all tasks info
+bit 1 print system memory info
+bit 2 print timer info
+bit 3 print locks info if CONFIG_LOCKDEP is on
+bit 4 print ftrace buffer
+===== ========================================
+
+So for example to print tasks and memory info on panic, user can::
+
+ echo 3 > /proc/sys/kernel/panic_print
+
+
+panic_on_rcu_stall:
+===================
+
+When set to 1, calls panic() after RCU stall detection messages. This
+is useful to define the root cause of RCU stalls using a vmcore.
+
+0: do not panic() when RCU stall takes place, default behavior.
+
+1: panic() after printing RCU stall messages.
+
+
+perf_cpu_time_max_percent:
+==========================
+
+Hints to the kernel how much CPU time it should be allowed to
+use to handle perf sampling events. If the perf subsystem
+is informed that its samples are exceeding this limit, it
+will drop its sampling frequency to attempt to reduce its CPU
+usage.
+
+Some perf sampling happens in NMIs. If these samples
+unexpectedly take too long to execute, the NMIs can become
+stacked up next to each other so much that nothing else is
+allowed to execute.
+
+0:
+ disable the mechanism. Do not monitor or correct perf's
+ sampling rate no matter how CPU time it takes.
+
+1-100:
+ attempt to throttle perf's sample rate to this
+ percentage of CPU. Note: the kernel calculates an
+ "expected" length of each sample event. 100 here means
+ 100% of that expected length. Even if this is set to
+ 100, you may still see sample throttling if this
+ length is exceeded. Set to 0 if you truly do not care
+ how much CPU is consumed.
+
+
+perf_event_paranoid:
+====================
+
+Controls use of the performance events system by unprivileged
+users (without CAP_SYS_ADMIN). The default value is 2.
+
+=== ==================================================================
+ -1 Allow use of (almost) all events by all users
+
+ Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
+
+>=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
+
+ Disallow raw tracepoint access by users without CAP_SYS_ADMIN
+
+>=1 Disallow CPU event access by users without CAP_SYS_ADMIN
+
+>=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
+=== ==================================================================
+
+
+perf_event_max_stack:
+=====================
+
+Controls maximum number of stack frames to copy for (attr.sample_type &
+PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using
+'perf record -g' or 'perf trace --call-graph fp'.
+
+This can only be done when no events are in use that have callchains
+enabled, otherwise writing to this file will return -EBUSY.
+
+The default value is 127.
+
+
+perf_event_mlock_kb:
+====================
+
+Control size of per-cpu ring buffer not counted agains mlock limit.
+
+The default value is 512 + 1 page
+
+
+perf_event_max_contexts_per_stack:
+==================================
+
+Controls maximum number of stack frame context entries for
+(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for
+instance, when using 'perf record -g' or 'perf trace --call-graph fp'.
+
+This can only be done when no events are in use that have callchains
+enabled, otherwise writing to this file will return -EBUSY.
+
+The default value is 8.
+
+
+pid_max:
+========
+
+PID allocation wrap value. When the kernel's next PID value
+reaches this value, it wraps back to a minimum PID value.
+PIDs of value pid_max or larger are not allocated.
+
+
+ns_last_pid:
+============
+
+The last pid allocated in the current (the one task using this sysctl
+lives in) pid namespace. When selecting a pid for a next task on fork
+kernel tries to allocate a number starting from this one.
+
+
+powersave-nap: (PPC only)
+=========================
+
+If set, Linux-PPC will use the 'nap' mode of powersaving,
+otherwise the 'doze' mode will be used.
+
+==============================================================
+
+printk:
+=======
+
+The four values in printk denote: console_loglevel,
+default_message_loglevel, minimum_console_loglevel and
+default_console_loglevel respectively.
+
+These values influence printk() behavior when printing or
+logging error messages. See 'man 2 syslog' for more info on
+the different loglevels.
+
+- console_loglevel:
+ messages with a higher priority than
+ this will be printed to the console
+- default_message_loglevel:
+ messages without an explicit priority
+ will be printed with this priority
+- minimum_console_loglevel:
+ minimum (highest) value to which
+ console_loglevel can be set
+- default_console_loglevel:
+ default value for console_loglevel
+
+
+printk_delay:
+=============
+
+Delay each printk message in printk_delay milliseconds
+
+Value from 0 - 10000 is allowed.
+
+
+printk_ratelimit:
+=================
+
+Some warning messages are rate limited. printk_ratelimit specifies
+the minimum length of time between these messages (in jiffies), by
+default we allow one every 5 seconds.
+
+A value of 0 will disable rate limiting.
+
+
+printk_ratelimit_burst:
+=======================
+
+While long term we enforce one message per printk_ratelimit
+seconds, we do allow a burst of messages to pass through.
+printk_ratelimit_burst specifies the number of messages we can
+send before ratelimiting kicks in.
+
+
+printk_devkmsg:
+===============
+
+Control the logging to /dev/kmsg from userspace:
+
+ratelimit:
+ default, ratelimited
+
+on: unlimited logging to /dev/kmsg from userspace
+
+off: logging to /dev/kmsg disabled
+
+The kernel command line parameter printk.devkmsg= overrides this and is
+a one-time setting until next reboot: once set, it cannot be changed by
+this sysctl interface anymore.
+
+
+randomize_va_space:
+===================
+
+This option can be used to select the type of process address
+space randomization that is used in the system, for architectures
+that support this feature.
+
+== ===========================================================================
+0 Turn the process address space randomization off. This is the
+ default for architectures that do not support this feature anyways,
+ and kernels that are booted with the "norandmaps" parameter.
+
+1 Make the addresses of mmap base, stack and VDSO page randomized.
+ This, among other things, implies that shared libraries will be
+ loaded to random addresses. Also for PIE-linked binaries, the
+ location of code start is randomized. This is the default if the
+ CONFIG_COMPAT_BRK option is enabled.
+
+2 Additionally enable heap randomization. This is the default if
+ CONFIG_COMPAT_BRK is disabled.
+
+ There are a few legacy applications out there (such as some ancient
+ versions of libc.so.5 from 1996) that assume that brk area starts
+ just after the end of the code+bss. These applications break when
+ start of the brk area is randomized. There are however no known
+ non-legacy applications that would be broken this way, so for most
+ systems it is safe to choose full randomization.
+
+ Systems with ancient and/or broken binaries should be configured
+ with CONFIG_COMPAT_BRK enabled, which excludes the heap from process
+ address space randomization.
+== ===========================================================================
+
+
+reboot-cmd: (Sparc only)
+========================
+
+??? This seems to be a way to give an argument to the Sparc
+ROM/Flash boot loader. Maybe to tell it what to do after
+rebooting. ???
+
+
+rtsig-max & rtsig-nr:
+=====================
+
+The file rtsig-max can be used to tune the maximum number
+of POSIX realtime (queued) signals that can be outstanding
+in the system.
+
+rtsig-nr shows the number of RT signals currently queued.
+
+
+sched_energy_aware:
+===================
+
+Enables/disables Energy Aware Scheduling (EAS). EAS starts
+automatically on platforms where it can run (that is,
+platforms with asymmetric CPU topologies and having an Energy
+Model available). If your platform happens to meet the
+requirements for EAS but you do not want to use it, change
+this value to 0.
+
+
+sched_schedstats:
+=================
+
+Enables/disables scheduler statistics. Enabling this feature
+incurs a small amount of overhead in the scheduler but is
+useful for debugging and performance tuning.
+
+
+sg-big-buff:
+============
+
+This file shows the size of the generic SCSI (sg) buffer.
+You can't tune it just yet, but you could change it on
+compile time by editing include/scsi/sg.h and changing
+the value of SG_BIG_BUFF.
+
+There shouldn't be any reason to change this value. If
+you can come up with one, you probably know what you
+are doing anyway :)
+
+
+shmall:
+=======
+
+This parameter sets the total amount of shared memory pages that
+can be used system wide. Hence, SHMALL should always be at least
+ceil(shmmax/PAGE_SIZE).
+
+If you are not sure what the default PAGE_SIZE is on your Linux
+system, you can run the following command:
+
+ # getconf PAGE_SIZE
+
+
+shmmax:
+=======
+
+This value can be used to query and set the run time limit
+on the maximum shared memory segment size that can be created.
+Shared memory segments up to 1Gb are now supported in the
+kernel. This value defaults to SHMMAX.
+
+
+shm_rmid_forced:
+================
+
+Linux lets you set resource limits, including how much memory one
+process can consume, via setrlimit(2). Unfortunately, shared memory
+segments are allowed to exist without association with any process, and
+thus might not be counted against any resource limits. If enabled,
+shared memory segments are automatically destroyed when their attach
+count becomes zero after a detach or a process termination. It will
+also destroy segments that were created, but never attached to, on exit
+from the process. The only use left for IPC_RMID is to immediately
+destroy an unattached segment. Of course, this breaks the way things are
+defined, so some applications might stop working. Note that this
+feature will do you no good unless you also configure your resource
+limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't
+need this.
+
+Note that if you change this from 0 to 1, already created segments
+without users and with a dead originative process will be destroyed.
+
+
+sysctl_writes_strict:
+=====================
+
+Control how file position affects the behavior of updating sysctl values
+via the /proc/sys interface:
+
+ == ======================================================================
+ -1 Legacy per-write sysctl value handling, with no printk warnings.
+ Each write syscall must fully contain the sysctl value to be
+ written, and multiple writes on the same sysctl file descriptor
+ will rewrite the sysctl value, regardless of file position.
+ 0 Same behavior as above, but warn about processes that perform writes
+ to a sysctl file descriptor when the file position is not 0.
+ 1 (default) Respect file position when writing sysctl strings. Multiple
+ writes will append to the sysctl value buffer. Anything past the max
+ length of the sysctl value buffer will be ignored. Writes to numeric
+ sysctl entries must always be at file position 0 and the value must
+ be fully contained in the buffer sent in the write syscall.
+ == ======================================================================
+
+
+softlockup_all_cpu_backtrace:
+=============================
+
+This value controls the soft lockup detector thread's behavior
+when a soft lockup condition is detected as to whether or not
+to gather further debug information. If enabled, each cpu will
+be issued an NMI and instructed to capture stack trace.
+
+This feature is only applicable for architectures which support
+NMI.
+
+0: do nothing. This is the default behavior.
+
+1: on detection capture more debug information.
+
+
+soft_watchdog:
+==============
+
+This parameter can be used to control the soft lockup detector.
+
+ 0 - disable the soft lockup detector
+
+ 1 - enable the soft lockup detector
+
+The soft lockup detector monitors CPUs for threads that are hogging the CPUs
+without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads
+from running. The mechanism depends on the CPUs ability to respond to timer
+interrupts which are needed for the 'watchdog/N' threads to be woken up by
+the watchdog timer function, otherwise the NMI watchdog - if enabled - can
+detect a hard lockup condition.
+
+
+stack_erasing:
+==============
+
+This parameter can be used to control kernel stack erasing at the end
+of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK.
+
+That erasing reduces the information which kernel stack leak bugs
+can reveal and blocks some uninitialized stack variable attacks.
+The tradeoff is the performance impact: on a single CPU system kernel
+compilation sees a 1% slowdown, other systems and workloads may vary.
+
+ 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated.
+
+ 1: kernel stack erasing is enabled (default), it is performed before
+ returning to the userspace at the end of syscalls.
+
+
+tainted
+=======
+
+Non-zero if the kernel has been tainted. Numeric values, which can be
+ORed together. The letters are seen in "Tainted" line of Oops reports.
+
+====== ===== ==============================================================
+ 1 `(P)` proprietary module was loaded
+ 2 `(F)` module was force loaded
+ 4 `(S)` SMP kernel oops on an officially SMP incapable processor
+ 8 `(R)` module was force unloaded
+ 16 `(M)` processor reported a Machine Check Exception (MCE)
+ 32 `(B)` bad page referenced or some unexpected page flags
+ 64 `(U)` taint requested by userspace application
+ 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG
+ 256 `(A)` an ACPI table was overridden by user
+ 512 `(W)` kernel issued warning
+ 1024 `(C)` staging driver was loaded
+ 2048 `(I)` workaround for bug in platform firmware applied
+ 4096 `(O)` externally-built ("out-of-tree") module was loaded
+ 8192 `(E)` unsigned module was loaded
+ 16384 `(L)` soft lockup occurred
+ 32768 `(K)` kernel has been live patched
+ 65536 `(X)` Auxiliary taint, defined and used by for distros
+131072 `(T)` The kernel was built with the struct randomization plugin
+====== ===== ==============================================================
+
+See Documentation/admin-guide/tainted-kernels.rst for more information.
+
+
+threads-max:
+============
+
+This value controls the maximum number of threads that can be created
+using fork().
+
+During initialization the kernel sets this value such that even if the
+maximum number of threads is created, the thread structures occupy only
+a part (1/8th) of the available RAM pages.
+
+The minimum value that can be written to threads-max is 20.
+
+The maximum value that can be written to threads-max is given by the
+constant FUTEX_TID_MASK (0x3fffffff).
+
+If a value outside of this range is written to threads-max an error
+EINVAL occurs.
+
+The value written is checked against the available RAM pages. If the
+thread structures would occupy too much (more than 1/8th) of the
+available RAM pages threads-max is reduced accordingly.
+
+
+unknown_nmi_panic:
+==================
+
+The value in this file affects behavior of handling NMI. When the
+value is non-zero, unknown NMI is trapped and then panic occurs. At
+that time, kernel debugging information is displayed on console.
+
+NMI switch that most IA32 servers have fires unknown NMI up, for
+example. If a system hangs up, try pressing the NMI switch.
+
+
+watchdog:
+=========
+
+This parameter can be used to disable or enable the soft lockup detector
+_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time.
+
+ 0 - disable both lockup detectors
+
+ 1 - enable both lockup detectors
+
+The soft lockup detector and the NMI watchdog can also be disabled or
+enabled individually, using the soft_watchdog and nmi_watchdog parameters.
+If the watchdog parameter is read, for example by executing::
+
+ cat /proc/sys/kernel/watchdog
+
+the output of this command (0 or 1) shows the logical OR of soft_watchdog
+and nmi_watchdog.
+
+
+watchdog_cpumask:
+=================
+
+This value can be used to control on which cpus the watchdog may run.
+The default cpumask is all possible cores, but if NO_HZ_FULL is
+enabled in the kernel config, and cores are specified with the
+nohz_full= boot argument, those cores are excluded by default.
+Offline cores can be included in this mask, and if the core is later
+brought online, the watchdog will be started based on the mask value.
+
+Typically this value would only be touched in the nohz_full case
+to re-enable cores that by default were not running the watchdog,
+if a kernel lockup was suspected on those cores.
+
+The argument value is the standard cpulist format for cpumasks,
+so for example to enable the watchdog on cores 0, 2, 3, and 4 you
+might say::
+
+ echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask
+
+
+watchdog_thresh:
+================
+
+This value can be used to control the frequency of hrtimer and NMI
+events and the soft and hard lockup thresholds. The default threshold
+is 10 seconds.
+
+The softlockup threshold is (2 * watchdog_thresh). Setting this
+tunable to zero will disable lockup detection altogether.
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
new file mode 100644
index 0000000..287b987
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -0,0 +1,434 @@
+================================
+Documentation for /proc/sys/net/
+================================
+
+Copyright
+
+Copyright (c) 1999
+
+ - Terrehon Bowden <terrehon@pacbell.net>
+ - Bodo Bauer <bb@ricochet.net>
+
+Copyright (c) 2000
+
+ - Jorge Nerin <comandante@zaralinux.com>
+
+Copyright (c) 2009
+
+ - Shen Feng <shen@cn.fujitsu.com>
+
+For general info and legal blurb, please look in index.rst.
+
+------------------------------------------------------------------------------
+
+This file contains the documentation for the sysctl files in
+/proc/sys/net
+
+The interface to the networking parts of the kernel is located in
+/proc/sys/net. The following table shows all possible subdirectories. You may
+see only some of them, depending on your kernel's configuration.
+
+
+Table : Subdirectories in /proc/sys/net
+
+ ========= =================== = ========== ==================
+ Directory Content Directory Content
+ ========= =================== = ========== ==================
+ core General parameter appletalk Appletalk protocol
+ unix Unix domain sockets netrom NET/ROM
+ 802 E802 protocol ax25 AX25
+ ethernet Ethernet protocol rose X.25 PLP layer
+ ipv4 IP version 4 x25 X.25 protocol
+ bridge Bridging decnet DEC net
+ ipv6 IP version 6 tipc TIPC
+ ========= =================== = ========== ==================
+
+1. /proc/sys/net/core - Network core options
+============================================
+
+bpf_jit_enable
+--------------
+
+This enables the BPF Just in Time (JIT) compiler. BPF is a flexible
+and efficient infrastructure allowing to execute bytecode at various
+hook points. It is used in a number of Linux kernel subsystems such
+as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints)
+and security (e.g. seccomp). LLVM has a BPF back end that can compile
+restricted C into a sequence of BPF instructions. After program load
+through bpf(2) and passing a verifier in the kernel, a JIT will then
+translate these BPF proglets into native CPU instructions. There are
+two flavors of JITs, the newer eBPF JIT currently supported on:
+
+ - x86_64
+ - x86_32
+ - arm64
+ - arm32
+ - ppc64
+ - sparc64
+ - mips64
+ - s390x
+ - riscv
+
+And the older cBPF JIT supported on the following archs:
+
+ - mips
+ - ppc
+ - sparc
+
+eBPF JITs are a superset of cBPF JITs, meaning the kernel will
+migrate cBPF instructions into eBPF instructions and then JIT
+compile them transparently. Older cBPF JITs can only translate
+tcpdump filters, seccomp rules, etc, but not mentioned eBPF
+programs loaded through bpf(2).
+
+Values:
+
+ - 0 - disable the JIT (default value)
+ - 1 - enable the JIT
+ - 2 - enable the JIT and ask the compiler to emit traces on kernel log.
+
+bpf_jit_harden
+--------------
+
+This enables hardening for the BPF JIT compiler. Supported are eBPF
+JIT backends. Enabling hardening trades off performance, but can
+mitigate JIT spraying.
+
+Values:
+
+ - 0 - disable JIT hardening (default value)
+ - 1 - enable JIT hardening for unprivileged users only
+ - 2 - enable JIT hardening for all users
+
+bpf_jit_kallsyms
+----------------
+
+When BPF JIT compiler is enabled, then compiled images are unknown
+addresses to the kernel, meaning they neither show up in traces nor
+in /proc/kallsyms. This enables export of these addresses, which can
+be used for debugging/tracing. If bpf_jit_harden is enabled, this
+feature is disabled.
+
+Values :
+
+ - 0 - disable JIT kallsyms export (default value)
+ - 1 - enable JIT kallsyms export for privileged users only
+
+bpf_jit_limit
+-------------
+
+This enforces a global limit for memory allocations to the BPF JIT
+compiler in order to reject unprivileged JIT requests once it has
+been surpassed. bpf_jit_limit contains the value of the global limit
+in bytes.
+
+dev_weight
+----------
+
+The maximum number of packets that kernel can handle on a NAPI interrupt,
+it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware
+aggregated packet is counted as one packet in this context.
+
+Default: 64
+
+dev_weight_rx_bias
+------------------
+
+RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function
+of the driver for the per softirq cycle netdev_budget. This parameter influences
+the proportion of the configured netdev_budget that is spent on RPS based packet
+processing during RX softirq cycles. It is further meant for making current
+dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack.
+(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based
+on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias).
+
+Default: 1
+
+dev_weight_tx_bias
+------------------
+
+Scales the maximum number of packets that can be processed during a TX softirq cycle.
+Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric
+net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog.
+
+Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias).
+
+Default: 1
+
+default_qdisc
+-------------
+
+The default queuing discipline to use for network devices. This allows
+overriding the default of pfifo_fast with an alternative. Since the default
+queuing discipline is created without additional parameters so is best suited
+to queuing disciplines that work well without configuration like stochastic
+fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use
+queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin
+which require setting up classes and bandwidths. Note that physical multiqueue
+interfaces still use mq as root qdisc, which in turn uses this default for its
+leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead
+default to noqueue.
+
+Default: pfifo_fast
+
+busy_read
+---------
+
+Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL)
+Approximate time in us to busy loop waiting for packets on the device queue.
+This sets the default value of the SO_BUSY_POLL socket option.
+Can be set or overridden per socket by setting socket option SO_BUSY_POLL,
+which is the preferred method of enabling. If you need to enable the feature
+globally via sysctl, a value of 50 is recommended.
+
+Will increase power usage.
+
+Default: 0 (off)
+
+busy_poll
+----------------
+Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL)
+Approximate time in us to busy loop waiting for events.
+Recommended value depends on the number of sockets you poll on.
+For several sockets 50, for several hundreds 100.
+For more than that you probably want to use epoll.
+Note that only sockets with SO_BUSY_POLL set will be busy polled,
+so you want to either selectively set SO_BUSY_POLL on those sockets or set
+sysctl.net.busy_read globally.
+
+Will increase power usage.
+
+Default: 0 (off)
+
+rmem_default
+------------
+
+The default setting of the socket receive buffer in bytes.
+
+rmem_max
+--------
+
+The maximum receive socket buffer size in bytes.
+
+tstamp_allow_data
+-----------------
+Allow processes to receive tx timestamps looped together with the original
+packet contents. If disabled, transmit timestamp requests from unprivileged
+processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set.
+
+Default: 1 (on)
+
+
+wmem_default
+------------
+
+The default setting (in bytes) of the socket send buffer.
+
+wmem_max
+--------
+
+The maximum send socket buffer size in bytes.
+
+message_burst and message_cost
+------------------------------
+
+These parameters are used to limit the warning messages written to the kernel
+log from the networking code. They enforce a rate limit to make a
+denial-of-service attack impossible. A higher message_cost factor, results in
+fewer messages that will be written. Message_burst controls when messages will
+be dropped. The default settings limit warning messages to one every five
+seconds.
+
+warnings
+--------
+
+This sysctl is now unused.
+
+This was used to control console messages from the networking stack that
+occur because of problems on the network like duplicate address or bad
+checksums.
+
+These messages are now emitted at KERN_DEBUG and can generally be enabled
+and controlled by the dynamic_debug facility.
+
+netdev_budget
+-------------
+
+Maximum number of packets taken from all interfaces in one polling cycle (NAPI
+poll). In one polling cycle interfaces which are registered to polling are
+probed in a round-robin manner. Also, a polling cycle may not exceed
+netdev_budget_usecs microseconds, even if netdev_budget has not been
+exhausted.
+
+netdev_budget_usecs
+---------------------
+
+Maximum number of microseconds in one NAPI polling cycle. Polling
+will exit when either netdev_budget_usecs have elapsed during the
+poll cycle or the number of packets processed reaches netdev_budget.
+
+netdev_max_backlog
+------------------
+
+Maximum number of packets, queued on the INPUT side, when the interface
+receives packets faster than kernel can process them.
+
+netdev_rss_key
+--------------
+
+RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is
+randomly generated.
+Some user space might need to gather its content even if drivers do not
+provide ethtool -x support yet.
+
+::
+
+ myhost:~# cat /proc/sys/net/core/netdev_rss_key
+ 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total)
+
+File contains nul bytes if no driver ever called netdev_rss_key_fill() function.
+
+Note:
+ /proc/sys/net/core/netdev_rss_key contains 52 bytes of key,
+ but most drivers only use 40 bytes of it.
+
+::
+
+ myhost:~# ethtool -x eth0
+ RX flow hash indirection table for eth0 with 8 RX ring(s):
+ 0: 0 1 2 3 4 5 6 7
+ RSS hash key:
+ 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89
+
+netdev_tstamp_prequeue
+----------------------
+
+If set to 0, RX packet timestamps can be sampled after RPS processing, when
+the target CPU processes packets. It might give some delay on timestamps, but
+permit to distribute the load on several cpus.
+
+If set to 1 (default), timestamps are sampled as soon as possible, before
+queueing.
+
+optmem_max
+----------
+
+Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence
+of struct cmsghdr structures with appended data.
+
+fb_tunnels_only_for_init_net
+----------------------------
+
+Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0,
+sit0, ip6tnl0, ip6gre0) are automatically created when a new
+network namespace is created, if corresponding tunnel is present
+in initial network namespace.
+If set to 1, these devices are not automatically created, and
+user space is responsible for creating them if needed.
+
+Default : 0 (for compatibility reasons)
+
+devconf_inherit_init_net
+------------------------
+
+Controls if a new network namespace should inherit all current
+settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By
+default, we keep the current behavior: for IPv4 we inherit all current
+settings from init_net and for IPv6 we reset all settings to default.
+
+If set to 1, both IPv4 and IPv6 settings are forced to inherit from
+current ones in init_net. If set to 2, both IPv4 and IPv6 settings are
+forced to reset to their default values.
+
+Default : 0 (for compatibility reasons)
+
+2. /proc/sys/net/unix - Parameters for Unix domain sockets
+----------------------------------------------------------
+
+There is only one file in this directory.
+unix_dgram_qlen limits the max number of datagrams queued in Unix domain
+socket's buffer. It will not take effect unless PF_UNIX flag is specified.
+
+
+3. /proc/sys/net/ipv4 - IPV4 settings
+-------------------------------------
+Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for
+descriptions of these entries.
+
+
+4. Appletalk
+------------
+
+The /proc/sys/net/appletalk directory holds the Appletalk configuration data
+when Appletalk is loaded. The configurable parameters are:
+
+aarp-expiry-time
+----------------
+
+The amount of time we keep an ARP entry before expiring it. Used to age out
+old hosts.
+
+aarp-resolve-time
+-----------------
+
+The amount of time we will spend trying to resolve an Appletalk address.
+
+aarp-retransmit-limit
+---------------------
+
+The number of times we will retransmit a query before giving up.
+
+aarp-tick-time
+--------------
+
+Controls the rate at which expires are checked.
+
+The directory /proc/net/appletalk holds the list of active Appletalk sockets
+on a machine.
+
+The fields indicate the DDP type, the local address (in network:node format)
+the remote address, the size of the transmit pending queue, the size of the
+received queue (bytes waiting for applications to read) the state and the uid
+owning the socket.
+
+/proc/net/atalk_iface lists all the interfaces configured for appletalk.It
+shows the name of the interface, its Appletalk address, the network range on
+that address (or network number for phase 1 networks), and the status of the
+interface.
+
+/proc/net/atalk_route lists each known network route. It lists the target
+(network) that the route leads to, the router (may be directly connected), the
+route flags, and the device the route is using.
+
+5. TIPC
+-------
+
+tipc_rmem
+---------
+
+The TIPC protocol now has a tunable for the receive memory, similar to the
+tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max)
+
+::
+
+ # cat /proc/sys/net/tipc/tipc_rmem
+ 4252725 34021800 68043600
+ #
+
+The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values
+are scaled (shifted) versions of that same value. Note that the min value
+is not at this point in time used in any meaningful way, but the triplet is
+preserved in order to be consistent with things like tcp_rmem.
+
+named_timeout
+-------------
+
+TIPC name table updates are distributed asynchronously in a cluster, without
+any form of transaction handling. This means that different race scenarios are
+possible. One such is that a name withdrawal sent out by one node and received
+by another node may arrive after a second, overlapping name publication already
+has been accepted from a third node, although the conflicting updates
+originally may have been issued in the correct sequential order.
+If named_timeout is nonzero, failed topology updates will be placed on a defer
+queue until another event arrives that clears the error, or until the timeout
+expires. Value is in milliseconds.
diff --git a/Documentation/admin-guide/sysctl/sunrpc.rst b/Documentation/admin-guide/sysctl/sunrpc.rst
new file mode 100644
index 0000000..09780a6
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/sunrpc.rst
@@ -0,0 +1,25 @@
+===================================
+Documentation for /proc/sys/sunrpc/
+===================================
+
+kernel version 2.2.10
+
+Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
+
+For general info and legal blurb, please look in index.rst.
+
+------------------------------------------------------------------------------
+
+This file contains the documentation for the sysctl files in
+/proc/sys/sunrpc and is valid for Linux kernel version 2.2.
+
+The files in this directory can be used to (re)set the debug
+flags of the SUN Remote Procedure Call (RPC) subsystem in
+the Linux kernel. This stuff is used for NFS, KNFSD and
+maybe a few other things as well.
+
+The files in there are used to control the debugging flags:
+rpc_debug, nfs_debug, nfsd_debug and nlm_debug.
+
+These flags are for kernel hackers only. You should read the
+source code in net/sunrpc/ for more information.
diff --git a/Documentation/admin-guide/sysctl/user.rst b/Documentation/admin-guide/sysctl/user.rst
new file mode 100644
index 0000000..650eaa0
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/user.rst
@@ -0,0 +1,78 @@
+=================================
+Documentation for /proc/sys/user/
+=================================
+
+kernel version 4.9.0
+
+Copyright (c) 2016 Eric Biederman <ebiederm@xmission.com>
+
+------------------------------------------------------------------------------
+
+This file contains the documentation for the sysctl files in
+/proc/sys/user.
+
+The files in this directory can be used to override the default
+limits on the number of namespaces and other objects that have
+per user per user namespace limits.
+
+The primary purpose of these limits is to stop programs that
+malfunction and attempt to create a ridiculous number of objects,
+before the malfunction becomes a system wide problem. It is the
+intention that the defaults of these limits are set high enough that
+no program in normal operation should run into these limits.
+
+The creation of per user per user namespace objects are charged to
+the user in the user namespace who created the object and
+verified to be below the per user limit in that user namespace.
+
+The creation of objects is also charged to all of the users
+who created user namespaces the creation of the object happens
+in (user namespaces can be nested) and verified to be below the per user
+limits in the user namespaces of those users.
+
+This recursive counting of created objects ensures that creating a
+user namespace does not allow a user to escape their current limits.
+
+Currently, these files are in /proc/sys/user:
+
+max_cgroup_namespaces
+=====================
+
+ The maximum number of cgroup namespaces that any user in the current
+ user namespace may create.
+
+max_ipc_namespaces
+==================
+
+ The maximum number of ipc namespaces that any user in the current
+ user namespace may create.
+
+max_mnt_namespaces
+==================
+
+ The maximum number of mount namespaces that any user in the current
+ user namespace may create.
+
+max_net_namespaces
+==================
+
+ The maximum number of network namespaces that any user in the
+ current user namespace may create.
+
+max_pid_namespaces
+==================
+
+ The maximum number of pid namespaces that any user in the current
+ user namespace may create.
+
+max_user_namespaces
+===================
+
+ The maximum number of user namespaces that any user in the current
+ user namespace may create.
+
+max_uts_namespaces
+==================
+
+ The maximum number of user namespaces that any user in the current
+ user namespace may create.
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
new file mode 100644
index 0000000..64aeee1
--- /dev/null
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -0,0 +1,964 @@
+===============================
+Documentation for /proc/sys/vm/
+===============================
+
+kernel version 2.6.29
+
+Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
+
+Copyright (c) 2008 Peter W. Morreale <pmorreale@novell.com>
+
+For general info and legal blurb, please look in index.rst.
+
+------------------------------------------------------------------------------
+
+This file contains the documentation for the sysctl files in
+/proc/sys/vm and is valid for Linux kernel version 2.6.29.
+
+The files in this directory can be used to tune the operation
+of the virtual memory (VM) subsystem of the Linux kernel and
+the writeout of dirty data to disk.
+
+Default values and initialization routines for most of these
+files can be found in mm/swap.c.
+
+Currently, these files are in /proc/sys/vm:
+
+- admin_reserve_kbytes
+- block_dump
+- compact_memory
+- compact_unevictable_allowed
+- dirty_background_bytes
+- dirty_background_ratio
+- dirty_bytes
+- dirty_expire_centisecs
+- dirty_ratio
+- dirtytime_expire_seconds
+- dirty_writeback_centisecs
+- drop_caches
+- extfrag_threshold
+- hugetlb_shm_group
+- laptop_mode
+- legacy_va_layout
+- lowmem_reserve_ratio
+- max_map_count
+- memory_failure_early_kill
+- memory_failure_recovery
+- min_free_kbytes
+- min_slab_ratio
+- min_unmapped_ratio
+- mmap_min_addr
+- mmap_rnd_bits
+- mmap_rnd_compat_bits
+- nr_hugepages
+- nr_hugepages_mempolicy
+- nr_overcommit_hugepages
+- nr_trim_pages (only if CONFIG_MMU=n)
+- numa_zonelist_order
+- oom_dump_tasks
+- oom_kill_allocating_task
+- overcommit_kbytes
+- overcommit_memory
+- overcommit_ratio
+- page-cluster
+- panic_on_oom
+- percpu_pagelist_fraction
+- stat_interval
+- stat_refresh
+- numa_stat
+- swappiness
+- unprivileged_userfaultfd
+- user_reserve_kbytes
+- vfs_cache_pressure
+- watermark_boost_factor
+- watermark_scale_factor
+- zone_reclaim_mode
+
+
+admin_reserve_kbytes
+====================
+
+The amount of free memory in the system that should be reserved for users
+with the capability cap_sys_admin.
+
+admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
+
+That should provide enough for the admin to log in and kill a process,
+if necessary, under the default overcommit 'guess' mode.
+
+Systems running under overcommit 'never' should increase this to account
+for the full Virtual Memory Size of programs used to recover. Otherwise,
+root may not be able to log in to recover the system.
+
+How do you calculate a minimum useful reserve?
+
+sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
+
+For overcommit 'guess', we can sum resident set sizes (RSS).
+On x86_64 this is about 8MB.
+
+For overcommit 'never', we can take the max of their virtual sizes (VSZ)
+and add the sum of their RSS.
+On x86_64 this is about 128MB.
+
+Changing this takes effect whenever an application requests memory.
+
+
+block_dump
+==========
+
+block_dump enables block I/O debugging when set to a nonzero value. More
+information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.
+
+
+compact_memory
+==============
+
+Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
+all zones are compacted such that free memory is available in contiguous
+blocks where possible. This can be important for example in the allocation of
+huge pages although processes will also directly compact memory as required.
+
+
+compact_unevictable_allowed
+===========================
+
+Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
+allowed to examine the unevictable lru (mlocked pages) for pages to compact.
+This should be used on systems where stalls for minor page faults are an
+acceptable trade for large contiguous free memory. Set to 0 to prevent
+compaction from moving pages that are unevictable. Default value is 1.
+
+
+dirty_background_bytes
+======================
+
+Contains the amount of dirty memory at which the background kernel
+flusher threads will start writeback.
+
+Note:
+ dirty_background_bytes is the counterpart of dirty_background_ratio. Only
+ one of them may be specified at a time. When one sysctl is written it is
+ immediately taken into account to evaluate the dirty memory limits and the
+ other appears as 0 when read.
+
+
+dirty_background_ratio
+======================
+
+Contains, as a percentage of total available memory that contains free pages
+and reclaimable pages, the number of pages at which the background kernel
+flusher threads will start writing out dirty data.
+
+The total available memory is not equal to total system memory.
+
+
+dirty_bytes
+===========
+
+Contains the amount of dirty memory at which a process generating disk writes
+will itself start writeback.
+
+Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
+specified at a time. When one sysctl is written it is immediately taken into
+account to evaluate the dirty memory limits and the other appears as 0 when
+read.
+
+Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
+value lower than this limit will be ignored and the old configuration will be
+retained.
+
+
+dirty_expire_centisecs
+======================
+
+This tunable is used to define when dirty data is old enough to be eligible
+for writeout by the kernel flusher threads. It is expressed in 100'ths
+of a second. Data which has been dirty in-memory for longer than this
+interval will be written out next time a flusher thread wakes up.
+
+
+dirty_ratio
+===========
+
+Contains, as a percentage of total available memory that contains free pages
+and reclaimable pages, the number of pages at which a process which is
+generating disk writes will itself start writing out dirty data.
+
+The total available memory is not equal to total system memory.
+
+
+dirtytime_expire_seconds
+========================
+
+When a lazytime inode is constantly having its pages dirtied, the inode with
+an updated timestamp will never get chance to be written out. And, if the
+only thing that has happened on the file system is a dirtytime inode caused
+by an atime update, a worker will be scheduled to make sure that inode
+eventually gets pushed out to disk. This tunable is used to define when dirty
+inode is old enough to be eligible for writeback by the kernel flusher threads.
+And, it is also used as the interval to wakeup dirtytime_writeback thread.
+
+
+dirty_writeback_centisecs
+=========================
+
+The kernel flusher threads will periodically wake up and write `old` data
+out to disk. This tunable expresses the interval between those wakeups, in
+100'ths of a second.
+
+Setting this to zero disables periodic writeback altogether.
+
+
+drop_caches
+===========
+
+Writing to this will cause the kernel to drop clean caches, as well as
+reclaimable slab objects like dentries and inodes. Once dropped, their
+memory becomes free.
+
+To free pagecache::
+
+ echo 1 > /proc/sys/vm/drop_caches
+
+To free reclaimable slab objects (includes dentries and inodes)::
+
+ echo 2 > /proc/sys/vm/drop_caches
+
+To free slab objects and pagecache::
+
+ echo 3 > /proc/sys/vm/drop_caches
+
+This is a non-destructive operation and will not free any dirty objects.
+To increase the number of objects freed by this operation, the user may run
+`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the
+number of dirty objects on the system and create more candidates to be
+dropped.
+
+This file is not a means to control the growth of the various kernel caches
+(inodes, dentries, pagecache, etc...) These objects are automatically
+reclaimed by the kernel when memory is needed elsewhere on the system.
+
+Use of this file can cause performance problems. Since it discards cached
+objects, it may cost a significant amount of I/O and CPU to recreate the
+dropped objects, especially if they were under heavy use. Because of this,
+use outside of a testing or debugging environment is not recommended.
+
+You may see informational messages in your kernel log when this file is
+used::
+
+ cat (1234): drop_caches: 3
+
+These are informational only. They do not mean that anything is wrong
+with your system. To disable them, echo 4 (bit 2) into drop_caches.
+
+
+extfrag_threshold
+=================
+
+This parameter affects whether the kernel will compact memory or direct
+reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
+debugfs shows what the fragmentation index for each order is in each zone in
+the system. Values tending towards 0 imply allocations would fail due to lack
+of memory, values towards 1000 imply failures are due to fragmentation and -1
+implies that the allocation will succeed as long as watermarks are met.
+
+The kernel will not compact memory in a zone if the
+fragmentation index is <= extfrag_threshold. The default value is 500.
+
+
+highmem_is_dirtyable
+====================
+
+Available only for systems with CONFIG_HIGHMEM enabled (32b systems).
+
+This parameter controls whether the high memory is considered for dirty
+writers throttling. This is not the case by default which means that
+only the amount of memory directly visible/usable by the kernel can
+be dirtied. As a result, on systems with a large amount of memory and
+lowmem basically depleted writers might be throttled too early and
+streaming writes can get very slow.
+
+Changing the value to non zero would allow more memory to be dirtied
+and thus allow writers to write more data which can be flushed to the
+storage more effectively. Note this also comes with a risk of pre-mature
+OOM killer because some writers (e.g. direct block device writes) can
+only use the low memory and they can fill it up with dirty data without
+any throttling.
+
+
+hugetlb_shm_group
+=================
+
+hugetlb_shm_group contains group id that is allowed to create SysV
+shared memory segment using hugetlb page.
+
+
+laptop_mode
+===========
+
+laptop_mode is a knob that controls "laptop mode". All the things that are
+controlled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst.
+
+
+legacy_va_layout
+================
+
+If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel
+will use the legacy (2.4) layout for all processes.
+
+
+lowmem_reserve_ratio
+====================
+
+For some specialised workloads on highmem machines it is dangerous for
+the kernel to allow process memory to be allocated from the "lowmem"
+zone. This is because that memory could then be pinned via the mlock()
+system call, or by unavailability of swapspace.
+
+And on large highmem machines this lack of reclaimable lowmem memory
+can be fatal.
+
+So the Linux page allocator has a mechanism which prevents allocations
+which *could* use highmem from using too much lowmem. This means that
+a certain amount of lowmem is defended from the possibility of being
+captured into pinned user memory.
+
+(The same argument applies to the old 16 megabyte ISA DMA region. This
+mechanism will also defend that region from allocations which could use
+highmem or lowmem).
+
+The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is
+in defending these lower zones.
+
+If you have a machine which uses highmem or ISA DMA and your
+applications are using mlock(), or if you are running with no swap then
+you probably should change the lowmem_reserve_ratio setting.
+
+The lowmem_reserve_ratio is an array. You can see them by reading this file::
+
+ % cat /proc/sys/vm/lowmem_reserve_ratio
+ 256 256 32
+
+But, these values are not used directly. The kernel calculates # of protection
+pages for each zones from them. These are shown as array of protection pages
+in /proc/zoneinfo like followings. (This is an example of x86-64 box).
+Each zone has an array of protection pages like this::
+
+ Node 0, zone DMA
+ pages free 1355
+ min 3
+ low 3
+ high 4
+ :
+ :
+ numa_other 0
+ protection: (0, 2004, 2004, 2004)
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ pagesets
+ cpu: 0 pcp: 0
+ :
+
+These protections are added to score to judge whether this zone should be used
+for page allocation or should be reclaimed.
+
+In this example, if normal pages (index=2) are required to this DMA zone and
+watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
+not be used because pages_free(1355) is smaller than watermark + protection[2]
+(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
+normal page requirement. If requirement is DMA zone(index=0), protection[0]
+(=0) is used.
+
+zone[i]'s protection[j] is calculated by following expression::
+
+ (i < j):
+ zone[i]->protection[j]
+ = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
+ / lowmem_reserve_ratio[i];
+ (i = j):
+ (should not be protected. = 0;
+ (i > j):
+ (not necessary, but looks 0)
+
+The default values of lowmem_reserve_ratio[i] are
+
+ === ====================================
+ 256 (if zone[i] means DMA or DMA32 zone)
+ 32 (others)
+ === ====================================
+
+As above expression, they are reciprocal number of ratio.
+256 means 1/256. # of protection pages becomes about "0.39%" of total managed
+pages of higher zones on the node.
+
+If you would like to protect more pages, smaller values are effective.
+The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
+disables protection of the pages.
+
+
+max_map_count:
+==============
+
+This file contains the maximum number of memory map areas a process
+may have. Memory map areas are used as a side-effect of calling
+malloc, directly by mmap, mprotect, and madvise, and also when loading
+shared libraries.
+
+While most applications need less than a thousand maps, certain
+programs, particularly malloc debuggers, may consume lots of them,
+e.g., up to one or two maps per allocation.
+
+The default value is 65536.
+
+
+memory_failure_early_kill:
+==========================
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware
+that cannot be handled by the kernel. In some cases (like the page
+still having a valid copy on disk) the kernel will handle the failure
+transparently without affecting any applications. But if there is
+no other uptodate copy of the data it will kill to prevent any data
+corruptions from propagating.
+
+1: Kill all processes that have the corrupted and not reloadable page mapped
+as soon as the corruption is detected. Note this is not supported
+for a few types of pages, like kernel internally allocated data or
+the swap cache, but works for the majority of user pages.
+
+0: Only unmap the corrupted page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
+handle this if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
+Applications can override this setting individually with the PR_MCE_KILL prctl
+
+
+memory_failure_recovery
+=======================
+
+Enable memory failure recovery (when supported by the platform)
+
+1: Attempt recovery.
+
+0: Always panic on a memory failure.
+
+
+min_free_kbytes
+===============
+
+This is used to force the Linux VM to keep a minimum number
+of kilobytes free. The VM uses this number to compute a
+watermark[WMARK_MIN] value for each lowmem zone in the system.
+Each lowmem zone gets a number of reserved free pages based
+proportionally on its size.
+
+Some minimal amount of memory is needed to satisfy PF_MEMALLOC
+allocations; if you set this to lower than 1024KB, your system will
+become subtly broken, and prone to deadlock under high loads.
+
+Setting this too high will OOM your machine instantly.
+
+
+min_slab_ratio
+==============
+
+This is available only on NUMA kernels.
+
+A percentage of the total pages in each zone. On Zone reclaim
+(fallback from the local zone occurs) slabs will be reclaimed if more
+than this percentage of pages in a zone are reclaimable slab pages.
+This insures that the slab growth stays under control even in NUMA
+systems that rarely perform global reclaim.
+
+The default is 5 percent.
+
+Note that slab reclaim is triggered in a per zone / node fashion.
+The process of reclaiming slab memory is currently not node specific
+and may not be fast.
+
+
+min_unmapped_ratio
+==================
+
+This is available only on NUMA kernels.
+
+This is a percentage of the total pages in each zone. Zone reclaim will
+only occur if more than this percentage of pages are in a state that
+zone_reclaim_mode allows to be reclaimed.
+
+If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
+against all file-backed unmapped pages including swapcache pages and tmpfs
+files. Otherwise, only unmapped pages backed by normal files but not tmpfs
+files and similar are considered.
+
+The default is 1 percent.
+
+
+mmap_min_addr
+=============
+
+This file indicates the amount of address space which a user process will
+be restricted from mmapping. Since kernel null dereference bugs could
+accidentally operate based on the information in the first couple of pages
+of memory userspace processes should not be allowed to write to them. By
+default this value is set to 0 and no protections will be enforced by the
+security module. Setting this value to something like 64k will allow the
+vast majority of applications to work correctly and provide defense in depth
+against future potential kernel bugs.
+
+
+mmap_rnd_bits
+=============
+
+This value can be used to select the number of bits to use to
+determine the random offset to the base address of vma regions
+resulting from mmap allocations on architectures which support
+tuning address space randomization. This value will be bounded
+by the architecture's minimum and maximum supported values.
+
+This value can be changed after boot using the
+/proc/sys/vm/mmap_rnd_bits tunable
+
+
+mmap_rnd_compat_bits
+====================
+
+This value can be used to select the number of bits to use to
+determine the random offset to the base address of vma regions
+resulting from mmap allocations for applications run in
+compatibility mode on architectures which support tuning address
+space randomization. This value will be bounded by the
+architecture's minimum and maximum supported values.
+
+This value can be changed after boot using the
+/proc/sys/vm/mmap_rnd_compat_bits tunable
+
+
+nr_hugepages
+============
+
+Change the minimum size of the hugepage pool.
+
+See Documentation/admin-guide/mm/hugetlbpage.rst
+
+
+nr_hugepages_mempolicy
+======================
+
+Change the size of the hugepage pool at run-time on a specific
+set of NUMA nodes.
+
+See Documentation/admin-guide/mm/hugetlbpage.rst
+
+
+nr_overcommit_hugepages
+=======================
+
+Change the maximum size of the hugepage pool. The maximum is
+nr_hugepages + nr_overcommit_hugepages.
+
+See Documentation/admin-guide/mm/hugetlbpage.rst
+
+
+nr_trim_pages
+=============
+
+This is available only on NOMMU kernels.
+
+This value adjusts the excess page trimming behaviour of power-of-2 aligned
+NOMMU mmap allocations.
+
+A value of 0 disables trimming of allocations entirely, while a value of 1
+trims excess pages aggressively. Any value >= 1 acts as the watermark where
+trimming of allocations is initiated.
+
+The default value is 1.
+
+See Documentation/nommu-mmap.txt for more information.
+
+
+numa_zonelist_order
+===================
+
+This sysctl is only for NUMA and it is deprecated. Anything but
+Node order will fail!
+
+'where the memory is allocated from' is controlled by zonelists.
+
+(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
+you may be able to read ZONE_DMA as ZONE_DMA32...)
+
+In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
+ZONE_NORMAL -> ZONE_DMA
+This means that a memory allocation request for GFP_KERNEL will
+get memory from ZONE_DMA only when ZONE_NORMAL is not available.
+
+In NUMA case, you can think of following 2 types of order.
+Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL::
+
+ (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
+ (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
+
+Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
+will be used before ZONE_NORMAL exhaustion. This increases possibility of
+out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
+
+Type(B) cannot offer the best locality but is more robust against OOM of
+the DMA zone.
+
+Type(A) is called as "Node" order. Type (B) is "Zone" order.
+
+"Node order" orders the zonelists by node, then by zone within each node.
+Specify "[Nn]ode" for node order
+
+"Zone Order" orders the zonelists by zone type, then by node within each
+zone. Specify "[Zz]one" for zone order.
+
+Specify "[Dd]efault" to request automatic configuration.
+
+On 32-bit, the Normal zone needs to be preserved for allocations accessible
+by the kernel, so "zone" order will be selected.
+
+On 64-bit, devices that require DMA32/DMA are relatively rare, so "node"
+order will be selected.
+
+Default order is recommended unless this is causing problems for your
+system/application.
+
+
+oom_dump_tasks
+==============
+
+Enables a system-wide task dump (excluding kernel threads) to be produced
+when the kernel performs an OOM-killing and includes such information as
+pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
+score, and name. This is helpful to determine why the OOM killer was
+invoked, to identify the rogue task that caused it, and to determine why
+the OOM killer chose the task it did to kill.
+
+If this is set to zero, this information is suppressed. On very
+large systems with thousands of tasks it may not be feasible to dump
+the memory state information for each one. Such systems should not
+be forced to incur a performance penalty in OOM conditions when the
+information may not be desired.
+
+If this is set to non-zero, this information is shown whenever the
+OOM killer actually kills a memory-hogging task.
+
+The default value is 1 (enabled).
+
+
+oom_kill_allocating_task
+========================
+
+This enables or disables killing the OOM-triggering task in
+out-of-memory situations.
+
+If this is set to zero, the OOM killer will scan through the entire
+tasklist and select a task based on heuristics to kill. This normally
+selects a rogue memory-hogging task that frees up a large amount of
+memory when killed.
+
+If this is set to non-zero, the OOM killer simply kills the task that
+triggered the out-of-memory condition. This avoids the expensive
+tasklist scan.
+
+If panic_on_oom is selected, it takes precedence over whatever value
+is used in oom_kill_allocating_task.
+
+The default value is 0.
+
+
+overcommit_kbytes
+=================
+
+When overcommit_memory is set to 2, the committed address space is not
+permitted to exceed swap plus this amount of physical RAM. See below.
+
+Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
+of them may be specified at a time. Setting one disables the other (which
+then appears as 0 when read).
+
+
+overcommit_memory
+=================
+
+This value contains a flag that enables memory overcommitment.
+
+When this flag is 0, the kernel attempts to estimate the amount
+of free memory left when userspace requests more memory.
+
+When this flag is 1, the kernel pretends there is always enough
+memory until it actually runs out.
+
+When this flag is 2, the kernel uses a "never overcommit"
+policy that attempts to prevent any overcommit of memory.
+Note that user_reserve_kbytes affects this policy.
+
+This feature can be very useful because there are a lot of
+programs that malloc() huge amounts of memory "just-in-case"
+and don't use much of it.
+
+The default value is 0.
+
+See Documentation/vm/overcommit-accounting.rst and
+mm/util.c::__vm_enough_memory() for more information.
+
+
+overcommit_ratio
+================
+
+When overcommit_memory is set to 2, the committed address
+space is not permitted to exceed swap plus this percentage
+of physical RAM. See above.
+
+
+page-cluster
+============
+
+page-cluster controls the number of pages up to which consecutive pages
+are read in from swap in a single attempt. This is the swap counterpart
+to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
+
+It is a logarithmic value - setting it to zero means "1 page", setting
+it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
+
+The default value is three (eight pages at a time). There may be some
+small benefits in tuning this to a different value if your workload is
+swap-intensive.
+
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
+
+panic_on_oom
+============
+
+This enables or disables panic on out-of-memory feature.
+
+If this is set to 0, the kernel will kill some rogue process,
+called oom_killer. Usually, oom_killer can kill rogue processes and
+system will survive.
+
+If this is set to 1, the kernel panics when out-of-memory happens.
+However, if a process limits using nodes by mempolicy/cpusets,
+and those nodes become memory exhaustion status, one process
+may be killed by oom-killer. No panic occurs in this case.
+Because other nodes' memory may be free. This means system total status
+may be not fatal yet.
+
+If this is set to 2, the kernel panics compulsorily even on the
+above-mentioned. Even oom happens under memory cgroup, the whole
+system panics.
+
+The default value is 0.
+
+1 and 2 are for failover of clustering. Please select either
+according to your policy of failover.
+
+panic_on_oom=2+kdump gives you very strong tool to investigate
+why oom happens. You can get snapshot.
+
+
+percpu_pagelist_fraction
+========================
+
+This is the fraction of pages at most (high mark pcp->high) in each zone that
+are allocated for each per cpu page list. The min value for this is 8. It
+means that we don't allow more than 1/8th of pages in each zone to be
+allocated in any single per_cpu_pagelist. This entry only changes the value
+of hot per cpu pagelists. User can specify a number like 100 to allocate
+1/100th of each zone to each per cpu page list.
+
+The batch value of each per cpu pagelist is also updated as a result. It is
+set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
+
+The initial value is zero. Kernel does not use this value at boot time to set
+the high water marks for each per cpu page list. If the user writes '0' to this
+sysctl, it will revert to this default behavior.
+
+
+stat_interval
+=============
+
+The time interval between which vm statistics are updated. The default
+is 1 second.
+
+
+stat_refresh
+============
+
+Any read or write (by root only) flushes all the per-cpu vm statistics
+into their global totals, for more accurate reports when testing
+e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
+
+As a side-effect, it also checks for negative totals (elsewhere reported
+as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
+(At time of writing, a few stats are known sometimes to be found negative,
+with no ill effects: errors and warnings on these stats are suppressed.)
+
+
+numa_stat
+=========
+
+This interface allows runtime configuration of numa statistics.
+
+When page allocation performance becomes a bottleneck and you can tolerate
+some possible tool breakage and decreased numa counter precision, you can
+do::
+
+ echo 0 > /proc/sys/vm/numa_stat
+
+When page allocation performance is not a bottleneck and you want all
+tooling to work, you can do::
+
+ echo 1 > /proc/sys/vm/numa_stat
+
+
+swappiness
+==========
+
+This control is used to define how aggressive the kernel will swap
+memory pages. Higher values will increase aggressiveness, lower values
+decrease the amount of swap. A value of 0 instructs the kernel not to
+initiate swap until the amount of free and file-backed pages is less
+than the high water mark in a zone.
+
+The default value is 60.
+
+
+unprivileged_userfaultfd
+========================
+
+This flag controls whether unprivileged users can use the userfaultfd
+system calls. Set this to 1 to allow unprivileged users to use the
+userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
+privileged users (with SYS_CAP_PTRACE capability).
+
+The default value is 1.
+
+
+user_reserve_kbytes
+===================
+
+When overcommit_memory is set to 2, "never overcommit" mode, reserve
+min(3% of current process size, user_reserve_kbytes) of free memory.
+This is intended to prevent a user from starting a single memory hogging
+process, such that they cannot recover (kill the hog).
+
+user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
+
+If this is reduced to zero, then the user will be allowed to allocate
+all free memory with a single process, minus admin_reserve_kbytes.
+Any subsequent attempts to execute a command will result in
+"fork: Cannot allocate memory".
+
+Changing this takes effect whenever an application requests memory.
+
+
+vfs_cache_pressure
+==================
+
+This percentage value controls the tendency of the kernel to reclaim
+the memory which is used for caching of directory and inode objects.
+
+At the default value of vfs_cache_pressure=100 the kernel will attempt to
+reclaim dentries and inodes at a "fair" rate with respect to pagecache and
+swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
+to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
+never reclaim dentries and inodes due to memory pressure and this can easily
+lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
+causes the kernel to prefer to reclaim dentries and inodes.
+
+Increasing vfs_cache_pressure significantly beyond 100 may have negative
+performance impact. Reclaim code needs to take various locks to find freeable
+directory and inode objects. With vfs_cache_pressure=1000, it will look for
+ten times more freeable objects than there are.
+
+
+watermark_boost_factor
+======================
+
+This factor controls the level of reclaim when memory is being fragmented.
+It defines the percentage of the high watermark of a zone that will be
+reclaimed if pages of different mobility are being mixed within pageblocks.
+The intent is that compaction has less work to do in the future and to
+increase the success rate of future high-order allocations such as SLUB
+allocations, THP and hugetlbfs pages.
+
+To make it sensible with respect to the watermark_scale_factor
+parameter, the unit is in fractions of 10,000. The default value of
+15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
+watermark will be reclaimed in the event of a pageblock being mixed due
+to fragmentation. The level of reclaim is determined by the number of
+fragmentation events that occurred in the recent past. If this value is
+smaller than a pageblock then a pageblocks worth of pages will be reclaimed
+(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+
+
+watermark_scale_factor
+======================
+
+This factor controls the aggressiveness of kswapd. It defines the
+amount of memory left in a node/system before kswapd is woken up and
+how much memory needs to be free before kswapd goes back to sleep.
+
+The unit is in fractions of 10,000. The default value of 10 means the
+distances between watermarks are 0.1% of the available memory in the
+node/system. The maximum value is 1000, or 10% of memory.
+
+A high rate of threads entering direct reclaim (allocstall) or kswapd
+going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
+that the number of free pages kswapd maintains for latency reasons is
+too small for the allocation bursts occurring in the system. This knob
+can then be used to tune kswapd aggressiveness accordingly.
+
+
+zone_reclaim_mode
+=================
+
+Zone_reclaim_mode allows someone to set more or less aggressive approaches to
+reclaim memory when a zone runs out of memory. If it is set to zero then no
+zone reclaim occurs. Allocations will be satisfied from other zones / nodes
+in the system.
+
+This is value OR'ed together of
+
+= ===================================
+1 Zone reclaim on
+2 Zone reclaim writes dirty pages out
+4 Zone reclaim swaps pages
+= ===================================
+
+zone_reclaim_mode is disabled by default. For file servers or workloads
+that benefit from having their data cached, zone_reclaim_mode should be
+left disabled as the caching effect is likely to be more important than
+data locality.
+
+zone_reclaim may be enabled if it's known that the workload is partitioned
+such that each partition fits within a NUMA node and that accessing remote
+memory would cause a measurable performance reduction. The page allocator
+will then reclaim easily reusable pages (those page cache pages that are
+currently not used) before allocating off node pages.
+
+Allowing zone reclaim to write out pages stops processes that are
+writing large amounts of data from dirtying pages on other nodes. Zone
+reclaim will write out dirty pages if a zone fills up and so effectively
+throttle the process. This may decrease the performance of a single process
+since it cannot use all of system memory to buffer the outgoing writes
+anymore but it preserve the memory on other nodes so that the performance
+of other processes running on other nodes will not be affected.
+
+Allowing regular swap effectively restricts allocations to the local
+node unless explicitly overridden by memory policies or cpuset
+configurations.
diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst
index 7b9035c..72b2cfb 100644
--- a/Documentation/admin-guide/sysrq.rst
+++ b/Documentation/admin-guide/sysrq.rst
@@ -171,22 +171,20 @@
useful when you want to exit a program that will not let you switch consoles.
(For example, X or a svgalib program.)
-``reboot(b)`` is good when you're unable to shut down. But you should also
-``sync(s)`` and ``umount(u)`` first.
+``reboot(b)`` is good when you're unable to shut down, it is an equivalent
+of pressing the "reset" button.
``crash(c)`` can be used to manually trigger a crashdump when the system is hung.
Note that this just triggers a crash if there is no dump mechanism available.
-``sync(s)`` is great when your system is locked up, it allows you to sync your
-disks and will certainly lessen the chance of data loss and fscking. Note
-that the sync hasn't taken place until you see the "OK" and "Done" appear
-on the screen. (If the kernel is really in strife, you may not ever get the
-OK or Done message...)
+``sync(s)`` is handy before yanking removable medium or after using a rescue
+shell that provides no graceful shutdown -- it will ensure your data is
+safely written to the disk. Note that the sync hasn't taken place until you see
+the "OK" and "Done" appear on the screen.
-``umount(u)`` is basically useful in the same ways as ``sync(s)``. I generally
-``sync(s)``, ``umount(u)``, then ``reboot(b)`` when my system locks. It's saved
-me many a fsck. Again, the unmount (remount read-only) hasn't taken place until
-you see the "OK" and "Done" message appear on the screen.
+``umount(u)`` can be used to mark filesystems as properly unmounted. From the
+running system's point of view, they will be remounted read-only. The remount
+isn't complete until you see the "OK" and "Done" message appear on the screen.
The loglevels ``0``-``9`` are useful when your console is being flooded with
kernel messages you do not want to see. Selecting ``0`` will prevent all but
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 28a869c..71e9184 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -1,59 +1,164 @@
Tainted kernels
---------------
-Some oops reports contain the string **'Tainted: '** after the program
-counter. This indicates that the kernel has been tainted by some
-mechanism. The string is followed by a series of position-sensitive
-characters, each representing a particular tainted value.
+The kernel will mark itself as 'tainted' when something occurs that might be
+relevant later when investigating problems. Don't worry too much about this,
+most of the time it's not a problem to run a tainted kernel; the information is
+mainly of interest once someone wants to investigate some problem, as its real
+cause might be the event that got the kernel tainted. That's why bug reports
+from tainted kernels will often be ignored by developers, hence try to reproduce
+problems with an untainted kernel.
- 1) ``G`` if all modules loaded have a GPL or compatible license, ``P`` if
+Note the kernel will remain tainted even after you undo what caused the taint
+(i.e. unload a proprietary kernel module), to indicate the kernel remains not
+trustworthy. That's also why the kernel will print the tainted state when it
+notices an internal problem (a 'kernel bug'), a recoverable error
+('kernel oops') or a non-recoverable error ('kernel panic') and writes debug
+information about this to the logs ``dmesg`` outputs. It's also possible to
+check the tainted state at runtime through a file in ``/proc/``.
+
+
+Tainted flag in bugs, oops or panics messages
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You find the tainted state near the top in a line starting with 'CPU:'; if or
+why the kernel was tainted is shown after the Process ID ('PID:') and a shortened
+name of the command ('Comm:') that triggered the event::
+
+ BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
+ Oops: 0002 [#1] SMP PTI
+ CPU: 0 PID: 4424 Comm: insmod Tainted: P W O 4.20.0-0.rc6.fc30 #1
+ Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
+ RIP: 0010:my_oops_init+0x13/0x1000 [kpanic]
+ [...]
+
+You'll find a 'Not tainted: ' there if the kernel was not tainted at the
+time of the event; if it was, then it will print 'Tainted: ' and characters
+either letters or blanks. In above example it looks like this::
+
+ Tainted: P W O
+
+The meaning of those characters is explained in the table below. In tis case
+the kernel got tainted earlier because a proprietary Module (``P``) was loaded,
+a warning occurred (``W``), and an externally-built module was loaded (``O``).
+To decode other letters use the table below.
+
+
+Decoding tainted state at runtime
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+At runtime, you can query the tainted state by reading
+``cat /proc/sys/kernel/tainted``. If that returns ``0``, the kernel is not
+tainted; any other number indicates the reasons why it is. The easiest way to
+decode that number is the script ``tools/debugging/kernel-chktaint``, which your
+distribution might ship as part of a package called ``linux-tools`` or
+``kernel-tools``; if it doesn't you can download the script from
+`git.kernel.org <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/tools/debugging/kernel-chktaint>`_
+and execute it with ``sh kernel-chktaint``, which would print something like
+this on the machine that had the statements in the logs that were quoted earlier::
+
+ Kernel is Tainted for following reasons:
+ * Proprietary module was loaded (#0)
+ * Kernel issued warning (#9)
+ * Externally-built ('out-of-tree') module was loaded (#12)
+ See Documentation/admin-guide/tainted-kernels.rst in the the Linux kernel or
+ https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html for
+ a more details explanation of the various taint flags.
+ Raw taint value as int/string: 4609/'P W O '
+
+You can try to decode the number yourself. That's easy if there was only one
+reason that got your kernel tainted, as in this case you can find the number
+with the table below. If there were multiple reasons you need to decode the
+number, as it is a bitfield, where each bit indicates the absence or presence of
+a particular type of taint. It's best to leave that to the aforementioned
+script, but if you need something quick you can use this shell command to check
+which bits are set::
+
+ $ for i in $(seq 18); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
+
+Table for decoding tainted state
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+=== === ====== ========================================================
+Bit Log Number Reason that got the kernel tainted
+=== === ====== ========================================================
+ 0 G/P 1 proprietary module was loaded
+ 1 _/F 2 module was force loaded
+ 2 _/S 4 SMP kernel oops on an officially SMP incapable processor
+ 3 _/R 8 module was force unloaded
+ 4 _/M 16 processor reported a Machine Check Exception (MCE)
+ 5 _/B 32 bad page referenced or some unexpected page flags
+ 6 _/U 64 taint requested by userspace application
+ 7 _/D 128 kernel died recently, i.e. there was an OOPS or BUG
+ 8 _/A 256 ACPI table overridden by user
+ 9 _/W 512 kernel issued warning
+ 10 _/C 1024 staging driver was loaded
+ 11 _/I 2048 workaround for bug in platform firmware applied
+ 12 _/O 4096 externally-built ("out-of-tree") module was loaded
+ 13 _/E 8192 unsigned module was loaded
+ 14 _/L 16384 soft lockup occurred
+ 15 _/K 32768 kernel has been live patched
+ 16 _/X 65536 auxiliary taint, defined for and used by distros
+ 17 _/T 131072 kernel was built with the struct randomization plugin
+=== === ====== ========================================================
+
+Note: The character ``_`` is representing a blank in this table to make reading
+easier.
+
+More detailed explanation for tainting
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ 0) ``G`` if all modules loaded have a GPL or compatible license, ``P`` if
any proprietary module has been loaded. Modules without a
MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by
insmod as GPL compatible are assumed to be proprietary.
- 2) ``F`` if any module was force loaded by ``insmod -f``, ``' '`` if all
+ 1) ``F`` if any module was force loaded by ``insmod -f``, ``' '`` if all
modules were loaded normally.
- 3) ``S`` if the oops occurred on an SMP kernel running on hardware that
+ 2) ``S`` if the oops occurred on an SMP kernel running on hardware that
hasn't been certified as safe to run multiprocessor.
Currently this occurs only on various Athlons that are not
SMP capable.
- 4) ``R`` if a module was force unloaded by ``rmmod -f``, ``' '`` if all
+ 3) ``R`` if a module was force unloaded by ``rmmod -f``, ``' '`` if all
modules were unloaded normally.
- 5) ``M`` if any processor has reported a Machine Check Exception,
+ 4) ``M`` if any processor has reported a Machine Check Exception,
``' '`` if no Machine Check Exceptions have occurred.
- 6) ``B`` if a page-release function has found a bad page reference or
- some unexpected page flags.
+ 5) ``B`` If a page-release function has found a bad page reference or some
+ unexpected page flags. This indicates a hardware problem or a kernel bug;
+ there should be other information in the log indicating why this tainting
+ occured.
- 7) ``U`` if a user or user application specifically requested that the
+ 6) ``U`` if a user or user application specifically requested that the
Tainted flag be set, ``' '`` otherwise.
- 8) ``D`` if the kernel has died recently, i.e. there was an OOPS or BUG.
+ 7) ``D`` if the kernel has died recently, i.e. there was an OOPS or BUG.
- 9) ``A`` if the ACPI table has been overridden.
+ 8) ``A`` if an ACPI table has been overridden.
- 10) ``W`` if a warning has previously been issued by the kernel.
+ 9) ``W`` if a warning has previously been issued by the kernel.
(Though some warnings may set more specific taint flags.)
- 11) ``C`` if a staging driver has been loaded.
+ 10) ``C`` if a staging driver has been loaded.
- 12) ``I`` if the kernel is working around a severe bug in the platform
+ 11) ``I`` if the kernel is working around a severe bug in the platform
firmware (BIOS or similar).
- 13) ``O`` if an externally-built ("out-of-tree") module has been loaded.
+ 12) ``O`` if an externally-built ("out-of-tree") module has been loaded.
- 14) ``E`` if an unsigned module has been loaded in a kernel supporting
+ 13) ``E`` if an unsigned module has been loaded in a kernel supporting
module signature.
- 15) ``L`` if a soft lockup has previously occurred on the system.
+ 14) ``L`` if a soft lockup has previously occurred on the system.
- 16) ``K`` if the kernel has been live patched.
+ 15) ``K`` if the kernel has been live patched.
-The primary reason for the **'Tainted: '** string is to tell kernel
-debuggers if this is a clean kernel or if anything unusual has
-occurred. Tainting is permanent: even if an offending module is
-unloaded, the tainted value remains to indicate that the kernel is not
-trustworthy.
+ 16) ``X`` Auxiliary taint, defined for and used by Linux distributors.
+
+ 17) ``T`` Kernel was build with the randstruct plugin, which can intentionally
+ produce extremely unusual kernel structure layouts (even performance
+ pathological ones), which is important to know when debugging. Set at
+ build time.
diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst
index 35fccba..898ad78 100644
--- a/Documentation/admin-guide/thunderbolt.rst
+++ b/Documentation/admin-guide/thunderbolt.rst
@@ -133,6 +133,26 @@
the device without a key or write a new key and write 1 to the
``authorized`` file to get the new key stored on the device NVM.
+DMA protection utilizing IOMMU
+------------------------------
+Recent systems from 2018 and forward with Thunderbolt ports may natively
+support IOMMU. This means that Thunderbolt security is handled by an IOMMU
+so connected devices cannot access memory regions outside of what is
+allocated for them by drivers. When Linux is running on such system it
+automatically enables IOMMU if not enabled by the user already. These
+systems can be identified by reading ``1`` from
+``/sys/bus/thunderbolt/devices/domainX/iommu_dma_protection`` attribute.
+
+The driver does not do anything special in this case but because DMA
+protection is handled by the IOMMU, security levels (if set) are
+redundant. For this reason some systems ship with security level set to
+``none``. Other systems have security level set to ``user`` in order to
+support downgrade to older OS, so users who want to automatically
+authorize devices when IOMMU DMA protection is enabled can use the
+following ``udev`` rule::
+
+ ACTION=="add", SUBSYSTEM=="thunderbolt", ATTRS{iommu_dma_protection}=="1", ATTR{authorized}=="0", ATTR{authorized}="1"
+
Upgrading NVM on Thunderbolt device or host
-------------------------------------------
Since most of the functionality is handled in firmware running on a
diff --git a/Documentation/admin-guide/ufs.rst b/Documentation/admin-guide/ufs.rst
new file mode 100644
index 0000000..55d1529
--- /dev/null
+++ b/Documentation/admin-guide/ufs.rst
@@ -0,0 +1,68 @@
+=========
+Using UFS
+=========
+
+mount -t ufs -o ufstype=type_of_ufs device dir
+
+
+UFS Options
+===========
+
+ufstype=type_of_ufs
+ UFS is a file system widely used in different operating systems.
+ The problem are differences among implementations. Features of
+ some implementations are undocumented, so its hard to recognize
+ type of ufs automatically. That's why user must specify type of
+ ufs manually by mount option ufstype. Possible values are:
+
+ old
+ old format of ufs
+ default value, supported as read-only
+
+ 44bsd
+ used in FreeBSD, NetBSD, OpenBSD
+ supported as read-write
+
+ ufs2
+ used in FreeBSD 5.x
+ supported as read-write
+
+ 5xbsd
+ synonym for ufs2
+
+ sun
+ used in SunOS (Solaris)
+ supported as read-write
+
+ sunx86
+ used in SunOS for Intel (Solarisx86)
+ supported as read-write
+
+ hp
+ used in HP-UX
+ supported as read-only
+
+ nextstep
+ used in NextStep
+ supported as read-only
+
+ nextstep-cd
+ used for NextStep CDROMs (block_size == 2048)
+ supported as read-only
+
+ openstep
+ used in OpenStep
+ supported as read-only
+
+
+Possible Problems
+-----------------
+
+See next section, if you have any.
+
+
+Bug Reports
+-----------
+
+Any ufs bug report you can send to daniel.pirkl@email.cz or
+to dushistov@mail.ru (do not send partition tables bug reports).
diff --git a/Documentation/admin-guide/video-output.rst b/Documentation/admin-guide/video-output.rst
new file mode 100644
index 0000000..56d6fa2
--- /dev/null
+++ b/Documentation/admin-guide/video-output.rst
@@ -0,0 +1,34 @@
+Video Output Switcher Control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+2006 luming.yu@intel.com
+
+The output sysfs class driver provides an abstract video output layer that
+can be used to hook platform specific methods to enable/disable video output
+device through common sysfs interface. For example, on my IBM ThinkPad T42
+laptop, The ACPI video driver registered its output devices and read/write
+method for 'state' with output sysfs class. The user interface under sysfs is::
+
+ linux:/sys/class/video_output # tree .
+ .
+ |-- CRT0
+ | |-- device -> ../../../devices/pci0000:00/0000:00:01.0
+ | |-- state
+ | |-- subsystem -> ../../../class/video_output
+ | `-- uevent
+ |-- DVI0
+ | |-- device -> ../../../devices/pci0000:00/0000:00:01.0
+ | |-- state
+ | |-- subsystem -> ../../../class/video_output
+ | `-- uevent
+ |-- LCD0
+ | |-- device -> ../../../devices/pci0000:00/0000:00:01.0
+ | |-- state
+ | |-- subsystem -> ../../../class/video_output
+ | `-- uevent
+ `-- TV0
+ |-- device -> ../../../devices/pci0000:00/0000:00:01.0
+ |-- state
+ |-- subsystem -> ../../../class/video_output
+ `-- uevent
+
diff --git a/Documentation/admin-guide/wimax/i2400m.rst b/Documentation/admin-guide/wimax/i2400m.rst
new file mode 100644
index 0000000..194388c
--- /dev/null
+++ b/Documentation/admin-guide/wimax/i2400m.rst
@@ -0,0 +1,283 @@
+.. include:: <isonum.txt>
+
+====================================================
+Driver for the Intel Wireless Wimax Connection 2400m
+====================================================
+
+:Copyright: |copy| 2008 Intel Corporation < linux-wimax@intel.com >
+
+ This provides a driver for the Intel Wireless WiMAX Connection 2400m
+ and a basic Linux kernel WiMAX stack.
+
+1. Requirements
+===============
+
+ * Linux installation with Linux kernel 2.6.22 or newer (if building
+ from a separate tree)
+ * Intel i2400m Echo Peak or Baxter Peak; this includes the Intel
+ Wireless WiMAX/WiFi Link 5x50 series.
+ * build tools:
+
+ + Linux kernel development package for the target kernel; to
+ build against your currently running kernel, you need to have
+ the kernel development package corresponding to the running
+ image installed (usually if your kernel is named
+ linux-VERSION, the development package is called
+ linux-dev-VERSION or linux-headers-VERSION).
+ + GNU C Compiler, make
+
+2. Compilation and installation
+===============================
+
+2.1. Compilation of the drivers included in the kernel
+------------------------------------------------------
+
+ Configure the kernel; to enable the WiMAX drivers select Drivers >
+ Networking Drivers > WiMAX device support. Enable all of them as
+ modules (easier).
+
+ If USB or SDIO are not enabled in the kernel configuration, the options
+ to build the i2400m USB or SDIO drivers will not show. Enable said
+ subsystems and go back to the WiMAX menu to enable the drivers.
+
+ Compile and install your kernel as usual.
+
+2.2. Compilation of the drivers distributed as an standalone module
+-------------------------------------------------------------------
+
+ To compile::
+
+ $ cd source/directory
+ $ make
+
+ Once built you can load and unload using the provided load.sh script;
+ load.sh will load the modules, load.sh u will unload them.
+
+ To install in the default kernel directories (and enable auto loading
+ when the device is plugged)::
+
+ $ make install
+ $ depmod -a
+
+ If your kernel development files are located in a non standard
+ directory or if you want to build for a kernel that is not the
+ currently running one, set KDIR to the right location::
+
+ $ make KDIR=/path/to/kernel/dev/tree
+
+ For more information, please contact linux-wimax@intel.com.
+
+3. Installing the firmware
+--------------------------
+
+ The firmware can be obtained from http://linuxwimax.org or might have
+ been supplied with your hardware.
+
+ It has to be installed in the target system::
+
+ $ cp FIRMWAREFILE.sbcf /lib/firmware/i2400m-fw-BUSTYPE-1.3.sbcf
+
+ * NOTE: if your firmware came in an .rpm or .deb file, just install
+ it as normal, with the rpm (rpm -i FIRMWARE.rpm) or dpkg
+ (dpkg -i FIRMWARE.deb) commands. No further action is needed.
+ * BUSTYPE will be usb or sdio, depending on the hardware you have.
+ Each hardware type comes with its own firmware and will not work
+ with other types.
+
+4. Design
+=========
+
+ This package contains two major parts: a WiMAX kernel stack and a
+ driver for the Intel i2400m.
+
+ The WiMAX stack is designed to provide for common WiMAX control
+ services to current and future WiMAX devices from any vendor; please
+ see README.wimax for details.
+
+ The i2400m kernel driver is broken up in two main parts: the bus
+ generic driver and the bus-specific drivers. The bus generic driver
+ forms the drivercore and contain no knowledge of the actual method we
+ use to connect to the device. The bus specific drivers are just the
+ glue to connect the bus-generic driver and the device. Currently only
+ USB and SDIO are supported. See drivers/net/wimax/i2400m/i2400m.h for
+ more information.
+
+ The bus generic driver is logically broken up in two parts: OS-glue and
+ hardware-glue. The OS-glue interfaces with Linux. The hardware-glue
+ interfaces with the device on using an interface provided by the
+ bus-specific driver. The reason for this breakup is to be able to
+ easily reuse the hardware-glue to write drivers for other OSes; note
+ the hardware glue part is written as a native Linux driver; no
+ abstraction layers are used, so to port to another OS, the Linux kernel
+ API calls should be replaced with the target OS's.
+
+5. Usage
+========
+
+ To load the driver, follow the instructions in the install section;
+ once the driver is loaded, plug in the device (unless it is permanently
+ plugged in). The driver will enumerate the device, upload the firmware
+ and output messages in the kernel log (dmesg, /var/log/messages or
+ /var/log/kern.log) such as::
+
+ ...
+ i2400m_usb 5-4:1.0: firmware interface version 8.0.0
+ i2400m_usb 5-4:1.0: WiMAX interface wmx0 (00:1d:e1:01:94:2c) ready
+
+ At this point the device is ready to work.
+
+ Current versions require the Intel WiMAX Network Service in userspace
+ to make things work. See the network service's README for instructions
+ on how to scan, connect and disconnect.
+
+5.1. Module parameters
+----------------------
+
+ Module parameters can be set at kernel or module load time or by
+ echoing values::
+
+ $ echo VALUE > /sys/module/MODULENAME/parameters/PARAMETERNAME
+
+ To make changes permanent, for example, for the i2400m module, you can
+ also create a file named /etc/modprobe.d/i2400m containing::
+
+ options i2400m idle_mode_disabled=1
+
+ To find which parameters are supported by a module, run::
+
+ $ modinfo path/to/module.ko
+
+ During kernel bootup (if the driver is linked in the kernel), specify
+ the following to the kernel command line::
+
+ i2400m.PARAMETER=VALUE
+
+5.1.1. i2400m: idle_mode_disabled
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The i2400m module supports a parameter to disable idle mode. This
+ parameter, once set, will take effect only when the device is
+ reinitialized by the driver (eg: following a reset or a reconnect).
+
+5.2. Debug operations: debugfs entries
+--------------------------------------
+
+ The driver will register debugfs entries that allow the user to tweak
+ debug settings. There are three main container directories where
+ entries are placed, which correspond to the three blocks a i2400m WiMAX
+ driver has:
+
+ * /sys/kernel/debug/wimax:DEVNAME/ for the generic WiMAX stack
+ controls
+ * /sys/kernel/debug/wimax:DEVNAME/i2400m for the i2400m generic
+ driver controls
+ * /sys/kernel/debug/wimax:DEVNAME/i2400m-usb (or -sdio) for the
+ bus-specific i2400m-usb or i2400m-sdio controls).
+
+ Of course, if debugfs is mounted in a directory other than
+ /sys/kernel/debug, those paths will change.
+
+5.2.1. Increasing debug output
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The files named *dl_* indicate knobs for controlling the debug output
+ of different submodules::
+
+ # find /sys/kernel/debug/wimax\:wmx0 -name \*dl_\*
+ /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_tx
+ /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_rx
+ /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_notif
+ /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_fw
+ /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_usb
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_tx
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_rx
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_rfkill
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_netdev
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_fw
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_debugfs
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_driver
+ /sys/kernel/debug/wimax:wmx0/i2400m/dl_control
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_stack
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_op_rfkill
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_op_reset
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_op_msg
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_debugfs
+
+ By reading the file you can obtain the current value of said debug
+ level; by writing to it, you can set it.
+
+ To increase the debug level of, for example, the i2400m's generic TX
+ engine, just write::
+
+ $ echo 3 > /sys/kernel/debug/wimax:wmx0/i2400m/dl_tx
+
+ Increasing numbers yield increasing debug information; for details of
+ what is printed and the available levels, check the source. The code
+ uses 0 for disabled and increasing values until 8.
+
+5.2.2. RX and TX statistics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ The i2400m/rx_stats and i2400m/tx_stats provide statistics about the
+ data reception/delivery from the device::
+
+ $ cat /sys/kernel/debug/wimax:wmx0/i2400m/rx_stats
+ 45 1 3 34 3104 48 480
+
+ The numbers reported are:
+
+ * packets/RX-buffer: total, min, max
+ * RX-buffers: total RX buffers received, accumulated RX buffer size
+ in bytes, min size received, max size received
+
+ Thus, to find the average buffer size received, divide accumulated
+ RX-buffer / total RX-buffers.
+
+ To clear the statistics back to 0, write anything to the rx_stats file::
+
+ $ echo 1 > /sys/kernel/debug/wimax:wmx0/i2400m_rx_stats
+
+ Likewise for TX.
+
+ Note the packets this debug file refers to are not network packet, but
+ packets in the sense of the device-specific protocol for communication
+ to the host. See drivers/net/wimax/i2400m/tx.c.
+
+5.2.3. Tracing messages received from user space
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ To echo messages received from user space into the trace pipe that the
+ i2400m driver creates, set the debug file i2400m/trace_msg_from_user to
+ 1::
+
+ $ echo 1 > /sys/kernel/debug/wimax:wmx0/i2400m/trace_msg_from_user
+
+5.2.4. Performing a device reset
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ By writing a 0, a 1 or a 2 to the file
+ /sys/kernel/debug/wimax:wmx0/reset, the driver performs a warm (without
+ disconnecting from the bus), cold (disconnecting from the bus) or bus
+ (bus specific) reset on the device.
+
+5.2.5. Asking the device to enter power saving mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+ By writing any value to the /sys/kernel/debug/wimax:wmx0 file, the
+ device will attempt to enter power saving mode.
+
+6. Troubleshooting
+==================
+
+6.1. Driver complains about ``i2400m-fw-usb-1.2.sbcf: request failed``
+----------------------------------------------------------------------
+
+ If upon connecting the device, the following is output in the kernel
+ log::
+
+ i2400m_usb 5-4:1.0: fw i2400m-fw-usb-1.3.sbcf: request failed: -2
+
+ This means that the driver cannot locate the firmware file named
+ /lib/firmware/i2400m-fw-usb-1.2.sbcf. Check that the file is present in
+ the right location.
diff --git a/Documentation/admin-guide/wimax/index.rst b/Documentation/admin-guide/wimax/index.rst
new file mode 100644
index 0000000..fdf7c1f
--- /dev/null
+++ b/Documentation/admin-guide/wimax/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+WiMAX subsystem
+===============
+
+.. toctree::
+ :maxdepth: 2
+
+ wimax
+
+ i2400m
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/admin-guide/wimax/wimax.rst b/Documentation/admin-guide/wimax/wimax.rst
new file mode 100644
index 0000000..817ee8b
--- /dev/null
+++ b/Documentation/admin-guide/wimax/wimax.rst
@@ -0,0 +1,89 @@
+.. include:: <isonum.txt>
+
+========================
+Linux kernel WiMAX stack
+========================
+
+:Copyright: |copy| 2008 Intel Corporation < linux-wimax@intel.com >
+
+ This provides a basic Linux kernel WiMAX stack to provide a common
+ control API for WiMAX devices, usable from kernel and user space.
+
+1. Design
+=========
+
+ The WiMAX stack is designed to provide for common WiMAX control
+ services to current and future WiMAX devices from any vendor.
+
+ Because currently there is only one and we don't know what would be the
+ common services, the APIs it currently provides are very minimal.
+ However, it is done in such a way that it is easily extensible to
+ accommodate future requirements.
+
+ The stack works by embedding a struct wimax_dev in your device's
+ control structures. This provides a set of callbacks that the WiMAX
+ stack will call in order to implement control operations requested by
+ the user. As well, the stack provides API functions that the driver
+ calls to notify about changes of state in the device.
+
+ The stack exports the API calls needed to control the device to user
+ space using generic netlink as a marshalling mechanism. You can access
+ them using your own code or use the wrappers provided for your
+ convenience in libwimax (in the wimax-tools package).
+
+ For detailed information on the stack, please see
+ include/linux/wimax.h.
+
+2. Usage
+========
+
+ For usage in a driver (registration, API, etc) please refer to the
+ instructions in the header file include/linux/wimax.h.
+
+ When a device is registered with the WiMAX stack, a set of debugfs
+ files will appear in /sys/kernel/debug/wimax:wmxX can tweak for
+ control.
+
+2.1. Obtaining debug information: debugfs entries
+-------------------------------------------------
+
+ The WiMAX stack is compiled, by default, with debug messages that can
+ be used to diagnose issues. By default, said messages are disabled.
+
+ The drivers will register debugfs entries that allow the user to tweak
+ debug settings.
+
+ Each driver, when registering with the stack, will cause a debugfs
+ directory named wimax:DEVICENAME to be created; optionally, it might
+ create more subentries below it.
+
+2.1.1. Increasing debug output
+------------------------------
+
+ The files named *dl_* indicate knobs for controlling the debug output
+ of different submodules of the WiMAX stack::
+
+ # find /sys/kernel/debug/wimax\:wmx0 -name \*dl_\*
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_stack
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_op_rfkill
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_op_reset
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_op_msg
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table
+ /sys/kernel/debug/wimax:wmx0/wimax_dl_debugfs
+ /sys/kernel/debug/wimax:wmx0/.... # other driver specific files
+
+ NOTE:
+ Of course, if debugfs is mounted in a directory other than
+ /sys/kernel/debug, those paths will change.
+
+ By reading the file you can obtain the current value of said debug
+ level; by writing to it, you can set it.
+
+ To increase the debug level of, for example, the id-table submodule,
+ just write:
+
+ $ echo 3 > /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table
+
+ Increasing numbers yield increasing debug information; for details of
+ what is printed and the available levels, check the source. The code
+ uses 0 for disabled and increasing values until 8.
diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
new file mode 100644
index 0000000..fb5b39f
--- /dev/null
+++ b/Documentation/admin-guide/xfs.rst
@@ -0,0 +1,467 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+The SGI XFS Filesystem
+======================
+
+XFS is a high performance journaling filesystem which originated
+on the SGI IRIX platform. It is completely multi-threaded, can
+support large files and large filesystems, extended attributes,
+variable block sizes, is extent based, and makes extensive use of
+Btrees (directories, extents, free space) to aid both performance
+and scalability.
+
+Refer to the documentation at https://xfs.wiki.kernel.org/
+for further details. This implementation is on-disk compatible
+with the IRIX version of XFS.
+
+
+Mount Options
+=============
+
+When mounting an XFS filesystem, the following options are accepted.
+
+ allocsize=size
+ Sets the buffered I/O end-of-file preallocation size when
+ doing delayed allocation writeout (default size is 64KiB).
+ Valid values for this option are page size (typically 4KiB)
+ through to 1GiB, inclusive, in power-of-2 increments.
+
+ The default behaviour is for dynamic end-of-file
+ preallocation size, which uses a set of heuristics to
+ optimise the preallocation size based on the current
+ allocation patterns within the file and the access patterns
+ to the file. Specifying a fixed ``allocsize`` value turns off
+ the dynamic behaviour.
+
+ attr2 or noattr2
+ The options enable/disable an "opportunistic" improvement to
+ be made in the way inline extended attributes are stored
+ on-disk. When the new form is used for the first time when
+ ``attr2`` is selected (either when setting or removing extended
+ attributes) the on-disk superblock feature bit field will be
+ updated to reflect this format being in use.
+
+ The default behaviour is determined by the on-disk feature
+ bit indicating that ``attr2`` behaviour is active. If either
+ mount option is set, then that becomes the new default used
+ by the filesystem.
+
+ CRC enabled filesystems always use the ``attr2`` format, and so
+ will reject the ``noattr2`` mount option if it is set.
+
+ discard or nodiscard (default)
+ Enable/disable the issuing of commands to let the block
+ device reclaim space freed by the filesystem. This is
+ useful for SSD devices, thinly provisioned LUNs and virtual
+ machine images, but may have a performance impact.
+
+ Note: It is currently recommended that you use the ``fstrim``
+ application to ``discard`` unused blocks rather than the ``discard``
+ mount option because the performance impact of this option
+ is quite severe.
+
+ grpid/bsdgroups or nogrpid/sysvgroups (default)
+ These options define what group ID a newly created file
+ gets. When ``grpid`` is set, it takes the group ID of the
+ directory in which it is created; otherwise it takes the
+ ``fsgid`` of the current process, unless the directory has the
+ ``setgid`` bit set, in which case it takes the ``gid`` from the
+ parent directory, and also gets the ``setgid`` bit set if it is
+ a directory itself.
+
+ filestreams
+ Make the data allocator use the filestreams allocation mode
+ across the entire filesystem rather than just on directories
+ configured to use it.
+
+ ikeep or noikeep (default)
+ When ``ikeep`` is specified, XFS does not delete empty inode
+ clusters and keeps them around on disk. When ``noikeep`` is
+ specified, empty inode clusters are returned to the free
+ space pool.
+
+ inode32 or inode64 (default)
+ When ``inode32`` is specified, it indicates that XFS limits
+ inode creation to locations which will not result in inode
+ numbers with more than 32 bits of significance.
+
+ When ``inode64`` is specified, it indicates that XFS is allowed
+ to create inodes at any location in the filesystem,
+ including those which will result in inode numbers occupying
+ more than 32 bits of significance.
+
+ ``inode32`` is provided for backwards compatibility with older
+ systems and applications, since 64 bits inode numbers might
+ cause problems for some applications that cannot handle
+ large inode numbers. If applications are in use which do
+ not handle inode numbers bigger than 32 bits, the ``inode32``
+ option should be specified.
+
+ largeio or nolargeio (default)
+ If ``nolargeio`` is specified, the optimal I/O reported in
+ ``st_blksize`` by **stat(2)** will be as small as possible to allow
+ user applications to avoid inefficient read/modify/write
+ I/O. This is typically the page size of the machine, as
+ this is the granularity of the page cache.
+
+ If ``largeio`` is specified, a filesystem that was created with a
+ ``swidth`` specified will return the ``swidth`` value (in bytes)
+ in ``st_blksize``. If the filesystem does not have a ``swidth``
+ specified but does specify an ``allocsize`` then ``allocsize``
+ (in bytes) will be returned instead. Otherwise the behaviour
+ is the same as if ``nolargeio`` was specified.
+
+ logbufs=value
+ Set the number of in-memory log buffers. Valid numbers
+ range from 2-8 inclusive.
+
+ The default value is 8 buffers.
+
+ If the memory cost of 8 log buffers is too high on small
+ systems, then it may be reduced at some cost to performance
+ on metadata intensive workloads. The ``logbsize`` option below
+ controls the size of each buffer and so is also relevant to
+ this case.
+
+ logbsize=value
+ Set the size of each in-memory log buffer. The size may be
+ specified in bytes, or in kilobytes with a "k" suffix.
+ Valid sizes for version 1 and version 2 logs are 16384 (16k)
+ and 32768 (32k). Valid sizes for version 2 logs also
+ include 65536 (64k), 131072 (128k) and 262144 (256k). The
+ logbsize must be an integer multiple of the log
+ stripe unit configured at **mkfs(8)** time.
+
+ The default value for for version 1 logs is 32768, while the
+ default value for version 2 logs is MAX(32768, log_sunit).
+
+ logdev=device and rtdev=device
+ Use an external log (metadata journal) and/or real-time device.
+ An XFS filesystem has up to three parts: a data section, a log
+ section, and a real-time section. The real-time section is
+ optional, and the log section can be separate from the data
+ section or contained within it.
+
+ noalign
+ Data allocations will not be aligned at stripe unit
+ boundaries. This is only relevant to filesystems created
+ with non-zero data alignment parameters (``sunit``, ``swidth``) by
+ **mkfs(8)**.
+
+ norecovery
+ The filesystem will be mounted without running log recovery.
+ If the filesystem was not cleanly unmounted, it is likely to
+ be inconsistent when mounted in ``norecovery`` mode.
+ Some files or directories may not be accessible because of this.
+ Filesystems mounted ``norecovery`` must be mounted read-only or
+ the mount will fail.
+
+ nouuid
+ Don't check for double mounted file systems using the file
+ system ``uuid``. This is useful to mount LVM snapshot volumes,
+ and often used in combination with ``norecovery`` for mounting
+ read-only snapshots.
+
+ noquota
+ Forcibly turns off all quota accounting and enforcement
+ within the filesystem.
+
+ uquota/usrquota/uqnoenforce/quota
+ User disk quota accounting enabled, and limits (optionally)
+ enforced. Refer to **xfs_quota(8)** for further details.
+
+ gquota/grpquota/gqnoenforce
+ Group disk quota accounting enabled and limits (optionally)
+ enforced. Refer to **xfs_quota(8)** for further details.
+
+ pquota/prjquota/pqnoenforce
+ Project disk quota accounting enabled and limits (optionally)
+ enforced. Refer to **xfs_quota(8)** for further details.
+
+ sunit=value and swidth=value
+ Used to specify the stripe unit and width for a RAID device
+ or a stripe volume. "value" must be specified in 512-byte
+ block units. These options are only relevant to filesystems
+ that were created with non-zero data alignment parameters.
+
+ The ``sunit`` and ``swidth`` parameters specified must be compatible
+ with the existing filesystem alignment characteristics. In
+ general, that means the only valid changes to ``sunit`` are
+ increasing it by a power-of-2 multiple. Valid ``swidth`` values
+ are any integer multiple of a valid ``sunit`` value.
+
+ Typically the only time these mount options are necessary if
+ after an underlying RAID device has had it's geometry
+ modified, such as adding a new disk to a RAID5 lun and
+ reshaping it.
+
+ swalloc
+ Data allocations will be rounded up to stripe width boundaries
+ when the current end of file is being extended and the file
+ size is larger than the stripe width size.
+
+ wsync
+ When specified, all filesystem namespace operations are
+ executed synchronously. This ensures that when the namespace
+ operation (create, unlink, etc) completes, the change to the
+ namespace is on stable storage. This is useful in HA setups
+ where failover must not result in clients seeing
+ inconsistent namespace presentation during or after a
+ failover event.
+
+
+Deprecated Mount Options
+========================
+
+=========================== ================
+ Name Removal Schedule
+=========================== ================
+=========================== ================
+
+
+Removed Mount Options
+=====================
+
+=========================== =======
+ Name Removed
+=========================== =======
+ delaylog/nodelaylog v4.0
+ ihashsize v4.0
+ irixsgid v4.0
+ osyncisdsync/osyncisosync v4.0
+ barrier v4.19
+ nobarrier v4.19
+=========================== =======
+
+sysctls
+=======
+
+The following sysctls are available for the XFS filesystem:
+
+ fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)
+ Setting this to "1" clears accumulated XFS statistics
+ in /proc/fs/xfs/stat. It then immediately resets to "0".
+
+ fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)
+ The interval at which the filesystem flushes metadata
+ out to disk and runs internal cache cleanup routines.
+
+ fs.xfs.filestream_centisecs (Min: 1 Default: 3000 Max: 360000)
+ The interval at which the filesystem ages filestreams cache
+ references and returns timed-out AGs back to the free stream
+ pool.
+
+ fs.xfs.speculative_prealloc_lifetime
+ (Units: seconds Min: 1 Default: 300 Max: 86400)
+ The interval at which the background scanning for inodes
+ with unused speculative preallocation runs. The scan
+ removes unused preallocation from clean inodes and releases
+ the unused space back to the free pool.
+
+ fs.xfs.error_level (Min: 0 Default: 3 Max: 11)
+ A volume knob for error reporting when internal errors occur.
+ This will generate detailed messages & backtraces for filesystem
+ shutdowns, for example. Current threshold values are:
+
+ XFS_ERRLEVEL_OFF: 0
+ XFS_ERRLEVEL_LOW: 1
+ XFS_ERRLEVEL_HIGH: 5
+
+ fs.xfs.panic_mask (Min: 0 Default: 0 Max: 256)
+ Causes certain error conditions to call BUG(). Value is a bitmask;
+ OR together the tags which represent errors which should cause panics:
+
+ XFS_NO_PTAG 0
+ XFS_PTAG_IFLUSH 0x00000001
+ XFS_PTAG_LOGRES 0x00000002
+ XFS_PTAG_AILDELETE 0x00000004
+ XFS_PTAG_ERROR_REPORT 0x00000008
+ XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010
+ XFS_PTAG_SHUTDOWN_IOERROR 0x00000020
+ XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040
+ XFS_PTAG_FSBLOCK_ZERO 0x00000080
+ XFS_PTAG_VERIFIER_ERROR 0x00000100
+
+ This option is intended for debugging only.
+
+ fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)
+ Controls whether symlinks are created with mode 0777 (default)
+ or whether their mode is affected by the umask (irix mode).
+
+ fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)
+ Controls files created in SGID directories.
+ If the group ID of the new file does not match the effective group
+ ID or one of the supplementary group IDs of the parent dir, the
+ ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
+ is set.
+
+ fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "sync" flag set
+ by the **xfs_io(8)** chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "nodump" flag set
+ by the **xfs_io(8)** chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "noatime" flag set
+ by the **xfs_io(8)** chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "nosymlinks" flag set
+ by the **xfs_io(8)** chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.inherit_nodefrag (Min: 0 Default: 1 Max: 1)
+ Setting this to "1" will cause the "nodefrag" flag set
+ by the **xfs_io(8)** chattr command on a directory to be
+ inherited by files in that directory.
+
+ fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)
+ In "inode32" allocation mode, this option determines how many
+ files the allocator attempts to allocate in the same allocation
+ group before moving to the next allocation group. The intent
+ is to control the rate at which the allocator moves between
+ allocation groups when allocating extents for new files.
+
+Deprecated Sysctls
+==================
+
+None at present.
+
+
+Removed Sysctls
+===============
+
+============================= =======
+ Name Removed
+============================= =======
+ fs.xfs.xfsbufd_centisec v4.0
+ fs.xfs.age_buffer_centisecs v4.0
+============================= =======
+
+Error handling
+==============
+
+XFS can act differently according to the type of error found during its
+operation. The implementation introduces the following concepts to the error
+handler:
+
+ -failure speed:
+ Defines how fast XFS should propagate an error upwards when a specific
+ error is found during the filesystem operation. It can propagate
+ immediately, after a defined number of retries, after a set time period,
+ or simply retry forever.
+
+ -error classes:
+ Specifies the subsystem the error configuration will apply to, such as
+ metadata IO or memory allocation. Different subsystems will have
+ different error handlers for which behaviour can be configured.
+
+ -error handlers:
+ Defines the behavior for a specific error.
+
+The filesystem behavior during an error can be set via ``sysfs`` files. Each
+error handler works independently - the first condition met by an error handler
+for a specific class will cause the error to be propagated rather than reset and
+retried.
+
+The action taken by the filesystem when the error is propagated is context
+dependent - it may cause a shut down in the case of an unrecoverable error,
+it may be reported back to userspace, or it may even be ignored because
+there's nothing useful we can with the error or anyone we can report it to (e.g.
+during unmount).
+
+The configuration files are organized into the following hierarchy for each
+mounted filesystem:
+
+ /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+Where:
+ <dev>
+ The short device name of the mounted filesystem. This is the same device
+ name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
+
+ <class>
+ The subsystem the error configuration belongs to. As of 4.9, the defined
+ classes are:
+
+ - "metadata": applies metadata buffer write IO
+
+ <error>
+ The individual error handler configurations.
+
+
+Each filesystem has "global" error configuration options defined in their top
+level directory:
+
+ /sys/fs/xfs/<dev>/error/
+
+ fail_at_unmount (Min: 0 Default: 1 Max: 1)
+ Defines the filesystem error behavior at unmount time.
+
+ If set to a value of 1, XFS will override all other error configurations
+ during unmount and replace them with "immediate fail" characteristics.
+ i.e. no retries, no retry timeout. This will always allow unmount to
+ succeed when there are persistent errors present.
+
+ If set to 0, the configured retry behaviour will continue until all
+ retries and/or timeouts have been exhausted. This will delay unmount
+ completion when there are persistent errors, and it may prevent the
+ filesystem from ever unmounting fully in the case of "retry forever"
+ handler configurations.
+
+ Note: there is no guarantee that fail_at_unmount can be set while an
+ unmount is in progress. It is possible that the ``sysfs`` entries are
+ removed by the unmounting filesystem before a "retry forever" error
+ handler configuration causes unmount to hang, and hence the filesystem
+ must be configured appropriately before unmount begins to prevent
+ unmount hangs.
+
+Each filesystem has specific error class handlers that define the error
+propagation behaviour for specific errors. There is also a "default" error
+handler defined, which defines the behaviour for all errors that don't have
+specific handlers defined. Where multiple retry constraints are configured for
+a single error, the first retry configuration that expires will cause the error
+to be propagated. The handler configurations are found in the directory:
+
+ /sys/fs/xfs/<dev>/error/<class>/<error>/
+
+ max_retries (Min: -1 Default: Varies Max: INTMAX)
+ Defines the allowed number of retries of a specific error before
+ the filesystem will propagate the error. The retry count for a given
+ error context (e.g. a specific metadata buffer) is reset every time
+ there is a successful completion of the operation.
+
+ Setting the value to "-1" will cause XFS to retry forever for this
+ specific error.
+
+ Setting the value to "0" will cause XFS to fail immediately when the
+ specific error is reported.
+
+ Setting the value to "N" (where 0 < N < Max) will make XFS retry the
+ operation "N" times before propagating the error.
+
+ retry_timeout_seconds (Min: -1 Default: Varies Max: 1 day)
+ Define the amount of time (in seconds) that the filesystem is
+ allowed to retry its operations when the specific error is
+ found.
+
+ Setting the value to "-1" will allow XFS to retry forever for this
+ specific error.
+
+ Setting the value to "0" will cause XFS to fail immediately when the
+ specific error is reported.
+
+ Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
+ operation for up to "N" seconds before propagating the error.
+
+**Note:** The default behaviour for a specific error handler is dependent on both
+the class and error context. For example, the default values for
+"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
+to "fail immediately" behaviour. This is done because ENODEV is a fatal,
+unrecoverable error no matter how many times the metadata IO is retried.