Update Linux to v5.10.109
Sourced from [1]
[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.10.109.tar.xz
Change-Id: I19bca9fc6762d4e63bcf3e4cba88bbe560d9c76c
Signed-off-by: Olivier Deprez <olivier.deprez@arm.com>
diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst
index 08a2f10..abb9fc1 100644
--- a/Documentation/x86/boot.rst
+++ b/Documentation/x86/boot.rst
@@ -68,8 +68,26 @@
Protocol 2.13 (Kernel 3.14) Support 32- and 64-bit flags being set in
xloadflags to support booting a 64-bit kernel from 32-bit
EFI
+
+Protocol 2.14 BURNT BY INCORRECT COMMIT
+ ae7e1238e68f2a472a125673ab506d49158c1889
+ (x86/boot: Add ACPI RSDP address to setup_header)
+ DO NOT USE!!! ASSUME SAME AS 2.13.
+
+Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max.
============= ============================================================
+.. note::
+ The protocol version number should be changed only if the setup header
+ is changed. There is no need to update the version number if boot_params
+ or kernel_info are changed. Additionally, it is recommended to use
+ xloadflags (in this case the protocol version number should not be
+ updated either) or kernel_info to communicate supported Linux kernel
+ features to the boot loader. Due to very limited space available in
+ the original setup header every update to it should be considered
+ with great care. Starting from the protocol 2.15 the primary way to
+ communicate things to the boot loader is the kernel_info.
+
Memory Layout
=============
@@ -207,6 +225,7 @@
0258/8 2.10+ pref_address Preferred loading address
0260/4 2.10+ init_size Linear memory required during initialization
0264/4 2.11+ handover_offset Offset of handover entry point
+0268/4 2.15+ kernel_info_offset Offset of the kernel_info
=========== ======== ===================== ============================================
.. note::
@@ -233,7 +252,7 @@
supported by the protocol version in use.
-Details of Harder Fileds
+Details of Header Fields
========================
For each field, some are information from the kernel to the bootloader
@@ -471,15 +490,11 @@
kernel) to not write early messages that require
accessing the display hardware directly.
- Bit 6 (write): KEEP_SEGMENTS
+ Bit 6 (obsolete): KEEP_SEGMENTS
Protocol: 2.07+
- - If 0, reload the segment registers in the 32bit entry point.
- - If 1, do not reload the segment registers in the 32bit entry point.
-
- Assume that %cs %ds %ss %es are all set to flat segments with
- a base of 0 (or the equivalent for their environment).
+ - This flag is obsolete.
Bit 7 (write): CAN_USE_HEAP
@@ -767,9 +782,9 @@
uncompressed data should be determined using the standard magic
numbers. The currently supported compression formats are gzip
(magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA
- (magic number 5D 00), XZ (magic number FD 37), and LZ4 (magic number
- 02 21). The uncompressed payload is currently always ELF (magic
- number 7F 45 4C 46).
+ (magic number 5D 00), XZ (magic number FD 37), LZ4 (magic number
+ 02 21) and ZSTD (magic number 28 B5). The uncompressed payload is
+ currently always ELF (magic number 7F 45 4C 46).
============ ==============
Field name: payload_length
@@ -809,6 +824,47 @@
sure to consider the case where the linked list already contains
entries.
+ The setup_data is a bit awkward to use for extremely large data objects,
+ both because the setup_data header has to be adjacent to the data object
+ and because it has a 32-bit length field. However, it is important that
+ intermediate stages of the boot process have a way to identify which
+ chunks of memory are occupied by kernel data.
+
+ Thus setup_indirect struct and SETUP_INDIRECT type were introduced in
+ protocol 2.15::
+
+ struct setup_indirect {
+ __u32 type;
+ __u32 reserved; /* Reserved, must be set to zero. */
+ __u64 len;
+ __u64 addr;
+ };
+
+ The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be
+ SETUP_INDIRECT itself since making the setup_indirect a tree structure
+ could require a lot of stack space in something that needs to parse it
+ and stack space can be limited in boot contexts.
+
+ Let's give an example how to point to SETUP_E820_EXT data using setup_indirect.
+ In this case setup_data and setup_indirect will look like this::
+
+ struct setup_data {
+ __u64 next = 0 or <addr_of_next_setup_data_struct>;
+ __u32 type = SETUP_INDIRECT;
+ __u32 len = sizeof(setup_data);
+ __u8 data[sizeof(setup_indirect)] = struct setup_indirect {
+ __u32 type = SETUP_INDIRECT | SETUP_E820_EXT;
+ __u32 reserved = 0;
+ __u64 len = <len_of_SETUP_E820_EXT_data>;
+ __u64 addr = <addr_of_SETUP_E820_EXT_data>;
+ }
+ }
+
+.. note::
+ SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished
+ from SETUP_INDIRECT itself. So, this kind of objects cannot be provided
+ by the bootloaders.
+
============ ============
Field name: pref_address
Type: read (reloc)
@@ -855,6 +911,121 @@
See EFI HANDOVER PROTOCOL below for more details.
+============ ==================
+Field name: kernel_info_offset
+Type: read
+Offset/size: 0x268/4
+Protocol: 2.15+
+============ ==================
+
+ This field is the offset from the beginning of the kernel image to the
+ kernel_info. The kernel_info structure is embedded in the Linux image
+ in the uncompressed protected mode region.
+
+
+The kernel_info
+===============
+
+The relationships between the headers are analogous to the various data
+sections:
+
+ setup_header = .data
+ boot_params/setup_data = .bss
+
+What is missing from the above list? That's right:
+
+ kernel_info = .rodata
+
+We have been (ab)using .data for things that could go into .rodata or .bss for
+a long time, for lack of alternatives and -- especially early on -- inertia.
+Also, the BIOS stub is responsible for creating boot_params, so it isn't
+available to a BIOS-based loader (setup_data is, though).
+
+setup_header is permanently limited to 144 bytes due to the reach of the
+2-byte jump field, which doubles as a length field for the structure, combined
+with the size of the "hole" in struct boot_params that a protected-mode loader
+or the BIOS stub has to copy it into. It is currently 119 bytes long, which
+leaves us with 25 very precious bytes. This isn't something that can be fixed
+without revising the boot protocol entirely, breaking backwards compatibility.
+
+boot_params proper is limited to 4096 bytes, but can be arbitrarily extended
+by adding setup_data entries. It cannot be used to communicate properties of
+the kernel image, because it is .bss and has no image-provided content.
+
+kernel_info solves this by providing an extensible place for information about
+the kernel image. It is readonly, because the kernel cannot rely on a
+bootloader copying its contents anywhere, but that is OK; if it becomes
+necessary it can still contain data items that an enabled bootloader would be
+expected to copy into a setup_data chunk.
+
+All kernel_info data should be part of this structure. Fixed size data have to
+be put before kernel_info_var_len_data label. Variable size data have to be put
+after kernel_info_var_len_data label. Each chunk of variable size data has to
+be prefixed with header/magic and its size, e.g.::
+
+ kernel_info:
+ .ascii "LToP" /* Header, Linux top (structure). */
+ .long kernel_info_var_len_data - kernel_info
+ .long kernel_info_end - kernel_info
+ .long 0x01234567 /* Some fixed size data for the bootloaders. */
+ kernel_info_var_len_data:
+ example_struct: /* Some variable size data for the bootloaders. */
+ .ascii "0123" /* Header/Magic. */
+ .long example_struct_end - example_struct
+ .ascii "Struct"
+ .long 0x89012345
+ example_struct_end:
+ example_strings: /* Some variable size data for the bootloaders. */
+ .ascii "ABCD" /* Header/Magic. */
+ .long example_strings_end - example_strings
+ .asciz "String_0"
+ .asciz "String_1"
+ example_strings_end:
+ kernel_info_end:
+
+This way the kernel_info is self-contained blob.
+
+.. note::
+ Each variable size data header/magic can be any 4-character string,
+ without \0 at the end of the string, which does not collide with
+ existing variable length data headers/magics.
+
+
+Details of the kernel_info Fields
+=================================
+
+============ ========
+Field name: header
+Offset/size: 0x0000/4
+============ ========
+
+ Contains the magic number "LToP" (0x506f544c).
+
+============ ========
+Field name: size
+Offset/size: 0x0004/4
+============ ========
+
+ This field contains the size of the kernel_info including kernel_info.header.
+ It does not count kernel_info.kernel_info_var_len_data size. This field should be
+ used by the bootloaders to detect supported fixed size fields in the kernel_info
+ and beginning of kernel_info.kernel_info_var_len_data.
+
+============ ========
+Field name: size_total
+Offset/size: 0x0008/4
+============ ========
+
+ This field contains the size of the kernel_info including kernel_info.header
+ and kernel_info.kernel_info_var_len_data.
+
+============ ==============
+Field name: setup_type_max
+Offset/size: 0x000c/4
+============ ==============
+
+ This field contains maximal allowed type for setup_data and setup_indirect structs.
+
The Image Checksum
==================
@@ -1171,8 +1342,8 @@
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
-also fill the additional fields of the struct boot_params as that
-described in zero-page.txt.
+also fill the additional fields of the struct boot_params as
+described in chapter :doc:`zero-page`.
After setting up the struct boot_params, the boot loader can load the
32/64-bit kernel in the same way as that of 16-bit boot protocol.
@@ -1208,7 +1379,7 @@
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
also fill the additional fields of the struct boot_params as described
-in zero-page.txt.
+in chapter :doc:`zero-page`.
After setting up the struct boot_params, the boot loader can load
64-bit kernel in the same way as that of 16-bit boot protocol, but
@@ -1228,8 +1399,8 @@
must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base
address of the struct boot_params.
-EFI Handover Protocol
-=====================
+EFI Handover Protocol (deprecated)
+==================================
This protocol allows boot loaders to defer initialisation to the EFI
boot stub. The boot loader is required to load the kernel/initrd(s)
@@ -1237,6 +1408,12 @@
which is hdr->handover_offset bytes from the beginning of
startup_{32,64}.
+The boot loader MUST respect the kernel's PE/COFF metadata when it comes
+to section alignment, the memory footprint of the executable image beyond
+the size of the file itself, and any other aspect of the PE/COFF header
+that may affect correct operation of the image as a PE/COFF binary in the
+execution context provided by the EFI firmware.
+
The function prototype for the handover entry point looks like this::
efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp)
@@ -1248,9 +1425,18 @@
The boot loader *must* fill out the following fields in bp::
- - hdr.code32_start
- hdr.cmd_line_ptr
- hdr.ramdisk_image (if applicable)
- hdr.ramdisk_size (if applicable)
All other fields should be zero.
+
+NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF
+ entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd
+ loading protocol (refer to [0] for an example of the bootloader side of
+ this), which removes the need for any knowledge on the part of the EFI
+ bootloader regarding the internal representation of boot_params or any
+ requirements/limitations regarding the placement of the command line
+ and ramdisk in memory, or the placement of the kernel image itself.
+
+[0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0
diff --git a/Documentation/x86/booting-dt.rst b/Documentation/x86/booting-dt.rst
new file mode 100644
index 0000000..965a374
--- /dev/null
+++ b/Documentation/x86/booting-dt.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+DeviceTree Booting
+------------------
+
+ There is one single 32bit entry point to the kernel at code32_start,
+ the decompressor (the real mode entry point goes to the same 32bit
+ entry point once it switched into protected mode). That entry point
+ supports one calling convention which is documented in
+ Documentation/x86/boot.rst
+ The physical pointer to the device-tree block is passed via setup_data
+ which requires at least boot protocol 2.09.
+ The type filed is defined as
+
+ #define SETUP_DTB 2
+
+ This device-tree is used as an extension to the "boot page". As such it
+ does not parse / consider data which is already covered by the boot
+ page. This includes memory size, reserved ranges, command line arguments
+ or initrd address. It simply holds information which can not be retrieved
+ otherwise like interrupt routing or a list of devices behind an I2C bus.
diff --git a/Documentation/x86/cpuinfo.rst b/Documentation/x86/cpuinfo.rst
new file mode 100644
index 0000000..5d54c39
--- /dev/null
+++ b/Documentation/x86/cpuinfo.rst
@@ -0,0 +1,155 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+x86 Feature Flags
+=================
+
+Introduction
+============
+
+On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition
+in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature
+or KVM want to expose the feature to a KVM guest, it can and should have
+an X86_FEATURE_* defined. These flags represent hardware features as
+well as software features.
+
+If users want to know if a feature is available on a given system, they
+try to find the flag in /proc/cpuinfo. If a given flag is present, it
+means that the kernel supports it and is currently making it available.
+If such flag represents a hardware feature, it also means that the
+hardware supports it.
+
+If the expected flag does not appear in /proc/cpuinfo, things are murkier.
+Users need to find out the reason why the flag is missing and find the way
+how to enable it, which is not always easy. There are several factors that
+can explain missing flags: the expected feature failed to enable, the feature
+is missing in hardware, platform firmware did not enable it, the feature is
+disabled at build or run time, an old kernel is in use, or the kernel does
+not support the feature and thus has not enabled it. In general, /proc/cpuinfo
+shows features which the kernel supports. For a full list of CPUID flags
+which the CPU supports, use tools/arch/x86/kcpuid.
+
+How are feature flags created?
+==============================
+
+a: Feature flags can be derived from the contents of CPUID leaves.
+------------------------------------------------------------------
+These feature definitions are organized mirroring the layout of CPUID
+leaves and grouped in words with offsets as mapped in enum cpuid_leafs
+in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details).
+If a feature is defined with a X86_FEATURE_<name> definition in
+cpufeatures.h, and if it is detected at run time, the flags will be
+displayed accordingly in /proc/cpuinfo. For example, the flag "avx2"
+comes from X86_FEATURE_AVX2 in cpufeatures.h.
+
+b: Flags can be from scattered CPUID-based features.
+----------------------------------------------------
+Hardware features enumerated in sparsely populated CPUID leaves get
+software-defined values. Still, CPUID needs to be queried to determine
+if a given feature is present. This is done in init_scattered_cpuid_features().
+For instance, X86_FEATURE_CQM_LLC is defined as 11*32 + 0 and its presence is
+checked at runtime in the respective CPUID leaf [EAX=f, ECX=0] bit EDX[1].
+
+The intent of scattering CPUID leaves is to not bloat struct
+cpuinfo_x86.x86_capability[] unnecessarily. For instance, the CPUID leaf
+[EAX=7, ECX=0] has 30 features and is dense, but the CPUID leaf [EAX=7, EAX=1]
+has only one feature and would waste 31 bits of space in the x86_capability[]
+array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted
+memory is not trivial.
+
+c: Flags can be created synthetically under certain conditions for hardware features.
+-------------------------------------------------------------------------------------
+Examples of conditions include whether certain features are present in
+MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed
+conditions are met, the features are enabled by the set_cpu_cap or
+setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS,
+the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and
+"split_lock_detect" will be displayed. The flag "ring3mwait" will be
+displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors.
+
+d: Flags can represent purely software features.
+------------------------------------------------
+These flags do not represent hardware features. Instead, they represent a
+software feature implemented in the kernel. For example, Kernel Page Table
+Isolation is purely software feature and its feature flag X86_FEATURE_PTI is
+also defined in cpufeatures.h.
+
+Naming of Flags
+===============
+
+The script arch/x86/kernel/cpu/mkcapflags.sh processes the
+#define X86_FEATURE_<name> from cpufeatures.h and generates the
+x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the
+resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming
+of flags in the x86_cap/bug_flags[] are as follows:
+
+a: The name of the flag is from the string in X86_FEATURE_<name> by default.
+----------------------------------------------------------------------------
+By default, the flag <name> in /proc/cpuinfo is extracted from the respective
+X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from
+X86_FEATURE_AVX2.
+
+b: The naming can be overridden.
+--------------------------------
+If the comment on the line for the #define X86_FEATURE_* starts with a
+double-quote character (""), the string inside the double-quote characters
+will be the name of the flags. For example, the flag "sse4_1" comes from
+the comment "sse4_1" following the X86_FEATURE_XMM4_1 definition.
+
+There are situations in which overriding the displayed name of the flag is
+needed. For instance, /proc/cpuinfo is a userspace interface and must remain
+constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one
+shall override the new naming with the name already used in /proc/cpuinfo.
+
+c: The naming override can be "", which means it will not appear in /proc/cpuinfo.
+----------------------------------------------------------------------------------
+The feature shall be omitted from /proc/cpuinfo if it does not make sense for
+the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is
+defined in cpufeatures.h but that flag is an internal kernel feature used
+in the alternative runtime patching functionality. So, its name is overridden
+with "". Its flag will not appear in /proc/cpuinfo.
+
+Flags are missing when one or more of these happen
+==================================================
+
+a: The hardware does not enumerate support for it.
+--------------------------------------------------
+For example, when a new kernel is running on old hardware or the feature is
+not enabled by boot firmware. Even if the hardware is new, there might be a
+problem enabling the feature at run time, the flag will not be displayed.
+
+b: The kernel does not know about the flag.
+-------------------------------------------
+For example, when an old kernel is running on new hardware.
+
+c: The kernel disabled support for it at compile-time.
+------------------------------------------------------
+For example, if 5-level-paging is not enabled when building (i.e.,
+CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_.
+Even though the feature will still be detected via CPUID, the kernel disables
+it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57).
+
+d: The feature is disabled at boot-time.
+----------------------------------------
+A feature can be disabled either using a command-line parameter or because
+it failed to be enabled. The command-line parameter clearcpuid= can be used
+to disable features using the feature number as defined in
+/arch/x86/include/asm/cpufeatures.h. For instance, User Mode Instruction
+Protection can be disabled using clearcpuid=514. The number 514 is calculated
+from #define X86_FEATURE_UMIP (16*32 + 2).
+
+In addition, there exists a variety of custom command-line parameters that
+disable specific features. The list of parameters includes, but is not limited
+to, nofsgsbase, nosmap, and nosmep. 5-level paging can also be disabled using
+"no5lvl". SMAP and SMEP are disabled with the aforementioned parameters,
+respectively.
+
+e: The feature was known to be non-functional.
+----------------------------------------------
+The feature was known to be non-functional because a dependency was
+missing at runtime. For example, AVX flags will not show up if XSAVE feature
+is disabled since they depend on XSAVE feature. Another example would be broken
+CPUs and them missing microcode patches. Due to that, the kernel decides not to
+enable a feature.
+
+.. [#f1] 5-level paging uses linear address of 57 bits.
diff --git a/Documentation/x86/earlyprintk.rst b/Documentation/x86/earlyprintk.rst
index 1130737..51ef11e 100644
--- a/Documentation/x86/earlyprintk.rst
+++ b/Documentation/x86/earlyprintk.rst
@@ -8,7 +8,7 @@
USB2 Debug port key and a debug cable, on x86 systems.
You need two computers, the 'USB debug key' special gadget and
-and two USB cables, connected like this::
+two USB cables, connected like this::
[host/target] <-------> [USB debug key] <-------> [client/console]
diff --git a/Documentation/x86/exception-tables.rst b/Documentation/x86/exception-tables.rst
index ed6d4b0..de58110 100644
--- a/Documentation/x86/exception-tables.rst
+++ b/Documentation/x86/exception-tables.rst
@@ -257,6 +257,9 @@
the original assembly code: > 3: movl $-14,%eax
and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
+If the fixup was able to handle the exception, control flow may be returned
+to the instruction after the one that triggered the fault, ie. local label 2b.
+
The assembly code::
> .section __ex_table,"a"
@@ -337,10 +340,15 @@
entry->insn. It is used to distinguish page faults from machine
check.
-3) ``int ex_handler_ext(const struct exception_table_entry *fixup)``
- This case is used for uaccess_err ... we need to set a flag
- in the task structure. Before the handler functions existed this
- case was handled by adding a large offset to the fixup to tag
- it as special.
-
More functions can easily be added.
+
+CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post
+link of the kernel image, via a host utility scripts/sorttable. It will set the
+symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section
+at boot time. With the exception table sorted, at runtime when an exception
+occurs we can quickly lookup the __ex_table entry via binary search.
+
+This is not just a boot time optimization, some architectures require this
+table to be sorted in order to handle exceptions relatively early in the boot
+process. For example, i386 makes use of this form of exception handling before
+paging support is even enabled!
diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index a8de2fb..b224d12 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -9,6 +9,8 @@
:numbered:
boot
+ booting-dt
+ cpuinfo
topology
exception-tables
kernel-stacks
@@ -19,7 +21,6 @@
tlb
mtrr
pat
- intel_mpx
intel-iommu
intel_txt
amd-memory-encryption
@@ -31,3 +32,4 @@
usb-legacy-support
i386/index
x86_64/index
+ sva
diff --git a/Documentation/x86/intel-iommu.rst b/Documentation/x86/intel-iommu.rst
index 9dae6b4..099f13d 100644
--- a/Documentation/x86/intel-iommu.rst
+++ b/Documentation/x86/intel-iommu.rst
@@ -95,9 +95,10 @@
When DMAR is enabled for use, you will notice..
PCI-DMA: Using DMAR IOMMU
+-------------------------
Fault reporting
----------------
+^^^^^^^^^^^^^^^
::
diff --git a/Documentation/x86/intel_mpx.rst b/Documentation/x86/intel_mpx.rst
deleted file mode 100644
index 387a640..0000000
--- a/Documentation/x86/intel_mpx.rst
+++ /dev/null
@@ -1,252 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================================
-Intel(R) Memory Protection Extensions (MPX)
-===========================================
-
-Intel(R) MPX Overview
-=====================
-
-Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
-introduced into Intel Architecture. Intel MPX provides hardware features
-that can be used in conjunction with compiler changes to check memory
-references, for those references whose compile-time normal intentions are
-usurped at runtime due to buffer overflow or underflow.
-
-You can tell if your CPU supports MPX by looking in /proc/cpuinfo::
-
- cat /proc/cpuinfo | grep ' mpx '
-
-For more information, please refer to Intel(R) Architecture Instruction
-Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
-Extensions.
-
-Note: As of December 2014, no hardware with MPX is available but it is
-possible to use SDE (Intel(R) Software Development Emulator) instead, which
-can be downloaded from
-http://software.intel.com/en-us/articles/intel-software-development-emulator
-
-
-How to get the advantage of MPX
-===============================
-
-For MPX to work, changes are required in the kernel, binutils and compiler.
-No source changes are required for applications, just a recompile.
-
-There are a lot of moving parts of this to all work right. The following
-is how we expect the compiler, application and kernel to work together.
-
-1) Application developer compiles with -fmpx. The compiler will add the
- instrumentation as well as some setup code called early after the app
- starts. New instruction prefixes are noops for old CPUs.
-2) That setup code allocates (virtual) space for the "bounds directory",
- points the "bndcfgu" register to the directory (must also set the valid
- bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT))
- that the app will be using MPX. The app must be careful not to access
- the bounds tables between the time when it populates "bndcfgu" and
- when it calls the prctl(). This might be hard to guarantee if the app
- is compiled with MPX. You can add "__attribute__((bnd_legacy))" to
- the function to disable MPX instrumentation to help guarantee this.
- Also be careful not to call out to any other code which might be
- MPX-instrumented.
-3) The kernel detects that the CPU has MPX, allows the new prctl() to
- succeed, and notes the location of the bounds directory. Userspace is
- expected to keep the bounds directory at that location. We note it
- instead of reading it each time because the 'xsave' operation needed
- to access the bounds directory register is an expensive operation.
-4) If the application needs to spill bounds out of the 4 registers, it
- issues a bndstx instruction. Since the bounds directory is empty at
- this point, a bounds fault (#BR) is raised, the kernel allocates a
- bounds table (in the user address space) and makes the relevant entry
- in the bounds directory point to the new table.
-5) If the application violates the bounds specified in the bounds registers,
- a separate kind of #BR is raised which will deliver a signal with
- information about the violation in the 'struct siginfo'.
-6) Whenever memory is freed, we know that it can no longer contain valid
- pointers, and we attempt to free the associated space in the bounds
- tables. If an entire table becomes unused, we will attempt to free
- the table and remove the entry in the directory.
-
-To summarize, there are essentially three things interacting here:
-
-GCC with -fmpx:
- * enables annotation of code with MPX instructions and prefixes
- * inserts code early in the application to call in to the "gcc runtime"
-GCC MPX Runtime:
- * Checks for hardware MPX support in cpuid leaf
- * allocates virtual space for the bounds directory (malloc() essentially)
- * points the hardware BNDCFGU register at the directory
- * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
- start managing the bounds directories
-Kernel MPX Code:
- * Checks for hardware MPX support in cpuid leaf
- * Handles #BR exceptions and sends SIGSEGV to the app when it violates
- bounds, like during a buffer overflow.
- * When bounds are spilled in to an unallocated bounds table, the kernel
- notices in the #BR exception, allocates the virtual space, then
- updates the bounds directory to point to the new table. It keeps
- special track of the memory with a VM_MPX flag.
- * Frees unused bounds tables at the time that the memory they described
- is unmapped.
-
-
-How does MPX kernel code work
-=============================
-
-Handling #BR faults caused by MPX
----------------------------------
-
-When MPX is enabled, there are 2 new situations that can generate
-#BR faults.
-
- * new bounds tables (BT) need to be allocated to save bounds.
- * bounds violation caused by MPX instructions.
-
-We hook #BR handler to handle these two new situations.
-
-On-demand kernel allocation of bounds tables
---------------------------------------------
-
-MPX only has 4 hardware registers for storing bounds information. If
-MPX-enabled code needs more than these 4 registers, it needs to spill
-them somewhere. It has two special instructions for this which allow
-the bounds to be moved between the bounds registers and some new "bounds
-tables".
-
-#BR exceptions are a new class of exceptions just for MPX. They are
-similar conceptually to a page fault and will be raised by the MPX
-hardware during both bounds violations or when the tables are not
-present. The kernel handles those #BR exceptions for not-present tables
-by carving the space out of the normal processes address space and then
-pointing the bounds-directory over to it.
-
-The tables need to be accessed and controlled by userspace because
-the instructions for moving bounds in and out of them are extremely
-frequent. They potentially happen every time a register points to
-memory. Any direct kernel involvement (like a syscall) to access the
-tables would obviously destroy performance.
-
-Why not do this in userspace? MPX does not strictly require anything in
-the kernel. It can theoretically be done completely from userspace. Here
-are a few ways this could be done. We don't think any of them are practical
-in the real-world, but here they are.
-
-:Q: Can virtual space simply be reserved for the bounds tables so that we
- never have to allocate them?
-:A: MPX-enabled application will possibly create a lot of bounds tables in
- process address space to save bounds information. These tables can take
- up huge swaths of memory (as much as 80% of the memory on the system)
- even if we clean them up aggressively. In the worst-case scenario, the
- tables can be 4x the size of the data structure being tracked. IOW, a
- 1-page structure can require 4 bounds-table pages. An X-GB virtual
- area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
- If we were to preallocate them for the 128TB of user virtual address
- space, we would need to reserve 512TB+2GB, which is larger than the
- entire virtual address space today. This means they can not be reserved
- ahead of time. Also, a single process's pre-populated bounds directory
- consumes 2GB of virtual *AND* physical memory. IOW, it's completely
- infeasible to prepopulate bounds directories.
-
-:Q: Can we preallocate bounds table space at the same time memory is
- allocated which might contain pointers that might eventually need
- bounds tables?
-:A: This would work if we could hook the site of each and every memory
- allocation syscall. This can be done for small, constrained applications.
- But, it isn't practical at a larger scale since a given app has no
- way of controlling how all the parts of the app might allocate memory
- (think libraries). The kernel is really the only place to intercept
- these calls.
-
-:Q: Could a bounds fault be handed to userspace and the tables allocated
- there in a signal handler instead of in the kernel?
-:A: mmap() is not on the list of safe async handler functions and even
- if mmap() would work it still requires locking or nasty tricks to
- keep track of the allocation state there.
-
-Having ruled out all of the userspace-only approaches for managing
-bounds tables that we could think of, we create them on demand in
-the kernel.
-
-Decoding MPX instructions
--------------------------
-
-If a #BR is generated due to a bounds violation caused by MPX.
-We need to decode MPX instructions to get violation address and
-set this address into extended struct siginfo.
-
-The _sigfault field of struct siginfo is extended as follow::
-
- 87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
- 88 struct {
- 89 void __user *_addr; /* faulting insn/memory ref. */
- 90 #ifdef __ARCH_SI_TRAPNO
- 91 int _trapno; /* TRAP # which caused the signal */
- 92 #endif
- 93 short _addr_lsb; /* LSB of the reported address */
- 94 struct {
- 95 void __user *_lower;
- 96 void __user *_upper;
- 97 } _addr_bnd;
- 98 } _sigfault;
-
-The '_addr' field refers to violation address, and new '_addr_and'
-field refers to the upper/lower bounds when a #BR is caused.
-
-Glibc will be also updated to support this new siginfo. So user
-can get violation address and bounds when bounds violations occur.
-
-Cleanup unused bounds tables
-----------------------------
-
-When a BNDSTX instruction attempts to save bounds to a bounds directory
-entry marked as invalid, a #BR is generated. This is an indication that
-no bounds table exists for this entry. In this case the fault handler
-will allocate a new bounds table on demand.
-
-Since the kernel allocated those tables on-demand without userspace
-knowledge, it is also responsible for freeing them when the associated
-mappings go away.
-
-Here, the solution for this issue is to hook do_munmap() to check
-whether one process is MPX enabled. If yes, those bounds tables covered
-in the virtual address region which is being unmapped will be freed also.
-
-Adding new prctl commands
--------------------------
-
-Two new prctl commands are added to enable and disable MPX bounds tables
-management in kernel.
-::
-
- 155 #define PR_MPX_ENABLE_MANAGEMENT 43
- 156 #define PR_MPX_DISABLE_MANAGEMENT 44
-
-Runtime library in userspace is responsible for allocation of bounds
-directory. So kernel have to use XSAVE instruction to get the base
-of bounds directory from BNDCFG register.
-
-But XSAVE is expected to be very expensive. In order to do performance
-optimization, we have to get the base of bounds directory and save it
-into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
-command execution.
-
-
-Special rules
-=============
-
-1) If userspace is requesting help from the kernel to do the management
-of bounds tables, it may not create or modify entries in the bounds directory.
-
-Certainly users can allocate bounds tables and forcibly point the bounds
-directory at them through XSAVE instruction, and then set valid bit
-of bounds entry to have this entry valid. But, the kernel will decline
-to assist in managing these tables.
-
-2) Userspace may not take multiple bounds directory entries and point
-them at the same bounds table.
-
-This is allowed architecturally. See more information "Intel(R) Architecture
-Instruction Set Extensions Programming Reference" (9.3.4).
-
-However, if users did this, the kernel might be fooled in to unmapping an
-in-use bounds table since it does not recognize sharing.
diff --git a/Documentation/x86/pat.rst b/Documentation/x86/pat.rst
index 9a298fd..5d90177 100644
--- a/Documentation/x86/pat.rst
+++ b/Documentation/x86/pat.rst
@@ -44,8 +44,6 @@
+------------------------+----------+--------------+------------------+
| ioremap_uc | -- | UC | UC |
+------------------------+----------+--------------+------------------+
-| ioremap_nocache | -- | UC- | UC- |
-+------------------------+----------+--------------+------------------+
| ioremap_wc | -- | -- | WC |
+------------------------+----------+--------------+------------------+
| ioremap_wt | -- | -- | WT |
diff --git a/Documentation/x86/resctrl_ui.rst b/Documentation/x86/resctrl_ui.rst
index 5368ced..e59b7b9 100644
--- a/Documentation/x86/resctrl_ui.rst
+++ b/Documentation/x86/resctrl_ui.rst
@@ -138,6 +138,18 @@
non-linear. This field is purely informational
only.
+"thread_throttle_mode":
+ Indicator on Intel systems of how tasks running on threads
+ of a physical core are throttled in cases where they
+ request different memory bandwidth percentages:
+
+ "max":
+ the smallest percentage is applied
+ to all threads
+ "per-thread":
+ bandwidth percentages are directly applied to
+ the threads running on the core
+
If RDT monitoring is available there will be an "L3_MON" directory
with the following files:
@@ -364,8 +376,10 @@
The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
-sharing a core will result in both threads being throttled to use the
-low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core
+sharing a core may result in both threads being throttled to use the
+low bandwidth (see "thread_throttle_mode").
+
+The fact that Memory bandwidth allocation(MBA) may be a core
specific mechanism where as memory bandwidth monitoring(MBM) is done at
the package level may lead to confusion when users try to apply control
via the MBA and then monitor the bandwidth to see if the controls are
diff --git a/Documentation/x86/sva.rst b/Documentation/x86/sva.rst
new file mode 100644
index 0000000..076efd5
--- /dev/null
+++ b/Documentation/x86/sva.rst
@@ -0,0 +1,257 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================
+Shared Virtual Addressing (SVA) with ENQCMD
+===========================================
+
+Background
+==========
+
+Shared Virtual Addressing (SVA) allows the processor and device to use the
+same virtual addresses avoiding the need for software to translate virtual
+addresses to physical addresses. SVA is what PCIe calls Shared Virtual
+Memory (SVM).
+
+In addition to the convenience of using application virtual addresses
+by the device, it also doesn't require pinning pages for DMA.
+PCIe Address Translation Services (ATS) along with Page Request Interface
+(PRI) allow devices to function much the same way as the CPU handling
+application page-faults. For more information please refer to the PCIe
+specification Chapter 10: ATS Specification.
+
+Use of SVA requires IOMMU support in the platform. IOMMU is also
+required to support the PCIe features ATS and PRI. ATS allows devices
+to cache translations for virtual addresses. The IOMMU driver uses the
+mmu_notifier() support to keep the device TLB cache and the CPU cache in
+sync. When an ATS lookup fails for a virtual address, the device should
+use the PRI in order to request the virtual address to be paged into the
+CPU page tables. The device must use ATS again in order the fetch the
+translation before use.
+
+Shared Hardware Workqueues
+==========================
+
+Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
+the use of Shared Work Queues (SWQ) by both applications and Virtual
+Machines (VM's). This allows better hardware utilization vs. hard
+partitioning resources that could result in under utilization. In order to
+allow the hardware to distinguish the context for which work is being
+executed in the hardware by SWQ interface, SIOV uses Process Address Space
+ID (PASID), which is a 20-bit number defined by the PCIe SIG.
+
+PASID value is encoded in all transactions from the device. This allows the
+IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
+Resource Identifier (RID) which is the Bus/Device/Function.
+
+
+ENQCMD
+======
+
+ENQCMD is a new instruction on Intel platforms that atomically submits a
+work descriptor to a device. The descriptor includes the operation to be
+performed, virtual addresses of all parameters, virtual address of a completion
+record, and the PASID (process address space ID) of the current process.
+
+ENQCMD works with non-posted semantics and carries a status back if the
+command was accepted by hardware. This allows the submitter to know if the
+submission needs to be retried or other device specific mechanisms to
+implement fairness or ensure forward progress should be provided.
+
+ENQCMD is the glue that ensures applications can directly submit commands
+to the hardware and also permits hardware to be aware of application context
+to perform I/O operations via use of PASID.
+
+Process Address Space Tagging
+=============================
+
+A new thread-scoped MSR (IA32_PASID) provides the connection between
+user processes and the rest of the hardware. When an application first
+accesses an SVA-capable device, this MSR is initialized with a newly
+allocated PASID. The driver for the device calls an IOMMU-specific API
+that sets up the routing for DMA and page-requests.
+
+For example, the Intel Data Streaming Accelerator (DSA) uses
+iommu_sva_bind_device(), which will do the following:
+
+- Allocate the PASID, and program the process page-table (%cr3 register) in the
+ PASID context entries.
+- Register for mmu_notifier() to track any page-table invalidations to keep
+ the device TLB in sync. For example, when a page-table entry is invalidated,
+ the IOMMU propagates the invalidation to the device TLB. This will force any
+ future access by the device to this virtual address to participate in
+ ATS. If the IOMMU responds with proper response that a page is not
+ present, the device would request the page to be paged in via the PCIe PRI
+ protocol before performing I/O.
+
+This MSR is managed with the XSAVE feature set as "supervisor state" to
+ensure the MSR is updated during context switch.
+
+PASID Management
+================
+
+The kernel must allocate a PASID on behalf of each process which will use
+ENQCMD and program it into the new MSR to communicate the process identity to
+platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests
+from this process. When a user submits a work descriptor to a device using the
+ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
+value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
+with the same PASID. The platform IOMMU uses the PASID in the transaction to
+perform address translation. The IOMMU APIs setup the corresponding PASID
+entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
+x86).
+
+The MSR must be configured on each logical CPU before any application
+thread can interact with a device. Threads that belong to the same
+process share the same page tables, thus the same MSR value.
+
+PASID is cleared when a process is created. The PASID allocation and MSR
+programming may occur long after a process and its threads have been created.
+One thread must call iommu_sva_bind_device() to allocate the PASID for the
+process. If a thread uses ENQCMD without the MSR first being populated, a #GP
+will be raised. The kernel will update the PASID MSR with the PASID for all
+threads in the process. A single process PASID can be used simultaneously
+with multiple devices since they all share the same address space.
+
+One thread can call iommu_sva_unbind_device() to free the allocated PASID.
+The kernel will clear the PASID MSR for all threads belonging to the process.
+
+New threads inherit the MSR value from the parent.
+
+Relationships
+=============
+
+ * Each process has many threads, but only one PASID.
+ * Devices have a limited number (~10's to 1000's) of hardware workqueues.
+ The device driver manages allocating hardware workqueues.
+ * A single mmap() maps a single hardware workqueue as a "portal" and
+ each portal maps down to a single workqueue.
+ * For each device with which a process interacts, there must be
+ one or more mmap()'d portals.
+ * Many threads within a process can share a single portal to access
+ a single device.
+ * Multiple processes can separately mmap() the same portal, in
+ which case they still share one device hardware workqueue.
+ * The single process-wide PASID is used by all threads to interact
+ with all devices. There is not, for instance, a PASID for each
+ thread or each thread<->device pair.
+
+FAQ
+===
+
+* What is SVA/SVM?
+
+Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
+work in the same address space, i.e., to share it. Some call it Shared
+Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
+POSIX Shared Memory and Secure Virtual Machines which were terms already in
+circulation.
+
+* What is a PASID?
+
+A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
+(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
+PASID is included in all transactions between the platform and the device.
+
+* How are shared workqueues different?
+
+Traditionally, in order for userspace applications to interact with hardware,
+there is a separate hardware instance required per process. For example,
+consider doorbells as a mechanism of informing hardware about work to process.
+Each doorbell is required to be spaced 4k (or page-size) apart for process
+isolation. This requires hardware to provision that space and reserve it in
+MMIO. This doesn't scale as the number of threads becomes quite large. The
+hardware also manages the queue depth for Shared Work Queues (SWQ), and
+consumers don't need to track queue depth. If there is no space to accept
+a command, the device will return an error indicating retry.
+
+A user should check Deferrable Memory Write (DMWr) capability on the device
+and only submits ENQCMD when the device supports it. In the new DMWr PCIe
+terminology, devices need to support DMWr completer capability. In addition,
+it requires all switch ports to support DMWr routing and must be enabled by
+the PCIe subsystem, much like how PCIe atomic operations are managed for
+instance.
+
+SWQ allows hardware to provision just a single address in the device. When
+used with ENQCMD to submit work, the device can distinguish the process
+submitting the work since it will include the PASID assigned to that
+process. This helps the device scale to a large number of processes.
+
+* Is this the same as a user space device driver?
+
+Communicating with the device via the shared workqueue is much simpler
+than a full blown user space driver. The kernel driver does all the
+initialization of the hardware. User space only needs to worry about
+submitting work and processing completions.
+
+* Is this the same as SR-IOV?
+
+Single Root I/O Virtualization (SR-IOV) focuses on providing independent
+hardware interfaces for virtualizing hardware. Hence, it's required to be
+almost fully functional interface to software supporting the traditional
+BARs, space for interrupts via MSI-X, its own register layout.
+Virtual Functions (VFs) are assisted by the Physical Function (PF)
+driver.
+
+Scalable I/O Virtualization builds on the PASID concept to create device
+instances for virtualization. SIOV requires host software to assist in
+creating virtual devices; each virtual device is represented by a PASID
+along with the bus/device/function of the device. This allows device
+hardware to optimize device resource creation and can grow dynamically on
+demand. SR-IOV creation and management is very static in nature. Consult
+references below for more details.
+
+* Why not just create a virtual function for each app?
+
+Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
+duplicated hardware for PCI config space and interrupts such as MSI-X.
+Resources such as interrupts have to be hard partitioned between VFs at
+creation time, and cannot scale dynamically on demand. The VFs are not
+completely independent from the Physical Function (PF). Most VFs require
+some communication and assistance from the PF driver. SIOV, in contrast,
+creates a software-defined device where all the configuration and control
+aspects are mediated via the slow path. The work submission and completion
+happen without any mediation.
+
+* Does this support virtualization?
+
+ENQCMD can be used from within a guest VM. In these cases, the VMM helps
+with setting up a translation table to translate from Guest PASID to Host
+PASID. Please consult the ENQCMD instruction set reference for more
+details.
+
+* Does memory need to be pinned?
+
+When devices support SVA along with platform hardware such as IOMMU
+supporting such devices, there is no need to pin memory for DMA purposes.
+Devices that support SVA also support other PCIe features that remove the
+pinning requirement for memory.
+
+Device TLB support - Device requests the IOMMU to lookup an address before
+use via Address Translation Service (ATS) requests. If the mapping exists
+but there is no page allocated by the OS, IOMMU hardware returns that no
+mapping exists.
+
+Device requests the virtual address to be mapped via Page Request
+Interface (PRI). Once the OS has successfully completed the mapping, it
+returns the response back to the device. The device requests again for
+a translation and continues.
+
+IOMMU works with the OS in managing consistency of page-tables with the
+device. When removing pages, it interacts with the device to remove any
+device TLB entry that might have been cached before removing the mappings from
+the OS.
+
+References
+==========
+
+VT-D:
+https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
+
+SIOV:
+https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
+
+ENQCMD in ISE:
+https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
+
+DSA spec:
+https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst
index 2b98efb..324ceff 100644
--- a/Documentation/x86/x86_64/boot-options.rst
+++ b/Documentation/x86/x86_64/boot-options.rst
@@ -173,6 +173,10 @@
numa=noacpi
Don't parse the SRAT table for NUMA setup
+ numa=nohmat
+ Don't parse the HMAT table for NUMA setup, or soft-reserved memory
+ partitioning.
+
numa=fake=<size>[MG]
If given as a memory unit, fills all system RAM with nodes of
size interleaved over physical nodes.
diff --git a/Documentation/x86/x86_64/fsgs.rst b/Documentation/x86/x86_64/fsgs.rst
new file mode 100644
index 0000000..50960e0
--- /dev/null
+++ b/Documentation/x86/x86_64/fsgs.rst
@@ -0,0 +1,199 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Using FS and GS segments in user space applications
+===================================================
+
+The x86 architecture supports segmentation. Instructions which access
+memory can use segment register based addressing mode. The following
+notation is used to address a byte within a segment:
+
+ Segment-register:Byte-address
+
+The segment base address is added to the Byte-address to compute the
+resulting virtual address which is accessed. This allows to access multiple
+instances of data with the identical Byte-address, i.e. the same code. The
+selection of a particular instance is purely based on the base-address in
+the segment register.
+
+In 32-bit mode the CPU provides 6 segments, which also support segment
+limits. The limits can be used to enforce address space protections.
+
+In 64-bit mode the CS/SS/DS/ES segments are ignored and the base address is
+always 0 to provide a full 64bit address space. The FS and GS segments are
+still functional in 64-bit mode.
+
+Common FS and GS usage
+------------------------------
+
+The FS segment is commonly used to address Thread Local Storage (TLS). FS
+is usually managed by runtime code or a threading library. Variables
+declared with the '__thread' storage class specifier are instantiated per
+thread and the compiler emits the FS: address prefix for accesses to these
+variables. Each thread has its own FS base address so common code can be
+used without complex address offset calculations to access the per thread
+instances. Applications should not use FS for other purposes when they use
+runtimes or threading libraries which manage the per thread FS.
+
+The GS segment has no common use and can be used freely by
+applications. GCC and Clang support GS based addressing via address space
+identifiers.
+
+Reading and writing the FS/GS base address
+------------------------------------------
+
+There exist two mechanisms to read and write the FS/GS base address:
+
+ - the arch_prctl() system call
+
+ - the FSGSBASE instruction family
+
+Accessing FS/GS base with arch_prctl()
+--------------------------------------
+
+ The arch_prctl(2) based mechanism is available on all 64-bit CPUs and all
+ kernel versions.
+
+ Reading the base:
+
+ arch_prctl(ARCH_GET_FS, &fsbase);
+ arch_prctl(ARCH_GET_GS, &gsbase);
+
+ Writing the base:
+
+ arch_prctl(ARCH_SET_FS, fsbase);
+ arch_prctl(ARCH_SET_GS, gsbase);
+
+ The ARCH_SET_GS prctl may be disabled depending on kernel configuration
+ and security settings.
+
+Accessing FS/GS base with the FSGSBASE instructions
+---------------------------------------------------
+
+ With the Ivy Bridge CPU generation Intel introduced a new set of
+ instructions to access the FS and GS base registers directly from user
+ space. These instructions are also supported on AMD Family 17H CPUs. The
+ following instructions are available:
+
+ =============== ===========================
+ RDFSBASE %reg Read the FS base register
+ RDGSBASE %reg Read the GS base register
+ WRFSBASE %reg Write the FS base register
+ WRGSBASE %reg Write the GS base register
+ =============== ===========================
+
+ The instructions avoid the overhead of the arch_prctl() syscall and allow
+ more flexible usage of the FS/GS addressing modes in user space
+ applications. This does not prevent conflicts between threading libraries
+ and runtimes which utilize FS and applications which want to use it for
+ their own purpose.
+
+FSGSBASE instructions enablement
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ The instructions are enumerated in CPUID leaf 7, bit 0 of EBX. If
+ available /proc/cpuinfo shows 'fsgsbase' in the flag entry of the CPUs.
+
+ The availability of the instructions does not enable them
+ automatically. The kernel has to enable them explicitly in CR4. The
+ reason for this is that older kernels make assumptions about the values in
+ the GS register and enforce them when GS base is set via
+ arch_prctl(). Allowing user space to write arbitrary values to GS base
+ would violate these assumptions and cause malfunction.
+
+ On kernels which do not enable FSGSBASE the execution of the FSGSBASE
+ instructions will fault with a #UD exception.
+
+ The kernel provides reliable information about the enabled state in the
+ ELF AUX vector. If the HWCAP2_FSGSBASE bit is set in the AUX vector, the
+ kernel has FSGSBASE instructions enabled and applications can use them.
+ The following code example shows how this detection works::
+
+ #include <sys/auxv.h>
+ #include <elf.h>
+
+ /* Will be eventually in asm/hwcap.h */
+ #ifndef HWCAP2_FSGSBASE
+ #define HWCAP2_FSGSBASE (1 << 1)
+ #endif
+
+ ....
+
+ unsigned val = getauxval(AT_HWCAP2);
+
+ if (val & HWCAP2_FSGSBASE)
+ printf("FSGSBASE enabled\n");
+
+FSGSBASE instructions compiler support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE
+instructions. Clang 5 supports them as well.
+
+ =================== ===========================
+ _readfsbase_u64() Read the FS base register
+ _readfsbase_u64() Read the GS base register
+ _writefsbase_u64() Write the FS base register
+ _writegsbase_u64() Write the GS base register
+ =================== ===========================
+
+To utilize these instrinsics <immintrin.h> must be included in the source
+code and the compiler option -mfsgsbase has to be added.
+
+Compiler support for FS/GS based addressing
+-------------------------------------------
+
+GCC version 6 and newer provide support for FS/GS based addressing via
+Named Address Spaces. GCC implements the following address space
+identifiers for x86:
+
+ ========= ====================================
+ __seg_fs Variable is addressed relative to FS
+ __seg_gs Variable is addressed relative to GS
+ ========= ====================================
+
+The preprocessor symbols __SEG_FS and __SEG_GS are defined when these
+address spaces are supported. Code which implements fallback modes should
+check whether these symbols are defined. Usage example::
+
+ #ifdef __SEG_GS
+
+ long data0 = 0;
+ long data1 = 1;
+
+ long __seg_gs *ptr;
+
+ /* Check whether FSGSBASE is enabled by the kernel (HWCAP2_FSGSBASE) */
+ ....
+
+ /* Set GS base to point to data0 */
+ _writegsbase_u64(&data0);
+
+ /* Access offset 0 of GS */
+ ptr = 0;
+ printf("data0 = %ld\n", *ptr);
+
+ /* Set GS base to point to data1 */
+ _writegsbase_u64(&data1);
+ /* ptr still addresses offset 0! */
+ printf("data1 = %ld\n", *ptr);
+
+
+Clang does not provide the GCC address space identifiers, but it provides
+address spaces via an attribute based mechanism in Clang 2.6 and newer
+versions:
+
+ ==================================== =====================================
+ __attribute__((address_space(256)) Variable is addressed relative to GS
+ __attribute__((address_space(257)) Variable is addressed relative to FS
+ ==================================== =====================================
+
+FS/GS based addressing with inline assembly
+-------------------------------------------
+
+In case the compiler does not support address spaces, inline assembly can
+be used for FS/GS based addressing mode::
+
+ mov %fs:offset, %reg
+ mov %gs:offset, %reg
+
+ mov %reg, %fs:offset
+ mov %reg, %gs:offset
diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst
index d6eaaa5..a56070f 100644
--- a/Documentation/x86/x86_64/index.rst
+++ b/Documentation/x86/x86_64/index.rst
@@ -14,3 +14,4 @@
fake-numa-for-cpusets
cpu-hotplug-spec
machinecheck
+ fsgs
diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst
index e189168..b402e04 100644
--- a/Documentation/x86/x86_64/machinecheck.rst
+++ b/Documentation/x86/x86_64/machinecheck.rst
@@ -81,5 +81,5 @@
For more details about the x86 machine check architecture
see the Intel and AMD architecture manuals from their developer websites.
-For more details about the architecture see
+For more details about the architecture
see http://one.firstfloor.org/~andi/mce.pdf
diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst
index 267fc48..ede1875 100644
--- a/Documentation/x86/x86_64/mm.rst
+++ b/Documentation/x86/x86_64/mm.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-================
-Memory Managment
-================
+=================
+Memory Management
+=================
Complete virtual memory map with 4-level page tables
====================================================
@@ -19,7 +19,7 @@
Note that as we get closer to the top of the address space, the notation changes
from TB to GB and then MB/KB.
- - "16M TB" might look weird at first sight, but it's an easier to visualize size
+ - "16M TB" might look weird at first sight, but it's an easier way to visualize size
notation than "16 EB", which few will recognize at first sight as 16 exabytes.
It also shows it nicely how incredibly large 64-bit address space is.
diff --git a/Documentation/x86/x86_64/uefi.rst b/Documentation/x86/x86_64/uefi.rst
index 88c3ba3..3b89410 100644
--- a/Documentation/x86/x86_64/uefi.rst
+++ b/Documentation/x86/x86_64/uefi.rst
@@ -36,7 +36,7 @@
elilo bootloader with x86_64 support, elilo configuration file,
kernel image built in first step and corresponding
- initrd. Instructions on building elilo and its dependencies
+ initrd. Instructions on building elilo and its dependencies
can be found in the elilo sourceforge project.
- Boot to EFI shell and invoke elilo choosing the kernel image built