docs/design/memory-management.rst - TF-RMM/tf-rmm - TrustedFirmware Git Browser

 .. SPDX-License-Identifier: BSD-3-Clause
 .. SPDX-FileCopyrightText: Copyright TF-RMM Contributors.

 MMU setup and memory management design in RMM
 =============================================

 This document describes how the MMU is setup and how memory is managed
 by the |RMM| implementation.

 Physical Address Space
 ----------------------

 The Realm Management Extension (``FEAT_RME``) defines four Physical Address
 Spaces (PAS):

 -  Non-secure
 -  Secure
 -  Realm
 -  Root

 |RMM| code and |RMM| data are in Realm PAS memory, loaded and allocated to
 Realm PAS at boot time by the EL3 Firmware. This is a static carveout and it
 is never changed during the lifetime of the system.

 The size of the |RMM| data is fixed at build time. The majority of this is the
 granules array (see `Granule state tracking`_ below), whose size is
 configurable and proportional to the maximum amount of delegable DRAM supported
 by the system.

 Realm data and metadata are in Realm PAS memory, which is delegated to the
 Realm PAS by the Host at runtime. The |RMM| ABI ensures that this memory cannot
 be returned to Non-secure PAS ("undelegated") while it is in use by the
 |RMM| or by a Realm.

 NS data is in Non-secure PAS memory. The Host is able to change the PAS
 of this memory while it is being accessed by the |RMM|. Consequently, the
 |RMM| must be able to handle a Granule Protection Fault (GPF) while accessing
 NS data as part of RMI handling.

 Granule state tracking
 ----------------------

 The |RMM| manages a data structure called the `granules` array, which is
 stored in |RMM| data memory.

 The `granules` array contains one entry for every Granule of physical
 memory which was in Non-secure PAS at |RMM| boot and can be delegated.

 Each entry in the `granules` array contains a field `granule_state` which
 records the *state* of the Granule and which can be one of the states as
 listed below:

 -  NS: Not Realm PAS (i.e. Non-secure PAS, Root PAS or Secure PAS)
 -  Delegated: Realm PAS, but not yet assigned a purpose as either Realm
    data or Realm metadata
 -  RD: Realm Descriptor
 -  REC: Realm Execution Context
 -  REC aux: Auxiliary storage for REC
 -  Data: Realm data
 -  RTT: Realm Stage 2 translation tables

 As part of RMI SMC handling, the state of the granule can be a pre-condition
 and undergo transition to a new state. For more details on the various granule
 states and their transitions, please refer to the
 `Realm Management Monitor (RMM) Specification`_.

 For further details, see:

 -  ``enum granule_state``
 -  ``struct granule``

 RMM stage 1 translation regime
 ------------------------------

 |RMM| uses the ``FEAT_VHE`` extension to split the 64-bit VA space into two
 address spaces as shown in the figure below:

 |full va space|

 -  The Low VA range: it expands from VA 0x0 up to the maximum VA size
    configured for the region (with a maximum VA size of 48 bits or 52 bits
    if ``FEAT_LPA2`` is supported). This range is used to map the |RMM| Runtime
    (code, data, shared memory with EL3-FW and any other platform mappings).
 -  The High VA range: It expands from VA 0xFFFF_FFFF_FFFF_FFFF all the way down
    to an address corresponding to the maximum VA size configured for the region.
    This region is used by the `Stage 1 High VA - Slot Buffer mechanism`_
    as well as the `Per-CPU stack mapping`_.

 There is a range of invalid addresses between both ranges that is not mapped to
 any of them as shown in the figure above. TCR_EL2.TxSZ fields controls the
 maximum VA size of each region and |RMM| configures this field to fit the
 mappings used for each region.

 The 2 VA ranges are used for 2 different purposes in RMM as described below.

 Stage 1 Low VA range
 ^^^^^^^^^^^^^^^^^^^^

 The Low VA range is used to create static mappings which are shared across all
 the CPUs. It encompasses the RMM executable binary memory and the EL3 Shared
 memory region.

 The RMM Executable binary memory consists of code, RO data and RW data. Note
 that the stage 1 translation tables for the Low Region are kept in RO data, so
 that once the MMU is enabled, the tables mappings are protected from further
 modification.

 The EL3 shared memory, which is allocated by the EL3 Firmware, is used by the
 `RMM-EL3 communications interface`_. A pointer to the beginning of this area
 is received by |RMM| during initialization. |RMM| will then map the region in
 the .rw area.

 The Low VA range is setup by the platform layer as part of platform
 initialization.

 The following mappings belong to the Low VA Range:

 - RMM_CODE
 - RMM_RO
 - RMM_RW
 - RMM_SHARED

 Per-platform mappings can also be added if needed, such as the UART for the
 FVP platform.

 Stage 1 High VA range
 ^^^^^^^^^^^^^^^^^^^^^

 The High VA range is used to create dynamic per-CPU mappings. The tables used
 for this are private to each CPU and hence it is possible for every CPU to map
 a different PA at a specific VA. This property is used by the `slot-buffer`
 mechanism as described later.

 In order to allow the mappings for this region to be dynamic, its translation
 tables are stored in the RW section of |RMM|, allowing for it to be
 modified as needed.

 For more details see ``xlat_high_va.c`` file of the xlat library.

 The diagram below shows the memory layout for the High VA region.

 |high va region|

 Stage 1 High VA - Slot Buffer mechanism
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The |RMM| provides a dynamic mapping mechanism called `slot-buffer` in the
 high VA region. The assigned VA space for `slot-buffer` is divided into `slots`
 of GRANULE_SIZE each.

 The |RMM| has a fixed number of `slots` per CPU. Each `slot` is used to map
 memory of a particular category. The |RMM| validates that the target physical
 granule to be mapped is of the expected `granule_state` by looking up the
 corresponding entry in `granules` array.

 The `slot-buffer` mechanism has `slots` for mapping memory of the following
 types:

    -  Realm metadata: These correspond to the specific Realm and Realm
       Execution context scheduled on the PE. These mappings are usually only
       valid during the execution of an RMI or RSI handlers and are removed
       afterwards. These include Realm Descriptors (RDs), Realm Execution
       Contexts (RECs), Realm Translation Tables (RTTs).

    -  NS data: RMM needs to map NS memory as part of RMIs to access parameters
       passed by the Host or to return arguments to the Host. RMM also needs
       to copy Data provided by the Host as part of populating the Realm
       data memory.

    -  Realm data: RMM sometimes needs to temporarily map Realm data memory
       during Realm creation in order to load the Realm image or access buffers
       specified by the Realm as part of RSI commends.

 The `slot-buffer` design avoids the need for generic allocation of VA space.
 The rationalization of all mappings ever needed for managing a realm via
 `slots` is only possible due to the simple nature of the |RMM| design - in
 particular, the fact that it is possible to statically determine the types
 of objects which need to be mapped into the |RMM|'s address space, and the
 maximum number of objects of a given type which need to be mapped at any point
 in time.

 During Realm entry and Realm exit, the RD is mapped in the "RD" buffer
 slot. Once Realm entry or Realm exit is complete, this mapping is
 removed. The RD is not mapped during Realm execution.

 The REC and the `rmi_rec_run` data structures are both mapped during Realm
 execution.

 As the `slots` are mapped on the High VA region, each CPU
 has its own private translation tables for such mappings, which means
 that a particular slot has a fixed VA on every CPU. Since the Translation
 tables are private to a CPU, the mapping to the slot is private to the CPU.
 This allows the interruption and migration of a REC (vCPU) to another CPU with
 live memory allocations in RMM. An example of this scenario is when the Realm
 attestation token is being created in RMM, a pending IRQ can cause RMM to yield
 to NS Host with live memory allocations in MbedTLS heap. The NS Host can
 schedule the REC on another CPU and, since the mapping for the memory
 allocations remain at the same VA, the interrupted realm token creation can
 continue.

 The `slot-buffer` implementation in RMM also has some performance optimizations
 like caching of TTE's to avoid walking the Stage 1 translation tables for every
 map and unmap operation.

 As an alternative to using dynamic mappings as required for the RMI command,
 the approach of maintaining static mappings for all physical memory was
 considered, but rejected on the grounds that this could permit arbitrary
 memory access for an attacker who is able to subvert |RMM| execution.

 The xlat lib APIs are used by the `slot-buffer` to create dynamic mappings.
 These dynamic mappings are stored in the high VA region's ``xlat_ctx``
 structure and marked by the xlat library as *TRANSIENT*. This helps xlat lib to
 distinguish valid Translation Table Entries from invalid ones as otherwise the
 unmapped dynamic TTEs would be identical to INVALID ones.

 For further details, see:

 -  ``enum buffer_slot``
 -  ``lib/realm/src/buffer.c``

 Per-CPU stack mapping
 ~~~~~~~~~~~~~~~~~~~~~

 Each CPU maps its stack to the High VA region which means that the stack has
 same VA on all the CPUs and it is private to the CPU. At boot time, each CPU
 calculates the PA for the start of the stack and maps it to the designated
 High VA address space.

 The per-CPU VA mapping also includes a gap at the end of the stack VA to detect
 any stack underflows. The gap has a page size.

 |RMM| also uses a separate Per-CPU stack to handle exceptions and faults.
 This stack is allocated below the general one, and it allows for |RMM| to be
 able to handle a stack overflow fault. There is another page gap of unmapped
 memory between both stacks to harden security.

 The rest of the VA space available below the exception stack is unused and
 therefore left unmapped. The stage 1 translation library will not allow to map
 anything there.

 Stage 1 translation library (xlat library)
 ------------------------------------------

 The |RMM| stage 1 translation management is taken care of by the xlat library.
 This library is able to support up to 52-bit addresses and 5 levels of
 translation (when ``FEAT_LPA2`` is enabled).

 The xlat library is designed to be stateless and it uses the abstraction of
 `translation context`, modelled through the ``struct xlat_ctx``. A translation
 context stores all the information related to a given VA space, such as the
 translation tables, the VA configuration used to initialize the context and any
 internal status related to such VA. Once a context has been initialized, its
 VA configuration cannot be modified.

 At the moment, although the xlat library supports creation of multiple
 contexts, it assumes that the caller will only use a single context per
 CPU for a given VA region. The library does not offer support to switch
 contexts on a CPU at run time. A context can be shared by several CPUs if they
 share the same VA configuration and mappings, like on the low va region.

 Dynamic mappings can be created by specifying the ``TRANSIENT`` flag. The
 high VA region create dynamic mappings using this flag.

 For further details, see ``lib/xlat``.

 RMM executable bootstrap
 ------------------------

 The |RMM| is loaded as a .bin file by the EL3 loader. The size of the sections
 in the |RMM| binary as well as the placing of |RMM| code and data into
 appropriate sections is controlled by the linker script in the source tree.

 Platform initialization code takes care of importing the linker symbols
 that define the boundaries of the different sections and creates static
 memory mappings that are then used to initialize an ``xlat_ctx`` structure
 for the low VA region. The RMM binary sections are flat-mapped and are shared
 across all the CPUs on the system. In addition, as |RMM| is compiled as a
 Position Independent Executable (PIE) at address 0x0, the Global Offset
 Table (GOT) and other relocations in the binary are fixed up with the right
 offsets as part of boot. This allows RMM to be run at any physical address as
 a PIE regardless of the compile time address.

 For further details, see:

 -  ``runtime/linker.lds``
 -  ``plat/common/src/plat_common_init.c``
 -  ``plat/fvp/src/fvp_setup.c``

 _______________________________________________________________________________

 .. |full va space| image:: ./diagrams/full_va_space_diagram.drawio.png
    :height: 500
 .. |high va region| image:: ./diagrams/high_va_memory_map.drawio.png
    :height: 600
 .. _Realm Management Monitor (RMM) Specification: https://developer.arm.com/documentation/den0137/1-0eac5/?lang=en
 .. _`RMM-EL3 communications interface`: https://trustedfirmware-a.readthedocs.io/en/latest/components/rmm-el3-comms-spec.html
	.. SPDX-License-Identifier: BSD-3-Clause
	.. SPDX-FileCopyrightText: Copyright TF-RMM Contributors.

	MMU setup and memory management design in RMM
	=============================================

	This document describes how the MMU is setup and how memory is managed
	by the \|RMM\| implementation.

	Physical Address Space
	----------------------

	The Realm Management Extension (``FEAT_RME``) defines four Physical Address
	Spaces (PAS):

	- Non-secure
	- Secure
	- Realm
	- Root

	\|RMM\| code and \|RMM\| data are in Realm PAS memory, loaded and allocated to
	Realm PAS at boot time by the EL3 Firmware. This is a static carveout and it
	is never changed during the lifetime of the system.

	The size of the \|RMM\| data is fixed at build time. The majority of this is the
	granules array (see `Granule state tracking`_ below), whose size is
	configurable and proportional to the maximum amount of delegable DRAM supported
	by the system.

	Realm data and metadata are in Realm PAS memory, which is delegated to the
	Realm PAS by the Host at runtime. The \|RMM\| ABI ensures that this memory cannot
	be returned to Non-secure PAS ("undelegated") while it is in use by the
	\|RMM\| or by a Realm.

	NS data is in Non-secure PAS memory. The Host is able to change the PAS
	of this memory while it is being accessed by the \|RMM\|. Consequently, the
	\|RMM\| must be able to handle a Granule Protection Fault (GPF) while accessing
	NS data as part of RMI handling.

	Granule state tracking
	----------------------

	The \|RMM\| manages a data structure called the `granules` array, which is
	stored in \|RMM\| data memory.

	The `granules` array contains one entry for every Granule of physical
	memory which was in Non-secure PAS at \|RMM\| boot and can be delegated.

	Each entry in the `granules` array contains a field `granule_state` which
	records the state of the Granule and which can be one of the states as
	listed below:

	- NS: Not Realm PAS (i.e. Non-secure PAS, Root PAS or Secure PAS)
	- Delegated: Realm PAS, but not yet assigned a purpose as either Realm
	data or Realm metadata
	- RD: Realm Descriptor
	- REC: Realm Execution Context
	- REC aux: Auxiliary storage for REC
	- Data: Realm data
	- RTT: Realm Stage 2 translation tables

	As part of RMI SMC handling, the state of the granule can be a pre-condition
	and undergo transition to a new state. For more details on the various granule
	states and their transitions, please refer to the
	`Realm Management Monitor (RMM) Specification`_.

	For further details, see:

	- ``enum granule_state``
	- ``struct granule``

	RMM stage 1 translation regime
	------------------------------

	\|RMM\| uses the ``FEAT_VHE`` extension to split the 64-bit VA space into two
	address spaces as shown in the figure below:

	\|full va space\|

	- The Low VA range: it expands from VA 0x0 up to the maximum VA size
	configured for the region (with a maximum VA size of 48 bits or 52 bits
	if ``FEAT_LPA2`` is supported). This range is used to map the \|RMM\| Runtime
	(code, data, shared memory with EL3-FW and any other platform mappings).
	- The High VA range: It expands from VA 0xFFFF_FFFF_FFFF_FFFF all the way down
	to an address corresponding to the maximum VA size configured for the region.
	This region is used by the `Stage 1 High VA - Slot Buffer mechanism`_
	as well as the `Per-CPU stack mapping`_.

	There is a range of invalid addresses between both ranges that is not mapped to
	any of them as shown in the figure above. TCR_EL2.TxSZ fields controls the
	maximum VA size of each region and \|RMM\| configures this field to fit the
	mappings used for each region.

	The 2 VA ranges are used for 2 different purposes in RMM as described below.

	Stage 1 Low VA range
	^^^^^^^^^^^^^^^^^^^^

	The Low VA range is used to create static mappings which are shared across all
	the CPUs. It encompasses the RMM executable binary memory and the EL3 Shared
	memory region.

	The RMM Executable binary memory consists of code, RO data and RW data. Note
	that the stage 1 translation tables for the Low Region are kept in RO data, so
	that once the MMU is enabled, the tables mappings are protected from further
	modification.

	The EL3 shared memory, which is allocated by the EL3 Firmware, is used by the
	`RMM-EL3 communications interface`_. A pointer to the beginning of this area
	is received by \|RMM\| during initialization. \|RMM\| will then map the region in
	the .rw area.

	The Low VA range is setup by the platform layer as part of platform
	initialization.

	The following mappings belong to the Low VA Range:

	- RMM_CODE
	- RMM_RO
	- RMM_RW
	- RMM_SHARED

	Per-platform mappings can also be added if needed, such as the UART for the
	FVP platform.

	Stage 1 High VA range
	^^^^^^^^^^^^^^^^^^^^^

	The High VA range is used to create dynamic per-CPU mappings. The tables used
	for this are private to each CPU and hence it is possible for every CPU to map
	a different PA at a specific VA. This property is used by the `slot-buffer`
	mechanism as described later.

	In order to allow the mappings for this region to be dynamic, its translation
	tables are stored in the RW section of \|RMM\|, allowing for it to be
	modified as needed.

	For more details see ``xlat_high_va.c`` file of the xlat library.

	The diagram below shows the memory layout for the High VA region.

	\|high va region\|

	Stage 1 High VA - Slot Buffer mechanism
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	The \|RMM\| provides a dynamic mapping mechanism called `slot-buffer` in the
	high VA region. The assigned VA space for `slot-buffer` is divided into `slots`
	of GRANULE_SIZE each.

	The \|RMM\| has a fixed number of `slots` per CPU. Each `slot` is used to map
	memory of a particular category. The \|RMM\| validates that the target physical
	granule to be mapped is of the expected `granule_state` by looking up the
	corresponding entry in `granules` array.

	The `slot-buffer` mechanism has `slots` for mapping memory of the following
	types:

	- Realm metadata: These correspond to the specific Realm and Realm
	Execution context scheduled on the PE. These mappings are usually only
	valid during the execution of an RMI or RSI handlers and are removed
	afterwards. These include Realm Descriptors (RDs), Realm Execution
	Contexts (RECs), Realm Translation Tables (RTTs).

	- NS data: RMM needs to map NS memory as part of RMIs to access parameters
	passed by the Host or to return arguments to the Host. RMM also needs
	to copy Data provided by the Host as part of populating the Realm
	data memory.

	- Realm data: RMM sometimes needs to temporarily map Realm data memory
	during Realm creation in order to load the Realm image or access buffers
	specified by the Realm as part of RSI commends.

	The `slot-buffer` design avoids the need for generic allocation of VA space.
	The rationalization of all mappings ever needed for managing a realm via
	`slots` is only possible due to the simple nature of the \|RMM\| design - in
	particular, the fact that it is possible to statically determine the types
	of objects which need to be mapped into the \|RMM\|'s address space, and the
	maximum number of objects of a given type which need to be mapped at any point
	in time.

	During Realm entry and Realm exit, the RD is mapped in the "RD" buffer
	slot. Once Realm entry or Realm exit is complete, this mapping is
	removed. The RD is not mapped during Realm execution.

	The REC and the `rmi_rec_run` data structures are both mapped during Realm
	execution.

	As the `slots` are mapped on the High VA region, each CPU
	has its own private translation tables for such mappings, which means
	that a particular slot has a fixed VA on every CPU. Since the Translation
	tables are private to a CPU, the mapping to the slot is private to the CPU.
	This allows the interruption and migration of a REC (vCPU) to another CPU with
	live memory allocations in RMM. An example of this scenario is when the Realm
	attestation token is being created in RMM, a pending IRQ can cause RMM to yield
	to NS Host with live memory allocations in MbedTLS heap. The NS Host can
	schedule the REC on another CPU and, since the mapping for the memory
	allocations remain at the same VA, the interrupted realm token creation can
	continue.

	The `slot-buffer` implementation in RMM also has some performance optimizations
	like caching of TTE's to avoid walking the Stage 1 translation tables for every
	map and unmap operation.

	As an alternative to using dynamic mappings as required for the RMI command,
	the approach of maintaining static mappings for all physical memory was
	considered, but rejected on the grounds that this could permit arbitrary
	memory access for an attacker who is able to subvert \|RMM\| execution.

	The xlat lib APIs are used by the `slot-buffer` to create dynamic mappings.
	These dynamic mappings are stored in the high VA region's ``xlat_ctx``
	structure and marked by the xlat library as TRANSIENT. This helps xlat lib to
	distinguish valid Translation Table Entries from invalid ones as otherwise the
	unmapped dynamic TTEs would be identical to INVALID ones.

	For further details, see:

	- ``enum buffer_slot``
	- ``lib/realm/src/buffer.c``

	Per-CPU stack mapping
	~~~~~~~~~~~~~~~~~~~~~

	Each CPU maps its stack to the High VA region which means that the stack has
	same VA on all the CPUs and it is private to the CPU. At boot time, each CPU
	calculates the PA for the start of the stack and maps it to the designated
	High VA address space.

	The per-CPU VA mapping also includes a gap at the end of the stack VA to detect
	any stack underflows. The gap has a page size.

	\|RMM\| also uses a separate Per-CPU stack to handle exceptions and faults.
	This stack is allocated below the general one, and it allows for \|RMM\| to be
	able to handle a stack overflow fault. There is another page gap of unmapped
	memory between both stacks to harden security.

	The rest of the VA space available below the exception stack is unused and
	therefore left unmapped. The stage 1 translation library will not allow to map
	anything there.

	Stage 1 translation library (xlat library)
	------------------------------------------

	The \|RMM\| stage 1 translation management is taken care of by the xlat library.
	This library is able to support up to 52-bit addresses and 5 levels of
	translation (when ``FEAT_LPA2`` is enabled).

	The xlat library is designed to be stateless and it uses the abstraction of
	`translation context`, modelled through the ``struct xlat_ctx``. A translation
	context stores all the information related to a given VA space, such as the
	translation tables, the VA configuration used to initialize the context and any
	internal status related to such VA. Once a context has been initialized, its
	VA configuration cannot be modified.

	At the moment, although the xlat library supports creation of multiple
	contexts, it assumes that the caller will only use a single context per
	CPU for a given VA region. The library does not offer support to switch
	contexts on a CPU at run time. A context can be shared by several CPUs if they
	share the same VA configuration and mappings, like on the low va region.

	Dynamic mappings can be created by specifying the ``TRANSIENT`` flag. The
	high VA region create dynamic mappings using this flag.

	For further details, see ``lib/xlat``.

	RMM executable bootstrap
	------------------------

	The \|RMM\| is loaded as a .bin file by the EL3 loader. The size of the sections
	in the \|RMM\| binary as well as the placing of \|RMM\| code and data into
	appropriate sections is controlled by the linker script in the source tree.

	Platform initialization code takes care of importing the linker symbols
	that define the boundaries of the different sections and creates static
	memory mappings that are then used to initialize an ``xlat_ctx`` structure
	for the low VA region. The RMM binary sections are flat-mapped and are shared
	across all the CPUs on the system. In addition, as \|RMM\| is compiled as a
	Position Independent Executable (PIE) at address 0x0, the Global Offset
	Table (GOT) and other relocations in the binary are fixed up with the right
	offsets as part of boot. This allows RMM to be run at any physical address as
	a PIE regardless of the compile time address.

	For further details, see:

	- ``runtime/linker.lds``
	- ``plat/common/src/plat_common_init.c``
	- ``plat/fvp/src/fvp_setup.c``

	_______________________________________________________________________________

	.. \|full va space\| image:: ./diagrams/full_va_space_diagram.drawio.png
	:height: 500
	.. \|high va region\| image:: ./diagrams/high_va_memory_map.drawio.png
	:height: 600
	.. _Realm Management Monitor (RMM) Specification: https://developer.arm.com/documentation/den0137/1-0eac5/?lang=en
	.. _`RMM-EL3 communications interface`: https://trustedfirmware-a.readthedocs.io/en/latest/components/rmm-el3-comms-spec.html