docs/design/locking.rst - TF-RMM/tf-rmm - TrustedFirmware Git Browser

 .. SPDX-License-Identifier: BSD-3-Clause
 .. SPDX-FileCopyrightText: Copyright TF-RMM Contributors.

 .. _locking_rmm:

 RMM Locking Guidelines
 =========================

 This document outlines the locking requirements, discusses the implementation
 and provides guidelines for a deadlock free |RMM| implementation. Further, the
 document hitherto is based upon |RMM| Alpha-05 specification and is expected to
 change as the implementation proceeds.

 .. _locking_intro:

 Introduction
 -------------
 In order to meet the requirement for the |RMM| to be small, simple to reason
 about, and to co-exist with contemporary hypervisors which are already
 designed to manage system memory, the |RMM| does not include a memory allocator.
 It instead relies on an untrusted caller providing granules of memory used
 to hold both meta data to manage realms as well as code and data for realms.

 To maintain confidentiality and integrity of these granules, the |RMM|
 implements memory access controls by maintaining awareness of the state of each
 granule (aka Granule State, ref :ref:`locking_impl`) and enforcing rules on how
 memory granules can transition from one state to another and how a granule can
 be used depending on its state. For example, all granules that can be accessed
 by software outside the |PAR| of a realm are in a specific state, and a granule
 that holds meta data for a realm is in another specific state that prevents it
 from being used as data in a realm and accidentally corrupted by a realm, which
 could lead to internal failure in the |RMM|.

 Due to this complex nature of the operations supported by the |RMM|, for example
 when managing page tables for realms, the |RMM| must be able to hold locks on
 multiple objects at the same time. It is a well known fact that holding multiple
 locks at the same time can easily lead to deadlocking the system, as for example
 illustrated by the dining philosophers problem [EWD310]_. In traditional
 operating systems software such issues are avoided by defining a partial order
 on all system objects and always acquiring a lower-ordered object before a
 higher-ordered object. This solution was shown to be correct by Dijkstra
 [EWD625]_. Solutions are typically obtained by assigning an arbitrary order
 based upon certain attributes of the objects, for example by using the memory
 address of the object.

 Unfortunately, software such as the |RMM| cannot use these methods directly
 because the |RMM| receives an opaque pointer from the untrusted caller and it
 cannot know before locking the object if it is indeed of the expected state.
 Furthermore, MMU page tables are hierarchical data structures and operations on
 the page tables typically must be able to locate a leaf node in the hierarchy
 based on single value (a virtual address) and therefore must walk the page
 tables in their hierarchical order. This implies an order of objects in the same
 Granule State which is not known by a process executing in the |RMM| before
 holding at least one lock on object in the page table hierarchy. An obvious
 solution to these problems would be to use a single global lock for the |RMM|,
 but that would serialize all operations across all shared data structures in the
 system and severely impact performance.


 .. _locking_reqs:

 Requirements
 -------------

 To address the synchronization needs of the |RMM| described above, we must
 employ locking and lock-free mechanisms which satisfies a number of properties.
 These are discussed below:

 Critical Section
 *****************

 A critical section can be defined as a section of code within a process that
 requires access to shared resources and that must not be executed while
 another process is in a corresponding section of code [WS2001]_.

 Further, access to shared resources without appropriate synchronization can lead
 to **race conditions**, which can be defined as a situation in which multiple
 threads or processes read and write a shared item and the final result depends
 on the relative timing of their execution [WS2001]_.

 In terms of |RMM|, an access to a shared resource can be considered as a list
 of operations/instructions in program order that either reads from or writes to
 a shared memory location (e.g. the granule data structure or the memory granule
 described by the granule data structure, ref :ref:`locking_impl`). It is also
 understood that this list of operations does not execute indefinitely, but
 eventually terminates.

 We can now define our desired properties as follows:

 Mutual Exclusion
 *****************

 Mutual exclusion can be defined as the requirement that when one process is in a
 critical section that accesses shared resources, no other process may be in a
 critical section that accesses any of those shared resources [WS2001]_.

 The following example illustrates how an implementation might enforce mutual
 exclusion of critical sections using a lock on a valid granule data structure
 `struct granule *a`:

 .. code-block:: C

 	struct granule *a;
 	bool r;

 	r = try_lock(a);
 	if (!r) {
 		return -ERROR;
 	}
 	critical_section(a);
 	unlock(a);
 	other_work();

 We note that a process might fail to perform the `lock` operation on object `a`
 and return an error or successfully acquire the lock, execute the
 `critical_section()`, `unlock()` and then continue to make forward progress to
 `other_work()` function.

 Deadlock Avoidance
 *******************

 A deadlock can be defined as a situation in which two or more processes are
 unable to proceed because each is waiting for one of the others to do something
 [WS2001]_.

 In other words, one or more processes are trying to enter their critical
 sections but none of them make forward progress.

 We can then define the deadlock avoidance property as the inverse scenario:

 When one or more processes are trying to enter their critical sections, at least
 one of them makes forward progress.

 A deadlock is a fatal event if it occurs in supervisory software such as the
 |RMM|. This must be avoided as it can render the system vulnerable to exploits
 and/or unresponsive which may lead to data loss, interrupted service and
 eventually economic loss.

 Starvation Avoidance
 *********************

 Starvation can be defined as a situation in which a runnable process is
 overlooked  indefinitely by the scheduler; although it is able to proceed, it is
 never chosen [WS2001]_.

 Then starvation avoidance can be defined as, all processes that are trying to
 enter their critical sections eventually make forward progress.

 Starvation must be avoided, because if one or more processes do not make forward
 progress, the PE on which the process runs will not perform useful work and
 will be lost to the user, resulting in similar issues like a deadlocked system.

 Nested Critical Sections
 *************************

 A critical section for an object may be nested within the critical section for
 another object for the same process.  In other words, a process may enter more
 than one critical section at the same time.

 For example, if the |RMM| needs to copy data from one granule to another
 granule, and must be sure that both granules can only be modified by the process
 itself, it may be implemented in the following way:

 .. code-block:: C

 	struct granule *a;
 	struct granule *b;
 	bool r;

 	r = try_lock(a);
 	if (!r) {
 		return -ERROR;
 	}

 	/* critical section for granule a -- ENTER */

 	r = try_lock(b);
 	if (r) {
 		/* critical section for granule b -- ENTER */
 		b->foo = a->foo;
 		/* critical section for granule b -- EXIT */
 		unlock(b);
 	}

 	/* critical section for granule a -- EXIT */
 	unlock(a);

 .. _locking_impl:

 Implementation
 ---------------

 The |RMM| maintains granule states by defining a data structure for each
 memory granule in the system. Conceptually, the data structure contains the
 following fields:

 * Granule State
 * Lock
 * Reference Count

 The Lock field provides mutual exclusion of processes executing in their
 critical sections which may access the shared granule data structure and the
 shared meta data which may be stored in the memory granule which is in one of
 the |RD|, |REC|, and Table states. Both the data structure describing
 the memory granule and the contents of the memory granule itself can be accessed
 by multiple PEs concurrently and we therefore require some concurrency protocol
 to avoid corruption of shared data structures. An alternative to using a lock
 providing mutual exclusion would be to design all operations that access shared
 data structures as lock-free algorithms, but due to the complexity of the data
 structures and the operation of the |RMM| we consider this too difficult to
 accomplish in practice.

 The Reference Count field is used to keep track of references between granules.
 For example, an |RD| describes a realm, and a |REC| describes an execution
 context within that realm, and therefore an |RD| must always exist when a |REC|
 exists. To prevent the |RMM| from destroying an |RD| while a |REC| still exists,
 the |RMM| holds a reference count on the |RD| for each |REC| associated with the
 same realm, and only when the all the RECs in a realm have been destroyed and
 the reference count on an |RD| drops to zero, can the |RD| be destroyed and the
 granule be repurposed for other use.

 Based on the above, we now describe the Granule State field and the current
 locking/refcount implementation:

 * **UnDelegated:** These are granules for which |RMM| does not prevent the |PAS|
   of the granule from being changed by another agent to any value.
   In this state, the granule content access is not protected by granule::lock,
   as it is always subject to reads and writes from Non-Realm worlds.

 * **Delegated:** These are granules with memory only accessible by the |RMM|.
   The granule content is protected by granule::lock. No reference counts are
   held on this granule state.

 * **Realm Descriptor (RD):** These are granules containing meta data describing
   a realm, and only accessible by the |RMM|. Granule content access is protected
   by granule::lock. A reference count is also held on this granule for each
   associated |REC| granule.

 * **Realm Execution Context (REC):** These are granules containing meta data
   describing a virtual PE running in a realm, and are only accessible by the
   |RMM|. The execution content access is not protected by granule::lock, because
   we cannot enter a realm while holding the lock. Further, the following rules
   apply with respect to the granule's reference counts:

 	- A reference count is held on this granule when a |REC| is running.

 	- As |REC| cannot be run on two PEs at the same time, the maximum value
 	  of the reference count is one.

 	- When the |REC| is entered, the reference count is incremented
 	  (set to 1) atomically while granule::lock is held.

 	- When the |REC| exits, the reference counter is released (set to 0)
 	  atomically with store-release semantics without granule::lock being
 	  held.

 	- The |RMM| can access the granule's content on the entry and exit path
 	  from the |REC| while the reference is held.

 * **Translation Table:** These are granules containing meta data describing
   virtual to physical address translation for the realm, accessible by the |RMM|
   and the hardware Memory Management Unit (MMU). Granule content access is
   protected by granule::lock, but hardware translation table walks may read the
   RTT at any point in time. Multiple granules in this state can only be locked
   at the same time if they are part of the same tree, and only in topological
   order from root to leaf. The topological order of concatenated root level RTTs
   is from the lowest address to the highest address. The complete internal
   locking order for RTT granules is: RD -> [RTT] -> ... -> RTT. A reference
   count is held on this granule for each entry in the RTT that refers to a
   granule:

 	- Table s2tte.

 	- Valid s2tte.

 	- Valid_NS s2tte.

 	- Assigned s2tte.

 * **Data:** These are granules containing realm data, accessible by the |RMM|
   and by the realm to which it belongs. Granule content access is not protected
   by granule::lock, as it is always subject to reads and writes from within a
   realm. A granule in this state is always referenced from exactly one entry in
   an RTT granule which must be locked before locking this granule. Only a single
   DATA granule can be locked at a time on a given PE. The complete internal
   locking order for DATA granules is: RD -> RTT -> RTT -> ... -> DATA.
   No reference counts are held on this granule type.


 Locking
 ********

 The |RMM| uses spinlocks along with the object state for locking implementation.
 The lock provides similar exclusive acquire semantics known from trivial
 spinlock implementations, however also allows verification of whether the locked
 object is of an expected state.

 The data structure for the spinlock can be described in C as follows:

 .. code-block:: C

 	typedef struct {
 		unsigned int val;
 	} spinlock_t;

 This data structure can be embedded in any object that requires synchronization
 of access, such as the `struct granule` described above.

 The following operations are defined on spinlocks:

 .. code-block:: C
 	:caption: **Typical spinlock operations**

 	/*
 	 * Locks a spinlock with acquire memory ordering semantics or goes into
 	 * a tight loop (spins) and repeatedly checks the lock variable
 	 * atomically until it becomes available.
 	 */
 	void spinlock_acquire(spinlock_t *l);

 	/*
 	 * Unlocks a spinlock with release memory ordering semantics. Must only
 	 * be called if the calling PE already holds the lock.
 	 */
 	void spinlock_release(spinlock_t *l);


 The above functions should not be directly used for locking/unlocking granules,
 instead the following should be used:

 .. code-block:: C
 	:caption: **Granule locking operations**

 	/*
 	 * Acquires a lock (or spins until the lock is available), then checks
 	 * if the granule is in the `expected_state`. If the `expected_state`
 	 * is matched, then returns `true`. Otherwise, releases the lock and
 	 * returns `false`.
 	 */
 	bool granule_lock_on_state_match(struct granule *g,
 					 enum granule_state expected_state);

 	/*
 	 * Used when we're certain of the state of an object (e.g. because we
 	 * hold a reference to it) or when locking objects whose reference is
 	 * obtained from another object, after that objects is locked.
 	 */
 	void granule_lock(struct granule *g,
 			  enum granule_state expected_state);

 	/*
 	 * Obtains a pointer to a locked granule at `addr` if `addr` is a valid
 	 * granule physical address and the state of the granule at `addr` is
 	 * `expected_state`.
 	 */
 	struct granule *find_lock_granule(unsigned long addr,
 					  enum granule_state expected_state);

 	/* Find two granules and lock them in order of their address. */
 	return_code_t find_lock_two_granules(unsigned long addr1,
 					     enum granule_state expected_state1,
 					     struct granule **g1,
 					     unsigned long addr2,
 					     enum granule_state expected_state2,
 					     struct granule **g2);

 	/*
 	 * Obtain a pointer to a locked granule at `addr` which is unused
 	 * (refcount = 0), if `addr` is a valid granule physical address and the
 	 * state of the granule at `addr` is `expected_state`.
 	 */
 	struct granule *find_lock_unused_granule(unsigned long addr,
 						 enum granule_state
 						 expected_state);

 .. code-block:: C
 	:caption: **Granule unlocking operations**

 	/*
 	 * Release a spinlock held on a granule. Must only be called if the
 	 * calling PE already holds the lock.
 	 */
 	void granule_unlock(struct granule *g);

 	/*
 	 * Sets the state and releases a spinlock held on a granule. Must only
 	 * be called if the calling PE already holds the lock.
 	 */
 	void granule_unlock_transition(struct granule *g,
 				       enum granule_state new_state);


 Reference Counting
 *******************

 The reference count is implemented using the **refcount** variable within the
 granule structure to keep track of the references in between granules. For
 example, the refcount is used to prevent changes to the attributes of a parent
 granule which is referenced by child granules, ie. a parent with refcount not
 equal to zero.

 Race conditions on the refcount variable are avoided by either locking the
 granule before accessing the variable or by lock-free mechanisms such as
 Single-Copy Atomic operations along with ARM weakly ordered
 ACQUIRE/RELEASE/RELAXED memory semantics to synchronize shared resources.

 The following operations are defined on refcount:

 .. code-block:: C
 	:caption: **Read a refcount value**

 	/*
 	 * Single-copy atomic read of refcount variable with RELAXED memory
 	 * ordering semantics. Use this function if lock-free access to the
 	 * refcount is required with relaxed memory ordering constraints applied
 	 * at that point.
 	 */
 	unsigned long granule_refcount_read_relaxed(struct granule *g);

 	/*
 	 * Single-copy atomic read of refcount variable with ACQUIRE memory
 	 * ordering semantics. Use this function if lock-free access to the
 	 * refcount is required with acquire memory ordering constraints applied
 	 * at that point.
 	 */
 	unsigned long granule_refcount_read_acquire(struct granule *g);

 .. code-block:: C
 	:caption: **Increment a refcount value**

 	/*
 	 * Increments the granule refcount. Must be called with the granule
 	 * lock held.
 	 */
 	void __granule_get(struct granule *g);

 	/*
 	 * Increments the granule refcount by `val`. Must be called with the
 	 * granule lock held.
 	 */
 	void __granule_refcount_inc(struct granule *g, unsigned long val);

 	/* Atomically increments the reference counter of the granule.*/
 	void atomic_granule_get(struct granule *g);


 .. code-block:: C
 	:caption: **Decrement a refcount value**

 	/*
 	 * Decrements the granule refcount. Must be called with the granule
 	 * lock held.
 	 */
 	void __granule_put(struct granule *g);

 	/*
 	 * Decrements the granule refcount by `val`. Asserts if refcount can
 	 * become negative. Must be called with the granule lock held.
 	 */
 	void __granule_refcount_dec(struct granule *g, unsigned long val);

 	/* Atomically decrements the reference counter of the granule. */
 	void atomic_granule_put(struct granule *g);

 	/*
 	 * Atomically decrements the reference counter of the granule. Stores to
 	 * memory with RELEASE semantics.
 	 */
 	void atomic_granule_put_release(struct granule *g);

 .. code-block:: C
 	:caption: **Directly access refcount value**

 	/*
 	 * Directly reads/writes the refcount variable. Must be called with the
 	 * granule lock held.
 	 */
 	granule->refcount;

 .. _locking_guidelines:

 Guidelines
 -----------

 In order to meet the :ref:`locking_reqs` discussed above, this section
 stipulates some locking and lock-free algorithm implementation guidelines for
 developers.

 Mutual Exclusion
 *****************

 The spinlock, acquire/release and atomic operations provide trivial mutual
 exclusion implementations for |RMM|. However, the following general guidelines
 should be taken into consideration:

 	- Appropriate deadlock avoidance techniques should be incorporated when
 	  using multiple locks.

 	- Lock-free access to shared resources should be atomic.

 	- Memory ordering constraints should be used prudently to avoid
 	  performance degradation. For e.g. on an unlocked granule (e.g. REC),
 	  prior to the refcount update, if there are associated memory
 	  operations, then the update should be done with release semantics.
 	  However, if there are no associated memory accesses to the granule
 	  prior to the refcount update then release semantics will not be
 	  required.


 Deadlock Avoidance
 ******************

 Deadlock avoidance is provided by defining a partial order on all objects in the
 system where the locking operation will eventually fail if the caller tries to
 acquire a lock of a different state object than expected. This means that no
 two processes will be expected to acquire locks in a different order than the
 defined partial order, and we can rely on the same reasoning for deadlock
 avoidance as shown by Dijkstra [EWD625]_.

 To establish this partial order, the objects referenced by |RMM| can be
 classified into two categories:

 #. **External**: A granule state belongs to the `external` class iff _any_
    parameter in _any_ RMI command is an address of a granule which is expected
    to be in that state. The following granule states are `external`:

 	- GRANULE_STATE_NS
 	- GRANULE_STATE_DELEGATED
 	- GRANULE_STATE_RD
 	- GRANULE_STATE_REC
 	- DEV_GRANULE_STATE_NS
 	- DEV_GRANULE_STATE_DELEGATED

 #. **Internal**: A granule state belongs to the `internal` class iff it is not
    an `external`. These are objects which are referenced from another
    object after that object is locked. Each `internal` object should be
    referenced from exactly one place. The following granule states are
    `internal`:

 	- GRANULE_STATE_RTT
 	- GRANULE_STATE_DATA
 	- DEV_GRANULE_STATE_MAPPED

 We now state the locking guidelines for |RMM| as:

 #. Granules expected to be in an `external` state must be locked before locking
    any granules in an `internal` state.

 #. Granules expected to be in an `external` state must be locked in order of
    their physical address, starting with the lowest address.

 #. Memory granules expected to be in an `external` state must be locked before
    locking any device memory granules in `external` state.

 #. Once a granule expected to be in an `external` state has been locked, its
    state must be checked against the expected state. If these do not match, the
    granule must be unlocked and no further granules may be locked within the
    currently-executing RMM command.

 #. Granules in an `internal` state must be locked in order of state:

 	- `RTT`
 	- `DATA`

 #. Granules in the same `internal` state must be locked in the
    :ref:`locking_impl` defined order for that specific state.

 #. A granule's state can be changed iff the granule is locked and the reference
    count is zero.

 Starvation Avoidance
 ********************

 Currently, the lock-free implementation for RMI.REC.Enter provides Starvation
 Avoidance in |RMM|. However, for the locking implementation, Starvation
 Avoidance is yet to be accomplished. This can be added by a ticket or MCS style
 locking implementation [MCS]_.

 Nested Critical Sections
 ************************

 Spinlocks provide support for nested critical sections. Processes can acquire
 multiple spinlocks at the same time, as long as the locking order is not
 violated.

 References
 ----------

 .. [EWD310] Dijkstra, E.W. Hierarchical ordering of sequential processes.
 	EWD 310.

 .. [EWD625] Dijkstra, E.W. Two starvation free solutions to a general exclusion
 	problem. EWD 625.

 .. [MCS] Mellor-Crummey, John M. and Scott, Michael L. Algorithms for scalable
 	synchronization on shared-memory multiprocessors. ACM TOCS, Volume 9,
 	Issue 1, Feb. 1991.

 .. [WS2001] Stallings, W. (2001). Operating systems: Internals and design
 	principles. Upper Saddle River, N.J: Prentice Hall.