| .. SPDX-License-Identifier: BSD-3-Clause |
| .. SPDX-FileCopyrightText: Copyright TF-RMM Contributors. |
| |
| .. _locking_rmm: |
| |
| RMM Locking Guidelines |
| ========================= |
| |
| This document outlines the locking requirements, discusses the implementation |
| and provides guidelines for a deadlock free |RMM| implementation. Further, the |
| document hitherto is based upon |RMM| Alpha-05 specification and is expected to |
| change as the implementation proceeds. |
| |
| .. _locking_intro: |
| |
| Introduction |
| ------------- |
| In order to meet the requirement for the |RMM| to be small, simple to reason |
| about, and to co-exist with contemporary hypervisors which are already |
| designed to manage system memory, the |RMM| does not include a memory allocator. |
| It instead relies on an untrusted caller providing granules of memory used |
| to hold both meta data to manage realms as well as code and data for realms. |
| |
| To maintain confidentiality and integrity of these granules, the |RMM| |
| implements memory access controls by maintaining awareness of the state of each |
| granule (aka Granule State, ref :ref:`locking_impl`) and enforcing rules on how |
| memory granules can transition from one state to another and how a granule can |
| be used depending on its state. For example, all granules that can be accessed |
| by software outside the |PAR| of a realm are in a specific state, and a granule |
| that holds meta data for a realm is in another specific state that prevents it |
| from being used as data in a realm and accidentally corrupted by a realm, which |
| could lead to internal failure in the |RMM|. |
| |
| Due to this complex nature of the operations supported by the |RMM|, for example |
| when managing page tables for realms, the |RMM| must be able to hold locks on |
| multiple objects at the same time. It is a well known fact that holding multiple |
| locks at the same time can easily lead to deadlocking the system, as for example |
| illustrated by the dining philosophers problem [EWD310]_. In traditional |
| operating systems software such issues are avoided by defining a partial order |
| on all system objects and always acquiring a lower-ordered object before a |
| higher-ordered object. This solution was shown to be correct by Dijkstra |
| [EWD625]_. Solutions are typically obtained by assigning an arbitrary order |
| based upon certain attributes of the objects, for example by using the memory |
| address of the object. |
| |
| Unfortunately, software such as the |RMM| cannot use these methods directly |
| because the |RMM| receives an opaque pointer from the untrusted caller and it |
| cannot know before locking the object if it is indeed of the expected state. |
| Furthermore, MMU page tables are hierarchical data structures and operations on |
| the page tables typically must be able to locate a leaf node in the hierarchy |
| based on single value (a virtual address) and therefore must walk the page |
| tables in their hierarchical order. This implies an order of objects in the same |
| Granule State which is not known by a process executing in the |RMM| before |
| holding at least one lock on object in the page table hierarchy. An obvious |
| solution to these problems would be to use a single global lock for the |RMM|, |
| but that would serialize all operations across all shared data structures in the |
| system and severely impact performance. |
| |
| |
| .. _locking_reqs: |
| |
| Requirements |
| ------------- |
| |
| To address the synchronization needs of the |RMM| described above, we must |
| employ locking and lock-free mechanisms which satisfies a number of properties. |
| These are discussed below: |
| |
| Critical Section |
| ***************** |
| |
| A critical section can be defined as a section of code within a process that |
| requires access to shared resources and that must not be executed while |
| another process is in a corresponding section of code [WS2001]_. |
| |
| Further, access to shared resources without appropriate synchronization can lead |
| to **race conditions**, which can be defined as a situation in which multiple |
| threads or processes read and write a shared item and the final result depends |
| on the relative timing of their execution [WS2001]_. |
| |
| In terms of |RMM|, an access to a shared resource can be considered as a list |
| of operations/instructions in program order that either reads from or writes to |
| a shared memory location (e.g. the granule data structure or the memory granule |
| described by the granule data structure, ref :ref:`locking_impl`). It is also |
| understood that this list of operations does not execute indefinitely, but |
| eventually terminates. |
| |
| We can now define our desired properties as follows: |
| |
| Mutual Exclusion |
| ***************** |
| |
| Mutual exclusion can be defined as the requirement that when one process is in a |
| critical section that accesses shared resources, no other process may be in a |
| critical section that accesses any of those shared resources [WS2001]_. |
| |
| The following example illustrates how an implementation might enforce mutual |
| exclusion of critical sections using a lock on a valid granule data structure |
| `struct granule *a`: |
| |
| .. code-block:: C |
| |
| struct granule *a; |
| bool r; |
| |
| r = try_lock(a); |
| if (!r) { |
| return -ERROR; |
| } |
| critical_section(a); |
| unlock(a); |
| other_work(); |
| |
| We note that a process might fail to perform the `lock` operation on object `a` |
| and return an error or successfully acquire the lock, execute the |
| `critical_section()`, `unlock()` and then continue to make forward progress to |
| `other_work()` function. |
| |
| Deadlock Avoidance |
| ******************* |
| |
| A deadlock can be defined as a situation in which two or more processes are |
| unable to proceed because each is waiting for one of the others to do something |
| [WS2001]_. |
| |
| In other words, one or more processes are trying to enter their critical |
| sections but none of them make forward progress. |
| |
| We can then define the deadlock avoidance property as the inverse scenario: |
| |
| When one or more processes are trying to enter their critical sections, at least |
| one of them makes forward progress. |
| |
| A deadlock is a fatal event if it occurs in supervisory software such as the |
| |RMM|. This must be avoided as it can render the system vulnerable to exploits |
| and/or unresponsive which may lead to data loss, interrupted service and |
| eventually economic loss. |
| |
| Starvation Avoidance |
| ********************* |
| |
| Starvation can be defined as a situation in which a runnable process is |
| overlooked indefinitely by the scheduler; although it is able to proceed, it is |
| never chosen [WS2001]_. |
| |
| Then starvation avoidance can be defined as, all processes that are trying to |
| enter their critical sections eventually make forward progress. |
| |
| Starvation must be avoided, because if one or more processes do not make forward |
| progress, the PE on which the process runs will not perform useful work and |
| will be lost to the user, resulting in similar issues like a deadlocked system. |
| |
| Nested Critical Sections |
| ************************* |
| |
| A critical section for an object may be nested within the critical section for |
| another object for the same process. In other words, a process may enter more |
| than one critical section at the same time. |
| |
| For example, if the |RMM| needs to copy data from one granule to another |
| granule, and must be sure that both granules can only be modified by the process |
| itself, it may be implemented in the following way: |
| |
| .. code-block:: C |
| |
| struct granule *a; |
| struct granule *b; |
| bool r; |
| |
| r = try_lock(a); |
| if (!r) { |
| return -ERROR; |
| } |
| |
| /* critical section for granule a -- ENTER */ |
| |
| r = try_lock(b); |
| if (r) { |
| /* critical section for granule b -- ENTER */ |
| b->foo = a->foo; |
| /* critical section for granule b -- EXIT */ |
| unlock(b); |
| } |
| |
| /* critical section for granule a -- EXIT */ |
| unlock(a); |
| |
| .. _locking_impl: |
| |
| Implementation |
| --------------- |
| |
| The |RMM| maintains granule states by defining a data structure for each |
| memory granule in the system. Conceptually, the data structure contains the |
| following fields: |
| |
| * Granule State |
| * Lock |
| * Reference Count |
| |
| The Lock field provides mutual exclusion of processes executing in their |
| critical sections which may access the shared granule data structure and the |
| shared meta data which may be stored in the memory granule which is in one of |
| the |RD|, |REC|, and Table states. Both the data structure describing |
| the memory granule and the contents of the memory granule itself can be accessed |
| by multiple PEs concurrently and we therefore require some concurrency protocol |
| to avoid corruption of shared data structures. An alternative to using a lock |
| providing mutual exclusion would be to design all operations that access shared |
| data structures as lock-free algorithms, but due to the complexity of the data |
| structures and the operation of the |RMM| we consider this too difficult to |
| accomplish in practice. |
| |
| The Reference Count field is used to keep track of references between granules. |
| For example, an |RD| describes a realm, and a |REC| describes an execution |
| context within that realm, and therefore an |RD| must always exist when a |REC| |
| exists. To prevent the |RMM| from destroying an |RD| while a |REC| still exists, |
| the |RMM| holds a reference count on the |RD| for each |REC| associated with the |
| same realm, and only when the all the RECs in a realm have been destroyed and |
| the reference count on an |RD| drops to zero, can the |RD| be destroyed and the |
| granule be repurposed for other use. |
| |
| Based on the above, we now describe the Granule State field and the current |
| locking/refcount implementation: |
| |
| * **UnDelegated:** These are granules for which |RMM| does not prevent the |PAS| |
| of the granule from being changed by another agent to any value. |
| In this state, the granule content access is not protected by granule::lock, |
| as it is always subject to reads and writes from Non-Realm worlds. |
| |
| * **Delegated:** These are granules with memory only accessible by the |RMM|. |
| The granule content is protected by granule::lock. No reference counts are |
| held on this granule state. |
| |
| * **Realm Descriptor (RD):** These are granules containing meta data describing |
| a realm, and only accessible by the |RMM|. Granule content access is protected |
| by granule::lock. A reference count is also held on this granule for each |
| associated |REC| granule. |
| |
| * **Realm Execution Context (REC):** These are granules containing meta data |
| describing a virtual PE running in a realm, and are only accessible by the |
| |RMM|. The execution content access is not protected by granule::lock, because |
| we cannot enter a realm while holding the lock. Further, the following rules |
| apply with respect to the granule's reference counts: |
| |
| - A reference count is held on this granule when a |REC| is running. |
| |
| - As |REC| cannot be run on two PEs at the same time, the maximum value |
| of the reference count is one. |
| |
| - When the |REC| is entered, the reference count is incremented |
| (set to 1) atomically while granule::lock is held. |
| |
| - When the |REC| exits, the reference counter is released (set to 0) |
| atomically with store-release semantics without granule::lock being |
| held. |
| |
| - The |RMM| can access the granule's content on the entry and exit path |
| from the |REC| while the reference is held. |
| |
| * **Translation Table:** These are granules containing meta data describing |
| virtual to physical address translation for the realm, accessible by the |RMM| |
| and the hardware Memory Management Unit (MMU). Granule content access is |
| protected by granule::lock, but hardware translation table walks may read the |
| RTT at any point in time. Multiple granules in this state can only be locked |
| at the same time if they are part of the same tree, and only in topological |
| order from root to leaf. The topological order of concatenated root level RTTs |
| is from the lowest address to the highest address. The complete internal |
| locking order for RTT granules is: RD -> [RTT] -> ... -> RTT. A reference |
| count is held on this granule for each entry in the RTT that refers to a |
| granule: |
| |
| - Table s2tte. |
| |
| - Valid s2tte. |
| |
| - Valid_NS s2tte. |
| |
| - Assigned s2tte. |
| |
| * **Data:** These are granules containing realm data, accessible by the |RMM| |
| and by the realm to which it belongs. Granule content access is not protected |
| by granule::lock, as it is always subject to reads and writes from within a |
| realm. A granule in this state is always referenced from exactly one entry in |
| an RTT granule which must be locked before locking this granule. Only a single |
| DATA granule can be locked at a time on a given PE. The complete internal |
| locking order for DATA granules is: RD -> RTT -> RTT -> ... -> DATA. |
| No reference counts are held on this granule type. |
| |
| |
| Locking |
| ******** |
| |
| The |RMM| uses spinlocks along with the object state for locking implementation. |
| The lock provides similar exclusive acquire semantics known from trivial |
| spinlock implementations, however also allows verification of whether the locked |
| object is of an expected state. |
| |
| The data structure for the spinlock can be described in C as follows: |
| |
| .. code-block:: C |
| |
| typedef struct { |
| unsigned int val; |
| } spinlock_t; |
| |
| This data structure can be embedded in any object that requires synchronization |
| of access, such as the `struct granule` described above. |
| |
| The following operations are defined on spinlocks: |
| |
| .. code-block:: C |
| :caption: **Typical spinlock operations** |
| |
| /* |
| * Locks a spinlock with acquire memory ordering semantics or goes into |
| * a tight loop (spins) and repeatedly checks the lock variable |
| * atomically until it becomes available. |
| */ |
| void spinlock_acquire(spinlock_t *l); |
| |
| /* |
| * Unlocks a spinlock with release memory ordering semantics. Must only |
| * be called if the calling PE already holds the lock. |
| */ |
| void spinlock_release(spinlock_t *l); |
| |
| |
| The above functions should not be directly used for locking/unlocking granules, |
| instead the following should be used: |
| |
| .. code-block:: C |
| :caption: **Granule locking operations** |
| |
| /* |
| * Acquires a lock (or spins until the lock is available), then checks |
| * if the granule is in the `expected_state`. If the `expected_state` |
| * is matched, then returns `true`. Otherwise, releases the lock and |
| * returns `false`. |
| */ |
| bool granule_lock_on_state_match(struct granule *g, |
| enum granule_state expected_state); |
| |
| /* |
| * Used when we're certain of the state of an object (e.g. because we |
| * hold a reference to it) or when locking objects whose reference is |
| * obtained from another object, after that objects is locked. |
| */ |
| void granule_lock(struct granule *g, |
| enum granule_state expected_state); |
| |
| /* |
| * Obtains a pointer to a locked granule at `addr` if `addr` is a valid |
| * granule physical address and the state of the granule at `addr` is |
| * `expected_state`. |
| */ |
| struct granule *find_lock_granule(unsigned long addr, |
| enum granule_state expected_state); |
| |
| /* Find two granules and lock them in order of their address. */ |
| return_code_t find_lock_two_granules(unsigned long addr1, |
| enum granule_state expected_state1, |
| struct granule **g1, |
| unsigned long addr2, |
| enum granule_state expected_state2, |
| struct granule **g2); |
| |
| /* |
| * Obtain a pointer to a locked granule at `addr` which is unused |
| * (refcount = 0), if `addr` is a valid granule physical address and the |
| * state of the granule at `addr` is `expected_state`. |
| */ |
| struct granule *find_lock_unused_granule(unsigned long addr, |
| enum granule_state |
| expected_state); |
| |
| .. code-block:: C |
| :caption: **Granule unlocking operations** |
| |
| /* |
| * Release a spinlock held on a granule. Must only be called if the |
| * calling PE already holds the lock. |
| */ |
| void granule_unlock(struct granule *g); |
| |
| /* |
| * Sets the state and releases a spinlock held on a granule. Must only |
| * be called if the calling PE already holds the lock. |
| */ |
| void granule_unlock_transition(struct granule *g, |
| enum granule_state new_state); |
| |
| |
| Reference Counting |
| ******************* |
| |
| The reference count is implemented using the **refcount** variable within the |
| granule structure to keep track of the references in between granules. For |
| example, the refcount is used to prevent changes to the attributes of a parent |
| granule which is referenced by child granules, ie. a parent with refcount not |
| equal to zero. |
| |
| Race conditions on the refcount variable are avoided by either locking the |
| granule before accessing the variable or by lock-free mechanisms such as |
| Single-Copy Atomic operations along with ARM weakly ordered |
| ACQUIRE/RELEASE/RELAXED memory semantics to synchronize shared resources. |
| |
| The following operations are defined on refcount: |
| |
| .. code-block:: C |
| :caption: **Read a refcount value** |
| |
| /* |
| * Single-copy atomic read of refcount variable with RELAXED memory |
| * ordering semantics. Use this function if lock-free access to the |
| * refcount is required with relaxed memory ordering constraints applied |
| * at that point. |
| */ |
| unsigned long granule_refcount_read_relaxed(struct granule *g); |
| |
| /* |
| * Single-copy atomic read of refcount variable with ACQUIRE memory |
| * ordering semantics. Use this function if lock-free access to the |
| * refcount is required with acquire memory ordering constraints applied |
| * at that point. |
| */ |
| unsigned long granule_refcount_read_acquire(struct granule *g); |
| |
| .. code-block:: C |
| :caption: **Increment a refcount value** |
| |
| /* |
| * Increments the granule refcount. Must be called with the granule |
| * lock held. |
| */ |
| void __granule_get(struct granule *g); |
| |
| /* |
| * Increments the granule refcount by `val`. Must be called with the |
| * granule lock held. |
| */ |
| void __granule_refcount_inc(struct granule *g, unsigned long val); |
| |
| /* Atomically increments the reference counter of the granule.*/ |
| void atomic_granule_get(struct granule *g); |
| |
| |
| .. code-block:: C |
| :caption: **Decrement a refcount value** |
| |
| /* |
| * Decrements the granule refcount. Must be called with the granule |
| * lock held. |
| */ |
| void __granule_put(struct granule *g); |
| |
| /* |
| * Decrements the granule refcount by `val`. Asserts if refcount can |
| * become negative. Must be called with the granule lock held. |
| */ |
| void __granule_refcount_dec(struct granule *g, unsigned long val); |
| |
| /* Atomically decrements the reference counter of the granule. */ |
| void atomic_granule_put(struct granule *g); |
| |
| /* |
| * Atomically decrements the reference counter of the granule. Stores to |
| * memory with RELEASE semantics. |
| */ |
| void atomic_granule_put_release(struct granule *g); |
| |
| .. code-block:: C |
| :caption: **Directly access refcount value** |
| |
| /* |
| * Directly reads/writes the refcount variable. Must be called with the |
| * granule lock held. |
| */ |
| granule->refcount; |
| |
| .. _locking_guidelines: |
| |
| Guidelines |
| ----------- |
| |
| In order to meet the :ref:`locking_reqs` discussed above, this section |
| stipulates some locking and lock-free algorithm implementation guidelines for |
| developers. |
| |
| Mutual Exclusion |
| ***************** |
| |
| The spinlock, acquire/release and atomic operations provide trivial mutual |
| exclusion implementations for |RMM|. However, the following general guidelines |
| should be taken into consideration: |
| |
| - Appropriate deadlock avoidance techniques should be incorporated when |
| using multiple locks. |
| |
| - Lock-free access to shared resources should be atomic. |
| |
| - Memory ordering constraints should be used prudently to avoid |
| performance degradation. For e.g. on an unlocked granule (e.g. REC), |
| prior to the refcount update, if there are associated memory |
| operations, then the update should be done with release semantics. |
| However, if there are no associated memory accesses to the granule |
| prior to the refcount update then release semantics will not be |
| required. |
| |
| |
| Deadlock Avoidance |
| ****************** |
| |
| Deadlock avoidance is provided by defining a partial order on all objects in the |
| system where the locking operation will eventually fail if the caller tries to |
| acquire a lock of a different state object than expected. This means that no |
| two processes will be expected to acquire locks in a different order than the |
| defined partial order, and we can rely on the same reasoning for deadlock |
| avoidance as shown by Dijkstra [EWD625]_. |
| |
| To establish this partial order, the objects referenced by |RMM| can be |
| classified into two categories: |
| |
| #. **External**: A granule state belongs to the `external` class iff _any_ |
| parameter in _any_ RMI command is an address of a granule which is expected |
| to be in that state. The following granule states are `external`: |
| |
| - GRANULE_STATE_NS |
| - GRANULE_STATE_DELEGATED |
| - GRANULE_STATE_RD |
| - GRANULE_STATE_REC |
| - DEV_GRANULE_STATE_NS |
| - DEV_GRANULE_STATE_DELEGATED |
| |
| #. **Internal**: A granule state belongs to the `internal` class iff it is not |
| an `external`. These are objects which are referenced from another |
| object after that object is locked. Each `internal` object should be |
| referenced from exactly one place. The following granule states are |
| `internal`: |
| |
| - GRANULE_STATE_RTT |
| - GRANULE_STATE_DATA |
| - DEV_GRANULE_STATE_MAPPED |
| |
| We now state the locking guidelines for |RMM| as: |
| |
| #. Granules expected to be in an `external` state must be locked before locking |
| any granules in an `internal` state. |
| |
| #. Granules expected to be in an `external` state must be locked in order of |
| their physical address, starting with the lowest address. |
| |
| #. Memory granules expected to be in an `external` state must be locked before |
| locking any device memory granules in `external` state. |
| |
| #. Once a granule expected to be in an `external` state has been locked, its |
| state must be checked against the expected state. If these do not match, the |
| granule must be unlocked and no further granules may be locked within the |
| currently-executing RMM command. |
| |
| #. Granules in an `internal` state must be locked in order of state: |
| |
| - `RTT` |
| - `DATA` |
| |
| #. Granules in the same `internal` state must be locked in the |
| :ref:`locking_impl` defined order for that specific state. |
| |
| #. A granule's state can be changed iff the granule is locked and the reference |
| count is zero. |
| |
| Starvation Avoidance |
| ******************** |
| |
| Currently, the lock-free implementation for RMI.REC.Enter provides Starvation |
| Avoidance in |RMM|. However, for the locking implementation, Starvation |
| Avoidance is yet to be accomplished. This can be added by a ticket or MCS style |
| locking implementation [MCS]_. |
| |
| Nested Critical Sections |
| ************************ |
| |
| Spinlocks provide support for nested critical sections. Processes can acquire |
| multiple spinlocks at the same time, as long as the locking order is not |
| violated. |
| |
| References |
| ---------- |
| |
| .. [EWD310] Dijkstra, E.W. Hierarchical ordering of sequential processes. |
| EWD 310. |
| |
| .. [EWD625] Dijkstra, E.W. Two starvation free solutions to a general exclusion |
| problem. EWD 625. |
| |
| .. [MCS] Mellor-Crummey, John M. and Scott, Michael L. Algorithms for scalable |
| synchronization on shared-memory multiprocessors. ACM TOCS, Volume 9, |
| Issue 1, Feb. 1991. |
| |
| .. [WS2001] Stallings, W. (2001). Operating systems: Internals and design |
| principles. Upper Saddle River, N.J: Prentice Hall. |