Blame - docs/design/locking.rst - TF-RMM/tf-rmm

blob: 09cb29249ec1e71c12e7dbc39cfef57ca4de6be4 [file] [log] [blame]

Soby Mathew	b4c6df4	2022-11-09 11:13:29 +0000	[diff] [blame]	1	.. SPDX-License-Identifier: BSD-3-Clause
				2	.. SPDX-FileCopyrightText: Copyright TF-RMM Contributors.
				3
				4	.. _locking_rmm:
				5
				6	RMM Locking Guidelines
				7	=========================
				8
				9	This document outlines the locking requirements, discusses the implementation
				10	and provides guidelines for a deadlock free \|RMM\| implementation. Further, the
				11	document hitherto is based upon \|RMM\| Alpha-05 specification and is expected to
				12	change as the implementation proceeds.
				13
				14	.. _locking_intro:
				15
				16	Introduction
				17	-------------
				18	In order to meet the requirement for the \|RMM\| to be small, simple to reason
				19	about, and to co-exist with contemporary hypervisors which are already
				20	designed to manage system memory, the \|RMM\| does not include a memory allocator.
				21	It instead relies on an untrusted caller providing granules of memory used
				22	to hold both meta data to manage realms as well as code and data for realms.
				23
				24	To maintain confidentiality and integrity of these granules, the \|RMM\|
				25	implements memory access controls by maintaining awareness of the state of each
				26	granule (aka Granule State, ref :ref:`locking_impl`) and enforcing rules on how
				27	memory granules can transition from one state to another and how a granule can
				28	be used depending on its state. For example, all granules that can be accessed
				29	by software outside the \|PAR\| of a realm are in a specific state, and a granule
				30	that holds meta data for a realm is in another specific state that prevents it
				31	from being used as data in a realm and accidentally corrupted by a realm, which
				32	could lead to internal failure in the \|RMM\|.
				33
				34	Due to this complex nature of the operations supported by the \|RMM\|, for example
				35	when managing page tables for realms, the \|RMM\| must be able to hold locks on
				36	multiple objects at the same time. It is a well known fact that holding multiple
				37	locks at the same time can easily lead to deadlocking the system, as for example
				38	illustrated by the dining philosophers problem [EWD310]_. In traditional
				39	operating systems software such issues are avoided by defining a partial order
				40	on all system objects and always acquiring a lower-ordered object before a
				41	higher-ordered object. This solution was shown to be correct by Dijkstra
				42	[EWD625]_. Solutions are typically obtained by assigning an arbitrary order
				43	based upon certain attributes of the objects, for example by using the memory
				44	address of the object.
				45
				46	Unfortunately, software such as the \|RMM\| cannot use these methods directly
				47	because the \|RMM\| receives an opaque pointer from the untrusted caller and it
				48	cannot know before locking the object if it is indeed of the expected state.
				49	Furthermore, MMU page tables are hierarchical data structures and operations on
				50	the page tables typically must be able to locate a leaf node in the hierarchy
				51	based on single value (a virtual address) and therefore must walk the page
				52	tables in their hierarchical order. This implies an order of objects in the same
				53	Granule State which is not known by a process executing in the \|RMM\| before
				54	holding at least one lock on object in the page table hierarchy. An obvious
				55	solution to these problems would be to use a single global lock for the \|RMM\|,
				56	but that would serialize all operations across all shared data structures in the
				57	system and severely impact performance.
				58
				59
				60	.. _locking_reqs:
				61
				62	Requirements
				63	-------------
				64
				65	To address the synchronization needs of the \|RMM\| described above, we must
				66	employ locking and lock-free mechanisms which satisfies a number of properties.
				67	These are discussed below:
				68
				69	Critical Section
				70	*****************
				71
				72	A critical section can be defined as a section of code within a process that
				73	requires access to shared resources and that must not be executed while
				74	another process is in a corresponding section of code [WS2001]_.
				75
				76	Further, access to shared resources without appropriate synchronization can lead
				77	to race conditions, which can be defined as a situation in which multiple
				78	threads or processes read and write a shared item and the final result depends
				79	on the relative timing of their execution [WS2001]_.
				80
				81	In terms of \|RMM\|, an access to a shared resource can be considered as a list
				82	of operations/instructions in program order that either reads from or writes to
				83	a shared memory location (e.g. the granule data structure or the memory granule
				84	described by the granule data structure, ref :ref:`locking_impl`). It is also
				85	understood that this list of operations does not execute indefinitely, but
				86	eventually terminates.
				87
				88	We can now define our desired properties as follows:
				89
				90	Mutual Exclusion
				91	*****************
				92
				93	Mutual exclusion can be defined as the requirement that when one process is in a
				94	critical section that accesses shared resources, no other process may be in a
				95	critical section that accesses any of those shared resources [WS2001]_.
				96
				97	The following example illustrates how an implementation might enforce mutual
				98	exclusion of critical sections using a lock on a valid granule data structure
				99	`struct granule *a`:
				100
				101	.. code-block:: C
				102
				103	struct granule *a;
				104	bool r;
				105
				106	r = try_lock(a);
				107	if (!r) {
				108	return -ERROR;
				109	}
				110	critical_section(a);
				111	unlock(a);
				112	other_work();
				113
				114	We note that a process might fail to perform the `lock` operation on object `a`
				115	and return an error or successfully acquire the lock, execute the
				116	`critical_section()`, `unlock()` and then continue to make forward progress to
				117	`other_work()` function.
				118
				119	Deadlock Avoidance
				120	*******************
				121
				122	A deadlock can be defined as a situation in which two or more processes are
				123	unable to proceed because each is waiting for one of the others to do something
				124	[WS2001]_.
				125
				126	In other words, one or more processes are trying to enter their critical
				127	sections but none of them make forward progress.
				128
				129	We can then define the deadlock avoidance property as the inverse scenario:
				130
				131	When one or more processes are trying to enter their critical sections, at least
				132	one of them makes forward progress.
				133
				134	A deadlock is a fatal event if it occurs in supervisory software such as the
				135	\|RMM\|. This must be avoided as it can render the system vulnerable to exploits
				136	and/or unresponsive which may lead to data loss, interrupted service and
				137	eventually economic loss.
				138
				139	Starvation Avoidance
				140	*********************
				141
				142	Starvation can be defined as a situation in which a runnable process is
				143	overlooked indefinitely by the scheduler; although it is able to proceed, it is
				144	never chosen [WS2001]_.
				145
				146	Then starvation avoidance can be defined as, all processes that are trying to
				147	enter their critical sections eventually make forward progress.
				148
				149	Starvation must be avoided, because if one or more processes do not make forward
				150	progress, the PE on which the process runs will not perform useful work and
				151	will be lost to the user, resulting in similar issues like a deadlocked system.
				152
				153	Nested Critical Sections
				154	*************************
				155
				156	A critical section for an object may be nested within the critical section for
				157	another object for the same process. In other words, a process may enter more
				158	than one critical section at the same time.
				159
				160	For example, if the \|RMM\| needs to copy data from one granule to another
				161	granule, and must be sure that both granules can only be modified by the process
				162	itself, it may be implemented in the following way:
				163
				164	.. code-block:: C
				165
				166	struct granule *a;
				167	struct granule *b;
				168	bool r;
				169
				170	r = try_lock(a);
				171	if (!r) {
				172	return -ERROR;
				173	}
				174
				175	/* critical section for granule a -- ENTER */
				176
				177	r = try_lock(b);
				178	if (r) {
				179	/* critical section for granule b -- ENTER */
				180	b->foo = a->foo;
				181	/* critical section for granule b -- EXIT */
				182	unlock(b);
				183	}
				184
				185	/* critical section for granule a -- EXIT */
				186	unlock(a);
				187
				188	.. _locking_impl:
				189
				190	Implementation
				191	---------------
				192
				193	The \|RMM\| maintains granule states by defining a data structure for each
				194	memory granule in the system. Conceptually, the data structure contains the
				195	following fields:
				196
				197	* Granule State
				198	* Lock
				199	* Reference Count
				200
				201	The Lock field provides mutual exclusion of processes executing in their
				202	critical sections which may access the shared granule data structure and the
				203	shared meta data which may be stored in the memory granule which is in one of
				204	the \|RD\|, \|REC\|, and Table states. Both the data structure describing
				205	the memory granule and the contents of the memory granule itself can be accessed
				206	by multiple PEs concurrently and we therefore require some concurrency protocol
				207	to avoid corruption of shared data structures. An alternative to using a lock
				208	providing mutual exclusion would be to design all operations that access shared
				209	data structures as lock-free algorithms, but due to the complexity of the data
				210	structures and the operation of the \|RMM\| we consider this too difficult to
				211	accomplish in practice.
				212
				213	The Reference Count field is used to keep track of references between granules.
				214	For example, an \|RD\| describes a realm, and a \|REC\| describes an execution
				215	context within that realm, and therefore an \|RD\| must always exist when a \|REC\|
				216	exists. To prevent the \|RMM\| from destroying an \|RD\| while a \|REC\| still exists,
				217	the \|RMM\| holds a reference count on the \|RD\| for each \|REC\| associated with the
				218	same realm, and only when the all the RECs in a realm have been destroyed and
				219	the reference count on an \|RD\| drops to zero, can the \|RD\| be destroyed and the
				220	granule be repurposed for other use.
				221
				222	Based on the above, we now describe the Granule State field and the current
				223	locking/refcount implementation:
				224
				225	* UnDelegated: These are granules for which \|RMM\| does not prevent the \|PAS\|
				226	of the granule from being changed by another agent to any value.
				227	In this state, the granule content access is not protected by granule::lock,
				228	as it is always subject to reads and writes from Non-Realm worlds.
				229
				230	* Delegated: These are granules with memory only accessible by the \|RMM\|.
				231	The granule content is protected by granule::lock. No reference counts are
				232	held on this granule state.
				233
				234	* Realm Descriptor (RD): These are granules containing meta data describing
				235	a realm, and only accessible by the \|RMM\|. Granule content access is protected
				236	by granule::lock. A reference count is also held on this granule for each
				237	associated \|REC\| granule.
				238
				239	* Realm Execution Context (REC): These are granules containing meta data
				240	describing a virtual PE running in a realm, and are only accessible by the
				241	\|RMM\|. The execution content access is not protected by granule::lock, because
				242	we cannot enter a realm while holding the lock. Further, the following rules
				243	apply with respect to the granule's reference counts:
				244
				245	- A reference count is held on this granule when a \|REC\| is running.
				246
				247	- As \|REC\| cannot be run on two PEs at the same time, the maximum value
				248	of the reference count is one.
				249
				250	- When the \|REC\| is entered, the reference count is incremented
				251	(set to 1) atomically while granule::lock is held.
				252
				253	- When the \|REC\| exits, the reference counter is released (set to 0)
				254	atomically with store-release semantics without granule::lock being
				255	held.
				256
				257	- The \|RMM\| can access the granule's content on the entry and exit path
				258	from the \|REC\| while the reference is held.
				259
				260	* Translation Table: These are granules containing meta data describing
				261	virtual to physical address translation for the realm, accessible by the \|RMM\|
				262	and the hardware Memory Management Unit (MMU). Granule content access is
				263	protected by granule::lock, but hardware translation table walks may read the
				264	RTT at any point in time. Multiple granules in this state can only be locked
				265	at the same time if they are part of the same tree, and only in topological
				266	order from root to leaf. The topological order of concatenated root level RTTs
				267	is from the lowest address to the highest address. The complete internal
				268	locking order for RTT granules is: RD -> [RTT] -> ... -> RTT. A reference
				269	count is held on this granule for each entry in the RTT that refers to a
				270	granule:
				271
				272	- Table s2tte.
				273
				274	- Valid s2tte.
				275
				276	- Valid_NS s2tte.
				277
				278	- Assigned s2tte.
				279
				280	* Data: These are granules containing realm data, accessible by the \|RMM\|
				281	and by the realm to which it belongs. Granule content access is not protected
				282	by granule::lock, as it is always subject to reads and writes from within a
				283	realm. A granule in this state is always referenced from exactly one entry in
				284	an RTT granule which must be locked before locking this granule. Only a single
				285	DATA granule can be locked at a time on a given PE. The complete internal
				286	locking order for DATA granules is: RD -> RTT -> RTT -> ... -> DATA.
				287	No reference counts are held on this granule type.
				288
				289
				290	Locking
				291	********
				292
				293	The \|RMM\| uses spinlocks along with the object state for locking implementation.
				294	The lock provides similar exclusive acquire semantics known from trivial
				295	spinlock implementations, however also allows verification of whether the locked
				296	object is of an expected state.
				297
				298	The data structure for the spinlock can be described in C as follows:
				299
				300	.. code-block:: C
				301
				302	typedef struct {
				303	unsigned int val;
				304	} spinlock_t;
				305
				306	This data structure can be embedded in any object that requires synchronization
				307	of access, such as the `struct granule` described above.
				308
				309	The following operations are defined on spinlocks:
				310
				311	.. code-block:: C
				312	:caption: Typical spinlock operations
				313
				314	/*
				315	* Locks a spinlock with acquire memory ordering semantics or goes into
				316	* a tight loop (spins) and repeatedly checks the lock variable
				317	* atomically until it becomes available.
				318	*/
				319	void spinlock_acquire(spinlock_t *l);
				320
				321	/*
				322	* Unlocks a spinlock with release memory ordering semantics. Must only
				323	* be called if the calling PE already holds the lock.
				324	*/
				325	void spinlock_release(spinlock_t *l);
				326
				327
				328	The above functions should not be directly used for locking/unlocking granules,
				329	instead the following should be used:
				330
				331	.. code-block:: C
				332	:caption: Granule locking operations
				333
				334	/*
				335	* Acquires a lock (or spins until the lock is available), then checks
				336	* if the granule is in the `expected_state`. If the `expected_state`
				337	* is matched, then returns `true`. Otherwise, releases the lock and
				338	* returns `false`.
				339	*/
				340	bool granule_lock_on_state_match(struct granule *g,
				341	enum granule_state expected_state);
				342
				343	/*
				344	* Used when we're certain of the state of an object (e.g. because we
				345	* hold a reference to it) or when locking objects whose reference is
				346	* obtained from another object, after that objects is locked.
				347	*/
				348	void granule_lock(struct granule *g,
				349	enum granule_state expected_state);
				350
				351	/*
				352	* Obtains a pointer to a locked granule at `addr` if `addr` is a valid
				353	* granule physical address and the state of the granule at `addr` is
				354	* `expected_state`.
				355	*/
				356	struct granule *find_lock_granule(unsigned long addr,
				357	enum granule_state expected_state);
				358
				359	/* Find two granules and lock them in order of their address. */
				360	return_code_t find_lock_two_granules(unsigned long addr1,
				361	enum granule_state expected_state1,
				362	struct granule **g1,
				363	unsigned long addr2,
				364	enum granule_state expected_state2,
				365	struct granule **g2);
				366
				367	/*
				368	* Obtain a pointer to a locked granule at `addr` which is unused
				369	* (refcount = 0), if `addr` is a valid granule physical address and the
				370	* state of the granule at `addr` is `expected_state`.
				371	*/
				372	struct granule *find_lock_unused_granule(unsigned long addr,
				373	enum granule_state
				374	expected_state);
				375
				376	.. code-block:: C
				377	:caption: Granule unlocking operations
				378
				379	/*
				380	* Release a spinlock held on a granule. Must only be called if the
				381	* calling PE already holds the lock.
				382	*/
				383	void granule_unlock(struct granule *g);
				384
				385	/*
				386	* Sets the state and releases a spinlock held on a granule. Must only
				387	* be called if the calling PE already holds the lock.
				388	*/
				389	void granule_unlock_transition(struct granule *g,
				390	enum granule_state new_state);
				391
				392
				393	Reference Counting
				394	*******************
				395
				396	The reference count is implemented using the refcount variable within the
				397	granule structure to keep track of the references in between granules. For
				398	example, the refcount is used to prevent changes to the attributes of a parent
				399	granule which is referenced by child granules, ie. a parent with refcount not
				400	equal to zero.
				401
				402	Race conditions on the refcount variable are avoided by either locking the
				403	granule before accessing the variable or by lock-free mechanisms such as
				404	Single-Copy Atomic operations along with ARM weakly ordered
				405	ACQUIRE/RELEASE/RELAXED memory semantics to synchronize shared resources.
				406
				407	The following operations are defined on refcount:
				408
				409	.. code-block:: C
				410	:caption: Read a refcount value
				411
				412	/*
				413	* Single-copy atomic read of refcount variable with RELAXED memory
				414	* ordering semantics. Use this function if lock-free access to the
				415	* refcount is required with relaxed memory ordering constraints applied
				416	* at that point.
				417	*/
				418	unsigned long granule_refcount_read_relaxed(struct granule *g);
				419
				420	/*
				421	* Single-copy atomic read of refcount variable with ACQUIRE memory
				422	* ordering semantics. Use this function if lock-free access to the
				423	* refcount is required with acquire memory ordering constraints applied
				424	* at that point.
				425	*/
				426	unsigned long granule_refcount_read_acquire(struct granule *g);
				427
				428	.. code-block:: C
				429	:caption: Increment a refcount value
				430
				431	/*
				432	* Increments the granule refcount. Must be called with the granule
				433	* lock held.
				434	*/
				435	void __granule_get(struct granule *g);
				436
				437	/*
				438	* Increments the granule refcount by `val`. Must be called with the
				439	* granule lock held.
				440	*/
				441	void __granule_refcount_inc(struct granule *g, unsigned long val);
				442
				443	/* Atomically increments the reference counter of the granule.*/
				444	void atomic_granule_get(struct granule *g);
				445
				446
				447	.. code-block:: C
				448	:caption: Decrement a refcount value
				449
				450	/*
				451	* Decrements the granule refcount. Must be called with the granule
				452	* lock held.
				453	*/
				454	void __granule_put(struct granule *g);
				455
				456	/*
				457	* Decrements the granule refcount by `val`. Asserts if refcount can
				458	* become negative. Must be called with the granule lock held.
				459	*/
				460	void __granule_refcount_dec(struct granule *g, unsigned long val);
				461
				462	/* Atomically decrements the reference counter of the granule. */
				463	void atomic_granule_put(struct granule *g);
				464
				465	/*
				466	* Atomically decrements the reference counter of the granule. Stores to
				467	* memory with RELEASE semantics.
				468	*/
				469	void atomic_granule_put_release(struct granule *g);
				470
				471	.. code-block:: C
				472	:caption: Directly access refcount value
				473
				474	/*
				475	* Directly reads/writes the refcount variable. Must be called with the
				476	* granule lock held.
				477	*/
				478	granule->refcount;
				479
				480	.. _locking_guidelines:
				481
				482	Guidelines
				483	-----------
				484
				485	In order to meet the :ref:`locking_reqs` discussed above, this section
				486	stipulates some locking and lock-free algorithm implementation guidelines for
				487	developers.
				488
				489	Mutual Exclusion
				490	*****************
				491
				492	The spinlock, acquire/release and atomic operations provide trivial mutual
				493	exclusion implementations for \|RMM\|. However, the following general guidelines
				494	should be taken into consideration:
				495
				496	- Appropriate deadlock avoidance techniques should be incorporated when
				497	using multiple locks.
				498
				499	- Lock-free access to shared resources should be atomic.
				500
				501	- Memory ordering constraints should be used prudently to avoid
				502	performance degradation. For e.g. on an unlocked granule (e.g. REC),
				503	prior to the refcount update, if there are associated memory
				504	operations, then the update should be done with release semantics.
				505	However, if there are no associated memory accesses to the granule
				506	prior to the refcount update then release semantics will not be
				507	required.
				508
				509
				510	Deadlock Avoidance
				511	******************
				512
				513	Deadlock avoidance is provided by defining a partial order on all objects in the
				514	system where the locking operation will eventually fail if the caller tries to
				515	acquire a lock of a different state object than expected. This means that no
				516	two processes will be expected to acquire locks in a different order than the
				517	defined partial order, and we can rely on the same reasoning for deadlock
				518	avoidance as shown by Dijkstra [EWD625]_.
				519
				520	To establish this partial order, the objects referenced by \|RMM\| can be
				521	classified into two categories:
				522
				523	#. External: A granule state belongs to the `external` class iff _any_
				524	parameter in _any_ RMI command is an address of a granule which is expected
				525	to be in that state. The following granule states are `external`:
				526
				527	- GRANULE_STATE_NS
				528	- GRANULE_STATE_DELEGATED
				529	- GRANULE_STATE_RD
				530	- GRANULE_STATE_REC
AlexeiFedorov	037add6	2024-10-30 15:53:05 +0000	[diff] [blame]	531	- DEV_GRANULE_STATE_NS
				532	- DEV_GRANULE_STATE_DELEGATED
Soby Mathew	b4c6df4	2022-11-09 11:13:29 +0000	[diff] [blame]	533
				534	#. Internal: A granule state belongs to the `internal` class iff it is not
Soby Mathew	e4cb517	2025-05-01 16:10:09 +0100	[diff] [blame]	535	an `external`. These are objects which are referenced from another
Soby Mathew	b4c6df4	2022-11-09 11:13:29 +0000	[diff] [blame]	536	object after that object is locked. Each `internal` object should be
				537	referenced from exactly one place. The following granule states are
				538	`internal`:
				539
				540	- GRANULE_STATE_RTT
				541	- GRANULE_STATE_DATA
AlexeiFedorov	037add6	2024-10-30 15:53:05 +0000	[diff] [blame]	542	- DEV_GRANULE_STATE_MAPPED
Soby Mathew	b4c6df4	2022-11-09 11:13:29 +0000	[diff] [blame]	543
				544	We now state the locking guidelines for \|RMM\| as:
				545
				546	#. Granules expected to be in an `external` state must be locked before locking
				547	any granules in an `internal` state.
				548
				549	#. Granules expected to be in an `external` state must be locked in order of
				550	their physical address, starting with the lowest address.
				551
AlexeiFedorov	d2e9393	2025-01-13 17:24:37 +0000	[diff] [blame]	552	#. Memory granules expected to be in an `external` state must be locked before
				553	locking any device memory granules in `external` state.
				554
Soby Mathew	b4c6df4	2022-11-09 11:13:29 +0000	[diff] [blame]	555	#. Once a granule expected to be in an `external` state has been locked, its
				556	state must be checked against the expected state. If these do not match, the
				557	granule must be unlocked and no further granules may be locked within the
				558	currently-executing RMM command.
				559
				560	#. Granules in an `internal` state must be locked in order of state:
				561
				562	- `RTT`
				563	- `DATA`
				564
				565	#. Granules in the same `internal` state must be locked in the
				566	:ref:`locking_impl` defined order for that specific state.
				567
				568	#. A granule's state can be changed iff the granule is locked and the reference
				569	count is zero.
				570
				571	Starvation Avoidance
				572	********************
				573
				574	Currently, the lock-free implementation for RMI.REC.Enter provides Starvation
				575	Avoidance in \|RMM\|. However, for the locking implementation, Starvation
				576	Avoidance is yet to be accomplished. This can be added by a ticket or MCS style
				577	locking implementation [MCS]_.
				578
				579	Nested Critical Sections
				580	************************
				581
				582	Spinlocks provide support for nested critical sections. Processes can acquire
				583	multiple spinlocks at the same time, as long as the locking order is not
				584	violated.
				585
				586	References
				587	----------
				588
				589	.. [EWD310] Dijkstra, E.W. Hierarchical ordering of sequential processes.
				590	EWD 310.
				591
				592	.. [EWD625] Dijkstra, E.W. Two starvation free solutions to a general exclusion
				593	problem. EWD 625.
				594
				595	.. [MCS] Mellor-Crummey, John M. and Scott, Michael L. Algorithms for scalable
				596	synchronization on shared-memory multiprocessors. ACM TOCS, Volume 9,
				597	Issue 1, Feb. 1991.
				598
				599	.. [WS2001] Stallings, W. (2001). Operating systems: Internals and design
				600	principles. Upper Saddle River, N.J: Prentice Hall.