Blame - Documentation/RCU/Design/Data-Structures/Data-Structures.html - hafnium/third_party/linux

blob: f5120a00f5116bc72fda540b427d4d774069a13b [file] [log] [blame]

Andrew Scull	b4b6d4a	2019-01-02 15:54:55 +0000	[diff] [blame^]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
				2	"http://www.w3.org/TR/html4/loose.dtd">
				3	<html>
				4	<head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title>
				5	<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
				6
				7	<p>December 18, 2016</p>
				8	<p>This article was contributed by Paul E. McKenney</p>
				9
				10	<h3>Introduction</h3>
				11
				12	This document describes RCU's major data structures and their relationship
				13	to each other.
				14
				15	<ol>
				16	<li> <a href="#Data-Structure Relationships">
				17	Data-Structure Relationships</a>
				18	<li> <a href="#The rcu_state Structure">
				19	The <tt>rcu_state</tt> Structure</a>
				20	<li> <a href="#The rcu_node Structure">
				21	The <tt>rcu_node</tt> Structure</a>
				22	<li> <a href="#The rcu_segcblist Structure">
				23	The <tt>rcu_segcblist</tt> Structure</a>
				24	<li> <a href="#The rcu_data Structure">
				25	The <tt>rcu_data</tt> Structure</a>
				26	<li> <a href="#The rcu_dynticks Structure">
				27	The <tt>rcu_dynticks</tt> Structure</a>
				28	<li> <a href="#The rcu_head Structure">
				29	The <tt>rcu_head</tt> Structure</a>
				30	<li> <a href="#RCU-Specific Fields in the task_struct Structure">
				31	RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
				32	<li> <a href="#Accessor Functions">
				33	Accessor Functions</a>
				34	</ol>
				35
				36	<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
				37
				38	<p>RCU is for all intents and purposes a large state machine, and its
				39	data structures maintain the state in such a way as to allow RCU readers
				40	to execute extremely quickly, while also processing the RCU grace periods
				41	requested by updaters in an efficient and extremely scalable fashion.
				42	The efficiency and scalability of RCU updaters is provided primarily
				43	by a combining tree, as shown below:
				44
				45	</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
				46
				47	</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
				48	containing a tree of <tt>rcu_node</tt> structures.
				49	Each leaf node of the <tt>rcu_node</tt> tree has up to 16
				50	<tt>rcu_data</tt> structures associated with it, so that there
				51	are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
				52	one for each possible CPU.
				53	This structure is adjusted at boot time, if needed, to handle the
				54	common case where <tt>nr_cpu_ids</tt> is much less than
				55	<tt>NR_CPUs</tt>.
				56	For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
				57	which results in a three-level <tt>rcu_node</tt> tree.
				58	If the actual hardware has only 16 CPUs, RCU will adjust itself
				59	at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
				60
				61	</p><p>The purpose of this combining tree is to allow per-CPU events
				62	such as quiescent states, dyntick-idle transitions,
				63	and CPU hotplug operations to be processed efficiently
				64	and scalably.
				65	Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
				66	and other events are recorded by the leaf-level <tt>rcu_node</tt>
				67	structures.
				68	All of these events are combined at each level of the tree until finally
				69	grace periods are completed at the tree's root <tt>rcu_node</tt>
				70	structure.
				71	A grace period can be completed at the root once every CPU
				72	(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
				73	has passed through a quiescent state.
				74	Once a grace period has completed, record of that fact is propagated
				75	back down the tree.
				76
				77	</p><p>As can be seen from the diagram, on a 64-bit system
				78	a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
				79	of 64 at the root and a fanout of 16 at the leaves.
				80
				81	<table>
				82	<tr><th> </th></tr>
				83	<tr><th align="left">Quick Quiz:</th></tr>
				84	<tr><td>
				85	Why isn't the fanout at the leaves also 64?
				86	</td></tr>
				87	<tr><th align="left">Answer:</th></tr>
				88	<tr><td bgcolor="#ffffff"><font color="ffffff">
				89	Because there are more types of events that affect the leaf-level
				90	<tt>rcu_node</tt> structures than further up the tree.
				91	Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
				92	64, the contention on these structures' <tt>->structures</tt>
				93	becomes excessive.
				94	Experimentation on a wide variety of systems has shown that a fanout
				95	of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
				96	</font>
				97
				98	<p><font color="ffffff">Of course, further experience with
				99	systems having hundreds or thousands of CPUs may demonstrate
				100	that the fanout for the non-leaf <tt>rcu_node</tt> structures
				101	must also be reduced.
				102	Such reduction can be easily carried out when and if it proves
				103	necessary.
				104	In the meantime, if you are using such a system and running into
				105	contention problems on the non-leaf <tt>rcu_node</tt> structures,
				106	you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
				107	parameter to reduce the non-leaf fanout as needed.
				108	</font>
				109
				110	<p><font color="ffffff">Kernels built for systems with
				111	strong NUMA characteristics might also need to adjust
				112	<tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
				113	<tt>rcu_node</tt> structures align with hardware boundaries.
				114	However, there has thus far been no need for this.
				115	</font></td></tr>
				116	<tr><td> </td></tr>
				117	</table>
				118
				119	<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
				120	a 32-bit system), then RCU will automatically add more levels to the
				121	tree.
				122	For example, if you are crazy enough to build a 64-bit system with 65,536
				123	CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
				124
				125	</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
				126
				127	</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
				128	accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
				129	32-bit systems.
				130	On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be
				131	as small as 2 if you wish, which would permit only 16 CPUs, which
				132	is useful for testing.
				133
				134	</p><p>This multi-level combining tree allows us to get most of the
				135	performance and scalability
				136	benefits of partitioning, even though RCU grace-period detection is
				137	inherently a global operation.
				138	The trick here is that only the last CPU to report a quiescent state
				139	into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
				140	structure at the next level up the tree.
				141	This means that at the leaf-level <tt>rcu_node</tt> structure, only
				142	one access out of sixteen will progress up the tree.
				143	For the internal <tt>rcu_node</tt> structures, the situation is even
				144	more extreme: Only one access out of sixty-four will progress up
				145	the tree.
				146	Because the vast majority of the CPUs do not progress up the tree,
				147	the lock contention remains roughly constant up the tree.
				148	No matter how many CPUs there are in the system, at most 64 quiescent-state
				149	reports per grace period will progress all the way to the root
				150	<tt>rcu_node</tt> structure, thus ensuring that the lock contention
				151	on that root <tt>rcu_node</tt> structure remains acceptably low.
				152
				153	</p><p>In effect, the combining tree acts like a big shock absorber,
				154	keeping lock contention under control at all tree levels regardless
				155	of the level of loading on the system.
				156
				157	</p><p>The Linux kernel actually supports multiple flavors of RCU
				158	running concurrently, so RCU builds separate data structures for each
				159	flavor.
				160	For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides
				161	rcu_sched and rcu_bh, as shown below:
				162
				163	</p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%">
				164
				165	</p><p>Energy efficiency is increasingly important, and for that
				166	reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which
				167	turns off the scheduling-clock interrupts on idle CPUs, which in
				168	turn allows those CPUs to attain deeper sleep states and to consume
				169	less energy.
				170	CPUs whose scheduling-clock interrupts have been turned off are
				171	said to be in <i>dyntick-idle mode</i>.
				172	RCU must handle dyntick-idle CPUs specially
				173	because RCU would otherwise wake up each CPU on every grace period,
				174	which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>.
				175	RCU uses the <tt>rcu_dynticks</tt> structure to track
				176	which CPUs are in dyntick idle mode, as shown below:
				177
				178	</p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%">
				179
				180	</p><p>However, if a CPU is in dyntick-idle mode, it is in that mode
				181	for all flavors of RCU.
				182	Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per
				183	CPU, and all of a given CPU's <tt>rcu_data</tt> structures share
				184	that <tt>rcu_dynticks</tt>, as shown in the figure.
				185
				186	</p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support
				187	rcu_preempt in addition to rcu_sched and rcu_bh, as shown below:
				188
				189	</p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%">
				190
				191	</p><p>RCU updaters wait for normal grace periods by registering
				192	RCU callbacks, either directly via <tt>call_rcu()</tt> and
				193	friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>),
				194	there being a separate interface per flavor of RCU)
				195	or indirectly via <tt>synchronize_rcu()</tt> and friends.
				196	RCU callbacks are represented by <tt>rcu_head</tt> structures,
				197	which are queued on <tt>rcu_data</tt> structures while they are
				198	waiting for a grace period to elapse, as shown in the following figure:
				199
				200	</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
				201
				202	</p><p>This figure shows how <tt>TREE_RCU</tt>'s and
				203	<tt>PREEMPT_RCU</tt>'s major data structures are related.
				204	Lesser data structures will be introduced with the algorithms that
				205	make use of them.
				206
				207	</p><p>Note that each of the data structures in the above figure has
				208	its own synchronization:
				209
				210	<p><ol>
				211	<li> Each <tt>rcu_state</tt> structures has a lock and a mutex,
				212	and some fields are protected by the corresponding root
				213	<tt>rcu_node</tt> structure's lock.
				214	<li> Each <tt>rcu_node</tt> structure has a spinlock.
				215	<li> The fields in <tt>rcu_data</tt> are private to the corresponding
				216	CPU, although a few can be read and written by other CPUs.
				217	<li> Similarly, the fields in <tt>rcu_dynticks</tt> are private
				218	to the corresponding CPU, although a few can be read by
				219	other CPUs.
				220	</ol>
				221
				222	<p>It is important to note that different data structures can have
				223	very different ideas about the state of RCU at any given time.
				224	For but one example, awareness of the start or end of a given RCU
				225	grace period propagates slowly through the data structures.
				226	This slow propagation is absolutely necessary for RCU to have good
				227	read-side performance.
				228	If this balkanized implementation seems foreign to you, one useful
				229	trick is to consider each instance of these data structures to be
				230	a different person, each having the usual slightly different
				231	view of reality.
				232
				233	</p><p>The general role of each of these data structures is as
				234	follows:
				235
				236	</p><ol>
				237	<li> <tt>rcu_state</tt>:
				238	This structure forms the interconnection between the
				239	<tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
				240	tracks grace periods, serves as short-term repository
				241	for callbacks orphaned by CPU-hotplug events,
				242	maintains <tt>rcu_barrier()</tt> state,
				243	tracks expedited grace-period state,
				244	and maintains state used to force quiescent states when
				245	grace periods extend too long,
				246	<li> <tt>rcu_node</tt>: This structure forms the combining
				247	tree that propagates quiescent-state
				248	information from the leaves to the root, and also propagates
				249	grace-period information from the root to the leaves.
				250	It provides local copies of the grace-period state in order
				251	to allow this information to be accessed in a synchronized
				252	manner without suffering the scalability limitations that
				253	would otherwise be imposed by global locking.
				254	In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
				255	of tasks that have blocked while in their current
				256	RCU read-side critical section.
				257	In <tt>CONFIG_PREEMPT_RCU</tt> with
				258	<tt>CONFIG_RCU_BOOST</tt>, it manages the
				259	per-<tt>rcu_node</tt> priority-boosting
				260	kernel threads (kthreads) and state.
				261	Finally, it records CPU-hotplug state in order to determine
				262	which CPUs should be ignored during a given grace period.
				263	<li> <tt>rcu_data</tt>: This per-CPU structure is the
				264	focus of quiescent-state detection and RCU callback queuing.
				265	It also tracks its relationship to the corresponding leaf
				266	<tt>rcu_node</tt> structure to allow more-efficient
				267	propagation of quiescent states up the <tt>rcu_node</tt>
				268	combining tree.
				269	Like the <tt>rcu_node</tt> structure, it provides a local
				270	copy of the grace-period information to allow for-free
				271	synchronized
				272	access to this information from the corresponding CPU.
				273	Finally, this structure records past dyntick-idle state
				274	for the corresponding CPU and also tracks statistics.
				275	<li> <tt>rcu_dynticks</tt>:
				276	This per-CPU structure tracks the current dyntick-idle
				277	state for the corresponding CPU.
				278	Unlike the other three structures, the <tt>rcu_dynticks</tt>
				279	structure is not replicated per RCU flavor.
				280	<li> <tt>rcu_head</tt>:
				281	This structure represents RCU callbacks, and is the
				282	only structure allocated and managed by RCU users.
				283	The <tt>rcu_head</tt> structure is normally embedded
				284	within the RCU-protected data structure.
				285	</ol>
				286
				287	<p>If all you wanted from this article was a general notion of how
				288	RCU's data structures are related, you are done.
				289	Otherwise, each of the following sections give more details on
				290	the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>,
				291	and <tt>rcu_dynticks</tt> data structures.
				292
				293	<h3><a name="The rcu_state Structure">
				294	The <tt>rcu_state</tt> Structure</a></h3>
				295
				296	<p>The <tt>rcu_state</tt> structure is the base structure that
				297	represents a flavor of RCU.
				298	This structure forms the interconnection between the
				299	<tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
				300	tracks grace periods, contains the lock used to
				301	synchronize with CPU-hotplug events,
				302	and maintains state used to force quiescent states when
				303	grace periods extend too long,
				304
				305	</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
				306	singly and in groups, in the following sections.
				307	The more specialized fields are covered in the discussion of their
				308	use.
				309
				310	<h5>Relationship to rcu_node and rcu_data Structures</h5>
				311
				312	This portion of the <tt>rcu_state</tt> structure is declared
				313	as follows:
				314
				315	<pre>
				316	1 struct rcu_node node[NUM_RCU_NODES];
				317	2 struct rcu_node *level[NUM_RCU_LVLS + 1];
				318	3 struct rcu_data __percpu *rda;
				319	</pre>
				320
				321	<table>
				322	<tr><th> </th></tr>
				323	<tr><th align="left">Quick Quiz:</th></tr>
				324	<tr><td>
				325	Wait a minute!
				326	You said that the <tt>rcu_node</tt> structures formed a tree,
				327	but they are declared as a flat array!
				328	What gives?
				329	</td></tr>
				330	<tr><th align="left">Answer:</th></tr>
				331	<tr><td bgcolor="#ffffff"><font color="ffffff">
				332	The tree is laid out in the array.
				333	The first node In the array is the head, the next set of nodes in the
				334	array are children of the head node, and so on until the last set of
				335	nodes in the array are the leaves.
				336	</font>
				337
				338	<p><font color="ffffff">See the following diagrams to see how
				339	this works.
				340	</font></td></tr>
				341	<tr><td> </td></tr>
				342	</table>
				343
				344	<p>The <tt>rcu_node</tt> tree is embedded into the
				345	<tt>->node[]</tt> array as shown in the following figure:
				346
				347	</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
				348
				349	</p><p>One interesting consequence of this mapping is that a
				350	breadth-first traversal of the tree is implemented as a simple
				351	linear scan of the array, which is in fact what the
				352	<tt>rcu_for_each_node_breadth_first()</tt> macro does.
				353	This macro is used at the beginning and ends of grace periods.
				354
				355	</p><p>Each entry of the <tt>->level</tt> array references
				356	the first <tt>rcu_node</tt> structure on the corresponding level
				357	of the tree, for example, as shown below:
				358
				359	</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
				360
				361	</p><p>The zero<sup>th</sup> element of the array references the root
				362	<tt>rcu_node</tt> structure, the first element references the
				363	first child of the root <tt>rcu_node</tt>, and finally the second
				364	element references the first leaf <tt>rcu_node</tt> structure.
				365
				366	</p><p>For whatever it is worth, if you draw the tree to be tree-shaped
				367	rather than array-shaped, it is easy to draw a planar representation:
				368
				369	</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
				370
				371	</p><p>Finally, the <tt>->rda</tt> field references a per-CPU
				372	pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
				373
				374	</p><p>All of these fields are constant once initialization is complete,
				375	and therefore need no protection.
				376
				377	<h5>Grace-Period Tracking</h5>
				378
				379	<p>This portion of the <tt>rcu_state</tt> structure is declared
				380	as follows:
				381
				382	<pre>
				383	1 unsigned long gp_seq;
				384	</pre>
				385
				386	<p>RCU grace periods are numbered, and
				387	the <tt>->gp_seq</tt> field contains the current grace-period
				388	sequence number.
				389	The bottom two bits are the state of the current grace period,
				390	which can be zero for not yet started or one for in progress.
				391	In other words, if the bottom two bits of <tt>->gp_seq</tt> are
				392	zero, the corresponding flavor of RCU is idle.
				393	Any other value in the bottom two bits indicates that something is broken.
				394	This field is protected by the root <tt>rcu_node</tt> structure's
				395	<tt>->lock</tt> field.
				396
				397	</p><p>There are <tt>->gp_seq</tt> fields
				398	in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
				399	as well.
				400	The fields in the <tt>rcu_state</tt> structure represent the
				401	most current value, and those of the other structures are compared
				402	in order to detect the beginnings and ends of grace periods in a distributed
				403	fashion.
				404	The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
				405	(down the tree from the root to the leaves) to <tt>rcu_data</tt>.
				406
				407	<h5>Miscellaneous</h5>
				408
				409	<p>This portion of the <tt>rcu_state</tt> structure is declared
				410	as follows:
				411
				412	<pre>
				413	1 unsigned long gp_max;
				414	2 char abbr;
				415	3 char *name;
				416	</pre>
				417
				418	<p>The <tt>->gp_max</tt> field tracks the duration of the longest
				419	grace period in jiffies.
				420	It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>.
				421
				422	<p>The <tt>->name</tt> field points to the name of the RCU flavor
				423	(for example, “rcu_sched”), and is constant.
				424	The <tt>->abbr</tt> field contains a one-character abbreviation,
				425	for example, “s” for RCU-sched.
				426
				427	<h3><a name="The rcu_node Structure">
				428	The <tt>rcu_node</tt> Structure</a></h3>
				429
				430	<p>The <tt>rcu_node</tt> structures form the combining
				431	tree that propagates quiescent-state
				432	information from the leaves to the root and also that propagates
				433	grace-period information from the root down to the leaves.
				434	They provides local copies of the grace-period state in order
				435	to allow this information to be accessed in a synchronized
				436	manner without suffering the scalability limitations that
				437	would otherwise be imposed by global locking.
				438	In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
				439	of tasks that have blocked while in their current
				440	RCU read-side critical section.
				441	In <tt>CONFIG_PREEMPT_RCU</tt> with
				442	<tt>CONFIG_RCU_BOOST</tt>, they manage the
				443	per-<tt>rcu_node</tt> priority-boosting
				444	kernel threads (kthreads) and state.
				445	Finally, they record CPU-hotplug state in order to determine
				446	which CPUs should be ignored during a given grace period.
				447
				448	</p><p>The <tt>rcu_node</tt> structure's fields are discussed,
				449	singly and in groups, in the following sections.
				450
				451	<h5>Connection to Combining Tree</h5>
				452
				453	<p>This portion of the <tt>rcu_node</tt> structure is declared
				454	as follows:
				455
				456	<pre>
				457	1 struct rcu_node *parent;
				458	2 u8 level;
				459	3 u8 grpnum;
				460	4 unsigned long grpmask;
				461	5 int grplo;
				462	6 int grphi;
				463	</pre>
				464
				465	<p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt>
				466	one level up in the tree, and is <tt>NULL</tt> for the root
				467	<tt>rcu_node</tt>.
				468	The RCU implementation makes heavy use of this field to push quiescent
				469	states up the tree.
				470	The <tt>->level</tt> field gives the level in the tree, with
				471	the root being at level zero, its children at level one, and so on.
				472	The <tt>->grpnum</tt> field gives this node's position within
				473	the children of its parent, so this number can range between 0 and 31
				474	on 32-bit systems and between 0 and 63 on 64-bit systems.
				475	The <tt>->level</tt> and <tt>->grpnum</tt> fields are
				476	used only during initialization and for tracing.
				477	The <tt>->grpmask</tt> field is the bitmask counterpart of
				478	<tt>->grpnum</tt>, and therefore always has exactly one bit set.
				479	This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
				480	structure in its parent's bitmasks, which are described later.
				481	Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields
				482	contain the lowest and highest numbered CPU served by this
				483	<tt>rcu_node</tt> structure, respectively.
				484
				485	</p><p>All of these fields are constant, and thus do not require any
				486	synchronization.
				487
				488	<h5>Synchronization</h5>
				489
				490	<p>This field of the <tt>rcu_node</tt> structure is declared
				491	as follows:
				492
				493	<pre>
				494	1 raw_spinlock_t lock;
				495	</pre>
				496
				497	<p>This field is used to protect the remaining fields in this structure,
				498	unless otherwise stated.
				499	That said, all of the fields in this structure can be accessed without
				500	locking for tracing purposes.
				501	Yes, this can result in confusing traces, but better some tracing confusion
				502	than to be heisenbugged out of existence.
				503
				504	<h5>Grace-Period Tracking</h5>
				505
				506	<p>This portion of the <tt>rcu_node</tt> structure is declared
				507	as follows:
				508
				509	<pre>
				510	1 unsigned long gp_seq;
				511	2 unsigned long gp_seq_needed;
				512	</pre>
				513
				514	<p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are
				515	the counterparts of the field of the same name in the <tt>rcu_state</tt>
				516	structure.
				517	They each may lag up to one step behind their <tt>rcu_state</tt>
				518	counterpart.
				519	If the bottom two bits of a given <tt>rcu_node</tt> structure's
				520	<tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt>
				521	structure believes that RCU is idle.
				522	</p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt>
				523	structure is updated at the beginning and the end
				524	of each grace period.
				525
				526	<p>The <tt>->gp_seq_needed</tt> fields record the
				527	furthest-in-the-future grace period request seen by the corresponding
				528	<tt>rcu_node</tt> structure. The request is considered fulfilled when
				529	the value of the <tt>->gp_seq</tt> field equals or exceeds that of
				530	the <tt>->gp_seq_needed</tt> field.
				531
				532	<table>
				533	<tr><th> </th></tr>
				534	<tr><th align="left">Quick Quiz:</th></tr>
				535	<tr><td>
				536	Suppose that this <tt>rcu_node</tt> structure doesn't see
				537	a request for a very long time.
				538	Won't wrapping of the <tt>->gp_seq</tt> field cause
				539	problems?
				540	</td></tr>
				541	<tr><th align="left">Answer:</th></tr>
				542	<tr><td bgcolor="#ffffff"><font color="ffffff">
				543	No, because if the <tt>->gp_seq_needed</tt> field lags behind the
				544	<tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field
				545	will be updated at the end of the grace period.
				546	Modulo-arithmetic comparisons therefore will always get the
				547	correct answer, even with wrapping.
				548	</font></td></tr>
				549	<tr><td> </td></tr>
				550	</table>
				551
				552	<h5>Quiescent-State Tracking</h5>
				553
				554	<p>These fields manage the propagation of quiescent states up the
				555	combining tree.
				556
				557	</p><p>This portion of the <tt>rcu_node</tt> structure has fields
				558	as follows:
				559
				560	<pre>
				561	1 unsigned long qsmask;
				562	2 unsigned long expmask;
				563	3 unsigned long qsmaskinit;
				564	4 unsigned long expmaskinit;
				565	</pre>
				566
				567	<p>The <tt>->qsmask</tt> field tracks which of this
				568	<tt>rcu_node</tt> structure's children still need to report
				569	quiescent states for the current normal grace period.
				570	Such children will have a value of 1 in their corresponding bit.
				571	Note that the leaf <tt>rcu_node</tt> structures should be
				572	thought of as having <tt>rcu_data</tt> structures as their
				573	children.
				574	Similarly, the <tt>->expmask</tt> field tracks which
				575	of this <tt>rcu_node</tt> structure's children still need to report
				576	quiescent states for the current expedited grace period.
				577	An expedited grace period has
				578	the same conceptual properties as a normal grace period, but the
				579	expedited implementation accepts extreme CPU overhead to obtain
				580	much lower grace-period latency, for example, consuming a few
				581	tens of microseconds worth of CPU time to reduce grace-period
				582	duration from milliseconds to tens of microseconds.
				583	The <tt>->qsmaskinit</tt> field tracks which of this
				584	<tt>rcu_node</tt> structure's children cover for at least
				585	one online CPU.
				586	This mask is used to initialize <tt>->qsmask</tt>,
				587	and <tt>->expmaskinit</tt> is used to initialize
				588	<tt>->expmask</tt> and the beginning of the
				589	normal and expedited grace periods, respectively.
				590
				591	<table>
				592	<tr><th> </th></tr>
				593	<tr><th align="left">Quick Quiz:</th></tr>
				594	<tr><td>
				595	Why are these bitmasks protected by locking?
				596	Come on, haven't you heard of atomic instructions???
				597	</td></tr>
				598	<tr><th align="left">Answer:</th></tr>
				599	<tr><td bgcolor="#ffffff"><font color="ffffff">
				600	Lockless grace-period computation! Such a tantalizing possibility!
				601	</font>
				602
				603	<p><font color="ffffff">But consider the following sequence of events:
				604	</font>
				605
				606	<ol>
				607	<li> <font color="ffffff">CPU 0 has been in dyntick-idle
				608	mode for quite some time.
				609	When it wakes up, it notices that the current RCU
				610	grace period needs it to report in, so it sets a
				611	flag where the scheduling clock interrupt will find it.
				612	</font><p>
				613	<li> <font color="ffffff">Meanwhile, CPU 1 is running
				614	<tt>force_quiescent_state()</tt>,
				615	and notices that CPU 0 has been in dyntick idle mode,
				616	which qualifies as an extended quiescent state.
				617	</font><p>
				618	<li> <font color="ffffff">CPU 0's scheduling clock
				619	interrupt fires in the
				620	middle of an RCU read-side critical section, and notices
				621	that the RCU core needs something, so commences RCU softirq
				622	processing.
				623	</font>
				624	<p>
				625	<li> <font color="ffffff">CPU 0's softirq handler
				626	executes and is just about ready
				627	to report its quiescent state up the <tt>rcu_node</tt>
				628	tree.
				629	</font><p>
				630	<li> <font color="ffffff">But CPU 1 beats it to the punch,
				631	completing the current
				632	grace period and starting a new one.
				633	</font><p>
				634	<li> <font color="ffffff">CPU 0 now reports its quiescent
				635	state for the wrong
				636	grace period.
				637	That grace period might now end before the RCU read-side
				638	critical section.
				639	If that happens, disaster will ensue.
				640	</font>
				641	</ol>
				642
				643	<p><font color="ffffff">So the locking is absolutely required in
				644	order to coordinate clearing of the bits with updating of the
				645	grace-period sequence number in <tt>->gp_seq</tt>.
				646	</font></td></tr>
				647	<tr><td> </td></tr>
				648	</table>
				649
				650	<h5>Blocked-Task Management</h5>
				651
				652	<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
				653	midst of their RCU read-side critical sections, and these tasks
				654	must be tracked explicitly.
				655	The details of exactly why and how they are tracked will be covered
				656	in a separate article on RCU read-side processing.
				657	For now, it is enough to know that the <tt>rcu_node</tt>
				658	structure tracks them.
				659
				660	<pre>
				661	1 struct list_head blkd_tasks;
				662	2 struct list_head *gp_tasks;
				663	3 struct list_head *exp_tasks;
				664	4 bool wait_blkd_tasks;
				665	</pre>
				666
				667	<p>The <tt>->blkd_tasks</tt> field is a list header for
				668	the list of blocked and preempted tasks.
				669	As tasks undergo context switches within RCU read-side critical
				670	sections, their <tt>task_struct</tt> structures are enqueued
				671	(via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt>
				672	field) onto the head of the <tt>->blkd_tasks</tt> list for the
				673	leaf <tt>rcu_node</tt> structure corresponding to the CPU
				674	on which the outgoing context switch executed.
				675	As these tasks later exit their RCU read-side critical sections,
				676	they remove themselves from the list.
				677	This list is therefore in reverse time order, so that if one of the tasks
				678	is blocking the current grace period, all subsequent tasks must
				679	also be blocking that same grace period.
				680	Therefore, a single pointer into this list suffices to track
				681	all tasks blocking a given grace period.
				682	That pointer is stored in <tt>->gp_tasks</tt> for normal
				683	grace periods and in <tt>->exp_tasks</tt> for expedited
				684	grace periods.
				685	These last two fields are <tt>NULL</tt> if either there is
				686	no grace period in flight or if there are no blocked tasks
				687	preventing that grace period from completing.
				688	If either of these two pointers is referencing a task that
				689	removes itself from the <tt>->blkd_tasks</tt> list,
				690	then that task must advance the pointer to the next task on
				691	the list, or set the pointer to <tt>NULL</tt> if there
				692	are no subsequent tasks on the list.
				693
				694	</p><p>For example, suppose that tasks T1, T2, and T3 are
				695	all hard-affinitied to the largest-numbered CPU in the system.
				696	Then if task T1 blocked in an RCU read-side
				697	critical section, then an expedited grace period started,
				698	then task T2 blocked in an RCU read-side critical section,
				699	then a normal grace period started, and finally task 3 blocked
				700	in an RCU read-side critical section, then the state of the
				701	last leaf <tt>rcu_node</tt> structure's blocked-task list
				702	would be as shown below:
				703
				704	</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
				705
				706	</p><p>Task T1 is blocking both grace periods, task T2 is
				707	blocking only the normal grace period, and task T3 is blocking
				708	neither grace period.
				709	Note that these tasks will not remove themselves from this list
				710	immediately upon resuming execution.
				711	They will instead remain on the list until they execute the outermost
				712	<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
				713	section.
				714
				715	<p>
				716	The <tt>->wait_blkd_tasks</tt> field indicates whether or not
				717	the current grace period is waiting on a blocked task.
				718
				719	<h5>Sizing the <tt>rcu_node</tt> Array</h5>
				720
				721	<p>The <tt>rcu_node</tt> array is sized via a series of
				722	C-preprocessor expressions as follows:
				723
				724	<pre>
				725	1 #ifdef CONFIG_RCU_FANOUT
				726	2 #define RCU_FANOUT CONFIG_RCU_FANOUT
				727	3 #else
				728	4 # ifdef CONFIG_64BIT
				729	5 # define RCU_FANOUT 64
				730	6 # else
				731	7 # define RCU_FANOUT 32
				732	8 # endif
				733	9 #endif
				734	10
				735	11 #ifdef CONFIG_RCU_FANOUT_LEAF
				736	12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
				737	13 #else
				738	14 # ifdef CONFIG_64BIT
				739	15 # define RCU_FANOUT_LEAF 64
				740	16 # else
				741	17 # define RCU_FANOUT_LEAF 32
				742	18 # endif
				743	19 #endif
				744	20
				745	21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
				746	22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
				747	23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
				748	24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
				749	25
				750	26 #if NR_CPUS <= RCU_FANOUT_1
				751	27 # define RCU_NUM_LVLS 1
				752	28 # define NUM_RCU_LVL_0 1
				753	29 # define NUM_RCU_NODES NUM_RCU_LVL_0
				754	30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
				755	31 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
				756	32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
				757	33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
				758	34 #elif NR_CPUS <= RCU_FANOUT_2
				759	35 # define RCU_NUM_LVLS 2
				760	36 # define NUM_RCU_LVL_0 1
				761	37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
				762	38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
				763	39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
				764	40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
				765	41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
				766	42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
				767	43 #elif NR_CPUS <= RCU_FANOUT_3
				768	44 # define RCU_NUM_LVLS 3
				769	45 # define NUM_RCU_LVL_0 1
				770	46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
				771	47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
				772	48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
				773	49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
				774	50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
				775	51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
				776	52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
				777	53 #elif NR_CPUS <= RCU_FANOUT_4
				778	54 # define RCU_NUM_LVLS 4
				779	55 # define NUM_RCU_LVL_0 1
				780	56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
				781	57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
				782	58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
				783	59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
				784	60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
				785	61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
				786	62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
				787	63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
				788	64 #else
				789	65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
				790	66 #endif
				791	</pre>
				792
				793	<p>The maximum number of levels in the <tt>rcu_node</tt> structure
				794	is currently limited to four, as specified by lines 21-24
				795	and the structure of the subsequent “if” statement.
				796	For 32-bit systems, this allows 163232*32=524,288 CPUs, which
				797	should be sufficient for the next few years at least.
				798	For 64-bit systems, 166464*64=4,194,304 CPUs is allowed, which
				799	should see us through the next decade or so.
				800	This four-level tree also allows kernels built with
				801	<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
				802	which might be useful in very large systems having eight CPUs per
				803	socket (but please note that no one has yet shown any measurable
				804	performance degradation due to misaligned socket and <tt>rcu_node</tt>
				805	boundaries).
				806	In addition, building kernels with a full four levels of <tt>rcu_node</tt>
				807	tree permits better testing of RCU's combining-tree code.
				808
				809	</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
				810	are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
				811	If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
				812	it is set based on the word size of the system, which is also
				813	the Kconfig default.
				814
				815	</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
				816	handled by each leaf <tt>rcu_node</tt> structure.
				817	Experience has shown that allowing a given leaf <tt>rcu_node</tt>
				818	structure to handle 64 CPUs, as permitted by the number of bits in
				819	the <tt>->qsmask</tt> field on a 64-bit system, results in
				820	excessive contention for the leaf <tt>rcu_node</tt> structures'
				821	<tt>->lock</tt> fields.
				822	The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
				823	limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
				824	If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
				825	selected is based on the word size of the system, just as for
				826	<tt>CONFIG_RCU_FANOUT</tt>.
				827	Lines 11-19 perform this computation.
				828
				829	</p><p>Lines 21-24 compute the maximum number of CPUs supported by
				830	a single-level (which contains a single <tt>rcu_node</tt> structure),
				831	two-level, three-level, and four-level <tt>rcu_node</tt> tree,
				832	respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
				833	and <tt>RCU_FANOUT_LEAF</tt>.
				834	These numbers of CPUs are retained in the
				835	<tt>RCU_FANOUT_1</tt>,
				836	<tt>RCU_FANOUT_2</tt>,
				837	<tt>RCU_FANOUT_3</tt>, and
				838	<tt>RCU_FANOUT_4</tt>
				839	C-preprocessor variables, respectively.
				840
				841	</p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
				842	statement spanning lines 26-66 that computes the number of
				843	<tt>rcu_node</tt> structures required for each level of the tree,
				844	as well as the number of levels required.
				845	The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
				846	C-preprocessor variable by lines 27, 35, 44, and 54.
				847	The number of <tt>rcu_node</tt> structures for the topmost level
				848	of the tree is always exactly one, and this value is unconditionally
				849	placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55.
				850	The rest of the levels (if any) of the <tt>rcu_node</tt> tree
				851	are computed by dividing the maximum number of CPUs by the
				852	fanout supported by the number of levels from the current level down,
				853	rounding up. This computation is performed by lines 37,
				854	46-47, and 56-58.
				855	Lines 31-33, 40-42, 50-52, and 62-63 create initializers
				856	for lockdep lock-class names.
				857	Finally, lines 64-66 produce an error if the maximum number of
				858	CPUs is too large for the specified fanout.
				859
				860	<h3><a name="The rcu_segcblist Structure">
				861	The <tt>rcu_segcblist</tt> Structure</a></h3>
				862
				863	The <tt>rcu_segcblist</tt> structure maintains a segmented list of
				864	callbacks as follows:
				865
				866	<pre>
				867	1 #define RCU_DONE_TAIL 0
				868	2 #define RCU_WAIT_TAIL 1
				869	3 #define RCU_NEXT_READY_TAIL 2
				870	4 #define RCU_NEXT_TAIL 3
				871	5 #define RCU_CBLIST_NSEGS 4
				872	6
				873	7 struct rcu_segcblist {
				874	8 struct rcu_head *head;
				875	9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
				876	10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
				877	11 long len;
				878	12 long len_lazy;
				879	13 };
				880	</pre>
				881
				882	<p>
				883	The segments are as follows:
				884
				885	<ol>
				886	<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed.
				887	These callbacks are ready to be invoked.
				888	<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the
				889	current grace period.
				890	Note that different CPUs can have different ideas about which
				891	grace period is current, hence the <tt>->gp_seq</tt> field.
				892	<li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next
				893	grace period to start.
				894	<li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been
				895	associated with a grace period.
				896	</ol>
				897
				898	<p>
				899	The <tt>->head</tt> pointer references the first callback or
				900	is <tt>NULL</tt> if the list contains no callbacks (which is
				901	<i>not</i> the same as being empty).
				902	Each element of the <tt>->tails[]</tt> array references the
				903	<tt>->next</tt> pointer of the last callback in the corresponding
				904	segment of the list, or the list's <tt>->head</tt> pointer if
				905	that segment and all previous segments are empty.
				906	If the corresponding segment is empty but some previous segment is
				907	not empty, then the array element is identical to its predecessor.
				908	Older callbacks are closer to the head of the list, and new callbacks
				909	are added at the tail.
				910	This relationship between the <tt>->head</tt> pointer, the
				911	<tt>->tails[]</tt> array, and the callbacks is shown in this
				912	diagram:
				913
				914	</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
				915
				916	</p><p>In this figure, the <tt>->head</tt> pointer references the
				917	first
				918	RCU callback in the list.
				919	The <tt>->tails[RCU_DONE_TAIL]</tt> array element references
				920	the <tt>->head</tt> pointer itself, indicating that none
				921	of the callbacks is ready to invoke.
				922	The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback
				923	CB 2's <tt>->next</tt> pointer, which indicates that
				924	CB 1 and CB 2 are both waiting on the current grace period,
				925	give or take possible disagreements about exactly which grace period
				926	is the current one.
				927	The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element
				928	references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt>
				929	does, which indicates that there are no callbacks waiting on the next
				930	RCU grace period.
				931	The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references
				932	CB 4's <tt>->next</tt> pointer, indicating that all the
				933	remaining RCU callbacks have not yet been assigned to an RCU grace
				934	period.
				935	Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element
				936	always references the last RCU callback's <tt>->next</tt> pointer
				937	unless the callback list is empty, in which case it references
				938	the <tt>->head</tt> pointer.
				939
				940	<p>
				941	There is one additional important special case for the
				942	<tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt>
				943	when this list is <i>disabled</i>.
				944	Lists are disabled when the corresponding CPU is offline or when
				945	the corresponding CPU's callbacks are offloaded to a kthread,
				946	both of which are described elsewhere.
				947
				948	</p><p>CPUs advance their callbacks from the
				949	<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
				950	<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
				951	as grace periods advance.
				952
				953	</p><p>The <tt>->gp_seq[]</tt> array records grace-period
				954	numbers corresponding to the list segments.
				955	This is what allows different CPUs to have different ideas as to
				956	which is the current grace period while still avoiding premature
				957	invocation of their callbacks.
				958	In particular, this allows CPUs that go idle for extended periods
				959	to determine which of their callbacks are ready to be invoked after
				960	reawakening.
				961
				962	</p><p>The <tt>->len</tt> counter contains the number of
				963	callbacks in <tt>->head</tt>, and the
				964	<tt>->len_lazy</tt> contains the number of those callbacks that
				965	are known to only free memory, and whose invocation can therefore
				966	be safely deferred.
				967
				968	<p><b>Important note</b>: It is the <tt>->len</tt> field that
				969	determines whether or not there are callbacks associated with
				970	this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt>
				971	pointer.
				972	The reason for this is that all the ready-to-invoke callbacks
				973	(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted
				974	all at once at callback-invocation time.
				975	If callback invocation must be postponed, for example, because a
				976	high-priority process just woke up on this CPU, then the remaining
				977	callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment.
				978	Either way, the <tt>->len</tt> and <tt>->len_lazy</tt> counts
				979	are adjusted after the corresponding callbacks have been invoked, and so
				980	again it is the <tt>->len</tt> count that accurately reflects whether
				981	or not there are callbacks associated with this <tt>rcu_segcblist</tt>
				982	structure.
				983	Of course, off-CPU sampling of the <tt>->len</tt> count requires
				984	the use of appropriate synchronization, for example, memory barriers.
				985	This synchronization can be a bit subtle, particularly in the case
				986	of <tt>rcu_barrier()</tt>.
				987
				988	<h3><a name="The rcu_data Structure">
				989	The <tt>rcu_data</tt> Structure</a></h3>
				990
				991	<p>The <tt>rcu_data</tt> maintains the per-CPU state for the
				992	corresponding flavor of RCU.
				993	The fields in this structure may be accessed only from the corresponding
				994	CPU (and from tracing) unless otherwise stated.
				995	This structure is the
				996	focus of quiescent-state detection and RCU callback queuing.
				997	It also tracks its relationship to the corresponding leaf
				998	<tt>rcu_node</tt> structure to allow more-efficient
				999	propagation of quiescent states up the <tt>rcu_node</tt>
				1000	combining tree.
				1001	Like the <tt>rcu_node</tt> structure, it provides a local
				1002	copy of the grace-period information to allow for-free
				1003	synchronized
				1004	access to this information from the corresponding CPU.
				1005	Finally, this structure records past dyntick-idle state
				1006	for the corresponding CPU and also tracks statistics.
				1007
				1008	</p><p>The <tt>rcu_data</tt> structure's fields are discussed,
				1009	singly and in groups, in the following sections.
				1010
				1011	<h5>Connection to Other Data Structures</h5>
				1012
				1013	<p>This portion of the <tt>rcu_data</tt> structure is declared
				1014	as follows:
				1015
				1016	<pre>
				1017	1 int cpu;
				1018	2 struct rcu_state *rsp;
				1019	3 struct rcu_node *mynode;
				1020	4 struct rcu_dynticks *dynticks;
				1021	5 unsigned long grpmask;
				1022	6 bool beenonline;
				1023	</pre>
				1024
				1025	<p>The <tt>->cpu</tt> field contains the number of the
				1026	corresponding CPU, the <tt>->rsp</tt> pointer references
				1027	the corresponding <tt>rcu_state</tt> structure (and is most frequently
				1028	used to locate the name of the corresponding flavor of RCU for tracing),
				1029	and the <tt>->mynode</tt> field references the corresponding
				1030	<tt>rcu_node</tt> structure.
				1031	The <tt>->mynode</tt> is used to propagate quiescent states
				1032	up the combining tree.
				1033	<p>The <tt>->dynticks</tt> pointer references the
				1034	<tt>rcu_dynticks</tt> structure corresponding to this
				1035	CPU.
				1036	Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt>
				1037	structure is shared among all flavors of RCU.
				1038	These first four fields are constant and therefore require not
				1039	synchronization.
				1040
				1041	</p><p>The <tt>->grpmask</tt> field indicates the bit in
				1042	the <tt>->mynode->qsmask</tt> corresponding to this
				1043	<tt>rcu_data</tt> structure, and is also used when propagating
				1044	quiescent states.
				1045	The <tt>->beenonline</tt> flag is set whenever the corresponding
				1046	CPU comes online, which means that the debugfs tracing need not dump
				1047	out any <tt>rcu_data</tt> structure for which this flag is not set.
				1048
				1049	<h5>Quiescent-State and Grace-Period Tracking</h5>
				1050
				1051	<p>This portion of the <tt>rcu_data</tt> structure is declared
				1052	as follows:
				1053
				1054	<pre>
				1055	1 unsigned long gp_seq;
				1056	2 unsigned long gp_seq_needed;
				1057	3 bool cpu_no_qs;
				1058	4 bool core_needs_qs;
				1059	5 bool gpwrap;
				1060	6 unsigned long rcu_qs_ctr_snap;
				1061	</pre>
				1062
				1063	<p>The <tt>->gp_seq</tt> and <tt>->gp_seq_needed</tt>
				1064	fields are the counterparts of the fields of the same name
				1065	in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.
				1066	They may each lag up to one behind their <tt>rcu_node</tt>
				1067	counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and
				1068	<tt>CONFIG_NO_HZ_FULL</tt> kernels can lag
				1069	arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
				1070	will catch up upon exit from dyntick-idle mode).
				1071	If the lower two bits of a given <tt>rcu_data</tt> structure's
				1072	<tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt>
				1073	structure believes that RCU is idle.
				1074
				1075	<table>
				1076	<tr><th> </th></tr>
				1077	<tr><th align="left">Quick Quiz:</th></tr>
				1078	<tr><td>
				1079	All this replication of the grace period numbers can only cause
				1080	massive confusion.
				1081	Why not just keep a global sequence number and be done with it???
				1082	</td></tr>
				1083	<tr><th align="left">Answer:</th></tr>
				1084	<tr><td bgcolor="#ffffff"><font color="ffffff">
				1085	Because if there was only a single global sequence
				1086	numbers, there would need to be a single global lock to allow
				1087	safely accessing and updating it.
				1088	And if we are not going to have a single global lock, we need
				1089	to carefully manage the numbers on a per-node basis.
				1090	Recall from the answer to a previous Quick Quiz that the consequences
				1091	of applying a previously sampled quiescent state to the wrong
				1092	grace period are quite severe.
				1093	</font></td></tr>
				1094	<tr><td> </td></tr>
				1095	</table>
				1096
				1097	<p>The <tt>->cpu_no_qs</tt> flag indicates that the
				1098	CPU has not yet passed through a quiescent state,
				1099	while the <tt>->core_needs_qs</tt> flag indicates that the
				1100	RCU core needs a quiescent state from the corresponding CPU.
				1101	The <tt>->gpwrap</tt> field indicates that the corresponding
				1102	CPU has remained idle for so long that the
				1103	<tt>gp_seq</tt> counter is in danger of overflow, which
				1104	will cause the CPU to disregard the values of its counters on
				1105	its next exit from idle.
				1106	Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect
				1107	cases where a given operation has resulted in a quiescent state
				1108	for all flavors of RCU, for example, <tt>cond_resched()</tt>
				1109	when RCU has indicated a need for quiescent states.
				1110
				1111	<h5>RCU Callback Handling</h5>
				1112
				1113	<p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
				1114	the same CPU that registered them.
				1115	This is strictly a cache-locality optimization: callbacks can and
				1116	do get invoked on CPUs other than the one that registered them.
				1117	After all, if the CPU that registered a given callback has gone
				1118	offline before the callback can be invoked, there really is no other
				1119	choice.
				1120
				1121	</p><p>This portion of the <tt>rcu_data</tt> structure is declared
				1122	as follows:
				1123
				1124	<pre>
				1125	1 struct rcu_segcblist cblist;
				1126	2 long qlen_last_fqs_check;
				1127	3 unsigned long n_cbs_invoked;
				1128	4 unsigned long n_nocbs_invoked;
				1129	5 unsigned long n_cbs_orphaned;
				1130	6 unsigned long n_cbs_adopted;
				1131	7 unsigned long n_force_qs_snap;
				1132	8 long blimit;
				1133	</pre>
				1134
				1135	<p>The <tt>->cblist</tt> structure is the segmented callback list
				1136	described earlier.
				1137	The CPU advances the callbacks in its <tt>rcu_data</tt> structure
				1138	whenever it notices that another RCU grace period has completed.
				1139	The CPU detects the completion of an RCU grace period by noticing
				1140	that the value of its <tt>rcu_data</tt> structure's
				1141	<tt>->gp_seq</tt> field differs from that of its leaf
				1142	<tt>rcu_node</tt> structure.
				1143	Recall that each <tt>rcu_node</tt> structure's
				1144	<tt>->gp_seq</tt> field is updated at the beginnings and ends of each
				1145	grace period.
				1146
				1147	<p>
				1148	The <tt>->qlen_last_fqs_check</tt> and
				1149	<tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent
				1150	states from <tt>call_rcu()</tt> and friends when callback
				1151	lists grow excessively long.
				1152
				1153	</p><p>The <tt>->n_cbs_invoked</tt>,
				1154	<tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt>
				1155	fields count the number of callbacks invoked,
				1156	sent to other CPUs when this CPU goes offline,
				1157	and received from other CPUs when those other CPUs go offline.
				1158	The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks
				1159	are offloaded to a kthread.
				1160
				1161	<p>
				1162	Finally, the <tt>->blimit</tt> counter is the maximum number of
				1163	RCU callbacks that may be invoked at a given time.
				1164
				1165	<h5>Dyntick-Idle Handling</h5>
				1166
				1167	<p>This portion of the <tt>rcu_data</tt> structure is declared
				1168	as follows:
				1169
				1170	<pre>
				1171	1 int dynticks_snap;
				1172	2 unsigned long dynticks_fqs;
				1173	</pre>
				1174
				1175	The <tt>->dynticks_snap</tt> field is used to take a snapshot
				1176	of the corresponding CPU's dyntick-idle state when forcing
				1177	quiescent states, and is therefore accessed from other CPUs.
				1178	Finally, the <tt>->dynticks_fqs</tt> field is used to
				1179	count the number of times this CPU is determined to be in
				1180	dyntick-idle state, and is used for tracing and debugging purposes.
				1181
				1182	<h3><a name="The rcu_dynticks Structure">
				1183	The <tt>rcu_dynticks</tt> Structure</a></h3>
				1184
				1185	<p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state
				1186	for the corresponding CPU.
				1187	Unlike the other structures, <tt>rcu_dynticks</tt> is not
				1188	replicated over the different flavors of RCU.
				1189	The fields in this structure may be accessed only from the corresponding
				1190	CPU (and from tracing) unless otherwise stated.
				1191	Its fields are as follows:
				1192
				1193	<pre>
				1194	1 long dynticks_nesting;
				1195	2 long dynticks_nmi_nesting;
				1196	3 atomic_t dynticks;
				1197	4 bool rcu_need_heavy_qs;
				1198	5 unsigned long rcu_qs_ctr;
				1199	6 bool rcu_urgent_qs;
				1200	</pre>
				1201
				1202	<p>The <tt>->dynticks_nesting</tt> field counts the
				1203	nesting depth of process execution, so that in normal circumstances
				1204	this counter has value zero or one.
				1205	NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt>
				1206	field.
				1207	Because NMIs cannot be masked, changes to this variable have to be
				1208	undertaken carefully using an algorithm provided by Andy Lutomirski.
				1209	The initial transition from idle adds one, and nested transitions
				1210	add two, so that a nesting level of five is represented by a
				1211	<tt>->dynticks_nmi_nesting</tt> value of nine.
				1212	This counter can therefore be thought of as counting the number
				1213	of reasons why this CPU cannot be permitted to enter dyntick-idle
				1214	mode, aside from process-level transitions.
				1215
				1216	<p>However, it turns out that when running in non-idle kernel context,
				1217	the Linux kernel is fully capable of entering interrupt handlers that
				1218	never exit and perhaps also vice versa.
				1219	Therefore, whenever the <tt>->dynticks_nesting</tt> field is
				1220	incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field
				1221	is set to a large positive number, and whenever the
				1222	<tt>->dynticks_nesting</tt> field is decremented down to zero,
				1223	the the <tt>->dynticks_nmi_nesting</tt> field is set to zero.
				1224	Assuming that the number of misnested interrupts is not sufficient
				1225	to overflow the counter, this approach corrects the
				1226	<tt>->dynticks_nmi_nesting</tt> field every time the corresponding
				1227	CPU enters the idle loop from process context.
				1228
				1229	</p><p>The <tt>->dynticks</tt> field counts the corresponding
				1230	CPU's transitions to and from dyntick-idle mode, so that this counter
				1231	has an even value when the CPU is in dyntick-idle mode and an odd
				1232	value otherwise.
				1233
				1234	</p><p>The <tt>->rcu_need_heavy_qs</tt> field is used
				1235	to record the fact that the RCU core code would really like to
				1236	see a quiescent state from the corresponding CPU, so much so that
				1237	it is willing to call for heavy-weight dyntick-counter operations.
				1238	This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
				1239	code, which provide a momentary idle sojourn in response.
				1240
				1241	</p><p>The <tt>->rcu_qs_ctr</tt> field is used to record
				1242	quiescent states from <tt>cond_resched()</tt>.
				1243	Because <tt>cond_resched()</tt> can execute quite frequently, this
				1244	must be quite lightweight, as in a non-atomic increment of this
				1245	per-CPU field.
				1246
				1247	</p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record
				1248	the fact that the RCU core code would really like to see a quiescent
				1249	state from the corresponding CPU, with the various other fields indicating
				1250	just how badly RCU wants this quiescent state.
				1251	This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
				1252	code, which, if nothing else, non-atomically increment <tt>->rcu_qs_ctr</tt>
				1253	in response.
				1254
				1255	<table>
				1256	<tr><th> </th></tr>
				1257	<tr><th align="left">Quick Quiz:</th></tr>
				1258	<tr><td>
				1259	Why not simply combine the <tt>->dynticks_nesting</tt>
				1260	and <tt>->dynticks_nmi_nesting</tt> counters into a
				1261	single counter that just counts the number of reasons that
				1262	the corresponding CPU is non-idle?
				1263	</td></tr>
				1264	<tr><th align="left">Answer:</th></tr>
				1265	<tr><td bgcolor="#ffffff"><font color="ffffff">
				1266	Because this would fail in the presence of interrupts whose
				1267	handlers never return and of handlers that manage to return
				1268	from a made-up interrupt.
				1269	</font></td></tr>
				1270	<tr><td> </td></tr>
				1271	</table>
				1272
				1273	<p>Additional fields are present for some special-purpose
				1274	builds, and are discussed separately.
				1275
				1276	<h3><a name="The rcu_head Structure">
				1277	The <tt>rcu_head</tt> Structure</a></h3>
				1278
				1279	<p>Each <tt>rcu_head</tt> structure represents an RCU callback.
				1280	These structures are normally embedded within RCU-protected data
				1281	structures whose algorithms use asynchronous grace periods.
				1282	In contrast, when using algorithms that block waiting for RCU grace periods,
				1283	RCU users need not provide <tt>rcu_head</tt> structures.
				1284
				1285	</p><p>The <tt>rcu_head</tt> structure has fields as follows:
				1286
				1287	<pre>
				1288	1 struct rcu_head *next;
				1289	2 void (func)(struct rcu_head head);
				1290	</pre>
				1291
				1292	<p>The <tt>->next</tt> field is used
				1293	to link the <tt>rcu_head</tt> structures together in the
				1294	lists within the <tt>rcu_data</tt> structures.
				1295	The <tt>->func</tt> field is a pointer to the function
				1296	to be called when the callback is ready to be invoked, and
				1297	this function is passed a pointer to the <tt>rcu_head</tt>
				1298	structure.
				1299	However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt>
				1300	field to record the offset of the <tt>rcu_head</tt>
				1301	structure within the enclosing RCU-protected data structure.
				1302
				1303	</p><p>Both of these fields are used internally by RCU.
				1304	From the viewpoint of RCU users, this structure is an
				1305	opaque “cookie”.
				1306
				1307	<table>
				1308	<tr><th> </th></tr>
				1309	<tr><th align="left">Quick Quiz:</th></tr>
				1310	<tr><td>
				1311	Given that the callback function <tt>->func</tt>
				1312	is passed a pointer to the <tt>rcu_head</tt> structure,
				1313	how is that function supposed to find the beginning of the
				1314	enclosing RCU-protected data structure?
				1315	</td></tr>
				1316	<tr><th align="left">Answer:</th></tr>
				1317	<tr><td bgcolor="#ffffff"><font color="ffffff">
				1318	In actual practice, there is a separate callback function per
				1319	type of RCU-protected data structure.
				1320	The callback function can therefore use the <tt>container_of()</tt>
				1321	macro in the Linux kernel (or other pointer-manipulation facilities
				1322	in other software environments) to find the beginning of the
				1323	enclosing structure.
				1324	</font></td></tr>
				1325	<tr><td> </td></tr>
				1326	</table>
				1327
				1328	<h3><a name="RCU-Specific Fields in the task_struct Structure">
				1329	RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
				1330
				1331	<p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some
				1332	additional fields in the <tt>task_struct</tt> structure:
				1333
				1334	<pre>
				1335	1 #ifdef CONFIG_PREEMPT_RCU
				1336	2 int rcu_read_lock_nesting;
				1337	3 union rcu_special rcu_read_unlock_special;
				1338	4 struct list_head rcu_node_entry;
				1339	5 struct rcu_node *rcu_blocked_node;
				1340	6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
				1341	7 #ifdef CONFIG_TASKS_RCU
				1342	8 unsigned long rcu_tasks_nvcsw;
				1343	9 bool rcu_tasks_holdout;
				1344	10 struct list_head rcu_tasks_holdout_list;
				1345	11 int rcu_tasks_idle_cpu;
				1346	12 #endif /* #ifdef CONFIG_TASKS_RCU */
				1347	</pre>
				1348
				1349	<p>The <tt>->rcu_read_lock_nesting</tt> field records the
				1350	nesting level for RCU read-side critical sections, and
				1351	the <tt>->rcu_read_unlock_special</tt> field is a bitmask
				1352	that records special conditions that require <tt>rcu_read_unlock()</tt>
				1353	to do additional work.
				1354	The <tt>->rcu_node_entry</tt> field is used to form lists of
				1355	tasks that have blocked within preemptible-RCU read-side critical
				1356	sections and the <tt>->rcu_blocked_node</tt> field references
				1357	the <tt>rcu_node</tt> structure whose list this task is a member of,
				1358	or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
				1359	read-side critical section.
				1360
				1361	<p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of
				1362	voluntary context switches that this task had undergone at the
				1363	beginning of the current tasks-RCU grace period,
				1364	<tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU
				1365	grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt>
				1366	is a list element enqueuing this task on the holdout list,
				1367	and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this
				1368	idle task is running, but only if the task is currently running,
				1369	that is, if the CPU is currently idle.
				1370
				1371	<h3><a name="Accessor Functions">
				1372	Accessor Functions</a></h3>
				1373
				1374	<p>The following listing shows the
				1375	<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>,
				1376	<tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and
				1377	<tt>rcu_for_each_leaf_node()</tt> function and macros:
				1378
				1379	<pre>
				1380	1 static struct rcu_node rcu_get_root(struct rcu_state rsp)
				1381	2 {
				1382	3 return &rsp->node[0];
				1383	4 }
				1384	5
				1385	6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
				1386	7 for ((rnp) = &(rsp)->node[0]; \
				1387	8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
				1388	9
				1389	10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
				1390	11 for ((rnp) = &(rsp)->node[0]; \
				1391	12 (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
				1392	13
				1393	14 #define rcu_for_each_leaf_node(rsp, rnp) \
				1394	15 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
				1395	16 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
				1396	</pre>
				1397
				1398	<p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
				1399	first element of the specified <tt>rcu_state</tt> structure's
				1400	<tt>->node[]</tt> array, which is the root <tt>rcu_node</tt>
				1401	structure.
				1402
				1403	</p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt>
				1404	macro takes advantage of the layout of the <tt>rcu_node</tt>
				1405	structures in the <tt>rcu_state</tt> structure's
				1406	<tt>->node[]</tt> array, performing a breadth-first traversal by
				1407	simply traversing the array in order.
				1408	The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates
				1409	similarly, but traverses only the first part of the array, thus excluding
				1410	the leaf <tt>rcu_node</tt> structures.
				1411	Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
				1412	the last part of the array, thus traversing only the leaf
				1413	<tt>rcu_node</tt> structures.
				1414
				1415	<table>
				1416	<tr><th> </th></tr>
				1417	<tr><th align="left">Quick Quiz:</th></tr>
				1418	<tr><td>
				1419	What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and
				1420	<tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
				1421	contains only a single node?
				1422	</td></tr>
				1423	<tr><th align="left">Answer:</th></tr>
				1424	<tr><td bgcolor="#ffffff"><font color="ffffff">
				1425	In the single-node case,
				1426	<tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op
				1427	and <tt>rcu_for_each_leaf_node()</tt> traverses the single node.
				1428	</font></td></tr>
				1429	<tr><td> </td></tr>
				1430	</table>
				1431
				1432	<h3><a name="Summary">
				1433	Summary</a></h3>
				1434
				1435	So each flavor of RCU is represented by an <tt>rcu_state</tt> structure,
				1436	which contains a combining tree of <tt>rcu_node</tt> and
				1437	<tt>rcu_data</tt> structures.
				1438	Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
				1439	state is tracked by an <tt>rcu_dynticks</tt> structure.
				1440
				1441	If you made it this far, you are well prepared to read the code
				1442	walkthroughs in the other articles in this series.
				1443
				1444	<h3><a name="Acknowledgments">
				1445	Acknowledgments</a></h3>
				1446
				1447	I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
				1448	Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn
				1449	for helping me get this document into a more human-readable state.
				1450
				1451	<h3><a name="Legal Statement">
				1452	Legal Statement</a></h3>
				1453
				1454	<p>This work represents the view of the author and does not necessarily
				1455	represent the view of IBM.
				1456
				1457	</p><p>Linux is a registered trademark of Linus Torvalds.
				1458
				1459	</p><p>Other company, product, and service names may be trademarks or
				1460	service marks of others.
				1461
				1462	</body></html>