Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
| 2 | "http://www.w3.org/TR/html4/loose.dtd"> |
| 3 | <html> |
| 4 | <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> |
| 5 | <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
| 6 | |
| 7 | <p>December 18, 2016</p> |
| 8 | <p>This article was contributed by Paul E. McKenney</p> |
| 9 | |
| 10 | <h3>Introduction</h3> |
| 11 | |
| 12 | This document describes RCU's major data structures and their relationship |
| 13 | to each other. |
| 14 | |
| 15 | <ol> |
| 16 | <li> <a href="#Data-Structure Relationships"> |
| 17 | Data-Structure Relationships</a> |
| 18 | <li> <a href="#The rcu_state Structure"> |
| 19 | The <tt>rcu_state</tt> Structure</a> |
| 20 | <li> <a href="#The rcu_node Structure"> |
| 21 | The <tt>rcu_node</tt> Structure</a> |
| 22 | <li> <a href="#The rcu_segcblist Structure"> |
| 23 | The <tt>rcu_segcblist</tt> Structure</a> |
| 24 | <li> <a href="#The rcu_data Structure"> |
| 25 | The <tt>rcu_data</tt> Structure</a> |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 26 | <li> <a href="#The rcu_head Structure"> |
| 27 | The <tt>rcu_head</tt> Structure</a> |
| 28 | <li> <a href="#RCU-Specific Fields in the task_struct Structure"> |
| 29 | RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> |
| 30 | <li> <a href="#Accessor Functions"> |
| 31 | Accessor Functions</a> |
| 32 | </ol> |
| 33 | |
| 34 | <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> |
| 35 | |
| 36 | <p>RCU is for all intents and purposes a large state machine, and its |
| 37 | data structures maintain the state in such a way as to allow RCU readers |
| 38 | to execute extremely quickly, while also processing the RCU grace periods |
| 39 | requested by updaters in an efficient and extremely scalable fashion. |
| 40 | The efficiency and scalability of RCU updaters is provided primarily |
| 41 | by a combining tree, as shown below: |
| 42 | |
| 43 | </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> |
| 44 | |
| 45 | </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure |
| 46 | containing a tree of <tt>rcu_node</tt> structures. |
| 47 | Each leaf node of the <tt>rcu_node</tt> tree has up to 16 |
| 48 | <tt>rcu_data</tt> structures associated with it, so that there |
| 49 | are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, |
| 50 | one for each possible CPU. |
| 51 | This structure is adjusted at boot time, if needed, to handle the |
| 52 | common case where <tt>nr_cpu_ids</tt> is much less than |
| 53 | <tt>NR_CPUs</tt>. |
| 54 | For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, |
| 55 | which results in a three-level <tt>rcu_node</tt> tree. |
| 56 | If the actual hardware has only 16 CPUs, RCU will adjust itself |
| 57 | at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. |
| 58 | |
| 59 | </p><p>The purpose of this combining tree is to allow per-CPU events |
| 60 | such as quiescent states, dyntick-idle transitions, |
| 61 | and CPU hotplug operations to be processed efficiently |
| 62 | and scalably. |
| 63 | Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, |
| 64 | and other events are recorded by the leaf-level <tt>rcu_node</tt> |
| 65 | structures. |
| 66 | All of these events are combined at each level of the tree until finally |
| 67 | grace periods are completed at the tree's root <tt>rcu_node</tt> |
| 68 | structure. |
| 69 | A grace period can be completed at the root once every CPU |
| 70 | (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) |
| 71 | has passed through a quiescent state. |
| 72 | Once a grace period has completed, record of that fact is propagated |
| 73 | back down the tree. |
| 74 | |
| 75 | </p><p>As can be seen from the diagram, on a 64-bit system |
| 76 | a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout |
| 77 | of 64 at the root and a fanout of 16 at the leaves. |
| 78 | |
| 79 | <table> |
| 80 | <tr><th> </th></tr> |
| 81 | <tr><th align="left">Quick Quiz:</th></tr> |
| 82 | <tr><td> |
| 83 | Why isn't the fanout at the leaves also 64? |
| 84 | </td></tr> |
| 85 | <tr><th align="left">Answer:</th></tr> |
| 86 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 87 | Because there are more types of events that affect the leaf-level |
| 88 | <tt>rcu_node</tt> structures than further up the tree. |
| 89 | Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of |
| 90 | 64, the contention on these structures' <tt>->structures</tt> |
| 91 | becomes excessive. |
| 92 | Experimentation on a wide variety of systems has shown that a fanout |
| 93 | of 16 works well for the leaves of the <tt>rcu_node</tt> tree. |
| 94 | </font> |
| 95 | |
| 96 | <p><font color="ffffff">Of course, further experience with |
| 97 | systems having hundreds or thousands of CPUs may demonstrate |
| 98 | that the fanout for the non-leaf <tt>rcu_node</tt> structures |
| 99 | must also be reduced. |
| 100 | Such reduction can be easily carried out when and if it proves |
| 101 | necessary. |
| 102 | In the meantime, if you are using such a system and running into |
| 103 | contention problems on the non-leaf <tt>rcu_node</tt> structures, |
| 104 | you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration |
| 105 | parameter to reduce the non-leaf fanout as needed. |
| 106 | </font> |
| 107 | |
| 108 | <p><font color="ffffff">Kernels built for systems with |
| 109 | strong NUMA characteristics might also need to adjust |
| 110 | <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the |
| 111 | <tt>rcu_node</tt> structures align with hardware boundaries. |
| 112 | However, there has thus far been no need for this. |
| 113 | </font></td></tr> |
| 114 | <tr><td> </td></tr> |
| 115 | </table> |
| 116 | |
| 117 | <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on |
| 118 | a 32-bit system), then RCU will automatically add more levels to the |
| 119 | tree. |
| 120 | For example, if you are crazy enough to build a 64-bit system with 65,536 |
| 121 | CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: |
| 122 | |
| 123 | </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> |
| 124 | |
| 125 | </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system |
| 126 | accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for |
| 127 | 32-bit systems. |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 128 | On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and |
| 129 | <tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result |
| 130 | in a 16-CPU test using a 4-level tree. |
| 131 | This can be useful for testing large-system capabilities on small test |
| 132 | machines. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 133 | |
| 134 | </p><p>This multi-level combining tree allows us to get most of the |
| 135 | performance and scalability |
| 136 | benefits of partitioning, even though RCU grace-period detection is |
| 137 | inherently a global operation. |
| 138 | The trick here is that only the last CPU to report a quiescent state |
| 139 | into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> |
| 140 | structure at the next level up the tree. |
| 141 | This means that at the leaf-level <tt>rcu_node</tt> structure, only |
| 142 | one access out of sixteen will progress up the tree. |
| 143 | For the internal <tt>rcu_node</tt> structures, the situation is even |
| 144 | more extreme: Only one access out of sixty-four will progress up |
| 145 | the tree. |
| 146 | Because the vast majority of the CPUs do not progress up the tree, |
| 147 | the lock contention remains roughly constant up the tree. |
| 148 | No matter how many CPUs there are in the system, at most 64 quiescent-state |
| 149 | reports per grace period will progress all the way to the root |
| 150 | <tt>rcu_node</tt> structure, thus ensuring that the lock contention |
| 151 | on that root <tt>rcu_node</tt> structure remains acceptably low. |
| 152 | |
| 153 | </p><p>In effect, the combining tree acts like a big shock absorber, |
| 154 | keeping lock contention under control at all tree levels regardless |
| 155 | of the level of loading on the system. |
| 156 | |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 157 | </p><p>RCU updaters wait for normal grace periods by registering |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 158 | RCU callbacks, either directly via <tt>call_rcu()</tt> |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 159 | or indirectly via <tt>synchronize_rcu()</tt> and friends. |
| 160 | RCU callbacks are represented by <tt>rcu_head</tt> structures, |
| 161 | which are queued on <tt>rcu_data</tt> structures while they are |
| 162 | waiting for a grace period to elapse, as shown in the following figure: |
| 163 | |
| 164 | </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> |
| 165 | |
| 166 | </p><p>This figure shows how <tt>TREE_RCU</tt>'s and |
| 167 | <tt>PREEMPT_RCU</tt>'s major data structures are related. |
| 168 | Lesser data structures will be introduced with the algorithms that |
| 169 | make use of them. |
| 170 | |
| 171 | </p><p>Note that each of the data structures in the above figure has |
| 172 | its own synchronization: |
| 173 | |
| 174 | <p><ol> |
| 175 | <li> Each <tt>rcu_state</tt> structures has a lock and a mutex, |
| 176 | and some fields are protected by the corresponding root |
| 177 | <tt>rcu_node</tt> structure's lock. |
| 178 | <li> Each <tt>rcu_node</tt> structure has a spinlock. |
| 179 | <li> The fields in <tt>rcu_data</tt> are private to the corresponding |
| 180 | CPU, although a few can be read and written by other CPUs. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 181 | </ol> |
| 182 | |
| 183 | <p>It is important to note that different data structures can have |
| 184 | very different ideas about the state of RCU at any given time. |
| 185 | For but one example, awareness of the start or end of a given RCU |
| 186 | grace period propagates slowly through the data structures. |
| 187 | This slow propagation is absolutely necessary for RCU to have good |
| 188 | read-side performance. |
| 189 | If this balkanized implementation seems foreign to you, one useful |
| 190 | trick is to consider each instance of these data structures to be |
| 191 | a different person, each having the usual slightly different |
| 192 | view of reality. |
| 193 | |
| 194 | </p><p>The general role of each of these data structures is as |
| 195 | follows: |
| 196 | |
| 197 | </p><ol> |
| 198 | <li> <tt>rcu_state</tt>: |
| 199 | This structure forms the interconnection between the |
| 200 | <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, |
| 201 | tracks grace periods, serves as short-term repository |
| 202 | for callbacks orphaned by CPU-hotplug events, |
| 203 | maintains <tt>rcu_barrier()</tt> state, |
| 204 | tracks expedited grace-period state, |
| 205 | and maintains state used to force quiescent states when |
| 206 | grace periods extend too long, |
| 207 | <li> <tt>rcu_node</tt>: This structure forms the combining |
| 208 | tree that propagates quiescent-state |
| 209 | information from the leaves to the root, and also propagates |
| 210 | grace-period information from the root to the leaves. |
| 211 | It provides local copies of the grace-period state in order |
| 212 | to allow this information to be accessed in a synchronized |
| 213 | manner without suffering the scalability limitations that |
| 214 | would otherwise be imposed by global locking. |
| 215 | In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists |
| 216 | of tasks that have blocked while in their current |
| 217 | RCU read-side critical section. |
| 218 | In <tt>CONFIG_PREEMPT_RCU</tt> with |
| 219 | <tt>CONFIG_RCU_BOOST</tt>, it manages the |
| 220 | per-<tt>rcu_node</tt> priority-boosting |
| 221 | kernel threads (kthreads) and state. |
| 222 | Finally, it records CPU-hotplug state in order to determine |
| 223 | which CPUs should be ignored during a given grace period. |
| 224 | <li> <tt>rcu_data</tt>: This per-CPU structure is the |
| 225 | focus of quiescent-state detection and RCU callback queuing. |
| 226 | It also tracks its relationship to the corresponding leaf |
| 227 | <tt>rcu_node</tt> structure to allow more-efficient |
| 228 | propagation of quiescent states up the <tt>rcu_node</tt> |
| 229 | combining tree. |
| 230 | Like the <tt>rcu_node</tt> structure, it provides a local |
| 231 | copy of the grace-period information to allow for-free |
| 232 | synchronized |
| 233 | access to this information from the corresponding CPU. |
| 234 | Finally, this structure records past dyntick-idle state |
| 235 | for the corresponding CPU and also tracks statistics. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 236 | <li> <tt>rcu_head</tt>: |
| 237 | This structure represents RCU callbacks, and is the |
| 238 | only structure allocated and managed by RCU users. |
| 239 | The <tt>rcu_head</tt> structure is normally embedded |
| 240 | within the RCU-protected data structure. |
| 241 | </ol> |
| 242 | |
| 243 | <p>If all you wanted from this article was a general notion of how |
| 244 | RCU's data structures are related, you are done. |
| 245 | Otherwise, each of the following sections give more details on |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 246 | the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data |
| 247 | structures. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 248 | |
| 249 | <h3><a name="The rcu_state Structure"> |
| 250 | The <tt>rcu_state</tt> Structure</a></h3> |
| 251 | |
| 252 | <p>The <tt>rcu_state</tt> structure is the base structure that |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 253 | represents the state of RCU in the system. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 254 | This structure forms the interconnection between the |
| 255 | <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, |
| 256 | tracks grace periods, contains the lock used to |
| 257 | synchronize with CPU-hotplug events, |
| 258 | and maintains state used to force quiescent states when |
| 259 | grace periods extend too long, |
| 260 | |
| 261 | </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, |
| 262 | singly and in groups, in the following sections. |
| 263 | The more specialized fields are covered in the discussion of their |
| 264 | use. |
| 265 | |
| 266 | <h5>Relationship to rcu_node and rcu_data Structures</h5> |
| 267 | |
| 268 | This portion of the <tt>rcu_state</tt> structure is declared |
| 269 | as follows: |
| 270 | |
| 271 | <pre> |
| 272 | 1 struct rcu_node node[NUM_RCU_NODES]; |
| 273 | 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; |
| 274 | 3 struct rcu_data __percpu *rda; |
| 275 | </pre> |
| 276 | |
| 277 | <table> |
| 278 | <tr><th> </th></tr> |
| 279 | <tr><th align="left">Quick Quiz:</th></tr> |
| 280 | <tr><td> |
| 281 | Wait a minute! |
| 282 | You said that the <tt>rcu_node</tt> structures formed a tree, |
| 283 | but they are declared as a flat array! |
| 284 | What gives? |
| 285 | </td></tr> |
| 286 | <tr><th align="left">Answer:</th></tr> |
| 287 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 288 | The tree is laid out in the array. |
| 289 | The first node In the array is the head, the next set of nodes in the |
| 290 | array are children of the head node, and so on until the last set of |
| 291 | nodes in the array are the leaves. |
| 292 | </font> |
| 293 | |
| 294 | <p><font color="ffffff">See the following diagrams to see how |
| 295 | this works. |
| 296 | </font></td></tr> |
| 297 | <tr><td> </td></tr> |
| 298 | </table> |
| 299 | |
| 300 | <p>The <tt>rcu_node</tt> tree is embedded into the |
| 301 | <tt>->node[]</tt> array as shown in the following figure: |
| 302 | |
| 303 | </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> |
| 304 | |
| 305 | </p><p>One interesting consequence of this mapping is that a |
| 306 | breadth-first traversal of the tree is implemented as a simple |
| 307 | linear scan of the array, which is in fact what the |
| 308 | <tt>rcu_for_each_node_breadth_first()</tt> macro does. |
| 309 | This macro is used at the beginning and ends of grace periods. |
| 310 | |
| 311 | </p><p>Each entry of the <tt>->level</tt> array references |
| 312 | the first <tt>rcu_node</tt> structure on the corresponding level |
| 313 | of the tree, for example, as shown below: |
| 314 | |
| 315 | </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> |
| 316 | |
| 317 | </p><p>The zero<sup>th</sup> element of the array references the root |
| 318 | <tt>rcu_node</tt> structure, the first element references the |
| 319 | first child of the root <tt>rcu_node</tt>, and finally the second |
| 320 | element references the first leaf <tt>rcu_node</tt> structure. |
| 321 | |
| 322 | </p><p>For whatever it is worth, if you draw the tree to be tree-shaped |
| 323 | rather than array-shaped, it is easy to draw a planar representation: |
| 324 | |
| 325 | </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> |
| 326 | |
| 327 | </p><p>Finally, the <tt>->rda</tt> field references a per-CPU |
| 328 | pointer to the corresponding CPU's <tt>rcu_data</tt> structure. |
| 329 | |
| 330 | </p><p>All of these fields are constant once initialization is complete, |
| 331 | and therefore need no protection. |
| 332 | |
| 333 | <h5>Grace-Period Tracking</h5> |
| 334 | |
| 335 | <p>This portion of the <tt>rcu_state</tt> structure is declared |
| 336 | as follows: |
| 337 | |
| 338 | <pre> |
| 339 | 1 unsigned long gp_seq; |
| 340 | </pre> |
| 341 | |
| 342 | <p>RCU grace periods are numbered, and |
| 343 | the <tt>->gp_seq</tt> field contains the current grace-period |
| 344 | sequence number. |
| 345 | The bottom two bits are the state of the current grace period, |
| 346 | which can be zero for not yet started or one for in progress. |
| 347 | In other words, if the bottom two bits of <tt>->gp_seq</tt> are |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 348 | zero, then RCU is idle. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 349 | Any other value in the bottom two bits indicates that something is broken. |
| 350 | This field is protected by the root <tt>rcu_node</tt> structure's |
| 351 | <tt>->lock</tt> field. |
| 352 | |
| 353 | </p><p>There are <tt>->gp_seq</tt> fields |
| 354 | in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures |
| 355 | as well. |
| 356 | The fields in the <tt>rcu_state</tt> structure represent the |
| 357 | most current value, and those of the other structures are compared |
| 358 | in order to detect the beginnings and ends of grace periods in a distributed |
| 359 | fashion. |
| 360 | The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> |
| 361 | (down the tree from the root to the leaves) to <tt>rcu_data</tt>. |
| 362 | |
| 363 | <h5>Miscellaneous</h5> |
| 364 | |
| 365 | <p>This portion of the <tt>rcu_state</tt> structure is declared |
| 366 | as follows: |
| 367 | |
| 368 | <pre> |
| 369 | 1 unsigned long gp_max; |
| 370 | 2 char abbr; |
| 371 | 3 char *name; |
| 372 | </pre> |
| 373 | |
| 374 | <p>The <tt>->gp_max</tt> field tracks the duration of the longest |
| 375 | grace period in jiffies. |
| 376 | It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. |
| 377 | |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 378 | <p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish |
| 379 | between preemptible RCU (“rcu_preempt” and “p”) |
| 380 | and non-preemptible RCU (“rcu_sched” and “s”). |
| 381 | These fields are used for diagnostic and tracing purposes. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 382 | |
| 383 | <h3><a name="The rcu_node Structure"> |
| 384 | The <tt>rcu_node</tt> Structure</a></h3> |
| 385 | |
| 386 | <p>The <tt>rcu_node</tt> structures form the combining |
| 387 | tree that propagates quiescent-state |
| 388 | information from the leaves to the root and also that propagates |
| 389 | grace-period information from the root down to the leaves. |
| 390 | They provides local copies of the grace-period state in order |
| 391 | to allow this information to be accessed in a synchronized |
| 392 | manner without suffering the scalability limitations that |
| 393 | would otherwise be imposed by global locking. |
| 394 | In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists |
| 395 | of tasks that have blocked while in their current |
| 396 | RCU read-side critical section. |
| 397 | In <tt>CONFIG_PREEMPT_RCU</tt> with |
| 398 | <tt>CONFIG_RCU_BOOST</tt>, they manage the |
| 399 | per-<tt>rcu_node</tt> priority-boosting |
| 400 | kernel threads (kthreads) and state. |
| 401 | Finally, they record CPU-hotplug state in order to determine |
| 402 | which CPUs should be ignored during a given grace period. |
| 403 | |
| 404 | </p><p>The <tt>rcu_node</tt> structure's fields are discussed, |
| 405 | singly and in groups, in the following sections. |
| 406 | |
| 407 | <h5>Connection to Combining Tree</h5> |
| 408 | |
| 409 | <p>This portion of the <tt>rcu_node</tt> structure is declared |
| 410 | as follows: |
| 411 | |
| 412 | <pre> |
| 413 | 1 struct rcu_node *parent; |
| 414 | 2 u8 level; |
| 415 | 3 u8 grpnum; |
| 416 | 4 unsigned long grpmask; |
| 417 | 5 int grplo; |
| 418 | 6 int grphi; |
| 419 | </pre> |
| 420 | |
| 421 | <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> |
| 422 | one level up in the tree, and is <tt>NULL</tt> for the root |
| 423 | <tt>rcu_node</tt>. |
| 424 | The RCU implementation makes heavy use of this field to push quiescent |
| 425 | states up the tree. |
| 426 | The <tt>->level</tt> field gives the level in the tree, with |
| 427 | the root being at level zero, its children at level one, and so on. |
| 428 | The <tt>->grpnum</tt> field gives this node's position within |
| 429 | the children of its parent, so this number can range between 0 and 31 |
| 430 | on 32-bit systems and between 0 and 63 on 64-bit systems. |
| 431 | The <tt>->level</tt> and <tt>->grpnum</tt> fields are |
| 432 | used only during initialization and for tracing. |
| 433 | The <tt>->grpmask</tt> field is the bitmask counterpart of |
| 434 | <tt>->grpnum</tt>, and therefore always has exactly one bit set. |
| 435 | This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> |
| 436 | structure in its parent's bitmasks, which are described later. |
| 437 | Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields |
| 438 | contain the lowest and highest numbered CPU served by this |
| 439 | <tt>rcu_node</tt> structure, respectively. |
| 440 | |
| 441 | </p><p>All of these fields are constant, and thus do not require any |
| 442 | synchronization. |
| 443 | |
| 444 | <h5>Synchronization</h5> |
| 445 | |
| 446 | <p>This field of the <tt>rcu_node</tt> structure is declared |
| 447 | as follows: |
| 448 | |
| 449 | <pre> |
| 450 | 1 raw_spinlock_t lock; |
| 451 | </pre> |
| 452 | |
| 453 | <p>This field is used to protect the remaining fields in this structure, |
| 454 | unless otherwise stated. |
| 455 | That said, all of the fields in this structure can be accessed without |
| 456 | locking for tracing purposes. |
| 457 | Yes, this can result in confusing traces, but better some tracing confusion |
| 458 | than to be heisenbugged out of existence. |
| 459 | |
| 460 | <h5>Grace-Period Tracking</h5> |
| 461 | |
| 462 | <p>This portion of the <tt>rcu_node</tt> structure is declared |
| 463 | as follows: |
| 464 | |
| 465 | <pre> |
| 466 | 1 unsigned long gp_seq; |
| 467 | 2 unsigned long gp_seq_needed; |
| 468 | </pre> |
| 469 | |
| 470 | <p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are |
| 471 | the counterparts of the field of the same name in the <tt>rcu_state</tt> |
| 472 | structure. |
| 473 | They each may lag up to one step behind their <tt>rcu_state</tt> |
| 474 | counterpart. |
| 475 | If the bottom two bits of a given <tt>rcu_node</tt> structure's |
| 476 | <tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt> |
| 477 | structure believes that RCU is idle. |
| 478 | </p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt> |
| 479 | structure is updated at the beginning and the end |
| 480 | of each grace period. |
| 481 | |
| 482 | <p>The <tt>->gp_seq_needed</tt> fields record the |
| 483 | furthest-in-the-future grace period request seen by the corresponding |
| 484 | <tt>rcu_node</tt> structure. The request is considered fulfilled when |
| 485 | the value of the <tt>->gp_seq</tt> field equals or exceeds that of |
| 486 | the <tt>->gp_seq_needed</tt> field. |
| 487 | |
| 488 | <table> |
| 489 | <tr><th> </th></tr> |
| 490 | <tr><th align="left">Quick Quiz:</th></tr> |
| 491 | <tr><td> |
| 492 | Suppose that this <tt>rcu_node</tt> structure doesn't see |
| 493 | a request for a very long time. |
| 494 | Won't wrapping of the <tt>->gp_seq</tt> field cause |
| 495 | problems? |
| 496 | </td></tr> |
| 497 | <tr><th align="left">Answer:</th></tr> |
| 498 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 499 | No, because if the <tt>->gp_seq_needed</tt> field lags behind the |
| 500 | <tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field |
| 501 | will be updated at the end of the grace period. |
| 502 | Modulo-arithmetic comparisons therefore will always get the |
| 503 | correct answer, even with wrapping. |
| 504 | </font></td></tr> |
| 505 | <tr><td> </td></tr> |
| 506 | </table> |
| 507 | |
| 508 | <h5>Quiescent-State Tracking</h5> |
| 509 | |
| 510 | <p>These fields manage the propagation of quiescent states up the |
| 511 | combining tree. |
| 512 | |
| 513 | </p><p>This portion of the <tt>rcu_node</tt> structure has fields |
| 514 | as follows: |
| 515 | |
| 516 | <pre> |
| 517 | 1 unsigned long qsmask; |
| 518 | 2 unsigned long expmask; |
| 519 | 3 unsigned long qsmaskinit; |
| 520 | 4 unsigned long expmaskinit; |
| 521 | </pre> |
| 522 | |
| 523 | <p>The <tt>->qsmask</tt> field tracks which of this |
| 524 | <tt>rcu_node</tt> structure's children still need to report |
| 525 | quiescent states for the current normal grace period. |
| 526 | Such children will have a value of 1 in their corresponding bit. |
| 527 | Note that the leaf <tt>rcu_node</tt> structures should be |
| 528 | thought of as having <tt>rcu_data</tt> structures as their |
| 529 | children. |
| 530 | Similarly, the <tt>->expmask</tt> field tracks which |
| 531 | of this <tt>rcu_node</tt> structure's children still need to report |
| 532 | quiescent states for the current expedited grace period. |
| 533 | An expedited grace period has |
| 534 | the same conceptual properties as a normal grace period, but the |
| 535 | expedited implementation accepts extreme CPU overhead to obtain |
| 536 | much lower grace-period latency, for example, consuming a few |
| 537 | tens of microseconds worth of CPU time to reduce grace-period |
| 538 | duration from milliseconds to tens of microseconds. |
| 539 | The <tt>->qsmaskinit</tt> field tracks which of this |
| 540 | <tt>rcu_node</tt> structure's children cover for at least |
| 541 | one online CPU. |
| 542 | This mask is used to initialize <tt>->qsmask</tt>, |
| 543 | and <tt>->expmaskinit</tt> is used to initialize |
| 544 | <tt>->expmask</tt> and the beginning of the |
| 545 | normal and expedited grace periods, respectively. |
| 546 | |
| 547 | <table> |
| 548 | <tr><th> </th></tr> |
| 549 | <tr><th align="left">Quick Quiz:</th></tr> |
| 550 | <tr><td> |
| 551 | Why are these bitmasks protected by locking? |
| 552 | Come on, haven't you heard of atomic instructions??? |
| 553 | </td></tr> |
| 554 | <tr><th align="left">Answer:</th></tr> |
| 555 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 556 | Lockless grace-period computation! Such a tantalizing possibility! |
| 557 | </font> |
| 558 | |
| 559 | <p><font color="ffffff">But consider the following sequence of events: |
| 560 | </font> |
| 561 | |
| 562 | <ol> |
| 563 | <li> <font color="ffffff">CPU 0 has been in dyntick-idle |
| 564 | mode for quite some time. |
| 565 | When it wakes up, it notices that the current RCU |
| 566 | grace period needs it to report in, so it sets a |
| 567 | flag where the scheduling clock interrupt will find it. |
| 568 | </font><p> |
| 569 | <li> <font color="ffffff">Meanwhile, CPU 1 is running |
| 570 | <tt>force_quiescent_state()</tt>, |
| 571 | and notices that CPU 0 has been in dyntick idle mode, |
| 572 | which qualifies as an extended quiescent state. |
| 573 | </font><p> |
| 574 | <li> <font color="ffffff">CPU 0's scheduling clock |
| 575 | interrupt fires in the |
| 576 | middle of an RCU read-side critical section, and notices |
| 577 | that the RCU core needs something, so commences RCU softirq |
| 578 | processing. |
| 579 | </font> |
| 580 | <p> |
| 581 | <li> <font color="ffffff">CPU 0's softirq handler |
| 582 | executes and is just about ready |
| 583 | to report its quiescent state up the <tt>rcu_node</tt> |
| 584 | tree. |
| 585 | </font><p> |
| 586 | <li> <font color="ffffff">But CPU 1 beats it to the punch, |
| 587 | completing the current |
| 588 | grace period and starting a new one. |
| 589 | </font><p> |
| 590 | <li> <font color="ffffff">CPU 0 now reports its quiescent |
| 591 | state for the wrong |
| 592 | grace period. |
| 593 | That grace period might now end before the RCU read-side |
| 594 | critical section. |
| 595 | If that happens, disaster will ensue. |
| 596 | </font> |
| 597 | </ol> |
| 598 | |
| 599 | <p><font color="ffffff">So the locking is absolutely required in |
| 600 | order to coordinate clearing of the bits with updating of the |
| 601 | grace-period sequence number in <tt>->gp_seq</tt>. |
| 602 | </font></td></tr> |
| 603 | <tr><td> </td></tr> |
| 604 | </table> |
| 605 | |
| 606 | <h5>Blocked-Task Management</h5> |
| 607 | |
| 608 | <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the |
| 609 | midst of their RCU read-side critical sections, and these tasks |
| 610 | must be tracked explicitly. |
| 611 | The details of exactly why and how they are tracked will be covered |
| 612 | in a separate article on RCU read-side processing. |
| 613 | For now, it is enough to know that the <tt>rcu_node</tt> |
| 614 | structure tracks them. |
| 615 | |
| 616 | <pre> |
| 617 | 1 struct list_head blkd_tasks; |
| 618 | 2 struct list_head *gp_tasks; |
| 619 | 3 struct list_head *exp_tasks; |
| 620 | 4 bool wait_blkd_tasks; |
| 621 | </pre> |
| 622 | |
| 623 | <p>The <tt>->blkd_tasks</tt> field is a list header for |
| 624 | the list of blocked and preempted tasks. |
| 625 | As tasks undergo context switches within RCU read-side critical |
| 626 | sections, their <tt>task_struct</tt> structures are enqueued |
| 627 | (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> |
| 628 | field) onto the head of the <tt>->blkd_tasks</tt> list for the |
| 629 | leaf <tt>rcu_node</tt> structure corresponding to the CPU |
| 630 | on which the outgoing context switch executed. |
| 631 | As these tasks later exit their RCU read-side critical sections, |
| 632 | they remove themselves from the list. |
| 633 | This list is therefore in reverse time order, so that if one of the tasks |
| 634 | is blocking the current grace period, all subsequent tasks must |
| 635 | also be blocking that same grace period. |
| 636 | Therefore, a single pointer into this list suffices to track |
| 637 | all tasks blocking a given grace period. |
| 638 | That pointer is stored in <tt>->gp_tasks</tt> for normal |
| 639 | grace periods and in <tt>->exp_tasks</tt> for expedited |
| 640 | grace periods. |
| 641 | These last two fields are <tt>NULL</tt> if either there is |
| 642 | no grace period in flight or if there are no blocked tasks |
| 643 | preventing that grace period from completing. |
| 644 | If either of these two pointers is referencing a task that |
| 645 | removes itself from the <tt>->blkd_tasks</tt> list, |
| 646 | then that task must advance the pointer to the next task on |
| 647 | the list, or set the pointer to <tt>NULL</tt> if there |
| 648 | are no subsequent tasks on the list. |
| 649 | |
| 650 | </p><p>For example, suppose that tasks T1, T2, and T3 are |
| 651 | all hard-affinitied to the largest-numbered CPU in the system. |
| 652 | Then if task T1 blocked in an RCU read-side |
| 653 | critical section, then an expedited grace period started, |
| 654 | then task T2 blocked in an RCU read-side critical section, |
| 655 | then a normal grace period started, and finally task 3 blocked |
| 656 | in an RCU read-side critical section, then the state of the |
| 657 | last leaf <tt>rcu_node</tt> structure's blocked-task list |
| 658 | would be as shown below: |
| 659 | |
| 660 | </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> |
| 661 | |
| 662 | </p><p>Task T1 is blocking both grace periods, task T2 is |
| 663 | blocking only the normal grace period, and task T3 is blocking |
| 664 | neither grace period. |
| 665 | Note that these tasks will not remove themselves from this list |
| 666 | immediately upon resuming execution. |
| 667 | They will instead remain on the list until they execute the outermost |
| 668 | <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical |
| 669 | section. |
| 670 | |
| 671 | <p> |
| 672 | The <tt>->wait_blkd_tasks</tt> field indicates whether or not |
| 673 | the current grace period is waiting on a blocked task. |
| 674 | |
| 675 | <h5>Sizing the <tt>rcu_node</tt> Array</h5> |
| 676 | |
| 677 | <p>The <tt>rcu_node</tt> array is sized via a series of |
| 678 | C-preprocessor expressions as follows: |
| 679 | |
| 680 | <pre> |
| 681 | 1 #ifdef CONFIG_RCU_FANOUT |
| 682 | 2 #define RCU_FANOUT CONFIG_RCU_FANOUT |
| 683 | 3 #else |
| 684 | 4 # ifdef CONFIG_64BIT |
| 685 | 5 # define RCU_FANOUT 64 |
| 686 | 6 # else |
| 687 | 7 # define RCU_FANOUT 32 |
| 688 | 8 # endif |
| 689 | 9 #endif |
| 690 | 10 |
| 691 | 11 #ifdef CONFIG_RCU_FANOUT_LEAF |
| 692 | 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF |
| 693 | 13 #else |
| 694 | 14 # ifdef CONFIG_64BIT |
| 695 | 15 # define RCU_FANOUT_LEAF 64 |
| 696 | 16 # else |
| 697 | 17 # define RCU_FANOUT_LEAF 32 |
| 698 | 18 # endif |
| 699 | 19 #endif |
| 700 | 20 |
| 701 | 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) |
| 702 | 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) |
| 703 | 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) |
| 704 | 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) |
| 705 | 25 |
| 706 | 26 #if NR_CPUS <= RCU_FANOUT_1 |
| 707 | 27 # define RCU_NUM_LVLS 1 |
| 708 | 28 # define NUM_RCU_LVL_0 1 |
| 709 | 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 |
| 710 | 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } |
| 711 | 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } |
| 712 | 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } |
| 713 | 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } |
| 714 | 34 #elif NR_CPUS <= RCU_FANOUT_2 |
| 715 | 35 # define RCU_NUM_LVLS 2 |
| 716 | 36 # define NUM_RCU_LVL_0 1 |
| 717 | 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) |
| 718 | 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) |
| 719 | 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } |
| 720 | 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } |
| 721 | 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } |
| 722 | 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } |
| 723 | 43 #elif NR_CPUS <= RCU_FANOUT_3 |
| 724 | 44 # define RCU_NUM_LVLS 3 |
| 725 | 45 # define NUM_RCU_LVL_0 1 |
| 726 | 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) |
| 727 | 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) |
| 728 | 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) |
| 729 | 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } |
| 730 | 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } |
| 731 | 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } |
| 732 | 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } |
| 733 | 53 #elif NR_CPUS <= RCU_FANOUT_4 |
| 734 | 54 # define RCU_NUM_LVLS 4 |
| 735 | 55 # define NUM_RCU_LVL_0 1 |
| 736 | 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) |
| 737 | 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) |
| 738 | 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) |
| 739 | 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) |
| 740 | 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } |
| 741 | 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } |
| 742 | 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } |
| 743 | 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } |
| 744 | 64 #else |
| 745 | 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" |
| 746 | 66 #endif |
| 747 | </pre> |
| 748 | |
| 749 | <p>The maximum number of levels in the <tt>rcu_node</tt> structure |
| 750 | is currently limited to four, as specified by lines 21-24 |
| 751 | and the structure of the subsequent “if” statement. |
| 752 | For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which |
| 753 | should be sufficient for the next few years at least. |
| 754 | For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which |
| 755 | should see us through the next decade or so. |
| 756 | This four-level tree also allows kernels built with |
| 757 | <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, |
| 758 | which might be useful in very large systems having eight CPUs per |
| 759 | socket (but please note that no one has yet shown any measurable |
| 760 | performance degradation due to misaligned socket and <tt>rcu_node</tt> |
| 761 | boundaries). |
| 762 | In addition, building kernels with a full four levels of <tt>rcu_node</tt> |
| 763 | tree permits better testing of RCU's combining-tree code. |
| 764 | |
| 765 | </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children |
| 766 | are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. |
| 767 | If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, |
| 768 | it is set based on the word size of the system, which is also |
| 769 | the Kconfig default. |
| 770 | |
| 771 | </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are |
| 772 | handled by each leaf <tt>rcu_node</tt> structure. |
| 773 | Experience has shown that allowing a given leaf <tt>rcu_node</tt> |
| 774 | structure to handle 64 CPUs, as permitted by the number of bits in |
| 775 | the <tt>->qsmask</tt> field on a 64-bit system, results in |
| 776 | excessive contention for the leaf <tt>rcu_node</tt> structures' |
| 777 | <tt>->lock</tt> fields. |
| 778 | The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore |
| 779 | limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. |
| 780 | If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value |
| 781 | selected is based on the word size of the system, just as for |
| 782 | <tt>CONFIG_RCU_FANOUT</tt>. |
| 783 | Lines 11-19 perform this computation. |
| 784 | |
| 785 | </p><p>Lines 21-24 compute the maximum number of CPUs supported by |
| 786 | a single-level (which contains a single <tt>rcu_node</tt> structure), |
| 787 | two-level, three-level, and four-level <tt>rcu_node</tt> tree, |
| 788 | respectively, given the fanout specified by <tt>RCU_FANOUT</tt> |
| 789 | and <tt>RCU_FANOUT_LEAF</tt>. |
| 790 | These numbers of CPUs are retained in the |
| 791 | <tt>RCU_FANOUT_1</tt>, |
| 792 | <tt>RCU_FANOUT_2</tt>, |
| 793 | <tt>RCU_FANOUT_3</tt>, and |
| 794 | <tt>RCU_FANOUT_4</tt> |
| 795 | C-preprocessor variables, respectively. |
| 796 | |
| 797 | </p><p>These variables are used to control the C-preprocessor <tt>#if</tt> |
| 798 | statement spanning lines 26-66 that computes the number of |
| 799 | <tt>rcu_node</tt> structures required for each level of the tree, |
| 800 | as well as the number of levels required. |
| 801 | The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> |
| 802 | C-preprocessor variable by lines 27, 35, 44, and 54. |
| 803 | The number of <tt>rcu_node</tt> structures for the topmost level |
| 804 | of the tree is always exactly one, and this value is unconditionally |
| 805 | placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. |
| 806 | The rest of the levels (if any) of the <tt>rcu_node</tt> tree |
| 807 | are computed by dividing the maximum number of CPUs by the |
| 808 | fanout supported by the number of levels from the current level down, |
| 809 | rounding up. This computation is performed by lines 37, |
| 810 | 46-47, and 56-58. |
| 811 | Lines 31-33, 40-42, 50-52, and 62-63 create initializers |
| 812 | for lockdep lock-class names. |
| 813 | Finally, lines 64-66 produce an error if the maximum number of |
| 814 | CPUs is too large for the specified fanout. |
| 815 | |
| 816 | <h3><a name="The rcu_segcblist Structure"> |
| 817 | The <tt>rcu_segcblist</tt> Structure</a></h3> |
| 818 | |
| 819 | The <tt>rcu_segcblist</tt> structure maintains a segmented list of |
| 820 | callbacks as follows: |
| 821 | |
| 822 | <pre> |
| 823 | 1 #define RCU_DONE_TAIL 0 |
| 824 | 2 #define RCU_WAIT_TAIL 1 |
| 825 | 3 #define RCU_NEXT_READY_TAIL 2 |
| 826 | 4 #define RCU_NEXT_TAIL 3 |
| 827 | 5 #define RCU_CBLIST_NSEGS 4 |
| 828 | 6 |
| 829 | 7 struct rcu_segcblist { |
| 830 | 8 struct rcu_head *head; |
| 831 | 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; |
| 832 | 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; |
| 833 | 11 long len; |
| 834 | 12 long len_lazy; |
| 835 | 13 }; |
| 836 | </pre> |
| 837 | |
| 838 | <p> |
| 839 | The segments are as follows: |
| 840 | |
| 841 | <ol> |
| 842 | <li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed. |
| 843 | These callbacks are ready to be invoked. |
| 844 | <li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the |
| 845 | current grace period. |
| 846 | Note that different CPUs can have different ideas about which |
| 847 | grace period is current, hence the <tt>->gp_seq</tt> field. |
| 848 | <li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next |
| 849 | grace period to start. |
| 850 | <li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been |
| 851 | associated with a grace period. |
| 852 | </ol> |
| 853 | |
| 854 | <p> |
| 855 | The <tt>->head</tt> pointer references the first callback or |
| 856 | is <tt>NULL</tt> if the list contains no callbacks (which is |
| 857 | <i>not</i> the same as being empty). |
| 858 | Each element of the <tt>->tails[]</tt> array references the |
| 859 | <tt>->next</tt> pointer of the last callback in the corresponding |
| 860 | segment of the list, or the list's <tt>->head</tt> pointer if |
| 861 | that segment and all previous segments are empty. |
| 862 | If the corresponding segment is empty but some previous segment is |
| 863 | not empty, then the array element is identical to its predecessor. |
| 864 | Older callbacks are closer to the head of the list, and new callbacks |
| 865 | are added at the tail. |
| 866 | This relationship between the <tt>->head</tt> pointer, the |
| 867 | <tt>->tails[]</tt> array, and the callbacks is shown in this |
| 868 | diagram: |
| 869 | |
| 870 | </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> |
| 871 | |
| 872 | </p><p>In this figure, the <tt>->head</tt> pointer references the |
| 873 | first |
| 874 | RCU callback in the list. |
| 875 | The <tt>->tails[RCU_DONE_TAIL]</tt> array element references |
| 876 | the <tt>->head</tt> pointer itself, indicating that none |
| 877 | of the callbacks is ready to invoke. |
| 878 | The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback |
| 879 | CB 2's <tt>->next</tt> pointer, which indicates that |
| 880 | CB 1 and CB 2 are both waiting on the current grace period, |
| 881 | give or take possible disagreements about exactly which grace period |
| 882 | is the current one. |
| 883 | The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element |
| 884 | references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt> |
| 885 | does, which indicates that there are no callbacks waiting on the next |
| 886 | RCU grace period. |
| 887 | The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references |
| 888 | CB 4's <tt>->next</tt> pointer, indicating that all the |
| 889 | remaining RCU callbacks have not yet been assigned to an RCU grace |
| 890 | period. |
| 891 | Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element |
| 892 | always references the last RCU callback's <tt>->next</tt> pointer |
| 893 | unless the callback list is empty, in which case it references |
| 894 | the <tt>->head</tt> pointer. |
| 895 | |
| 896 | <p> |
| 897 | There is one additional important special case for the |
| 898 | <tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt> |
| 899 | when this list is <i>disabled</i>. |
| 900 | Lists are disabled when the corresponding CPU is offline or when |
| 901 | the corresponding CPU's callbacks are offloaded to a kthread, |
| 902 | both of which are described elsewhere. |
| 903 | |
| 904 | </p><p>CPUs advance their callbacks from the |
| 905 | <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the |
| 906 | <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments |
| 907 | as grace periods advance. |
| 908 | |
| 909 | </p><p>The <tt>->gp_seq[]</tt> array records grace-period |
| 910 | numbers corresponding to the list segments. |
| 911 | This is what allows different CPUs to have different ideas as to |
| 912 | which is the current grace period while still avoiding premature |
| 913 | invocation of their callbacks. |
| 914 | In particular, this allows CPUs that go idle for extended periods |
| 915 | to determine which of their callbacks are ready to be invoked after |
| 916 | reawakening. |
| 917 | |
| 918 | </p><p>The <tt>->len</tt> counter contains the number of |
| 919 | callbacks in <tt>->head</tt>, and the |
| 920 | <tt>->len_lazy</tt> contains the number of those callbacks that |
| 921 | are known to only free memory, and whose invocation can therefore |
| 922 | be safely deferred. |
| 923 | |
| 924 | <p><b>Important note</b>: It is the <tt>->len</tt> field that |
| 925 | determines whether or not there are callbacks associated with |
| 926 | this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt> |
| 927 | pointer. |
| 928 | The reason for this is that all the ready-to-invoke callbacks |
| 929 | (that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 930 | all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due |
| 931 | to which <tt>->head</tt> may be set to NULL if there are no not-done |
| 932 | callbacks remaining in the <tt>rcu_segcblist</tt>. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 933 | If callback invocation must be postponed, for example, because a |
| 934 | high-priority process just woke up on this CPU, then the remaining |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 935 | callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment and |
| 936 | <tt>->head</tt> once again points to the start of the segment. |
| 937 | In short, the head field can briefly be <tt>NULL</tt> even though the |
| 938 | CPU has callbacks present the entire time. |
| 939 | Therefore, it is not appropriate to test the <tt>->head</tt> pointer |
| 940 | for <tt>NULL</tt>. |
| 941 | |
| 942 | <p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts |
| 943 | are adjusted only after the corresponding callbacks have been invoked. |
| 944 | This means that the <tt>->len</tt> count is zero only if |
| 945 | the <tt>rcu_segcblist</tt> structure really is devoid of callbacks. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 946 | Of course, off-CPU sampling of the <tt>->len</tt> count requires |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 947 | careful use of appropriate synchronization, for example, memory barriers. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 948 | This synchronization can be a bit subtle, particularly in the case |
| 949 | of <tt>rcu_barrier()</tt>. |
| 950 | |
| 951 | <h3><a name="The rcu_data Structure"> |
| 952 | The <tt>rcu_data</tt> Structure</a></h3> |
| 953 | |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 954 | <p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 955 | The fields in this structure may be accessed only from the corresponding |
| 956 | CPU (and from tracing) unless otherwise stated. |
| 957 | This structure is the |
| 958 | focus of quiescent-state detection and RCU callback queuing. |
| 959 | It also tracks its relationship to the corresponding leaf |
| 960 | <tt>rcu_node</tt> structure to allow more-efficient |
| 961 | propagation of quiescent states up the <tt>rcu_node</tt> |
| 962 | combining tree. |
| 963 | Like the <tt>rcu_node</tt> structure, it provides a local |
| 964 | copy of the grace-period information to allow for-free |
| 965 | synchronized |
| 966 | access to this information from the corresponding CPU. |
| 967 | Finally, this structure records past dyntick-idle state |
| 968 | for the corresponding CPU and also tracks statistics. |
| 969 | |
| 970 | </p><p>The <tt>rcu_data</tt> structure's fields are discussed, |
| 971 | singly and in groups, in the following sections. |
| 972 | |
| 973 | <h5>Connection to Other Data Structures</h5> |
| 974 | |
| 975 | <p>This portion of the <tt>rcu_data</tt> structure is declared |
| 976 | as follows: |
| 977 | |
| 978 | <pre> |
| 979 | 1 int cpu; |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 980 | 2 struct rcu_node *mynode; |
| 981 | 3 unsigned long grpmask; |
| 982 | 4 bool beenonline; |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 983 | </pre> |
| 984 | |
| 985 | <p>The <tt>->cpu</tt> field contains the number of the |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 986 | corresponding CPU and the <tt>->mynode</tt> field references the |
| 987 | corresponding <tt>rcu_node</tt> structure. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 988 | The <tt>->mynode</tt> is used to propagate quiescent states |
| 989 | up the combining tree. |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 990 | These two fields are constant and therefore do not require synchronization. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 991 | |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 992 | <p>The <tt>->grpmask</tt> field indicates the bit in |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 993 | the <tt>->mynode->qsmask</tt> corresponding to this |
| 994 | <tt>rcu_data</tt> structure, and is also used when propagating |
| 995 | quiescent states. |
| 996 | The <tt>->beenonline</tt> flag is set whenever the corresponding |
| 997 | CPU comes online, which means that the debugfs tracing need not dump |
| 998 | out any <tt>rcu_data</tt> structure for which this flag is not set. |
| 999 | |
| 1000 | <h5>Quiescent-State and Grace-Period Tracking</h5> |
| 1001 | |
| 1002 | <p>This portion of the <tt>rcu_data</tt> structure is declared |
| 1003 | as follows: |
| 1004 | |
| 1005 | <pre> |
| 1006 | 1 unsigned long gp_seq; |
| 1007 | 2 unsigned long gp_seq_needed; |
| 1008 | 3 bool cpu_no_qs; |
| 1009 | 4 bool core_needs_qs; |
| 1010 | 5 bool gpwrap; |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1011 | </pre> |
| 1012 | |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1013 | <p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same |
| 1014 | name in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures. The |
| 1015 | <tt>->gp_seq_needed</tt> field is the counterpart of the field of the same |
| 1016 | name in the rcu_node</tt> structure. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1017 | They may each lag up to one behind their <tt>rcu_node</tt> |
| 1018 | counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and |
| 1019 | <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag |
| 1020 | arbitrarily far behind for CPUs in dyntick-idle mode (but these counters |
| 1021 | will catch up upon exit from dyntick-idle mode). |
| 1022 | If the lower two bits of a given <tt>rcu_data</tt> structure's |
| 1023 | <tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt> |
| 1024 | structure believes that RCU is idle. |
| 1025 | |
| 1026 | <table> |
| 1027 | <tr><th> </th></tr> |
| 1028 | <tr><th align="left">Quick Quiz:</th></tr> |
| 1029 | <tr><td> |
| 1030 | All this replication of the grace period numbers can only cause |
| 1031 | massive confusion. |
| 1032 | Why not just keep a global sequence number and be done with it??? |
| 1033 | </td></tr> |
| 1034 | <tr><th align="left">Answer:</th></tr> |
| 1035 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 1036 | Because if there was only a single global sequence |
| 1037 | numbers, there would need to be a single global lock to allow |
| 1038 | safely accessing and updating it. |
| 1039 | And if we are not going to have a single global lock, we need |
| 1040 | to carefully manage the numbers on a per-node basis. |
| 1041 | Recall from the answer to a previous Quick Quiz that the consequences |
| 1042 | of applying a previously sampled quiescent state to the wrong |
| 1043 | grace period are quite severe. |
| 1044 | </font></td></tr> |
| 1045 | <tr><td> </td></tr> |
| 1046 | </table> |
| 1047 | |
| 1048 | <p>The <tt>->cpu_no_qs</tt> flag indicates that the |
| 1049 | CPU has not yet passed through a quiescent state, |
| 1050 | while the <tt>->core_needs_qs</tt> flag indicates that the |
| 1051 | RCU core needs a quiescent state from the corresponding CPU. |
| 1052 | The <tt>->gpwrap</tt> field indicates that the corresponding |
| 1053 | CPU has remained idle for so long that the |
| 1054 | <tt>gp_seq</tt> counter is in danger of overflow, which |
| 1055 | will cause the CPU to disregard the values of its counters on |
| 1056 | its next exit from idle. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1057 | |
| 1058 | <h5>RCU Callback Handling</h5> |
| 1059 | |
| 1060 | <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by |
| 1061 | the same CPU that registered them. |
| 1062 | This is strictly a cache-locality optimization: callbacks can and |
| 1063 | do get invoked on CPUs other than the one that registered them. |
| 1064 | After all, if the CPU that registered a given callback has gone |
| 1065 | offline before the callback can be invoked, there really is no other |
| 1066 | choice. |
| 1067 | |
| 1068 | </p><p>This portion of the <tt>rcu_data</tt> structure is declared |
| 1069 | as follows: |
| 1070 | |
| 1071 | <pre> |
| 1072 | 1 struct rcu_segcblist cblist; |
| 1073 | 2 long qlen_last_fqs_check; |
| 1074 | 3 unsigned long n_cbs_invoked; |
| 1075 | 4 unsigned long n_nocbs_invoked; |
| 1076 | 5 unsigned long n_cbs_orphaned; |
| 1077 | 6 unsigned long n_cbs_adopted; |
| 1078 | 7 unsigned long n_force_qs_snap; |
| 1079 | 8 long blimit; |
| 1080 | </pre> |
| 1081 | |
| 1082 | <p>The <tt>->cblist</tt> structure is the segmented callback list |
| 1083 | described earlier. |
| 1084 | The CPU advances the callbacks in its <tt>rcu_data</tt> structure |
| 1085 | whenever it notices that another RCU grace period has completed. |
| 1086 | The CPU detects the completion of an RCU grace period by noticing |
| 1087 | that the value of its <tt>rcu_data</tt> structure's |
| 1088 | <tt>->gp_seq</tt> field differs from that of its leaf |
| 1089 | <tt>rcu_node</tt> structure. |
| 1090 | Recall that each <tt>rcu_node</tt> structure's |
| 1091 | <tt>->gp_seq</tt> field is updated at the beginnings and ends of each |
| 1092 | grace period. |
| 1093 | |
| 1094 | <p> |
| 1095 | The <tt>->qlen_last_fqs_check</tt> and |
| 1096 | <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent |
| 1097 | states from <tt>call_rcu()</tt> and friends when callback |
| 1098 | lists grow excessively long. |
| 1099 | |
| 1100 | </p><p>The <tt>->n_cbs_invoked</tt>, |
| 1101 | <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> |
| 1102 | fields count the number of callbacks invoked, |
| 1103 | sent to other CPUs when this CPU goes offline, |
| 1104 | and received from other CPUs when those other CPUs go offline. |
| 1105 | The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks |
| 1106 | are offloaded to a kthread. |
| 1107 | |
| 1108 | <p> |
| 1109 | Finally, the <tt>->blimit</tt> counter is the maximum number of |
| 1110 | RCU callbacks that may be invoked at a given time. |
| 1111 | |
| 1112 | <h5>Dyntick-Idle Handling</h5> |
| 1113 | |
| 1114 | <p>This portion of the <tt>rcu_data</tt> structure is declared |
| 1115 | as follows: |
| 1116 | |
| 1117 | <pre> |
| 1118 | 1 int dynticks_snap; |
| 1119 | 2 unsigned long dynticks_fqs; |
| 1120 | </pre> |
| 1121 | |
| 1122 | The <tt>->dynticks_snap</tt> field is used to take a snapshot |
| 1123 | of the corresponding CPU's dyntick-idle state when forcing |
| 1124 | quiescent states, and is therefore accessed from other CPUs. |
| 1125 | Finally, the <tt>->dynticks_fqs</tt> field is used to |
| 1126 | count the number of times this CPU is determined to be in |
| 1127 | dyntick-idle state, and is used for tracing and debugging purposes. |
| 1128 | |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1129 | <p> |
| 1130 | This portion of the rcu_data structure is declared as follows: |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1131 | |
| 1132 | <pre> |
| 1133 | 1 long dynticks_nesting; |
| 1134 | 2 long dynticks_nmi_nesting; |
| 1135 | 3 atomic_t dynticks; |
| 1136 | 4 bool rcu_need_heavy_qs; |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1137 | 5 bool rcu_urgent_qs; |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1138 | </pre> |
| 1139 | |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1140 | <p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle |
| 1141 | state for the corresponding CPU. |
| 1142 | The fields may be accessed only from the corresponding CPU (and from tracing) |
| 1143 | unless otherwise stated. |
| 1144 | |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1145 | <p>The <tt>->dynticks_nesting</tt> field counts the |
| 1146 | nesting depth of process execution, so that in normal circumstances |
| 1147 | this counter has value zero or one. |
| 1148 | NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt> |
| 1149 | field. |
| 1150 | Because NMIs cannot be masked, changes to this variable have to be |
| 1151 | undertaken carefully using an algorithm provided by Andy Lutomirski. |
| 1152 | The initial transition from idle adds one, and nested transitions |
| 1153 | add two, so that a nesting level of five is represented by a |
| 1154 | <tt>->dynticks_nmi_nesting</tt> value of nine. |
| 1155 | This counter can therefore be thought of as counting the number |
| 1156 | of reasons why this CPU cannot be permitted to enter dyntick-idle |
| 1157 | mode, aside from process-level transitions. |
| 1158 | |
| 1159 | <p>However, it turns out that when running in non-idle kernel context, |
| 1160 | the Linux kernel is fully capable of entering interrupt handlers that |
| 1161 | never exit and perhaps also vice versa. |
| 1162 | Therefore, whenever the <tt>->dynticks_nesting</tt> field is |
| 1163 | incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field |
| 1164 | is set to a large positive number, and whenever the |
| 1165 | <tt>->dynticks_nesting</tt> field is decremented down to zero, |
| 1166 | the the <tt>->dynticks_nmi_nesting</tt> field is set to zero. |
| 1167 | Assuming that the number of misnested interrupts is not sufficient |
| 1168 | to overflow the counter, this approach corrects the |
| 1169 | <tt>->dynticks_nmi_nesting</tt> field every time the corresponding |
| 1170 | CPU enters the idle loop from process context. |
| 1171 | |
| 1172 | </p><p>The <tt>->dynticks</tt> field counts the corresponding |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1173 | CPU's transitions to and from either dyntick-idle or user mode, so |
| 1174 | that this counter has an even value when the CPU is in dyntick-idle |
| 1175 | mode or user mode and an odd value otherwise. The transitions to/from |
| 1176 | user mode need to be counted for user mode adaptive-ticks support |
| 1177 | (see timers/NO_HZ.txt). |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1178 | |
| 1179 | </p><p>The <tt>->rcu_need_heavy_qs</tt> field is used |
| 1180 | to record the fact that the RCU core code would really like to |
| 1181 | see a quiescent state from the corresponding CPU, so much so that |
| 1182 | it is willing to call for heavy-weight dyntick-counter operations. |
| 1183 | This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> |
| 1184 | code, which provide a momentary idle sojourn in response. |
| 1185 | |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1186 | </p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1187 | the fact that the RCU core code would really like to see a quiescent state from |
| 1188 | the corresponding CPU, with the various other fields indicating just how badly |
| 1189 | RCU wants this quiescent state. |
| 1190 | This flag is checked by RCU's context-switch path |
| 1191 | (<tt>rcu_note_context_switch</tt>) and the cond_resched code. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1192 | |
| 1193 | <table> |
| 1194 | <tr><th> </th></tr> |
| 1195 | <tr><th align="left">Quick Quiz:</th></tr> |
| 1196 | <tr><td> |
| 1197 | Why not simply combine the <tt>->dynticks_nesting</tt> |
| 1198 | and <tt>->dynticks_nmi_nesting</tt> counters into a |
| 1199 | single counter that just counts the number of reasons that |
| 1200 | the corresponding CPU is non-idle? |
| 1201 | </td></tr> |
| 1202 | <tr><th align="left">Answer:</th></tr> |
| 1203 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 1204 | Because this would fail in the presence of interrupts whose |
| 1205 | handlers never return and of handlers that manage to return |
| 1206 | from a made-up interrupt. |
| 1207 | </font></td></tr> |
| 1208 | <tr><td> </td></tr> |
| 1209 | </table> |
| 1210 | |
| 1211 | <p>Additional fields are present for some special-purpose |
| 1212 | builds, and are discussed separately. |
| 1213 | |
| 1214 | <h3><a name="The rcu_head Structure"> |
| 1215 | The <tt>rcu_head</tt> Structure</a></h3> |
| 1216 | |
| 1217 | <p>Each <tt>rcu_head</tt> structure represents an RCU callback. |
| 1218 | These structures are normally embedded within RCU-protected data |
| 1219 | structures whose algorithms use asynchronous grace periods. |
| 1220 | In contrast, when using algorithms that block waiting for RCU grace periods, |
| 1221 | RCU users need not provide <tt>rcu_head</tt> structures. |
| 1222 | |
| 1223 | </p><p>The <tt>rcu_head</tt> structure has fields as follows: |
| 1224 | |
| 1225 | <pre> |
| 1226 | 1 struct rcu_head *next; |
| 1227 | 2 void (*func)(struct rcu_head *head); |
| 1228 | </pre> |
| 1229 | |
| 1230 | <p>The <tt>->next</tt> field is used |
| 1231 | to link the <tt>rcu_head</tt> structures together in the |
| 1232 | lists within the <tt>rcu_data</tt> structures. |
| 1233 | The <tt>->func</tt> field is a pointer to the function |
| 1234 | to be called when the callback is ready to be invoked, and |
| 1235 | this function is passed a pointer to the <tt>rcu_head</tt> |
| 1236 | structure. |
| 1237 | However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> |
| 1238 | field to record the offset of the <tt>rcu_head</tt> |
| 1239 | structure within the enclosing RCU-protected data structure. |
| 1240 | |
| 1241 | </p><p>Both of these fields are used internally by RCU. |
| 1242 | From the viewpoint of RCU users, this structure is an |
| 1243 | opaque “cookie”. |
| 1244 | |
| 1245 | <table> |
| 1246 | <tr><th> </th></tr> |
| 1247 | <tr><th align="left">Quick Quiz:</th></tr> |
| 1248 | <tr><td> |
| 1249 | Given that the callback function <tt>->func</tt> |
| 1250 | is passed a pointer to the <tt>rcu_head</tt> structure, |
| 1251 | how is that function supposed to find the beginning of the |
| 1252 | enclosing RCU-protected data structure? |
| 1253 | </td></tr> |
| 1254 | <tr><th align="left">Answer:</th></tr> |
| 1255 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 1256 | In actual practice, there is a separate callback function per |
| 1257 | type of RCU-protected data structure. |
| 1258 | The callback function can therefore use the <tt>container_of()</tt> |
| 1259 | macro in the Linux kernel (or other pointer-manipulation facilities |
| 1260 | in other software environments) to find the beginning of the |
| 1261 | enclosing structure. |
| 1262 | </font></td></tr> |
| 1263 | <tr><td> </td></tr> |
| 1264 | </table> |
| 1265 | |
| 1266 | <h3><a name="RCU-Specific Fields in the task_struct Structure"> |
| 1267 | RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> |
| 1268 | |
| 1269 | <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some |
| 1270 | additional fields in the <tt>task_struct</tt> structure: |
| 1271 | |
| 1272 | <pre> |
| 1273 | 1 #ifdef CONFIG_PREEMPT_RCU |
| 1274 | 2 int rcu_read_lock_nesting; |
| 1275 | 3 union rcu_special rcu_read_unlock_special; |
| 1276 | 4 struct list_head rcu_node_entry; |
| 1277 | 5 struct rcu_node *rcu_blocked_node; |
| 1278 | 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ |
| 1279 | 7 #ifdef CONFIG_TASKS_RCU |
| 1280 | 8 unsigned long rcu_tasks_nvcsw; |
| 1281 | 9 bool rcu_tasks_holdout; |
| 1282 | 10 struct list_head rcu_tasks_holdout_list; |
| 1283 | 11 int rcu_tasks_idle_cpu; |
| 1284 | 12 #endif /* #ifdef CONFIG_TASKS_RCU */ |
| 1285 | </pre> |
| 1286 | |
| 1287 | <p>The <tt>->rcu_read_lock_nesting</tt> field records the |
| 1288 | nesting level for RCU read-side critical sections, and |
| 1289 | the <tt>->rcu_read_unlock_special</tt> field is a bitmask |
| 1290 | that records special conditions that require <tt>rcu_read_unlock()</tt> |
| 1291 | to do additional work. |
| 1292 | The <tt>->rcu_node_entry</tt> field is used to form lists of |
| 1293 | tasks that have blocked within preemptible-RCU read-side critical |
| 1294 | sections and the <tt>->rcu_blocked_node</tt> field references |
| 1295 | the <tt>rcu_node</tt> structure whose list this task is a member of, |
| 1296 | or <tt>NULL</tt> if it is not blocked within a preemptible-RCU |
| 1297 | read-side critical section. |
| 1298 | |
| 1299 | <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of |
| 1300 | voluntary context switches that this task had undergone at the |
| 1301 | beginning of the current tasks-RCU grace period, |
| 1302 | <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU |
| 1303 | grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> |
| 1304 | is a list element enqueuing this task on the holdout list, |
| 1305 | and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this |
| 1306 | idle task is running, but only if the task is currently running, |
| 1307 | that is, if the CPU is currently idle. |
| 1308 | |
| 1309 | <h3><a name="Accessor Functions"> |
| 1310 | Accessor Functions</a></h3> |
| 1311 | |
| 1312 | <p>The following listing shows the |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1313 | <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1314 | <tt>rcu_for_each_leaf_node()</tt> function and macros: |
| 1315 | |
| 1316 | <pre> |
| 1317 | 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) |
| 1318 | 2 { |
| 1319 | 3 return &rsp->node[0]; |
| 1320 | 4 } |
| 1321 | 5 |
| 1322 | 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ |
| 1323 | 7 for ((rnp) = &(rsp)->node[0]; \ |
| 1324 | 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) |
| 1325 | 9 |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1326 | 10 #define rcu_for_each_leaf_node(rsp, rnp) \ |
| 1327 | 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ |
| 1328 | 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1329 | </pre> |
| 1330 | |
| 1331 | <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the |
| 1332 | first element of the specified <tt>rcu_state</tt> structure's |
| 1333 | <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> |
| 1334 | structure. |
| 1335 | |
| 1336 | </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> |
| 1337 | macro takes advantage of the layout of the <tt>rcu_node</tt> |
| 1338 | structures in the <tt>rcu_state</tt> structure's |
| 1339 | <tt>->node[]</tt> array, performing a breadth-first traversal by |
| 1340 | simply traversing the array in order. |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1341 | Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1342 | the last part of the array, thus traversing only the leaf |
| 1343 | <tt>rcu_node</tt> structures. |
| 1344 | |
| 1345 | <table> |
| 1346 | <tr><th> </th></tr> |
| 1347 | <tr><th align="left">Quick Quiz:</th></tr> |
| 1348 | <tr><td> |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1349 | What does |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1350 | <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree |
| 1351 | contains only a single node? |
| 1352 | </td></tr> |
| 1353 | <tr><th align="left">Answer:</th></tr> |
| 1354 | <tr><td bgcolor="#ffffff"><font color="ffffff"> |
| 1355 | In the single-node case, |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1356 | <tt>rcu_for_each_leaf_node()</tt> traverses the single node. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1357 | </font></td></tr> |
| 1358 | <tr><td> </td></tr> |
| 1359 | </table> |
| 1360 | |
| 1361 | <h3><a name="Summary"> |
| 1362 | Summary</a></h3> |
| 1363 | |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1364 | So the state of RCU is represented by an <tt>rcu_state</tt> structure, |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1365 | which contains a combining tree of <tt>rcu_node</tt> and |
| 1366 | <tt>rcu_data</tt> structures. |
| 1367 | Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle |
David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame] | 1368 | state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure. |
Andrew Scull | b4b6d4a | 2019-01-02 15:54:55 +0000 | [diff] [blame] | 1369 | |
| 1370 | If you made it this far, you are well prepared to read the code |
| 1371 | walkthroughs in the other articles in this series. |
| 1372 | |
| 1373 | <h3><a name="Acknowledgments"> |
| 1374 | Acknowledgments</a></h3> |
| 1375 | |
| 1376 | I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul |
| 1377 | Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn |
| 1378 | for helping me get this document into a more human-readable state. |
| 1379 | |
| 1380 | <h3><a name="Legal Statement"> |
| 1381 | Legal Statement</a></h3> |
| 1382 | |
| 1383 | <p>This work represents the view of the author and does not necessarily |
| 1384 | represent the view of IBM. |
| 1385 | |
| 1386 | </p><p>Linux is a registered trademark of Linus Torvalds. |
| 1387 | |
| 1388 | </p><p>Other company, product, and service names may be trademarks or |
| 1389 | service marks of others. |
| 1390 | |
| 1391 | </body></html> |