Olivier Deprez | 157378f | 2022-04-04 15:47:50 +0200 | [diff] [blame^] | 1 | =================================================== |
| 2 | A Tour Through TREE_RCU's Data Structures [LWN.net] |
| 3 | =================================================== |
| 4 | |
| 5 | December 18, 2016 |
| 6 | |
| 7 | This article was contributed by Paul E. McKenney |
| 8 | |
| 9 | Introduction |
| 10 | ============ |
| 11 | |
| 12 | This document describes RCU's major data structures and their relationship |
| 13 | to each other. |
| 14 | |
| 15 | Data-Structure Relationships |
| 16 | ============================ |
| 17 | |
| 18 | RCU is for all intents and purposes a large state machine, and its |
| 19 | data structures maintain the state in such a way as to allow RCU readers |
| 20 | to execute extremely quickly, while also processing the RCU grace periods |
| 21 | requested by updaters in an efficient and extremely scalable fashion. |
| 22 | The efficiency and scalability of RCU updaters is provided primarily |
| 23 | by a combining tree, as shown below: |
| 24 | |
| 25 | .. kernel-figure:: BigTreeClassicRCU.svg |
| 26 | |
| 27 | This diagram shows an enclosing ``rcu_state`` structure containing a tree |
| 28 | of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up |
| 29 | to 16 ``rcu_data`` structures associated with it, so that there are |
| 30 | ``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. |
| 31 | This structure is adjusted at boot time, if needed, to handle the common |
| 32 | case where ``nr_cpu_ids`` is much less than ``NR_CPUs``. |
| 33 | For example, a number of Linux distributions set ``NR_CPUs=4096``, |
| 34 | which results in a three-level ``rcu_node`` tree. |
| 35 | If the actual hardware has only 16 CPUs, RCU will adjust itself |
| 36 | at boot time, resulting in an ``rcu_node`` tree with only a single node. |
| 37 | |
| 38 | The purpose of this combining tree is to allow per-CPU events |
| 39 | such as quiescent states, dyntick-idle transitions, |
| 40 | and CPU hotplug operations to be processed efficiently |
| 41 | and scalably. |
| 42 | Quiescent states are recorded by the per-CPU ``rcu_data`` structures, |
| 43 | and other events are recorded by the leaf-level ``rcu_node`` |
| 44 | structures. |
| 45 | All of these events are combined at each level of the tree until finally |
| 46 | grace periods are completed at the tree's root ``rcu_node`` |
| 47 | structure. |
| 48 | A grace period can be completed at the root once every CPU |
| 49 | (or, in the case of ``CONFIG_PREEMPT_RCU``, task) |
| 50 | has passed through a quiescent state. |
| 51 | Once a grace period has completed, record of that fact is propagated |
| 52 | back down the tree. |
| 53 | |
| 54 | As can be seen from the diagram, on a 64-bit system |
| 55 | a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout |
| 56 | of 64 at the root and a fanout of 16 at the leaves. |
| 57 | |
| 58 | +-----------------------------------------------------------------------+ |
| 59 | | **Quick Quiz**: | |
| 60 | +-----------------------------------------------------------------------+ |
| 61 | | Why isn't the fanout at the leaves also 64? | |
| 62 | +-----------------------------------------------------------------------+ |
| 63 | | **Answer**: | |
| 64 | +-----------------------------------------------------------------------+ |
| 65 | | Because there are more types of events that affect the leaf-level | |
| 66 | | ``rcu_node`` structures than further up the tree. Therefore, if the | |
| 67 | | leaf ``rcu_node`` structures have fanout of 64, the contention on | |
| 68 | | these structures' ``->structures`` becomes excessive. Experimentation | |
| 69 | | on a wide variety of systems has shown that a fanout of 16 works well | |
| 70 | | for the leaves of the ``rcu_node`` tree. | |
| 71 | | | |
| 72 | | Of course, further experience with systems having hundreds or | |
| 73 | | thousands of CPUs may demonstrate that the fanout for the non-leaf | |
| 74 | | ``rcu_node`` structures must also be reduced. Such reduction can be | |
| 75 | | easily carried out when and if it proves necessary. In the meantime, | |
| 76 | | if you are using such a system and running into contention problems | |
| 77 | | on the non-leaf ``rcu_node`` structures, you may use the | |
| 78 | | ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the | |
| 79 | | non-leaf fanout as needed. | |
| 80 | | | |
| 81 | | Kernels built for systems with strong NUMA characteristics might | |
| 82 | | also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of | |
| 83 | | the ``rcu_node`` structures align with hardware boundaries. | |
| 84 | | However, there has thus far been no need for this. | |
| 85 | +-----------------------------------------------------------------------+ |
| 86 | |
| 87 | If your system has more than 1,024 CPUs (or more than 512 CPUs on a |
| 88 | 32-bit system), then RCU will automatically add more levels to the tree. |
| 89 | For example, if you are crazy enough to build a 64-bit system with |
| 90 | 65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: |
| 91 | |
| 92 | .. kernel-figure:: HugeTreeClassicRCU.svg |
| 93 | |
| 94 | RCU currently permits up to a four-level tree, which on a 64-bit system |
| 95 | accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for |
| 96 | 32-bit systems. On the other hand, you can set both |
| 97 | ``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as |
| 98 | 2, which would result in a 16-CPU test using a 4-level tree. This can be |
| 99 | useful for testing large-system capabilities on small test machines. |
| 100 | |
| 101 | This multi-level combining tree allows us to get most of the performance |
| 102 | and scalability benefits of partitioning, even though RCU grace-period |
| 103 | detection is inherently a global operation. The trick here is that only |
| 104 | the last CPU to report a quiescent state into a given ``rcu_node`` |
| 105 | structure need advance to the ``rcu_node`` structure at the next level |
| 106 | up the tree. This means that at the leaf-level ``rcu_node`` structure, |
| 107 | only one access out of sixteen will progress up the tree. For the |
| 108 | internal ``rcu_node`` structures, the situation is even more extreme: |
| 109 | Only one access out of sixty-four will progress up the tree. Because the |
| 110 | vast majority of the CPUs do not progress up the tree, the lock |
| 111 | contention remains roughly constant up the tree. No matter how many CPUs |
| 112 | there are in the system, at most 64 quiescent-state reports per grace |
| 113 | period will progress all the way to the root ``rcu_node`` structure, |
| 114 | thus ensuring that the lock contention on that root ``rcu_node`` |
| 115 | structure remains acceptably low. |
| 116 | |
| 117 | In effect, the combining tree acts like a big shock absorber, keeping |
| 118 | lock contention under control at all tree levels regardless of the level |
| 119 | of loading on the system. |
| 120 | |
| 121 | RCU updaters wait for normal grace periods by registering RCU callbacks, |
| 122 | either directly via ``call_rcu()`` or indirectly via |
| 123 | ``synchronize_rcu()`` and friends. RCU callbacks are represented by |
| 124 | ``rcu_head`` structures, which are queued on ``rcu_data`` structures |
| 125 | while they are waiting for a grace period to elapse, as shown in the |
| 126 | following figure: |
| 127 | |
| 128 | .. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg |
| 129 | |
| 130 | This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data |
| 131 | structures are related. Lesser data structures will be introduced with |
| 132 | the algorithms that make use of them. |
| 133 | |
| 134 | Note that each of the data structures in the above figure has its own |
| 135 | synchronization: |
| 136 | |
| 137 | #. Each ``rcu_state`` structures has a lock and a mutex, and some fields |
| 138 | are protected by the corresponding root ``rcu_node`` structure's lock. |
| 139 | #. Each ``rcu_node`` structure has a spinlock. |
| 140 | #. The fields in ``rcu_data`` are private to the corresponding CPU, |
| 141 | although a few can be read and written by other CPUs. |
| 142 | |
| 143 | It is important to note that different data structures can have very |
| 144 | different ideas about the state of RCU at any given time. For but one |
| 145 | example, awareness of the start or end of a given RCU grace period |
| 146 | propagates slowly through the data structures. This slow propagation is |
| 147 | absolutely necessary for RCU to have good read-side performance. If this |
| 148 | balkanized implementation seems foreign to you, one useful trick is to |
| 149 | consider each instance of these data structures to be a different |
| 150 | person, each having the usual slightly different view of reality. |
| 151 | |
| 152 | The general role of each of these data structures is as follows: |
| 153 | |
| 154 | #. ``rcu_state``: This structure forms the interconnection between the |
| 155 | ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, |
| 156 | serves as short-term repository for callbacks orphaned by CPU-hotplug |
| 157 | events, maintains ``rcu_barrier()`` state, tracks expedited |
| 158 | grace-period state, and maintains state used to force quiescent |
| 159 | states when grace periods extend too long, |
| 160 | #. ``rcu_node``: This structure forms the combining tree that propagates |
| 161 | quiescent-state information from the leaves to the root, and also |
| 162 | propagates grace-period information from the root to the leaves. It |
| 163 | provides local copies of the grace-period state in order to allow |
| 164 | this information to be accessed in a synchronized manner without |
| 165 | suffering the scalability limitations that would otherwise be imposed |
| 166 | by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the |
| 167 | lists of tasks that have blocked while in their current RCU read-side |
| 168 | critical section. In ``CONFIG_PREEMPT_RCU`` with |
| 169 | ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` |
| 170 | priority-boosting kernel threads (kthreads) and state. Finally, it |
| 171 | records CPU-hotplug state in order to determine which CPUs should be |
| 172 | ignored during a given grace period. |
| 173 | #. ``rcu_data``: This per-CPU structure is the focus of quiescent-state |
| 174 | detection and RCU callback queuing. It also tracks its relationship |
| 175 | to the corresponding leaf ``rcu_node`` structure to allow |
| 176 | more-efficient propagation of quiescent states up the ``rcu_node`` |
| 177 | combining tree. Like the ``rcu_node`` structure, it provides a local |
| 178 | copy of the grace-period information to allow for-free synchronized |
| 179 | access to this information from the corresponding CPU. Finally, this |
| 180 | structure records past dyntick-idle state for the corresponding CPU |
| 181 | and also tracks statistics. |
| 182 | #. ``rcu_head``: This structure represents RCU callbacks, and is the |
| 183 | only structure allocated and managed by RCU users. The ``rcu_head`` |
| 184 | structure is normally embedded within the RCU-protected data |
| 185 | structure. |
| 186 | |
| 187 | If all you wanted from this article was a general notion of how RCU's |
| 188 | data structures are related, you are done. Otherwise, each of the |
| 189 | following sections give more details on the ``rcu_state``, ``rcu_node`` |
| 190 | and ``rcu_data`` data structures. |
| 191 | |
| 192 | The ``rcu_state`` Structure |
| 193 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 194 | |
| 195 | The ``rcu_state`` structure is the base structure that represents the |
| 196 | state of RCU in the system. This structure forms the interconnection |
| 197 | between the ``rcu_node`` and ``rcu_data`` structures, tracks grace |
| 198 | periods, contains the lock used to synchronize with CPU-hotplug events, |
| 199 | and maintains state used to force quiescent states when grace periods |
| 200 | extend too long, |
| 201 | |
| 202 | A few of the ``rcu_state`` structure's fields are discussed, singly and |
| 203 | in groups, in the following sections. The more specialized fields are |
| 204 | covered in the discussion of their use. |
| 205 | |
| 206 | Relationship to rcu_node and rcu_data Structures |
| 207 | '''''''''''''''''''''''''''''''''''''''''''''''' |
| 208 | |
| 209 | This portion of the ``rcu_state`` structure is declared as follows: |
| 210 | |
| 211 | :: |
| 212 | |
| 213 | 1 struct rcu_node node[NUM_RCU_NODES]; |
| 214 | 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; |
| 215 | 3 struct rcu_data __percpu *rda; |
| 216 | |
| 217 | +-----------------------------------------------------------------------+ |
| 218 | | **Quick Quiz**: | |
| 219 | +-----------------------------------------------------------------------+ |
| 220 | | Wait a minute! You said that the ``rcu_node`` structures formed a | |
| 221 | | tree, but they are declared as a flat array! What gives? | |
| 222 | +-----------------------------------------------------------------------+ |
| 223 | | **Answer**: | |
| 224 | +-----------------------------------------------------------------------+ |
| 225 | | The tree is laid out in the array. The first node In the array is the | |
| 226 | | head, the next set of nodes in the array are children of the head | |
| 227 | | node, and so on until the last set of nodes in the array are the | |
| 228 | | leaves. | |
| 229 | | See the following diagrams to see how this works. | |
| 230 | +-----------------------------------------------------------------------+ |
| 231 | |
| 232 | The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown |
| 233 | in the following figure: |
| 234 | |
| 235 | .. kernel-figure:: TreeMapping.svg |
| 236 | |
| 237 | One interesting consequence of this mapping is that a breadth-first |
| 238 | traversal of the tree is implemented as a simple linear scan of the |
| 239 | array, which is in fact what the ``rcu_for_each_node_breadth_first()`` |
| 240 | macro does. This macro is used at the beginning and ends of grace |
| 241 | periods. |
| 242 | |
| 243 | Each entry of the ``->level`` array references the first ``rcu_node`` |
| 244 | structure on the corresponding level of the tree, for example, as shown |
| 245 | below: |
| 246 | |
| 247 | .. kernel-figure:: TreeMappingLevel.svg |
| 248 | |
| 249 | The zero\ :sup:`th` element of the array references the root |
| 250 | ``rcu_node`` structure, the first element references the first child of |
| 251 | the root ``rcu_node``, and finally the second element references the |
| 252 | first leaf ``rcu_node`` structure. |
| 253 | |
| 254 | For whatever it is worth, if you draw the tree to be tree-shaped rather |
| 255 | than array-shaped, it is easy to draw a planar representation: |
| 256 | |
| 257 | .. kernel-figure:: TreeLevel.svg |
| 258 | |
| 259 | Finally, the ``->rda`` field references a per-CPU pointer to the |
| 260 | corresponding CPU's ``rcu_data`` structure. |
| 261 | |
| 262 | All of these fields are constant once initialization is complete, and |
| 263 | therefore need no protection. |
| 264 | |
| 265 | Grace-Period Tracking |
| 266 | ''''''''''''''''''''' |
| 267 | |
| 268 | This portion of the ``rcu_state`` structure is declared as follows: |
| 269 | |
| 270 | :: |
| 271 | |
| 272 | 1 unsigned long gp_seq; |
| 273 | |
| 274 | RCU grace periods are numbered, and the ``->gp_seq`` field contains the |
| 275 | current grace-period sequence number. The bottom two bits are the state |
| 276 | of the current grace period, which can be zero for not yet started or |
| 277 | one for in progress. In other words, if the bottom two bits of |
| 278 | ``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom |
| 279 | two bits indicates that something is broken. This field is protected by |
| 280 | the root ``rcu_node`` structure's ``->lock`` field. |
| 281 | |
| 282 | There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` |
| 283 | structures as well. The fields in the ``rcu_state`` structure represent |
| 284 | the most current value, and those of the other structures are compared |
| 285 | in order to detect the beginnings and ends of grace periods in a |
| 286 | distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` |
| 287 | (down the tree from the root to the leaves) to ``rcu_data``. |
| 288 | |
| 289 | Miscellaneous |
| 290 | ''''''''''''' |
| 291 | |
| 292 | This portion of the ``rcu_state`` structure is declared as follows: |
| 293 | |
| 294 | :: |
| 295 | |
| 296 | 1 unsigned long gp_max; |
| 297 | 2 char abbr; |
| 298 | 3 char *name; |
| 299 | |
| 300 | The ``->gp_max`` field tracks the duration of the longest grace period |
| 301 | in jiffies. It is protected by the root ``rcu_node``'s ``->lock``. |
| 302 | |
| 303 | The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU |
| 304 | (“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). |
| 305 | These fields are used for diagnostic and tracing purposes. |
| 306 | |
| 307 | The ``rcu_node`` Structure |
| 308 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 309 | |
| 310 | The ``rcu_node`` structures form the combining tree that propagates |
| 311 | quiescent-state information from the leaves to the root and also that |
| 312 | propagates grace-period information from the root down to the leaves. |
| 313 | They provides local copies of the grace-period state in order to allow |
| 314 | this information to be accessed in a synchronized manner without |
| 315 | suffering the scalability limitations that would otherwise be imposed by |
| 316 | global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists |
| 317 | of tasks that have blocked while in their current RCU read-side critical |
| 318 | section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they |
| 319 | manage the per-\ ``rcu_node`` priority-boosting kernel threads |
| 320 | (kthreads) and state. Finally, they record CPU-hotplug state in order to |
| 321 | determine which CPUs should be ignored during a given grace period. |
| 322 | |
| 323 | The ``rcu_node`` structure's fields are discussed, singly and in groups, |
| 324 | in the following sections. |
| 325 | |
| 326 | Connection to Combining Tree |
| 327 | '''''''''''''''''''''''''''' |
| 328 | |
| 329 | This portion of the ``rcu_node`` structure is declared as follows: |
| 330 | |
| 331 | :: |
| 332 | |
| 333 | 1 struct rcu_node *parent; |
| 334 | 2 u8 level; |
| 335 | 3 u8 grpnum; |
| 336 | 4 unsigned long grpmask; |
| 337 | 5 int grplo; |
| 338 | 6 int grphi; |
| 339 | |
| 340 | The ``->parent`` pointer references the ``rcu_node`` one level up in the |
| 341 | tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation |
| 342 | makes heavy use of this field to push quiescent states up the tree. The |
| 343 | ``->level`` field gives the level in the tree, with the root being at |
| 344 | level zero, its children at level one, and so on. The ``->grpnum`` field |
| 345 | gives this node's position within the children of its parent, so this |
| 346 | number can range between 0 and 31 on 32-bit systems and between 0 and 63 |
| 347 | on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only |
| 348 | during initialization and for tracing. The ``->grpmask`` field is the |
| 349 | bitmask counterpart of ``->grpnum``, and therefore always has exactly |
| 350 | one bit set. This mask is used to clear the bit corresponding to this |
| 351 | ``rcu_node`` structure in its parent's bitmasks, which are described |
| 352 | later. Finally, the ``->grplo`` and ``->grphi`` fields contain the |
| 353 | lowest and highest numbered CPU served by this ``rcu_node`` structure, |
| 354 | respectively. |
| 355 | |
| 356 | All of these fields are constant, and thus do not require any |
| 357 | synchronization. |
| 358 | |
| 359 | Synchronization |
| 360 | ''''''''''''''' |
| 361 | |
| 362 | This field of the ``rcu_node`` structure is declared as follows: |
| 363 | |
| 364 | :: |
| 365 | |
| 366 | 1 raw_spinlock_t lock; |
| 367 | |
| 368 | This field is used to protect the remaining fields in this structure, |
| 369 | unless otherwise stated. That said, all of the fields in this structure |
| 370 | can be accessed without locking for tracing purposes. Yes, this can |
| 371 | result in confusing traces, but better some tracing confusion than to be |
| 372 | heisenbugged out of existence. |
| 373 | |
| 374 | .. _grace-period-tracking-1: |
| 375 | |
| 376 | Grace-Period Tracking |
| 377 | ''''''''''''''''''''' |
| 378 | |
| 379 | This portion of the ``rcu_node`` structure is declared as follows: |
| 380 | |
| 381 | :: |
| 382 | |
| 383 | 1 unsigned long gp_seq; |
| 384 | 2 unsigned long gp_seq_needed; |
| 385 | |
| 386 | The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of |
| 387 | the field of the same name in the ``rcu_state`` structure. They each may |
| 388 | lag up to one step behind their ``rcu_state`` counterpart. If the bottom |
| 389 | two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, |
| 390 | then this ``rcu_node`` structure believes that RCU is idle. |
| 391 | |
| 392 | The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the |
| 393 | beginning and the end of each grace period. |
| 394 | |
| 395 | The ``->gp_seq_needed`` fields record the furthest-in-the-future grace |
| 396 | period request seen by the corresponding ``rcu_node`` structure. The |
| 397 | request is considered fulfilled when the value of the ``->gp_seq`` field |
| 398 | equals or exceeds that of the ``->gp_seq_needed`` field. |
| 399 | |
| 400 | +-----------------------------------------------------------------------+ |
| 401 | | **Quick Quiz**: | |
| 402 | +-----------------------------------------------------------------------+ |
| 403 | | Suppose that this ``rcu_node`` structure doesn't see a request for a | |
| 404 | | very long time. Won't wrapping of the ``->gp_seq`` field cause | |
| 405 | | problems? | |
| 406 | +-----------------------------------------------------------------------+ |
| 407 | | **Answer**: | |
| 408 | +-----------------------------------------------------------------------+ |
| 409 | | No, because if the ``->gp_seq_needed`` field lags behind the | |
| 410 | | ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at | |
| 411 | | the end of the grace period. Modulo-arithmetic comparisons therefore | |
| 412 | | will always get the correct answer, even with wrapping. | |
| 413 | +-----------------------------------------------------------------------+ |
| 414 | |
| 415 | Quiescent-State Tracking |
| 416 | '''''''''''''''''''''''' |
| 417 | |
| 418 | These fields manage the propagation of quiescent states up the combining |
| 419 | tree. |
| 420 | |
| 421 | This portion of the ``rcu_node`` structure has fields as follows: |
| 422 | |
| 423 | :: |
| 424 | |
| 425 | 1 unsigned long qsmask; |
| 426 | 2 unsigned long expmask; |
| 427 | 3 unsigned long qsmaskinit; |
| 428 | 4 unsigned long expmaskinit; |
| 429 | |
| 430 | The ``->qsmask`` field tracks which of this ``rcu_node`` structure's |
| 431 | children still need to report quiescent states for the current normal |
| 432 | grace period. Such children will have a value of 1 in their |
| 433 | corresponding bit. Note that the leaf ``rcu_node`` structures should be |
| 434 | thought of as having ``rcu_data`` structures as their children. |
| 435 | Similarly, the ``->expmask`` field tracks which of this ``rcu_node`` |
| 436 | structure's children still need to report quiescent states for the |
| 437 | current expedited grace period. An expedited grace period has the same |
| 438 | conceptual properties as a normal grace period, but the expedited |
| 439 | implementation accepts extreme CPU overhead to obtain much lower |
| 440 | grace-period latency, for example, consuming a few tens of microseconds |
| 441 | worth of CPU time to reduce grace-period duration from milliseconds to |
| 442 | tens of microseconds. The ``->qsmaskinit`` field tracks which of this |
| 443 | ``rcu_node`` structure's children cover for at least one online CPU. |
| 444 | This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is |
| 445 | used to initialize ``->expmask`` and the beginning of the normal and |
| 446 | expedited grace periods, respectively. |
| 447 | |
| 448 | +-----------------------------------------------------------------------+ |
| 449 | | **Quick Quiz**: | |
| 450 | +-----------------------------------------------------------------------+ |
| 451 | | Why are these bitmasks protected by locking? Come on, haven't you | |
| 452 | | heard of atomic instructions??? | |
| 453 | +-----------------------------------------------------------------------+ |
| 454 | | **Answer**: | |
| 455 | +-----------------------------------------------------------------------+ |
| 456 | | Lockless grace-period computation! Such a tantalizing possibility! | |
| 457 | | But consider the following sequence of events: | |
| 458 | | | |
| 459 | | #. CPU 0 has been in dyntick-idle mode for quite some time. When it | |
| 460 | | wakes up, it notices that the current RCU grace period needs it to | |
| 461 | | report in, so it sets a flag where the scheduling clock interrupt | |
| 462 | | will find it. | |
| 463 | | #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and | |
| 464 | | notices that CPU 0 has been in dyntick idle mode, which qualifies | |
| 465 | | as an extended quiescent state. | |
| 466 | | #. CPU 0's scheduling clock interrupt fires in the middle of an RCU | |
| 467 | | read-side critical section, and notices that the RCU core needs | |
| 468 | | something, so commences RCU softirq processing. | |
| 469 | | #. CPU 0's softirq handler executes and is just about ready to report | |
| 470 | | its quiescent state up the ``rcu_node`` tree. | |
| 471 | | #. But CPU 1 beats it to the punch, completing the current grace | |
| 472 | | period and starting a new one. | |
| 473 | | #. CPU 0 now reports its quiescent state for the wrong grace period. | |
| 474 | | That grace period might now end before the RCU read-side critical | |
| 475 | | section. If that happens, disaster will ensue. | |
| 476 | | | |
| 477 | | So the locking is absolutely required in order to coordinate clearing | |
| 478 | | of the bits with updating of the grace-period sequence number in | |
| 479 | | ``->gp_seq``. | |
| 480 | +-----------------------------------------------------------------------+ |
| 481 | |
| 482 | Blocked-Task Management |
| 483 | ''''''''''''''''''''''' |
| 484 | |
| 485 | ``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU |
| 486 | read-side critical sections, and these tasks must be tracked explicitly. |
| 487 | The details of exactly why and how they are tracked will be covered in a |
| 488 | separate article on RCU read-side processing. For now, it is enough to |
| 489 | know that the ``rcu_node`` structure tracks them. |
| 490 | |
| 491 | :: |
| 492 | |
| 493 | 1 struct list_head blkd_tasks; |
| 494 | 2 struct list_head *gp_tasks; |
| 495 | 3 struct list_head *exp_tasks; |
| 496 | 4 bool wait_blkd_tasks; |
| 497 | |
| 498 | The ``->blkd_tasks`` field is a list header for the list of blocked and |
| 499 | preempted tasks. As tasks undergo context switches within RCU read-side |
| 500 | critical sections, their ``task_struct`` structures are enqueued (via |
| 501 | the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the |
| 502 | ``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding |
| 503 | to the CPU on which the outgoing context switch executed. As these tasks |
| 504 | later exit their RCU read-side critical sections, they remove themselves |
| 505 | from the list. This list is therefore in reverse time order, so that if |
| 506 | one of the tasks is blocking the current grace period, all subsequent |
| 507 | tasks must also be blocking that same grace period. Therefore, a single |
| 508 | pointer into this list suffices to track all tasks blocking a given |
| 509 | grace period. That pointer is stored in ``->gp_tasks`` for normal grace |
| 510 | periods and in ``->exp_tasks`` for expedited grace periods. These last |
| 511 | two fields are ``NULL`` if either there is no grace period in flight or |
| 512 | if there are no blocked tasks preventing that grace period from |
| 513 | completing. If either of these two pointers is referencing a task that |
| 514 | removes itself from the ``->blkd_tasks`` list, then that task must |
| 515 | advance the pointer to the next task on the list, or set the pointer to |
| 516 | ``NULL`` if there are no subsequent tasks on the list. |
| 517 | |
| 518 | For example, suppose that tasks T1, T2, and T3 are all hard-affinitied |
| 519 | to the largest-numbered CPU in the system. Then if task T1 blocked in an |
| 520 | RCU read-side critical section, then an expedited grace period started, |
| 521 | then task T2 blocked in an RCU read-side critical section, then a normal |
| 522 | grace period started, and finally task 3 blocked in an RCU read-side |
| 523 | critical section, then the state of the last leaf ``rcu_node`` |
| 524 | structure's blocked-task list would be as shown below: |
| 525 | |
| 526 | .. kernel-figure:: blkd_task.svg |
| 527 | |
| 528 | Task T1 is blocking both grace periods, task T2 is blocking only the |
| 529 | normal grace period, and task T3 is blocking neither grace period. Note |
| 530 | that these tasks will not remove themselves from this list immediately |
| 531 | upon resuming execution. They will instead remain on the list until they |
| 532 | execute the outermost ``rcu_read_unlock()`` that ends their RCU |
| 533 | read-side critical section. |
| 534 | |
| 535 | The ``->wait_blkd_tasks`` field indicates whether or not the current |
| 536 | grace period is waiting on a blocked task. |
| 537 | |
| 538 | Sizing the ``rcu_node`` Array |
| 539 | ''''''''''''''''''''''''''''' |
| 540 | |
| 541 | The ``rcu_node`` array is sized via a series of C-preprocessor |
| 542 | expressions as follows: |
| 543 | |
| 544 | :: |
| 545 | |
| 546 | 1 #ifdef CONFIG_RCU_FANOUT |
| 547 | 2 #define RCU_FANOUT CONFIG_RCU_FANOUT |
| 548 | 3 #else |
| 549 | 4 # ifdef CONFIG_64BIT |
| 550 | 5 # define RCU_FANOUT 64 |
| 551 | 6 # else |
| 552 | 7 # define RCU_FANOUT 32 |
| 553 | 8 # endif |
| 554 | 9 #endif |
| 555 | 10 |
| 556 | 11 #ifdef CONFIG_RCU_FANOUT_LEAF |
| 557 | 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF |
| 558 | 13 #else |
| 559 | 14 # ifdef CONFIG_64BIT |
| 560 | 15 # define RCU_FANOUT_LEAF 64 |
| 561 | 16 # else |
| 562 | 17 # define RCU_FANOUT_LEAF 32 |
| 563 | 18 # endif |
| 564 | 19 #endif |
| 565 | 20 |
| 566 | 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) |
| 567 | 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) |
| 568 | 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) |
| 569 | 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) |
| 570 | 25 |
| 571 | 26 #if NR_CPUS <= RCU_FANOUT_1 |
| 572 | 27 # define RCU_NUM_LVLS 1 |
| 573 | 28 # define NUM_RCU_LVL_0 1 |
| 574 | 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 |
| 575 | 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } |
| 576 | 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } |
| 577 | 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } |
| 578 | 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } |
| 579 | 34 #elif NR_CPUS <= RCU_FANOUT_2 |
| 580 | 35 # define RCU_NUM_LVLS 2 |
| 581 | 36 # define NUM_RCU_LVL_0 1 |
| 582 | 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) |
| 583 | 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) |
| 584 | 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } |
| 585 | 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } |
| 586 | 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } |
| 587 | 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } |
| 588 | 43 #elif NR_CPUS <= RCU_FANOUT_3 |
| 589 | 44 # define RCU_NUM_LVLS 3 |
| 590 | 45 # define NUM_RCU_LVL_0 1 |
| 591 | 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) |
| 592 | 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) |
| 593 | 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) |
| 594 | 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } |
| 595 | 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } |
| 596 | 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } |
| 597 | 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } |
| 598 | 53 #elif NR_CPUS <= RCU_FANOUT_4 |
| 599 | 54 # define RCU_NUM_LVLS 4 |
| 600 | 55 # define NUM_RCU_LVL_0 1 |
| 601 | 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) |
| 602 | 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) |
| 603 | 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) |
| 604 | 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) |
| 605 | 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } |
| 606 | 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } |
| 607 | 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } |
| 608 | 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } |
| 609 | 64 #else |
| 610 | 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" |
| 611 | 66 #endif |
| 612 | |
| 613 | The maximum number of levels in the ``rcu_node`` structure is currently |
| 614 | limited to four, as specified by lines 21-24 and the structure of the |
| 615 | subsequent “if” statement. For 32-bit systems, this allows |
| 616 | 16*32*32*32=524,288 CPUs, which should be sufficient for the next few |
| 617 | years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is |
| 618 | allowed, which should see us through the next decade or so. This |
| 619 | four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` |
| 620 | to support up to 4096 CPUs, which might be useful in very large systems |
| 621 | having eight CPUs per socket (but please note that no one has yet shown |
| 622 | any measurable performance degradation due to misaligned socket and |
| 623 | ``rcu_node`` boundaries). In addition, building kernels with a full four |
| 624 | levels of ``rcu_node`` tree permits better testing of RCU's |
| 625 | combining-tree code. |
| 626 | |
| 627 | The ``RCU_FANOUT`` symbol controls how many children are permitted at |
| 628 | each non-leaf level of the ``rcu_node`` tree. If the |
| 629 | ``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based |
| 630 | on the word size of the system, which is also the Kconfig default. |
| 631 | |
| 632 | The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by |
| 633 | each leaf ``rcu_node`` structure. Experience has shown that allowing a |
| 634 | given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the |
| 635 | number of bits in the ``->qsmask`` field on a 64-bit system, results in |
| 636 | excessive contention for the leaf ``rcu_node`` structures' ``->lock`` |
| 637 | fields. The number of CPUs per leaf ``rcu_node`` structure is therefore |
| 638 | limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If |
| 639 | ``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based |
| 640 | on the word size of the system, just as for ``CONFIG_RCU_FANOUT``. |
| 641 | Lines 11-19 perform this computation. |
| 642 | |
| 643 | Lines 21-24 compute the maximum number of CPUs supported by a |
| 644 | single-level (which contains a single ``rcu_node`` structure), |
| 645 | two-level, three-level, and four-level ``rcu_node`` tree, respectively, |
| 646 | given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. |
| 647 | These numbers of CPUs are retained in the ``RCU_FANOUT_1``, |
| 648 | ``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor |
| 649 | variables, respectively. |
| 650 | |
| 651 | These variables are used to control the C-preprocessor ``#if`` statement |
| 652 | spanning lines 26-66 that computes the number of ``rcu_node`` structures |
| 653 | required for each level of the tree, as well as the number of levels |
| 654 | required. The number of levels is placed in the ``NUM_RCU_LVLS`` |
| 655 | C-preprocessor variable by lines 27, 35, 44, and 54. The number of |
| 656 | ``rcu_node`` structures for the topmost level of the tree is always |
| 657 | exactly one, and this value is unconditionally placed into |
| 658 | ``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels |
| 659 | (if any) of the ``rcu_node`` tree are computed by dividing the maximum |
| 660 | number of CPUs by the fanout supported by the number of levels from the |
| 661 | current level down, rounding up. This computation is performed by |
| 662 | lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create |
| 663 | initializers for lockdep lock-class names. Finally, lines 64-66 produce |
| 664 | an error if the maximum number of CPUs is too large for the specified |
| 665 | fanout. |
| 666 | |
| 667 | The ``rcu_segcblist`` Structure |
| 668 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 669 | |
| 670 | The ``rcu_segcblist`` structure maintains a segmented list of callbacks |
| 671 | as follows: |
| 672 | |
| 673 | :: |
| 674 | |
| 675 | 1 #define RCU_DONE_TAIL 0 |
| 676 | 2 #define RCU_WAIT_TAIL 1 |
| 677 | 3 #define RCU_NEXT_READY_TAIL 2 |
| 678 | 4 #define RCU_NEXT_TAIL 3 |
| 679 | 5 #define RCU_CBLIST_NSEGS 4 |
| 680 | 6 |
| 681 | 7 struct rcu_segcblist { |
| 682 | 8 struct rcu_head *head; |
| 683 | 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; |
| 684 | 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; |
| 685 | 11 long len; |
| 686 | 12 long len_lazy; |
| 687 | 13 }; |
| 688 | |
| 689 | The segments are as follows: |
| 690 | |
| 691 | #. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These |
| 692 | callbacks are ready to be invoked. |
| 693 | #. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace |
| 694 | period. Note that different CPUs can have different ideas about which |
| 695 | grace period is current, hence the ``->gp_seq`` field. |
| 696 | #. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period |
| 697 | to start. |
| 698 | #. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a |
| 699 | grace period. |
| 700 | |
| 701 | The ``->head`` pointer references the first callback or is ``NULL`` if |
| 702 | the list contains no callbacks (which is *not* the same as being empty). |
| 703 | Each element of the ``->tails[]`` array references the ``->next`` |
| 704 | pointer of the last callback in the corresponding segment of the list, |
| 705 | or the list's ``->head`` pointer if that segment and all previous |
| 706 | segments are empty. If the corresponding segment is empty but some |
| 707 | previous segment is not empty, then the array element is identical to |
| 708 | its predecessor. Older callbacks are closer to the head of the list, and |
| 709 | new callbacks are added at the tail. This relationship between the |
| 710 | ``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown |
| 711 | in this diagram: |
| 712 | |
| 713 | .. kernel-figure:: nxtlist.svg |
| 714 | |
| 715 | In this figure, the ``->head`` pointer references the first RCU callback |
| 716 | in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the |
| 717 | ``->head`` pointer itself, indicating that none of the callbacks is |
| 718 | ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references |
| 719 | callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 |
| 720 | are both waiting on the current grace period, give or take possible |
| 721 | disagreements about exactly which grace period is the current one. The |
| 722 | ``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU |
| 723 | callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that |
| 724 | there are no callbacks waiting on the next RCU grace period. The |
| 725 | ``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` |
| 726 | pointer, indicating that all the remaining RCU callbacks have not yet |
| 727 | been assigned to an RCU grace period. Note that the |
| 728 | ``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU |
| 729 | callback's ``->next`` pointer unless the callback list is empty, in |
| 730 | which case it references the ``->head`` pointer. |
| 731 | |
| 732 | There is one additional important special case for the |
| 733 | ``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this |
| 734 | list is *disabled*. Lists are disabled when the corresponding CPU is |
| 735 | offline or when the corresponding CPU's callbacks are offloaded to a |
| 736 | kthread, both of which are described elsewhere. |
| 737 | |
| 738 | CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the |
| 739 | ``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the |
| 740 | ``RCU_DONE_TAIL`` list segments as grace periods advance. |
| 741 | |
| 742 | The ``->gp_seq[]`` array records grace-period numbers corresponding to |
| 743 | the list segments. This is what allows different CPUs to have different |
| 744 | ideas as to which is the current grace period while still avoiding |
| 745 | premature invocation of their callbacks. In particular, this allows CPUs |
| 746 | that go idle for extended periods to determine which of their callbacks |
| 747 | are ready to be invoked after reawakening. |
| 748 | |
| 749 | The ``->len`` counter contains the number of callbacks in ``->head``, |
| 750 | and the ``->len_lazy`` contains the number of those callbacks that are |
| 751 | known to only free memory, and whose invocation can therefore be safely |
| 752 | deferred. |
| 753 | |
| 754 | .. important:: |
| 755 | |
| 756 | It is the ``->len`` field that determines whether or |
| 757 | not there are callbacks associated with this ``rcu_segcblist`` |
| 758 | structure, *not* the ``->head`` pointer. The reason for this is that all |
| 759 | the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` |
| 760 | segment) are extracted all at once at callback-invocation time |
| 761 | (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there |
| 762 | are no not-done callbacks remaining in the ``rcu_segcblist``. If |
| 763 | callback invocation must be postponed, for example, because a |
| 764 | high-priority process just woke up on this CPU, then the remaining |
| 765 | callbacks are placed back on the ``RCU_DONE_TAIL`` segment and |
| 766 | ``->head`` once again points to the start of the segment. In short, the |
| 767 | head field can briefly be ``NULL`` even though the CPU has callbacks |
| 768 | present the entire time. Therefore, it is not appropriate to test the |
| 769 | ``->head`` pointer for ``NULL``. |
| 770 | |
| 771 | In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only |
| 772 | after the corresponding callbacks have been invoked. This means that the |
| 773 | ``->len`` count is zero only if the ``rcu_segcblist`` structure really |
| 774 | is devoid of callbacks. Of course, off-CPU sampling of the ``->len`` |
| 775 | count requires careful use of appropriate synchronization, for example, |
| 776 | memory barriers. This synchronization can be a bit subtle, particularly |
| 777 | in the case of ``rcu_barrier()``. |
| 778 | |
| 779 | The ``rcu_data`` Structure |
| 780 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 781 | |
| 782 | The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The |
| 783 | fields in this structure may be accessed only from the corresponding CPU |
| 784 | (and from tracing) unless otherwise stated. This structure is the focus |
| 785 | of quiescent-state detection and RCU callback queuing. It also tracks |
| 786 | its relationship to the corresponding leaf ``rcu_node`` structure to |
| 787 | allow more-efficient propagation of quiescent states up the ``rcu_node`` |
| 788 | combining tree. Like the ``rcu_node`` structure, it provides a local |
| 789 | copy of the grace-period information to allow for-free synchronized |
| 790 | access to this information from the corresponding CPU. Finally, this |
| 791 | structure records past dyntick-idle state for the corresponding CPU and |
| 792 | also tracks statistics. |
| 793 | |
| 794 | The ``rcu_data`` structure's fields are discussed, singly and in groups, |
| 795 | in the following sections. |
| 796 | |
| 797 | Connection to Other Data Structures |
| 798 | ''''''''''''''''''''''''''''''''''' |
| 799 | |
| 800 | This portion of the ``rcu_data`` structure is declared as follows: |
| 801 | |
| 802 | :: |
| 803 | |
| 804 | 1 int cpu; |
| 805 | 2 struct rcu_node *mynode; |
| 806 | 3 unsigned long grpmask; |
| 807 | 4 bool beenonline; |
| 808 | |
| 809 | The ``->cpu`` field contains the number of the corresponding CPU and the |
| 810 | ``->mynode`` field references the corresponding ``rcu_node`` structure. |
| 811 | The ``->mynode`` is used to propagate quiescent states up the combining |
| 812 | tree. These two fields are constant and therefore do not require |
| 813 | synchronization. |
| 814 | |
| 815 | The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` |
| 816 | corresponding to this ``rcu_data`` structure, and is also used when |
| 817 | propagating quiescent states. The ``->beenonline`` flag is set whenever |
| 818 | the corresponding CPU comes online, which means that the debugfs tracing |
| 819 | need not dump out any ``rcu_data`` structure for which this flag is not |
| 820 | set. |
| 821 | |
| 822 | Quiescent-State and Grace-Period Tracking |
| 823 | ''''''''''''''''''''''''''''''''''''''''' |
| 824 | |
| 825 | This portion of the ``rcu_data`` structure is declared as follows: |
| 826 | |
| 827 | :: |
| 828 | |
| 829 | 1 unsigned long gp_seq; |
| 830 | 2 unsigned long gp_seq_needed; |
| 831 | 3 bool cpu_no_qs; |
| 832 | 4 bool core_needs_qs; |
| 833 | 5 bool gpwrap; |
| 834 | |
| 835 | The ``->gp_seq`` field is the counterpart of the field of the same name |
| 836 | in the ``rcu_state`` and ``rcu_node`` structures. The |
| 837 | ``->gp_seq_needed`` field is the counterpart of the field of the same |
| 838 | name in the rcu_node structure. They may each lag up to one behind their |
| 839 | ``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and |
| 840 | ``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in |
| 841 | dyntick-idle mode (but these counters will catch up upon exit from |
| 842 | dyntick-idle mode). If the lower two bits of a given ``rcu_data`` |
| 843 | structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure |
| 844 | believes that RCU is idle. |
| 845 | |
| 846 | +-----------------------------------------------------------------------+ |
| 847 | | **Quick Quiz**: | |
| 848 | +-----------------------------------------------------------------------+ |
| 849 | | All this replication of the grace period numbers can only cause | |
| 850 | | massive confusion. Why not just keep a global sequence number and be | |
| 851 | | done with it??? | |
| 852 | +-----------------------------------------------------------------------+ |
| 853 | | **Answer**: | |
| 854 | +-----------------------------------------------------------------------+ |
| 855 | | Because if there was only a single global sequence numbers, there | |
| 856 | | would need to be a single global lock to allow safely accessing and | |
| 857 | | updating it. And if we are not going to have a single global lock, we | |
| 858 | | need to carefully manage the numbers on a per-node basis. Recall from | |
| 859 | | the answer to a previous Quick Quiz that the consequences of applying | |
| 860 | | a previously sampled quiescent state to the wrong grace period are | |
| 861 | | quite severe. | |
| 862 | +-----------------------------------------------------------------------+ |
| 863 | |
| 864 | The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed |
| 865 | through a quiescent state, while the ``->core_needs_qs`` flag indicates |
| 866 | that the RCU core needs a quiescent state from the corresponding CPU. |
| 867 | The ``->gpwrap`` field indicates that the corresponding CPU has remained |
| 868 | idle for so long that the ``gp_seq`` counter is in danger of overflow, |
| 869 | which will cause the CPU to disregard the values of its counters on its |
| 870 | next exit from idle. |
| 871 | |
| 872 | RCU Callback Handling |
| 873 | ''''''''''''''''''''' |
| 874 | |
| 875 | In the absence of CPU-hotplug events, RCU callbacks are invoked by the |
| 876 | same CPU that registered them. This is strictly a cache-locality |
| 877 | optimization: callbacks can and do get invoked on CPUs other than the |
| 878 | one that registered them. After all, if the CPU that registered a given |
| 879 | callback has gone offline before the callback can be invoked, there |
| 880 | really is no other choice. |
| 881 | |
| 882 | This portion of the ``rcu_data`` structure is declared as follows: |
| 883 | |
| 884 | :: |
| 885 | |
| 886 | 1 struct rcu_segcblist cblist; |
| 887 | 2 long qlen_last_fqs_check; |
| 888 | 3 unsigned long n_cbs_invoked; |
| 889 | 4 unsigned long n_nocbs_invoked; |
| 890 | 5 unsigned long n_cbs_orphaned; |
| 891 | 6 unsigned long n_cbs_adopted; |
| 892 | 7 unsigned long n_force_qs_snap; |
| 893 | 8 long blimit; |
| 894 | |
| 895 | The ``->cblist`` structure is the segmented callback list described |
| 896 | earlier. The CPU advances the callbacks in its ``rcu_data`` structure |
| 897 | whenever it notices that another RCU grace period has completed. The CPU |
| 898 | detects the completion of an RCU grace period by noticing that the value |
| 899 | of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of |
| 900 | its leaf ``rcu_node`` structure. Recall that each ``rcu_node`` |
| 901 | structure's ``->gp_seq`` field is updated at the beginnings and ends of |
| 902 | each grace period. |
| 903 | |
| 904 | The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the |
| 905 | forcing of quiescent states from ``call_rcu()`` and friends when |
| 906 | callback lists grow excessively long. |
| 907 | |
| 908 | The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` |
| 909 | fields count the number of callbacks invoked, sent to other CPUs when |
| 910 | this CPU goes offline, and received from other CPUs when those other |
| 911 | CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's |
| 912 | callbacks are offloaded to a kthread. |
| 913 | |
| 914 | Finally, the ``->blimit`` counter is the maximum number of RCU callbacks |
| 915 | that may be invoked at a given time. |
| 916 | |
| 917 | Dyntick-Idle Handling |
| 918 | ''''''''''''''''''''' |
| 919 | |
| 920 | This portion of the ``rcu_data`` structure is declared as follows: |
| 921 | |
| 922 | :: |
| 923 | |
| 924 | 1 int dynticks_snap; |
| 925 | 2 unsigned long dynticks_fqs; |
| 926 | |
| 927 | The ``->dynticks_snap`` field is used to take a snapshot of the |
| 928 | corresponding CPU's dyntick-idle state when forcing quiescent states, |
| 929 | and is therefore accessed from other CPUs. Finally, the |
| 930 | ``->dynticks_fqs`` field is used to count the number of times this CPU |
| 931 | is determined to be in dyntick-idle state, and is used for tracing and |
| 932 | debugging purposes. |
| 933 | |
| 934 | This portion of the rcu_data structure is declared as follows: |
| 935 | |
| 936 | :: |
| 937 | |
| 938 | 1 long dynticks_nesting; |
| 939 | 2 long dynticks_nmi_nesting; |
| 940 | 3 atomic_t dynticks; |
| 941 | 4 bool rcu_need_heavy_qs; |
| 942 | 5 bool rcu_urgent_qs; |
| 943 | |
| 944 | These fields in the rcu_data structure maintain the per-CPU dyntick-idle |
| 945 | state for the corresponding CPU. The fields may be accessed only from |
| 946 | the corresponding CPU (and from tracing) unless otherwise stated. |
| 947 | |
| 948 | The ``->dynticks_nesting`` field counts the nesting depth of process |
| 949 | execution, so that in normal circumstances this counter has value zero |
| 950 | or one. NMIs, irqs, and tracers are counted by the |
| 951 | ``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes |
| 952 | to this variable have to be undertaken carefully using an algorithm |
| 953 | provided by Andy Lutomirski. The initial transition from idle adds one, |
| 954 | and nested transitions add two, so that a nesting level of five is |
| 955 | represented by a ``->dynticks_nmi_nesting`` value of nine. This counter |
| 956 | can therefore be thought of as counting the number of reasons why this |
| 957 | CPU cannot be permitted to enter dyntick-idle mode, aside from |
| 958 | process-level transitions. |
| 959 | |
| 960 | However, it turns out that when running in non-idle kernel context, the |
| 961 | Linux kernel is fully capable of entering interrupt handlers that never |
| 962 | exit and perhaps also vice versa. Therefore, whenever the |
| 963 | ``->dynticks_nesting`` field is incremented up from zero, the |
| 964 | ``->dynticks_nmi_nesting`` field is set to a large positive number, and |
| 965 | whenever the ``->dynticks_nesting`` field is decremented down to zero, |
| 966 | the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that |
| 967 | the number of misnested interrupts is not sufficient to overflow the |
| 968 | counter, this approach corrects the ``->dynticks_nmi_nesting`` field |
| 969 | every time the corresponding CPU enters the idle loop from process |
| 970 | context. |
| 971 | |
| 972 | The ``->dynticks`` field counts the corresponding CPU's transitions to |
| 973 | and from either dyntick-idle or user mode, so that this counter has an |
| 974 | even value when the CPU is in dyntick-idle mode or user mode and an odd |
| 975 | value otherwise. The transitions to/from user mode need to be counted |
| 976 | for user mode adaptive-ticks support (see timers/NO_HZ.txt). |
| 977 | |
| 978 | The ``->rcu_need_heavy_qs`` field is used to record the fact that the |
| 979 | RCU core code would really like to see a quiescent state from the |
| 980 | corresponding CPU, so much so that it is willing to call for |
| 981 | heavy-weight dyntick-counter operations. This flag is checked by RCU's |
| 982 | context-switch and ``cond_resched()`` code, which provide a momentary |
| 983 | idle sojourn in response. |
| 984 | |
| 985 | Finally, the ``->rcu_urgent_qs`` field is used to record the fact that |
| 986 | the RCU core code would really like to see a quiescent state from the |
| 987 | corresponding CPU, with the various other fields indicating just how |
| 988 | badly RCU wants this quiescent state. This flag is checked by RCU's |
| 989 | context-switch path (``rcu_note_context_switch``) and the cond_resched |
| 990 | code. |
| 991 | |
| 992 | +-----------------------------------------------------------------------+ |
| 993 | | **Quick Quiz**: | |
| 994 | +-----------------------------------------------------------------------+ |
| 995 | | Why not simply combine the ``->dynticks_nesting`` and | |
| 996 | | ``->dynticks_nmi_nesting`` counters into a single counter that just | |
| 997 | | counts the number of reasons that the corresponding CPU is non-idle? | |
| 998 | +-----------------------------------------------------------------------+ |
| 999 | | **Answer**: | |
| 1000 | +-----------------------------------------------------------------------+ |
| 1001 | | Because this would fail in the presence of interrupts whose handlers | |
| 1002 | | never return and of handlers that manage to return from a made-up | |
| 1003 | | interrupt. | |
| 1004 | +-----------------------------------------------------------------------+ |
| 1005 | |
| 1006 | Additional fields are present for some special-purpose builds, and are |
| 1007 | discussed separately. |
| 1008 | |
| 1009 | The ``rcu_head`` Structure |
| 1010 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 1011 | |
| 1012 | Each ``rcu_head`` structure represents an RCU callback. These structures |
| 1013 | are normally embedded within RCU-protected data structures whose |
| 1014 | algorithms use asynchronous grace periods. In contrast, when using |
| 1015 | algorithms that block waiting for RCU grace periods, RCU users need not |
| 1016 | provide ``rcu_head`` structures. |
| 1017 | |
| 1018 | The ``rcu_head`` structure has fields as follows: |
| 1019 | |
| 1020 | :: |
| 1021 | |
| 1022 | 1 struct rcu_head *next; |
| 1023 | 2 void (*func)(struct rcu_head *head); |
| 1024 | |
| 1025 | The ``->next`` field is used to link the ``rcu_head`` structures |
| 1026 | together in the lists within the ``rcu_data`` structures. The ``->func`` |
| 1027 | field is a pointer to the function to be called when the callback is |
| 1028 | ready to be invoked, and this function is passed a pointer to the |
| 1029 | ``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` |
| 1030 | field to record the offset of the ``rcu_head`` structure within the |
| 1031 | enclosing RCU-protected data structure. |
| 1032 | |
| 1033 | Both of these fields are used internally by RCU. From the viewpoint of |
| 1034 | RCU users, this structure is an opaque “cookie”. |
| 1035 | |
| 1036 | +-----------------------------------------------------------------------+ |
| 1037 | | **Quick Quiz**: | |
| 1038 | +-----------------------------------------------------------------------+ |
| 1039 | | Given that the callback function ``->func`` is passed a pointer to | |
| 1040 | | the ``rcu_head`` structure, how is that function supposed to find the | |
| 1041 | | beginning of the enclosing RCU-protected data structure? | |
| 1042 | +-----------------------------------------------------------------------+ |
| 1043 | | **Answer**: | |
| 1044 | +-----------------------------------------------------------------------+ |
| 1045 | | In actual practice, there is a separate callback function per type of | |
| 1046 | | RCU-protected data structure. The callback function can therefore use | |
| 1047 | | the ``container_of()`` macro in the Linux kernel (or other | |
| 1048 | | pointer-manipulation facilities in other software environments) to | |
| 1049 | | find the beginning of the enclosing structure. | |
| 1050 | +-----------------------------------------------------------------------+ |
| 1051 | |
| 1052 | RCU-Specific Fields in the ``task_struct`` Structure |
| 1053 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 1054 | |
| 1055 | The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in |
| 1056 | the ``task_struct`` structure: |
| 1057 | |
| 1058 | :: |
| 1059 | |
| 1060 | 1 #ifdef CONFIG_PREEMPT_RCU |
| 1061 | 2 int rcu_read_lock_nesting; |
| 1062 | 3 union rcu_special rcu_read_unlock_special; |
| 1063 | 4 struct list_head rcu_node_entry; |
| 1064 | 5 struct rcu_node *rcu_blocked_node; |
| 1065 | 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ |
| 1066 | 7 #ifdef CONFIG_TASKS_RCU |
| 1067 | 8 unsigned long rcu_tasks_nvcsw; |
| 1068 | 9 bool rcu_tasks_holdout; |
| 1069 | 10 struct list_head rcu_tasks_holdout_list; |
| 1070 | 11 int rcu_tasks_idle_cpu; |
| 1071 | 12 #endif /* #ifdef CONFIG_TASKS_RCU */ |
| 1072 | |
| 1073 | The ``->rcu_read_lock_nesting`` field records the nesting level for RCU |
| 1074 | read-side critical sections, and the ``->rcu_read_unlock_special`` field |
| 1075 | is a bitmask that records special conditions that require |
| 1076 | ``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` |
| 1077 | field is used to form lists of tasks that have blocked within |
| 1078 | preemptible-RCU read-side critical sections and the |
| 1079 | ``->rcu_blocked_node`` field references the ``rcu_node`` structure whose |
| 1080 | list this task is a member of, or ``NULL`` if it is not blocked within a |
| 1081 | preemptible-RCU read-side critical section. |
| 1082 | |
| 1083 | The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context |
| 1084 | switches that this task had undergone at the beginning of the current |
| 1085 | tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current |
| 1086 | tasks-RCU grace period is waiting on this task, |
| 1087 | ``->rcu_tasks_holdout_list`` is a list element enqueuing this task on |
| 1088 | the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this |
| 1089 | idle task is running, but only if the task is currently running, that |
| 1090 | is, if the CPU is currently idle. |
| 1091 | |
| 1092 | Accessor Functions |
| 1093 | ~~~~~~~~~~~~~~~~~~ |
| 1094 | |
| 1095 | The following listing shows the ``rcu_get_root()``, |
| 1096 | ``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` |
| 1097 | function and macros: |
| 1098 | |
| 1099 | :: |
| 1100 | |
| 1101 | 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) |
| 1102 | 2 { |
| 1103 | 3 return &rsp->node[0]; |
| 1104 | 4 } |
| 1105 | 5 |
| 1106 | 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ |
| 1107 | 7 for ((rnp) = &(rsp)->node[0]; \ |
| 1108 | 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) |
| 1109 | 9 |
| 1110 | 10 #define rcu_for_each_leaf_node(rsp, rnp) \ |
| 1111 | 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ |
| 1112 | 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) |
| 1113 | |
| 1114 | The ``rcu_get_root()`` simply returns a pointer to the first element of |
| 1115 | the specified ``rcu_state`` structure's ``->node[]`` array, which is the |
| 1116 | root ``rcu_node`` structure. |
| 1117 | |
| 1118 | As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes |
| 1119 | advantage of the layout of the ``rcu_node`` structures in the |
| 1120 | ``rcu_state`` structure's ``->node[]`` array, performing a breadth-first |
| 1121 | traversal by simply traversing the array in order. Similarly, the |
| 1122 | ``rcu_for_each_leaf_node()`` macro traverses only the last part of the |
| 1123 | array, thus traversing only the leaf ``rcu_node`` structures. |
| 1124 | |
| 1125 | +-----------------------------------------------------------------------+ |
| 1126 | | **Quick Quiz**: | |
| 1127 | +-----------------------------------------------------------------------+ |
| 1128 | | What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree | |
| 1129 | | contains only a single node? | |
| 1130 | +-----------------------------------------------------------------------+ |
| 1131 | | **Answer**: | |
| 1132 | +-----------------------------------------------------------------------+ |
| 1133 | | In the single-node case, ``rcu_for_each_leaf_node()`` traverses the | |
| 1134 | | single node. | |
| 1135 | +-----------------------------------------------------------------------+ |
| 1136 | |
| 1137 | Summary |
| 1138 | ~~~~~~~ |
| 1139 | |
| 1140 | So the state of RCU is represented by an ``rcu_state`` structure, which |
| 1141 | contains a combining tree of ``rcu_node`` and ``rcu_data`` structures. |
| 1142 | Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state |
| 1143 | is tracked by dynticks-related fields in the ``rcu_data`` structure. If |
| 1144 | you made it this far, you are well prepared to read the code |
| 1145 | walkthroughs in the other articles in this series. |
| 1146 | |
| 1147 | Acknowledgments |
| 1148 | ~~~~~~~~~~~~~~~ |
| 1149 | |
| 1150 | I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul |
| 1151 | Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for |
| 1152 | helping me get this document into a more human-readable state. |
| 1153 | |
| 1154 | Legal Statement |
| 1155 | ~~~~~~~~~~~~~~~ |
| 1156 | |
| 1157 | This work represents the view of the author and does not necessarily |
| 1158 | represent the view of IBM. |
| 1159 | |
| 1160 | Linux is a registered trademark of Linus Torvalds. |
| 1161 | |
| 1162 | Other company, product, and service names may be trademarks or service |
| 1163 | marks of others. |