Update Linux to v5.4.2
Change-Id: Idf6911045d9d382da2cfe01b1edff026404ac8fd
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
deleted file mode 100644
index eccf7ad..0000000
--- a/Documentation/scheduler/00-INDEX
+++ /dev/null
@@ -1,18 +0,0 @@
-00-INDEX
- - this file.
-sched-arch.txt
- - CPU Scheduler implementation hints for architecture specific code.
-sched-bwc.txt
- - CFS bandwidth control overview.
-sched-design-CFS.txt
- - goals, design and implementation of the Completely Fair Scheduler.
-sched-domains.txt
- - information on scheduling domains.
-sched-nice-design.txt
- - How and why the scheduler's nice levels are implemented.
-sched-rt-group.txt
- - real-time group scheduling.
-sched-deadline.txt
- - deadline scheduling.
-sched-stats.txt
- - information on schedstats (Linux Scheduler Statistics).
diff --git a/Documentation/scheduler/completion.rst b/Documentation/scheduler/completion.rst
new file mode 100644
index 0000000..9f039b4
--- /dev/null
+++ b/Documentation/scheduler/completion.rst
@@ -0,0 +1,293 @@
+================================================
+Completions - "wait for completion" barrier APIs
+================================================
+
+Introduction:
+-------------
+
+If you have one or more threads that must wait for some kernel activity
+to have reached a point or a specific state, completions can provide a
+race-free solution to this problem. Semantically they are somewhat like a
+pthread_barrier() and have similar use-cases.
+
+Completions are a code synchronization mechanism which is preferable to any
+misuse of locks/semaphores and busy-loops. Any time you think of using
+yield() or some quirky msleep(1) loop to allow something else to proceed,
+you probably want to look into using one of the wait_for_completion*()
+calls and complete() instead.
+
+The advantage of using completions is that they have a well defined, focused
+purpose which makes it very easy to see the intent of the code, but they
+also result in more efficient code as all threads can continue execution
+until the result is actually needed, and both the waiting and the signalling
+is highly efficient using low level scheduler sleep/wakeup facilities.
+
+Completions are built on top of the waitqueue and wakeup infrastructure of
+the Linux scheduler. The event the threads on the waitqueue are waiting for
+is reduced to a simple flag in 'struct completion', appropriately called "done".
+
+As completions are scheduling related, the code can be found in
+kernel/sched/completion.c.
+
+
+Usage:
+------
+
+There are three main parts to using completions:
+
+ - the initialization of the 'struct completion' synchronization object
+ - the waiting part through a call to one of the variants of wait_for_completion(),
+ - the signaling side through a call to complete() or complete_all().
+
+There are also some helper functions for checking the state of completions.
+Note that while initialization must happen first, the waiting and signaling
+part can happen in any order. I.e. it's entirely normal for a thread
+to have marked a completion as 'done' before another thread checks whether
+it has to wait for it.
+
+To use completions you need to #include <linux/completion.h> and
+create a static or dynamic variable of type 'struct completion',
+which has only two fields::
+
+ struct completion {
+ unsigned int done;
+ wait_queue_head_t wait;
+ };
+
+This provides the ->wait waitqueue to place tasks on for waiting (if any), and
+the ->done completion flag for indicating whether it's completed or not.
+
+Completions should be named to refer to the event that is being synchronized on.
+A good example is::
+
+ wait_for_completion(&early_console_added);
+
+ complete(&early_console_added);
+
+Good, intuitive naming (as always) helps code readability. Naming a completion
+'complete' is not helpful unless the purpose is super obvious...
+
+
+Initializing completions:
+-------------------------
+
+Dynamically allocated completion objects should preferably be embedded in data
+structures that are assured to be alive for the life-time of the function/driver,
+to prevent races with asynchronous complete() calls from occurring.
+
+Particular care should be taken when using the _timeout() or _killable()/_interruptible()
+variants of wait_for_completion(), as it must be assured that memory de-allocation
+does not happen until all related activities (complete() or reinit_completion())
+have taken place, even if these wait functions return prematurely due to a timeout
+or a signal triggering.
+
+Initializing of dynamically allocated completion objects is done via a call to
+init_completion()::
+
+ init_completion(&dynamic_object->done);
+
+In this call we initialize the waitqueue and set ->done to 0, i.e. "not completed"
+or "not done".
+
+The re-initialization function, reinit_completion(), simply resets the
+->done field to 0 ("not done"), without touching the waitqueue.
+Callers of this function must make sure that there are no racy
+wait_for_completion() calls going on in parallel.
+
+Calling init_completion() on the same completion object twice is
+most likely a bug as it re-initializes the queue to an empty queue and
+enqueued tasks could get "lost" - use reinit_completion() in that case,
+but be aware of other races.
+
+For static declaration and initialization, macros are available.
+
+For static (or global) declarations in file scope you can use
+DECLARE_COMPLETION()::
+
+ static DECLARE_COMPLETION(setup_done);
+ DECLARE_COMPLETION(setup_done);
+
+Note that in this case the completion is boot time (or module load time)
+initialized to 'not done' and doesn't require an init_completion() call.
+
+When a completion is declared as a local variable within a function,
+then the initialization should always use DECLARE_COMPLETION_ONSTACK()
+explicitly, not just to make lockdep happy, but also to make it clear
+that limited scope had been considered and is intentional::
+
+ DECLARE_COMPLETION_ONSTACK(setup_done)
+
+Note that when using completion objects as local variables you must be
+acutely aware of the short life time of the function stack: the function
+must not return to a calling context until all activities (such as waiting
+threads) have ceased and the completion object is completely unused.
+
+To emphasise this again: in particular when using some of the waiting API variants
+with more complex outcomes, such as the timeout or signalling (_timeout(),
+_killable() and _interruptible()) variants, the wait might complete
+prematurely while the object might still be in use by another thread - and a return
+from the wait_on_completion*() caller function will deallocate the function
+stack and cause subtle data corruption if a complete() is done in some
+other thread. Simple testing might not trigger these kinds of races.
+
+If unsure, use dynamically allocated completion objects, preferably embedded
+in some other long lived object that has a boringly long life time which
+exceeds the life time of any helper threads using the completion object,
+or has a lock or other synchronization mechanism to make sure complete()
+is not called on a freed object.
+
+A naive DECLARE_COMPLETION() on the stack triggers a lockdep warning.
+
+Waiting for completions:
+------------------------
+
+For a thread to wait for some concurrent activity to finish, it
+calls wait_for_completion() on the initialized completion structure::
+
+ void wait_for_completion(struct completion *done)
+
+A typical usage scenario is::
+
+ CPU#1 CPU#2
+
+ struct completion setup_done;
+
+ init_completion(&setup_done);
+ initialize_work(...,&setup_done,...);
+
+ /* run non-dependent code */ /* do setup */
+
+ wait_for_completion(&setup_done); complete(setup_done);
+
+This is not implying any particular order between wait_for_completion() and
+the call to complete() - if the call to complete() happened before the call
+to wait_for_completion() then the waiting side simply will continue
+immediately as all dependencies are satisfied; if not, it will block until
+completion is signaled by complete().
+
+Note that wait_for_completion() is calling spin_lock_irq()/spin_unlock_irq(),
+so it can only be called safely when you know that interrupts are enabled.
+Calling it from IRQs-off atomic contexts will result in hard-to-detect
+spurious enabling of interrupts.
+
+The default behavior is to wait without a timeout and to mark the task as
+uninterruptible. wait_for_completion() and its variants are only safe
+in process context (as they can sleep) but not in atomic context,
+interrupt context, with disabled IRQs, or preemption is disabled - see also
+try_wait_for_completion() below for handling completion in atomic/interrupt
+context.
+
+As all variants of wait_for_completion() can (obviously) block for a long
+time depending on the nature of the activity they are waiting for, so in
+most cases you probably don't want to call this with held mutexes.
+
+
+wait_for_completion*() variants available:
+------------------------------------------
+
+The below variants all return status and this status should be checked in
+most(/all) cases - in cases where the status is deliberately not checked you
+probably want to make a note explaining this (e.g. see
+arch/arm/kernel/smp.c:__cpu_up()).
+
+A common problem that occurs is to have unclean assignment of return types,
+so take care to assign return-values to variables of the proper type.
+
+Checking for the specific meaning of return values also has been found
+to be quite inaccurate, e.g. constructs like::
+
+ if (!wait_for_completion_interruptible_timeout(...))
+
+... would execute the same code path for successful completion and for the
+interrupted case - which is probably not what you want::
+
+ int wait_for_completion_interruptible(struct completion *done)
+
+This function marks the task TASK_INTERRUPTIBLE while it is waiting.
+If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise::
+
+ unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout)
+
+The task is marked as TASK_UNINTERRUPTIBLE and will wait at most 'timeout'
+jiffies. If a timeout occurs it returns 0, else the remaining time in
+jiffies (but at least 1).
+
+Timeouts are preferably calculated with msecs_to_jiffies() or usecs_to_jiffies(),
+to make the code largely HZ-invariant.
+
+If the returned timeout value is deliberately ignored a comment should probably explain
+why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc())::
+
+ long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout)
+
+This function passes a timeout in jiffies and marks the task as
+TASK_INTERRUPTIBLE. If a signal was received it will return -ERESTARTSYS;
+otherwise it returns 0 if the completion timed out, or the remaining time in
+jiffies if completion occurred.
+
+Further variants include _killable which uses TASK_KILLABLE as the
+designated tasks state and will return -ERESTARTSYS if it is interrupted,
+or 0 if completion was achieved. There is a _timeout variant as well::
+
+ long wait_for_completion_killable(struct completion *done)
+ long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout)
+
+The _io variants wait_for_completion_io() behave the same as the non-_io
+variants, except for accounting waiting time as 'waiting on IO', which has
+an impact on how the task is accounted in scheduling/IO stats::
+
+ void wait_for_completion_io(struct completion *done)
+ unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout)
+
+
+Signaling completions:
+----------------------
+
+A thread that wants to signal that the conditions for continuation have been
+achieved calls complete() to signal exactly one of the waiters that it can
+continue::
+
+ void complete(struct completion *done)
+
+... or calls complete_all() to signal all current and future waiters::
+
+ void complete_all(struct completion *done)
+
+The signaling will work as expected even if completions are signaled before
+a thread starts waiting. This is achieved by the waiter "consuming"
+(decrementing) the done field of 'struct completion'. Waiting threads
+wakeup order is the same in which they were enqueued (FIFO order).
+
+If complete() is called multiple times then this will allow for that number
+of waiters to continue - each call to complete() will simply increment the
+done field. Calling complete_all() multiple times is a bug though. Both
+complete() and complete_all() can be called in IRQ/atomic context safely.
+
+There can only be one thread calling complete() or complete_all() on a
+particular 'struct completion' at any time - serialized through the wait
+queue spinlock. Any such concurrent calls to complete() or complete_all()
+probably are a design bug.
+
+Signaling completion from IRQ context is fine as it will appropriately
+lock with spin_lock_irqsave()/spin_unlock_irqrestore() and it will never
+sleep.
+
+
+try_wait_for_completion()/completion_done():
+--------------------------------------------
+
+The try_wait_for_completion() function will not put the thread on the wait
+queue but rather returns false if it would need to enqueue (block) the thread,
+else it consumes one posted completion and returns true::
+
+ bool try_wait_for_completion(struct completion *done)
+
+Finally, to check the state of a completion without changing it in any way,
+call completion_done(), which returns false if there are no posted
+completions that were not yet consumed by waiters (implying that there are
+waiters) and true otherwise::
+
+ bool completion_done(struct completion *done)
+
+Both try_wait_for_completion() and completion_done() are safe to be called in
+IRQ or atomic context.
diff --git a/Documentation/scheduler/completion.txt b/Documentation/scheduler/completion.txt
deleted file mode 100644
index 656cf80..0000000
--- a/Documentation/scheduler/completion.txt
+++ /dev/null
@@ -1,247 +0,0 @@
-completions - wait for completion handling
-==========================================
-
-This document was originally written based on 3.18.0 (linux-next)
-
-Introduction:
--------------
-
-If you have one or more threads of execution that must wait for some process
-to have reached a point or a specific state, completions can provide a
-race-free solution to this problem. Semantically they are somewhat like a
-pthread_barrier and have similar use-cases.
-
-Completions are a code synchronization mechanism which is preferable to any
-misuse of locks. Any time you think of using yield() or some quirky
-msleep(1) loop to allow something else to proceed, you probably want to
-look into using one of the wait_for_completion*() calls instead. The
-advantage of using completions is clear intent of the code, but also more
-efficient code as both threads can continue until the result is actually
-needed.
-
-Completions are built on top of the generic event infrastructure in Linux,
-with the event reduced to a simple flag (appropriately called "done") in
-struct completion that tells the waiting threads of execution if they
-can continue safely.
-
-As completions are scheduling related, the code is found in
-kernel/sched/completion.c.
-
-
-Usage:
-------
-
-There are three parts to using completions, the initialization of the
-struct completion, the waiting part through a call to one of the variants of
-wait_for_completion() and the signaling side through a call to complete()
-or complete_all(). Further there are some helper functions for checking the
-state of completions.
-
-To use completions one needs to include <linux/completion.h> and
-create a variable of type struct completion. The structure used for
-handling of completions is:
-
- struct completion {
- unsigned int done;
- wait_queue_head_t wait;
- };
-
-providing the wait queue to place tasks on for waiting and the flag for
-indicating the state of affairs.
-
-Completions should be named to convey the intent of the waiter. A good
-example is:
-
- wait_for_completion(&early_console_added);
-
- complete(&early_console_added);
-
-Good naming (as always) helps code readability.
-
-
-Initializing completions:
--------------------------
-
-Initialization of dynamically allocated completions, often embedded in
-other structures, is done with:
-
- void init_completion(&done);
-
-Initialization is accomplished by initializing the wait queue and setting
-the default state to "not available", that is, "done" is set to 0.
-
-The re-initialization function, reinit_completion(), simply resets the
-done element to "not available", thus again to 0, without touching the
-wait queue. Calling init_completion() twice on the same completion object is
-most likely a bug as it re-initializes the queue to an empty queue and
-enqueued tasks could get "lost" - use reinit_completion() in that case.
-
-For static declaration and initialization, macros are available. These are:
-
- static DECLARE_COMPLETION(setup_done)
-
-used for static declarations in file scope. Within functions the static
-initialization should always use:
-
- DECLARE_COMPLETION_ONSTACK(setup_done)
-
-suitable for automatic/local variables on the stack and will make lockdep
-happy. Note also that one needs to make *sure* the completion passed to
-work threads remains in-scope, and no references remain to on-stack data
-when the initiating function returns.
-
-Using on-stack completions for code that calls any of the _timeout or
-_interruptible/_killable variants is not advisable as they will require
-additional synchronization to prevent the on-stack completion object in
-the timeout/signal cases from going out of scope. Consider using dynamically
-allocated completions when intending to use the _interruptible/_killable
-or _timeout variants of wait_for_completion().
-
-
-Waiting for completions:
-------------------------
-
-For a thread of execution to wait for some concurrent work to finish, it
-calls wait_for_completion() on the initialized completion structure.
-A typical usage scenario is:
-
- struct completion setup_done;
- init_completion(&setup_done);
- initialize_work(...,&setup_done,...)
-
- /* run non-dependent code */ /* do setup */
-
- wait_for_completion(&setup_done); complete(setup_done)
-
-This is not implying any temporal order on wait_for_completion() and the
-call to complete() - if the call to complete() happened before the call
-to wait_for_completion() then the waiting side simply will continue
-immediately as all dependencies are satisfied if not it will block until
-completion is signaled by complete().
-
-Note that wait_for_completion() is calling spin_lock_irq()/spin_unlock_irq(),
-so it can only be called safely when you know that interrupts are enabled.
-Calling it from hard-irq or irqs-off atomic contexts will result in
-hard-to-detect spurious enabling of interrupts.
-
-wait_for_completion():
-
- void wait_for_completion(struct completion *done):
-
-The default behavior is to wait without a timeout and to mark the task as
-uninterruptible. wait_for_completion() and its variants are only safe
-in process context (as they can sleep) but not in atomic context,
-interrupt context, with disabled irqs. or preemption is disabled - see also
-try_wait_for_completion() below for handling completion in atomic/interrupt
-context.
-
-As all variants of wait_for_completion() can (obviously) block for a long
-time, you probably don't want to call this with held mutexes.
-
-
-Variants available:
--------------------
-
-The below variants all return status and this status should be checked in
-most(/all) cases - in cases where the status is deliberately not checked you
-probably want to make a note explaining this (e.g. see
-arch/arm/kernel/smp.c:__cpu_up()).
-
-A common problem that occurs is to have unclean assignment of return types,
-so care should be taken with assigning return-values to variables of proper
-type. Checking for the specific meaning of return values also has been found
-to be quite inaccurate e.g. constructs like
-if (!wait_for_completion_interruptible_timeout(...)) would execute the same
-code path for successful completion and for the interrupted case - which is
-probably not what you want.
-
- int wait_for_completion_interruptible(struct completion *done)
-
-This function marks the task TASK_INTERRUPTIBLE. If a signal was received
-while waiting it will return -ERESTARTSYS; 0 otherwise.
-
- unsigned long wait_for_completion_timeout(struct completion *done,
- unsigned long timeout)
-
-The task is marked as TASK_UNINTERRUPTIBLE and will wait at most 'timeout'
-(in jiffies). If timeout occurs it returns 0 else the remaining time in
-jiffies (but at least 1). Timeouts are preferably calculated with
-msecs_to_jiffies() or usecs_to_jiffies(). If the returned timeout value is
-deliberately ignored a comment should probably explain why (e.g. see
-drivers/mfd/wm8350-core.c wm8350_read_auxadc())
-
- long wait_for_completion_interruptible_timeout(
- struct completion *done, unsigned long timeout)
-
-This function passes a timeout in jiffies and marks the task as
-TASK_INTERRUPTIBLE. If a signal was received it will return -ERESTARTSYS;
-otherwise it returns 0 if the completion timed out or the remaining time in
-jiffies if completion occurred.
-
-Further variants include _killable which uses TASK_KILLABLE as the
-designated tasks state and will return -ERESTARTSYS if it is interrupted or
-else 0 if completion was achieved. There is a _timeout variant as well:
-
- long wait_for_completion_killable(struct completion *done)
- long wait_for_completion_killable_timeout(struct completion *done,
- unsigned long timeout)
-
-The _io variants wait_for_completion_io() behave the same as the non-_io
-variants, except for accounting waiting time as waiting on IO, which has
-an impact on how the task is accounted in scheduling stats.
-
- void wait_for_completion_io(struct completion *done)
- unsigned long wait_for_completion_io_timeout(struct completion *done
- unsigned long timeout)
-
-
-Signaling completions:
-----------------------
-
-A thread that wants to signal that the conditions for continuation have been
-achieved calls complete() to signal exactly one of the waiters that it can
-continue.
-
- void complete(struct completion *done)
-
-or calls complete_all() to signal all current and future waiters.
-
- void complete_all(struct completion *done)
-
-The signaling will work as expected even if completions are signaled before
-a thread starts waiting. This is achieved by the waiter "consuming"
-(decrementing) the done element of struct completion. Waiting threads
-wakeup order is the same in which they were enqueued (FIFO order).
-
-If complete() is called multiple times then this will allow for that number
-of waiters to continue - each call to complete() will simply increment the
-done element. Calling complete_all() multiple times is a bug though. Both
-complete() and complete_all() can be called in hard-irq/atomic context safely.
-
-There only can be one thread calling complete() or complete_all() on a
-particular struct completion at any time - serialized through the wait
-queue spinlock. Any such concurrent calls to complete() or complete_all()
-probably are a design bug.
-
-Signaling completion from hard-irq context is fine as it will appropriately
-lock with spin_lock_irqsave/spin_unlock_irqrestore and it will never sleep.
-
-
-try_wait_for_completion()/completion_done():
---------------------------------------------
-
-The try_wait_for_completion() function will not put the thread on the wait
-queue but rather returns false if it would need to enqueue (block) the thread,
-else it consumes one posted completion and returns true.
-
- bool try_wait_for_completion(struct completion *done)
-
-Finally, to check the state of a completion without changing it in any way,
-call completion_done(), which returns false if there are no posted
-completions that were not yet consumed by waiters (implying that there are
-waiters) and true otherwise;
-
- bool completion_done(struct completion *done)
-
-Both try_wait_for_completion() and completion_done() are safe to be called in
-hard-irq or atomic context.
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
new file mode 100644
index 0000000..69074e5
--- /dev/null
+++ b/Documentation/scheduler/index.rst
@@ -0,0 +1,27 @@
+===============
+Linux Scheduler
+===============
+
+.. toctree::
+ :maxdepth: 1
+
+
+ completion
+ sched-arch
+ sched-bwc
+ sched-deadline
+ sched-design-CFS
+ sched-domains
+ sched-energy
+ sched-nice-design
+ sched-rt-group
+ sched-stats
+
+ text_files
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/scheduler/sched-arch.txt b/Documentation/scheduler/sched-arch.rst
similarity index 81%
rename from Documentation/scheduler/sched-arch.txt
rename to Documentation/scheduler/sched-arch.rst
index a2f27bb..0eaec66 100644
--- a/Documentation/scheduler/sched-arch.txt
+++ b/Documentation/scheduler/sched-arch.rst
@@ -1,4 +1,6 @@
- CPU Scheduler implementation hints for architecture specific code
+=================================================================
+CPU Scheduler implementation hints for architecture specific code
+=================================================================
Nick Piggin, 2005
@@ -35,9 +37,10 @@
4. The only time interrupts need to be disabled when checking
need_resched is if we are about to sleep the processor until
the next interrupt (this doesn't provide any protection of
- need_resched, it prevents losing an interrupt).
+ need_resched, it prevents losing an interrupt):
- 4a. Common problem with this type of sleep appears to be:
+ 4a. Common problem with this type of sleep appears to be::
+
local_irq_disable();
if (!need_resched()) {
local_irq_enable();
@@ -51,10 +54,10 @@
although it may be reasonable to do some background work or enter
a low CPU priority.
- 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
- an interrupt sleep, it needs to be cleared then a memory
- barrier issued (followed by a test of need_resched with
- interrupts disabled, as explained in 3).
+ - 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
+ an interrupt sleep, it needs to be cleared then a memory
+ barrier issued (followed by a test of need_resched with
+ interrupts disabled, as explained in 3).
arch/x86/kernel/process.c has examples of both polling and
sleeping idle functions.
@@ -71,4 +74,3 @@
sparc - IRQs on at this point(?), change local_irq_save to _disable.
- TODO: needs secondary CPUs to disable preempt (See #1)
-
diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
new file mode 100644
index 0000000..9801d6b
--- /dev/null
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -0,0 +1,174 @@
+=====================
+CFS Bandwidth Control
+=====================
+
+[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
+ The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
+
+CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
+specification of the maximum CPU bandwidth available to a group or hierarchy.
+
+The bandwidth allowed for a group is specified using a quota and period. Within
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per-cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
+within each of these updates is tunable and described as the "slice".
+
+Management
+----------
+Quota and period are managed within the cpu subsystem via cgroupfs.
+
+cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.stat: exports throttling statistics [explained further below]
+
+The default values are::
+
+ cpu.cfs_period_us=100ms
+ cpu.cfs_quota=-1
+
+A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
+bandwidth restriction in place, such a group is described as an unconstrained
+bandwidth group. This represents the traditional work-conserving behavior for
+CFS.
+
+Writing any (valid) positive value(s) will enact the specified bandwidth limit.
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
+bandwidth limits are used in a hierarchical fashion, these are explained in
+more detail below.
+
+Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
+and return the group to an unconstrained state once more.
+
+Any updates to a group's bandwidth specification will result in it becoming
+unthrottled if it is in a constrained state.
+
+System wide settings
+--------------------
+For efficiency run-time is transferred between the global pool and CPU local
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
+is described as the "slice".
+
+This is tunable via procfs::
+
+ /proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
+
+Larger slice values will reduce transfer overheads, while smaller values allow
+for more fine-grained consumption.
+
+Statistics
+----------
+A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+
+cpu.stat:
+
+- nr_periods: Number of enforcement intervals that have elapsed.
+- nr_throttled: Number of times the group has been throttled/limited.
+- throttled_time: The total time duration (in nanoseconds) for which entities
+ of the group have been throttled.
+
+This interface is read-only.
+
+Hierarchical considerations
+---------------------------
+The interface enforces that an individual entity's bandwidth is always
+attainable, that is: max(c_i) <= C. However, over-subscription in the
+aggregate case is explicitly allowed to enable work-conserving semantics
+within a hierarchy:
+
+ e.g. \Sum (c_i) may exceed C
+
+[ Where C is the parent's bandwidth, and c_i its children ]
+
+
+There are two ways in which a group may become throttled:
+
+ a. it fully consumes its own quota within a period
+ b. a parent's quota is fully consumed within its period
+
+In case b) above, even though the child may have runtime remaining it will not
+be allowed to until the parent's runtime is refreshed.
+
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire. However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period. This
+also limits the burst ability to no more than 1ms per cpu. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime::
+
+ If period is 250ms and quota is also 250ms, the group will get
+ 1 CPU worth of runtime every 250ms.
+
+ # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
+ # echo 250000 > cpu.cfs_period_us /* period = 250ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
+
+ With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+ runtime every 500ms::
+
+ # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+ # echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+ The larger period here allows for increased burst capacity.
+
+3. Limit a group to 20% of 1 CPU.
+
+ With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
+
+ # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+ # echo 50000 > cpu.cfs_period_us /* period = 50ms */
+
+ By using a small period here we are ensuring a consistent latency
+ response at the expense of burst capacity.
diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
deleted file mode 100644
index f6b1873..0000000
--- a/Documentation/scheduler/sched-bwc.txt
+++ /dev/null
@@ -1,122 +0,0 @@
-CFS Bandwidth Control
-=====================
-
-[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
- The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
-
-CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
-specification of the maximum CPU bandwidth available to a group or hierarchy.
-
-The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary. As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis. The amount transferred
-within each of these updates is tunable and described as the "slice".
-
-Management
-----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
-
-cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
-cpu.cfs_period_us: the length of a period (in microseconds)
-cpu.stat: exports throttling statistics [explained further below]
-
-The default values are:
- cpu.cfs_period_us=100ms
- cpu.cfs_quota=-1
-
-A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
-bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group. This represents the traditional work-conserving behavior for
-CFS.
-
-Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms. There is also an
-upper bound on the period length of 1s. Additional restrictions exist when
-bandwidth limits are used in a hierarchical fashion, these are explained in
-more detail below.
-
-Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
-and return the group to an unconstrained state once more.
-
-Any updates to a group's bandwidth specification will result in it becoming
-unthrottled if it is in a constrained state.
-
-System wide settings
---------------------
-For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion. This greatly reduces global accounting pressure
-on large systems. The amount transferred each time such an update is required
-is described as the "slice".
-
-This is tunable via procfs:
- /proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
-
-Larger slice values will reduce transfer overheads, while smaller values allow
-for more fine-grained consumption.
-
-Statistics
-----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
-
-cpu.stat:
-- nr_periods: Number of enforcement intervals that have elapsed.
-- nr_throttled: Number of times the group has been throttled/limited.
-- throttled_time: The total time duration (in nanoseconds) for which entities
- of the group have been throttled.
-
-This interface is read-only.
-
-Hierarchical considerations
----------------------------
-The interface enforces that an individual entity's bandwidth is always
-attainable, that is: max(c_i) <= C. However, over-subscription in the
-aggregate case is explicitly allowed to enable work-conserving semantics
-within a hierarchy.
- e.g. \Sum (c_i) may exceed C
-[ Where C is the parent's bandwidth, and c_i its children ]
-
-
-There are two ways in which a group may become throttled:
- a. it fully consumes its own quota within a period
- b. a parent's quota is fully consumed within its period
-
-In case b) above, even though the child may have runtime remaining it will not
-be allowed to until the parent's runtime is refreshed.
-
-Examples
---------
-1. Limit a group to 1 CPU worth of runtime.
-
- If period is 250ms and quota is also 250ms, the group will get
- 1 CPU worth of runtime every 250ms.
-
- # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
- # echo 250000 > cpu.cfs_period_us /* period = 250ms */
-
-2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
-
- With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
- runtime every 500ms.
-
- # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
- # echo 500000 > cpu.cfs_period_us /* period = 500ms */
-
- The larger period here allows for increased burst capacity.
-
-3. Limit a group to 20% of 1 CPU.
-
- With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
-
- # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
- # echo 50000 > cpu.cfs_period_us /* period = 50ms */
-
- By using a small period here we are ensuring a consistent latency
- response at the expense of burst capacity.
-
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.rst
similarity index 87%
rename from Documentation/scheduler/sched-deadline.txt
rename to Documentation/scheduler/sched-deadline.rst
index b14e03f..14a2f7b 100644
--- a/Documentation/scheduler/sched-deadline.txt
+++ b/Documentation/scheduler/sched-deadline.rst
@@ -1,29 +1,29 @@
- Deadline Task Scheduling
- ------------------------
+========================
+Deadline Task Scheduling
+========================
-CONTENTS
-========
+.. CONTENTS
- 0. WARNING
- 1. Overview
- 2. Scheduling algorithm
- 2.1 Main algorithm
- 2.2 Bandwidth reclaiming
- 3. Scheduling Real-Time Tasks
- 3.1 Definitions
- 3.2 Schedulability Analysis for Uniprocessor Systems
- 3.3 Schedulability Analysis for Multiprocessor Systems
- 3.4 Relationship with SCHED_DEADLINE Parameters
- 4. Bandwidth management
- 4.1 System-wide settings
- 4.2 Task interface
- 4.3 Default behavior
- 4.4 Behavior of sched_yield()
- 5. Tasks CPU affinity
- 5.1 SCHED_DEADLINE and cpusets HOWTO
- 6. Future plans
- A. Test suite
- B. Minimal main()
+ 0. WARNING
+ 1. Overview
+ 2. Scheduling algorithm
+ 2.1 Main algorithm
+ 2.2 Bandwidth reclaiming
+ 3. Scheduling Real-Time Tasks
+ 3.1 Definitions
+ 3.2 Schedulability Analysis for Uniprocessor Systems
+ 3.3 Schedulability Analysis for Multiprocessor Systems
+ 3.4 Relationship with SCHED_DEADLINE Parameters
+ 4. Bandwidth management
+ 4.1 System-wide settings
+ 4.2 Task interface
+ 4.3 Default behavior
+ 4.4 Behavior of sched_yield()
+ 5. Tasks CPU affinity
+ 5.1 SCHED_DEADLINE and cpusets HOWTO
+ 6. Future plans
+ A. Test suite
+ B. Minimal main()
0. WARNING
@@ -44,7 +44,7 @@
2. Scheduling algorithm
-==================
+=======================
2.1 Main algorithm
------------------
@@ -80,7 +80,7 @@
a "remaining runtime". These two parameters are initially set to 0;
- When a SCHED_DEADLINE task wakes up (becomes ready for execution),
- the scheduler checks if
+ the scheduler checks if::
remaining runtime runtime
---------------------------------- > ---------
@@ -97,7 +97,7 @@
left unchanged;
- When a SCHED_DEADLINE task executes for an amount of time t, its
- remaining runtime is decreased as
+ remaining runtime is decreased as::
remaining runtime = remaining runtime - t
@@ -112,7 +112,7 @@
- When the current time is equal to the replenishment time of a
throttled task, the scheduling deadline and the remaining runtime are
- updated as
+ updated as::
scheduling deadline = scheduling deadline + period
remaining runtime = remaining runtime + runtime
@@ -129,7 +129,7 @@
Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
when flag SCHED_FLAG_RECLAIM is set.
- The following diagram illustrates the state names for tasks handled by GRUB:
+ The following diagram illustrates the state names for tasks handled by GRUB::
------------
(d) | Active |
@@ -168,7 +168,7 @@
breaking the real-time guarantees.
The 0-lag time for a task entering the ActiveNonContending state is
- computed as
+ computed as::
(runtime * dl_period)
deadline - ---------------------
@@ -183,7 +183,7 @@
the task's utilization must be removed from the previous runqueue's active
utilization and must be added to the new runqueue's active utilization.
In order to avoid races between a task waking up on a runqueue while the
- "inactive timer" is running on a different CPU, the "dl_non_contending"
+ "inactive timer" is running on a different CPU, the "dl_non_contending"
flag is used to indicate that a task is not on a runqueue but is active
(so, the flag is set when the task blocks and is cleared when the
"inactive timer" fires or when the task wakes up).
@@ -222,36 +222,36 @@
Let's now see a trivial example of two deadline tasks with runtime equal
- to 4 and period equal to 8 (i.e., bandwidth equal to 0.5):
+ to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
- A Task T1
- |
- | |
- | |
- |-------- |----
- | | V
- |---|---|---|---|---|---|---|---|--------->t
- 0 1 2 3 4 5 6 7 8
+ A Task T1
+ |
+ | |
+ | |
+ |-------- |----
+ | | V
+ |---|---|---|---|---|---|---|---|--------->t
+ 0 1 2 3 4 5 6 7 8
- A Task T2
- |
- | |
- | |
- | ------------------------|
- | | V
- |---|---|---|---|---|---|---|---|--------->t
- 0 1 2 3 4 5 6 7 8
+ A Task T2
+ |
+ | |
+ | |
+ | ------------------------|
+ | | V
+ |---|---|---|---|---|---|---|---|--------->t
+ 0 1 2 3 4 5 6 7 8
- A running_bw
- |
- 1 ----------------- ------
- | | |
- 0.5- -----------------
- | |
- |---|---|---|---|---|---|---|---|--------->t
- 0 1 2 3 4 5 6 7 8
+ A running_bw
+ |
+ 1 ----------------- ------
+ | | |
+ 0.5- -----------------
+ | |
+ |---|---|---|---|---|---|---|---|--------->t
+ 0 1 2 3 4 5 6 7 8
- Time t = 0:
@@ -284,7 +284,7 @@
2.3 Energy-aware scheduling
-------------------------
+---------------------------
When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
@@ -299,15 +299,20 @@
3. Scheduling Real-Time Tasks
=============================
- * BIG FAT WARNING ******************************************************
- *
- * This section contains a (not-thorough) summary on classical deadline
- * scheduling theory, and how it applies to SCHED_DEADLINE.
- * The reader can "safely" skip to Section 4 if only interested in seeing
- * how the scheduling policy can be used. Anyway, we strongly recommend
- * to come back here and continue reading (once the urge for testing is
- * satisfied :P) to be sure of fully understanding all technical details.
- ************************************************************************
+
+
+ .. BIG FAT WARNING ******************************************************
+
+ .. warning::
+
+ This section contains a (not-thorough) summary on classical deadline
+ scheduling theory, and how it applies to SCHED_DEADLINE.
+ The reader can "safely" skip to Section 4 if only interested in seeing
+ how the scheduling policy can be used. Anyway, we strongly recommend
+ to come back here and continue reading (once the urge for testing is
+ satisfied :P) to be sure of fully understanding all technical details.
+
+ .. ************************************************************************
There are no limitations on what kind of task can exploit this new
scheduling discipline, even if it must be said that it is particularly
@@ -329,6 +334,7 @@
sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
d_j = r_j + D, where D is the task's relative deadline.
Summing up, a real-time task can be described as
+
Task = (WCET, D, P)
The utilization of a real-time task is defined as the ratio between its
@@ -352,13 +358,15 @@
between the finishing time of a job and its absolute deadline).
More precisely, it can be proven that using a global EDF scheduler the
maximum tardiness of each task is smaller or equal than
+
((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
+
where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
utilization[12].
3.2 Schedulability Analysis for Uniprocessor Systems
-------------------------
+----------------------------------------------------
If M=1 (uniprocessor system), or in case of partitioned scheduling (each
real-time task is statically assigned to one and only one CPU), it is
@@ -370,7 +378,9 @@
a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
of all the tasks running on a CPU if the sum of the densities of the tasks
running on such a CPU is smaller or equal than 1:
+
sum(WCET_i / min{D_i, P_i}) <= 1
+
It is important to notice that this condition is only sufficient, and not
necessary: there are task sets that are schedulable, but do not respect the
condition. For example, consider the task set {Task_1,Task_2} composed by
@@ -379,7 +389,9 @@
(Task_1 is scheduled as soon as it is released, and finishes just in time
to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
its response time cannot be larger than 50ms + 10ms = 60ms) even if
+
50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
+
Of course it is possible to test the exact schedulability of tasks with
D_i != P_i (checking a condition that is both sufficient and necessary),
but this cannot be done by comparing the total utilization or density with
@@ -399,7 +411,7 @@
4 Linux uses an admission test based on the tasks' utilizations.
3.3 Schedulability Analysis for Multiprocessor Systems
-------------------------
+------------------------------------------------------
On multiprocessor systems with global EDF scheduling (non partitioned
systems), a sufficient test for schedulability can not be based on the
@@ -428,7 +440,9 @@
between total utilization (or density) and a fixed constant. If all tasks
have D_i = P_i, a sufficient schedulability condition can be expressed in
a simple way:
+
sum(WCET_i / P_i) <= M - (M - 1) · U_max
+
where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
just confirms the Dhall's effect. A more complete survey of the literature
@@ -447,7 +461,7 @@
the tasks are limited.
3.4 Relationship with SCHED_DEADLINE Parameters
-------------------------
+-----------------------------------------------
Finally, it is important to understand the relationship between the
SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
@@ -473,6 +487,7 @@
this task, as it is not possible to respect its temporal constraints.
References:
+
1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
ming in a hard-real-time environment. Journal of the Association for
Computing Machinery, 20(1), 1973.
@@ -550,7 +565,7 @@
The interface used to control the CPU bandwidth that can be allocated
to -deadline tasks is similar to the one already used for -rt
tasks with real-time group scheduling (a.k.a. RT-throttling - see
- Documentation/scheduler/sched-rt-group.txt), and is based on readable/
+ Documentation/scheduler/sched-rt-group.rst), and is based on readable/
writable control files located in procfs (for system wide settings).
Notice that per-group settings (controlled through cgroupfs) are still not
defined for -deadline tasks, because more discussion is needed in order to
@@ -596,11 +611,13 @@
Specifying a periodic/sporadic task that executes for a given amount of
runtime at each instance, and that is scheduled according to the urgency of
its own timing constraints needs, in general, a way of declaring:
+
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Therefore:
+
* a new struct sched_attr, containing all the necessary fields is
provided;
* the new scheduling related syscalls that manipulate it, i.e.,
@@ -652,27 +669,27 @@
-deadline tasks cannot have an affinity mask smaller that the entire
root_domain they are created on. However, affinities can be specified
- through the cpuset facility (Documentation/cgroup-v1/cpusets.txt).
+ through the cpuset facility (Documentation/admin-guide/cgroup-v1/cpusets.rst).
5.1 SCHED_DEADLINE and cpusets HOWTO
------------------------------------
An example of a simple configuration (pin a -deadline task to CPU0)
- follows (rt-app is used to create a -deadline task).
+ follows (rt-app is used to create a -deadline task)::
- mkdir /dev/cpuset
- mount -t cgroup -o cpuset cpuset /dev/cpuset
- cd /dev/cpuset
- mkdir cpu0
- echo 0 > cpu0/cpuset.cpus
- echo 0 > cpu0/cpuset.mems
- echo 1 > cpuset.cpu_exclusive
- echo 0 > cpuset.sched_load_balance
- echo 1 > cpu0/cpuset.cpu_exclusive
- echo 1 > cpu0/cpuset.mem_exclusive
- echo $$ > cpu0/tasks
- rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
- task affinity)
+ mkdir /dev/cpuset
+ mount -t cgroup -o cpuset cpuset /dev/cpuset
+ cd /dev/cpuset
+ mkdir cpu0
+ echo 0 > cpu0/cpuset.cpus
+ echo 0 > cpu0/cpuset.mems
+ echo 1 > cpuset.cpu_exclusive
+ echo 0 > cpuset.sched_load_balance
+ echo 1 > cpu0/cpuset.cpu_exclusive
+ echo 1 > cpu0/cpuset.mem_exclusive
+ echo $$ > cpu0/tasks
+ rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
+ # task affinity
6. Future plans
===============
@@ -711,7 +728,7 @@
rt-app is available at: https://github.com/scheduler-tools/rt-app.
Thread parameters can be specified from the command line, with something like
- this:
+ this::
# rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
@@ -721,27 +738,27 @@
of 5 seconds.
More interestingly, configurations can be described with a json file that
- can be passed as input to rt-app with something like this:
+ can be passed as input to rt-app with something like this::
# rt-app my_config.json
The parameters that can be specified with the second method are a superset
of the command line options. Please refer to rt-app documentation for more
- details (<rt-app-sources>/doc/*.json).
+ details (`<rt-app-sources>/doc/*.json`).
The second testing application is a modification of schedtool, called
schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
certain pid/application. schedtool-dl is available at:
https://github.com/scheduler-tools/schedtool-dl.git.
- The usage is straightforward:
+ The usage is straightforward::
# schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
of 10ms every 100ms (note that parameters are expressed in microseconds).
You can also use schedtool to create a reservation for an already running
- application, given that you know its pid:
+ application, given that you know its pid::
# schedtool -E -t 10000000:100000000 my_app_pid
@@ -750,43 +767,43 @@
We provide in what follows a simple (ugly) self-contained code snippet
showing how SCHED_DEADLINE reservations can be created by a real-time
- application developer.
+ application developer::
- #define _GNU_SOURCE
- #include <unistd.h>
- #include <stdio.h>
- #include <stdlib.h>
- #include <string.h>
- #include <time.h>
- #include <linux/unistd.h>
- #include <linux/kernel.h>
- #include <linux/types.h>
- #include <sys/syscall.h>
- #include <pthread.h>
+ #define _GNU_SOURCE
+ #include <unistd.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <string.h>
+ #include <time.h>
+ #include <linux/unistd.h>
+ #include <linux/kernel.h>
+ #include <linux/types.h>
+ #include <sys/syscall.h>
+ #include <pthread.h>
- #define gettid() syscall(__NR_gettid)
+ #define gettid() syscall(__NR_gettid)
- #define SCHED_DEADLINE 6
+ #define SCHED_DEADLINE 6
- /* XXX use the proper syscall numbers */
- #ifdef __x86_64__
- #define __NR_sched_setattr 314
- #define __NR_sched_getattr 315
- #endif
+ /* XXX use the proper syscall numbers */
+ #ifdef __x86_64__
+ #define __NR_sched_setattr 314
+ #define __NR_sched_getattr 315
+ #endif
- #ifdef __i386__
- #define __NR_sched_setattr 351
- #define __NR_sched_getattr 352
- #endif
+ #ifdef __i386__
+ #define __NR_sched_setattr 351
+ #define __NR_sched_getattr 352
+ #endif
- #ifdef __arm__
- #define __NR_sched_setattr 380
- #define __NR_sched_getattr 381
- #endif
+ #ifdef __arm__
+ #define __NR_sched_setattr 380
+ #define __NR_sched_getattr 381
+ #endif
- static volatile int done;
+ static volatile int done;
- struct sched_attr {
+ struct sched_attr {
__u32 size;
__u32 sched_policy;
@@ -802,25 +819,25 @@
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
- };
+ };
- int sched_setattr(pid_t pid,
+ int sched_setattr(pid_t pid,
const struct sched_attr *attr,
unsigned int flags)
- {
+ {
return syscall(__NR_sched_setattr, pid, attr, flags);
- }
+ }
- int sched_getattr(pid_t pid,
+ int sched_getattr(pid_t pid,
struct sched_attr *attr,
unsigned int size,
unsigned int flags)
- {
+ {
return syscall(__NR_sched_getattr, pid, attr, size, flags);
- }
+ }
- void *run_deadline(void *data)
- {
+ void *run_deadline(void *data)
+ {
struct sched_attr attr;
int x = 0;
int ret;
@@ -851,10 +868,10 @@
printf("deadline thread dies [%ld]\n", gettid());
return NULL;
- }
+ }
- int main (int argc, char **argv)
- {
+ int main (int argc, char **argv)
+ {
pthread_t thread;
printf("main thread [%ld]\n", gettid());
@@ -868,4 +885,4 @@
printf("main dies [%ld]\n", gettid());
return 0;
- }
+ }
diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.rst
similarity index 96%
rename from Documentation/scheduler/sched-design-CFS.txt
rename to Documentation/scheduler/sched-design-CFS.rst
index edd861c..a96c726 100644
--- a/Documentation/scheduler/sched-design-CFS.txt
+++ b/Documentation/scheduler/sched-design-CFS.rst
@@ -1,9 +1,10 @@
- =============
- CFS Scheduler
- =============
+=============
+CFS Scheduler
+=============
1. OVERVIEW
+============
CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
@@ -27,6 +28,7 @@
2. FEW IMPLEMENTATION DETAILS
+==============================
In CFS the virtual runtime is expressed and tracked via the per-task
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
@@ -49,6 +51,7 @@
3. THE RBTREE
+==============
CFS's design is quite radical: it does not use the old data structures for the
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
@@ -84,6 +87,7 @@
4. SOME FEATURES OF CFS
+========================
CFS uses nanosecond granularity accounting and does not rely on any jiffies or
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
@@ -113,6 +117,7 @@
5. Scheduling policies
+======================
CFS implements three scheduling policies:
@@ -137,6 +142,7 @@
6. SCHEDULING CLASSES
+======================
The new CFS scheduler has been designed in such a way to introduce "Scheduling
Classes," an extensible hierarchy of scheduler modules. These modules
@@ -197,6 +203,7 @@
7. GROUP SCHEDULER EXTENSIONS TO CFS
+=====================================
Normally, the scheduler operates on individual tasks and strives to provide
fair CPU time to each task. Sometimes, it may be desirable to group tasks and
@@ -215,11 +222,11 @@
These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
- Documentation/cgroup-v1/cgroups.txt for more information about this filesystem.
+ Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem.
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create
-task groups and modify their CPU share using the "cgroups" pseudo filesystem.
+task groups and modify their CPU share using the "cgroups" pseudo filesystem::
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.rst
similarity index 97%
rename from Documentation/scheduler/sched-domains.txt
rename to Documentation/scheduler/sched-domains.rst
index 4af80b1..f750422 100644
--- a/Documentation/scheduler/sched-domains.txt
+++ b/Documentation/scheduler/sched-domains.rst
@@ -1,3 +1,7 @@
+=================
+Scheduler Domains
+=================
+
Each CPU has a "base" scheduling domain (struct sched_domain). The domain
hierarchy is built from these base domains via the ->parent pointer. ->parent
MUST be NULL terminated, and domain structures should be per-CPU as they are
@@ -46,7 +50,9 @@
to our runqueue. The exact number of tasks amounts to an imbalance previously
computed while iterating over this sched domain's groups.
-*** Implementing sched domains ***
+Implementing sched domains
+==========================
+
The "base" domain will "span" the first level of the hierarchy. In the case
of SMT, you'll span all siblings of the physical CPU, with each group being
a single virtual CPU.
diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst
new file mode 100644
index 0000000..9580c57
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.rst
@@ -0,0 +1,430 @@
+=======================
+Energy Aware Scheduling
+=======================
+
+1. Introduction
+---------------
+
+Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
+the impact of its decisions on the energy consumed by CPUs. EAS relies on an
+Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
+with a minimal impact on throughput. This document aims at providing an
+introduction on how EAS works, what are the main design decisions behind it, and
+details what is needed to get it to run.
+
+Before going any further, please note that at the time of writing::
+
+ /!\ EAS does not support platforms with symmetric CPU topologies /!\
+
+EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
+because this is where the potential for saving energy through scheduling is
+the highest.
+
+The actual EM used by EAS is _not_ maintained by the scheduler, but by a
+dedicated framework. For details about this framework and what it provides,
+please refer to its documentation (see Documentation/power/energy-model.rst).
+
+
+2. Background and Terminology
+-----------------------------
+
+To make it clear from the start:
+ - energy = [joule] (resource like a battery on powered devices)
+ - power = energy/time = [joule/second] = [watt]
+
+The goal of EAS is to minimize energy, while still getting the job done. That
+is, we want to maximize::
+
+ performance [inst/s]
+ --------------------
+ power [W]
+
+which is equivalent to minimizing::
+
+ energy [J]
+ -----------
+ instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance.
+
+The idea behind introducing an EM is to allow the scheduler to evaluate the
+implications of its decisions rather than blindly applying energy-saving
+techniques that may have positive effects only on some platforms. At the same
+time, the EM must be as simple as possible to minimize the scheduler latency
+impact.
+
+In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
+for the scheduler to decide where a task should run (during wake-up), the EM
+is used to break the tie between several good CPU candidates and pick the one
+that is predicted to yield the best energy consumption without harming the
+system's throughput. The predictions made by EAS rely on specific elements of
+knowledge about the platform's topology, which include the 'capacity' of CPUs,
+and their respective energy costs.
+
+
+3. Topology information
+-----------------------
+
+EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
+differentiate CPUs with different computing throughput. The 'capacity' of a CPU
+represents the amount of work it can absorb when running at its highest
+frequency compared to the most capable CPU of the system. Capacity values are
+normalized in a 1024 range, and are comparable with the utilization signals of
+tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
+to capacity and utilization values, EAS is able to estimate how big/busy a
+task/CPU is, and to take this into consideration when evaluating performance vs
+energy trade-offs. The capacity of CPUs is provided via arch-specific code
+through the arch_scale_cpu_capacity() callback.
+
+The rest of platform knowledge used by EAS is directly read from the Energy
+Model (EM) framework. The EM of a platform is composed of a power cost table
+per 'performance domain' in the system (see Documentation/power/energy-model.rst
+for futher details about performance domains).
+
+The scheduler manages references to the EM objects in the topology code when the
+scheduling domains are built, or re-built. For each root domain (rd), the
+scheduler maintains a singly linked list of all performance domains intersecting
+the current rd->span. Each node in the list contains a pointer to a struct
+em_perf_domain as provided by the EM framework.
+
+The lists are attached to the root domains in order to cope with exclusive
+cpuset configurations. Since the boundaries of exclusive cpusets do not
+necessarily match those of performance domains, the lists of different root
+domains can contain duplicate elements.
+
+Example 1.
+ Let us consider a platform with 12 CPUs, split in 3 performance domains
+ (pd0, pd4 and pd8), organized as follows::
+
+ CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
+ PDs: |--pd0--|--pd4--|---pd8---|
+ RDs: |----rd1----|-----rd2-----|
+
+ Now, consider that userspace decided to split the system with two
+ exclusive cpusets, hence creating two independent root domains, each
+ containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
+ above figure. Since pd4 intersects with both rd1 and rd2, it will be
+ present in the linked list '->pd' attached to each of them:
+
+ * rd1->pd: pd0 -> pd4
+ * rd2->pd: pd4 -> pd8
+
+ Please note that the scheduler will create two duplicate list nodes for
+ pd4 (one for each list). However, both just hold a pointer to the same
+ shared data structure of the EM framework.
+
+Since the access to these lists can happen concurrently with hotplug and other
+things, they are protected by RCU, like the rest of topology structures
+manipulated by the scheduler.
+
+EAS also maintains a static key (sched_energy_present) which is enabled when at
+least one root domain meets all conditions for EAS to start. Those conditions
+are summarized in Section 6.
+
+
+4. Energy-Aware task placement
+------------------------------
+
+EAS overrides the CFS task wake-up balancing code. It uses the EM of the
+platform and the PELT signals to choose an energy-efficient target CPU during
+wake-up balance. When EAS is enabled, select_task_rq_fair() calls
+find_energy_efficient_cpu() to do the placement decision. This function looks
+for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
+each performance domain since it is the one which will allow us to keep the
+frequency the lowest. Then, the function checks if placing the task there could
+save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
+in its previous activation.
+
+find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
+energy consumed by the system if the waking task was migrated. compute_energy()
+looks at the current utilization landscape of the CPUs and adjusts it to
+'simulate' the task migration. The EM framework provides the em_pd_energy() API
+which computes the expected energy consumption of each performance domain for
+the given utilization landscape.
+
+An example of energy-optimized task placement decision is detailed below.
+
+Example 2.
+ Let us consider a (fake) platform with 2 independent performance domains
+ composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
+ are big.
+
+ The scheduler must decide where to place a task P whose util_avg = 200
+ and prev_cpu = 0.
+
+ The current utilization landscape of the CPUs is depicted on the graph
+ below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
+ Each performance domain has three Operating Performance Points (OPPs).
+ The CPU capacity and power cost associated with each OPP is listed in
+ the Energy Model table. The util_avg of P is shown on the figures
+ below as 'PP'::
+
+ CPU util.
+ 1024 - - - - - - - Energy Model
+ +-----------+-------------+
+ | Little | Big |
+ 768 ============= +-----+-----+------+------+
+ | Cap | Pwr | Cap | Pwr |
+ +-----+-----+------+------+
+ 512 =========== - ##- - - - - | 170 | 50 | 512 | 400 |
+ ## ## | 341 | 150 | 768 | 800 |
+ 341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 |
+ PP ## ## +-----+-----+------+------+
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+ Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
+
+
+ find_energy_efficient_cpu() will first look for the CPUs with the
+ maximum spare capacity in the two performance domains. In this example,
+ CPU1 and CPU3. Then it will estimate the energy of the system if P was
+ placed on either of them, and check if that would save some energy
+ compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
+ (which is coherent with the behaviour of the schedutil CPUFreq
+ governor, see Section 6. for more details on this topic).
+
+ **Case 1. P is migrated to CPU1**::
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 200 / 341 * 150 = 88
+ * CPU1: 300 / 341 * 150 = 131
+ * CPU2: 600 / 768 * 800 = 625
+ 512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
+ ## ## => total_energy = 1364
+ 341 =========== ## ##
+ PP ## ##
+ 170 -## - - PP- ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+
+ **Case 2. P is migrated to CPU3**::
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 200 / 341 * 150 = 88
+ * CPU1: 100 / 341 * 150 = 43
+ PP * CPU2: 600 / 768 * 800 = 625
+ 512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
+ ## ## => total_energy = 1485
+ 341 =========== ## ##
+ ## ##
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+
+ **Case 3. P stays on prev_cpu / CPU 0**::
+
+ 1024 - - - - - - -
+
+ Energy calculation:
+ 768 ============= * CPU0: 400 / 512 * 300 = 234
+ * CPU1: 100 / 512 * 300 = 58
+ * CPU2: 600 / 768 * 800 = 625
+ 512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
+ ## ## => total_energy = 1437
+ 341 -PP - - - - ## ##
+ PP ## ##
+ 170 -## - - - - ## ##
+ ## ## ## ##
+ ------------ -------------
+ CPU0 CPU1 CPU2 CPU3
+
+
+ From these calculations, the Case 1 has the lowest total energy. So CPU 1
+ is be the best candidate from an energy-efficiency standpoint.
+
+Big CPUs are generally more power hungry than the little ones and are thus used
+mainly when a task doesn't fit the littles. However, little CPUs aren't always
+necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
+of the little CPUs can be less energy-efficient than the lowest OPPs of the
+bigs, for example. So, if the little CPUs happen to have enough utilization at
+a specific point in time, a small task waking up at that moment could be better
+of executing on the big side in order to save energy, even though it would fit
+on the little side.
+
+And even in the case where all OPPs of the big CPUs are less energy-efficient
+than those of the little, using the big CPUs for a small task might still, under
+specific conditions, save energy. Indeed, placing a task on a little CPU can
+result in raising the OPP of the entire performance domain, and that will
+increase the cost of the tasks already running there. If the waking task is
+placed on a big CPU, its own execution cost might be higher than if it was
+running on a little, but it won't impact the other tasks of the little CPUs
+which will keep running at a lower OPP. So, when considering the total energy
+consumed by CPUs, the extra cost of running that one task on a big core can be
+smaller than the cost of raising the OPP on the little CPUs for all the other
+tasks.
+
+The examples above would be nearly impossible to get right in a generic way, and
+for all platforms, without knowing the cost of running at different OPPs on all
+CPUs of the system. Thanks to its EM-based design, EAS should cope with them
+correctly without too many troubles. However, in order to ensure a minimal
+impact on throughput for high-utilization scenarios, EAS also implements another
+mechanism called 'over-utilization'.
+
+
+5. Over-utilization
+-------------------
+
+From a general standpoint, the use-cases where EAS can help the most are those
+involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
+being run, they will require all of the available CPU capacity, and there isn't
+much that can be done by the scheduler to save energy without severly harming
+throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
+'over-utilized' as soon as they are used at more than 80% of their compute
+capacity. As long as no CPUs are over-utilized in a root domain, load balancing
+is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
+the most energy efficient CPUs of the system more than the others if that can be
+done without harming throughput. So, the load-balancer is disabled to prevent
+it from breaking the energy-efficient task placement found by EAS. It is safe to
+do so when the system isn't overutilized since being below the 80% tipping point
+implies that:
+
+ a. there is some idle time on all CPUs, so the utilization signals used by
+ EAS are likely to accurately represent the 'size' of the various tasks
+ in the system;
+ b. all tasks should already be provided with enough CPU capacity,
+ regardless of their nice values;
+ c. since there is spare capacity all tasks must be blocking/sleeping
+ regularly and balancing at wake-up is sufficient.
+
+As soon as one CPU goes above the 80% tipping point, at least one of the three
+assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
+is raised for the entire root domain, EAS is disabled, and the load-balancer is
+re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
+wake-up and load balance under CPU-bound conditions. This provides a better
+respect of the nice values of tasks.
+
+Since the notion of overutilization largely relies on detecting whether or not
+there is some idle time in the system, the CPU capacity 'stolen' by higher
+(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
+such, the detection of overutilization accounts for the capacity used not only
+by CFS tasks, but also by the other scheduling classes and IRQ.
+
+
+6. Dependencies and requirements for EAS
+----------------------------------------
+
+Energy Aware Scheduling depends on the CPUs of the system having specific
+hardware properties and on other features of the kernel being enabled. This
+section lists these dependencies and provides hints as to how they can be met.
+
+
+6.1 - Asymmetric CPU topology
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+As mentioned in the introduction, EAS is only supported on platforms with
+asymmetric CPU topologies for now. This requirement is checked at run-time by
+looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling
+domains are built.
+
+The flag is set/cleared automatically by the scheduler topology code whenever
+there are CPUs with different capacities in a root domain. The capacities of
+CPUs are provided by arch-specific code through the arch_scale_cpu_capacity()
+callback. As an example, arm and arm64 share an implementation of this callback
+which uses a combination of CPUFreq data and device-tree bindings to compute the
+capacity of CPUs (see drivers/base/arch_topology.c for more details).
+
+So, in order to use EAS on your platform your architecture must implement the
+arch_scale_cpu_capacity() callback, and some of the CPUs must have a lower
+capacity than others.
+
+Please note that EAS is not fundamentally incompatible with SMP, but no
+significant savings on SMP platforms have been observed yet. This restriction
+could be amended in the future if proven otherwise.
+
+
+6.2 - Energy Model presence
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+EAS uses the EM of a platform to estimate the impact of scheduling decisions on
+energy. So, your platform must provide power cost tables to the EM framework in
+order to make EAS start. To do so, please refer to documentation of the
+independent EM framework in Documentation/power/energy-model.rst.
+
+Please also note that the scheduling domains need to be re-built after the
+EM has been registered in order to start EAS.
+
+
+6.3 - Energy Model complexity
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The task wake-up path is very latency-sensitive. When the EM of a platform is
+too complex (too many CPUs, too many performance domains, too many performance
+states, ...), the cost of using it in the wake-up path can become prohibitive.
+The energy-aware wake-up algorithm has a complexity of:
+
+ C = Nd * (Nc + Ns)
+
+with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
+total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
+
+A complexity check is performed at the root domain level, when scheduling
+domains are built. EAS will not start on a root domain if its C happens to be
+higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
+time of writing).
+
+If you really want to use EAS but the complexity of your platform's Energy
+Model is too high to be used with a single root domain, you're left with only
+two possible options:
+
+ 1. split your system into separate, smaller, root domains using exclusive
+ cpusets and enable EAS locally on each of them. This option has the
+ benefit to work out of the box but the drawback of preventing load
+ balance between root domains, which can result in an unbalanced system
+ overall;
+ 2. submit patches to reduce the complexity of the EAS wake-up algorithm,
+ hence enabling it to cope with larger EMs in reasonable time.
+
+
+6.4 - Schedutil governor
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+EAS tries to predict at which OPP will the CPUs be running in the close future
+in order to estimate their energy consumption. To do so, it is assumed that OPPs
+of CPUs follow their utilization.
+
+Although it is very difficult to provide hard guarantees regarding the accuracy
+of this assumption in practice (because the hardware might not do what it is
+told to do, for example), schedutil as opposed to other CPUFreq governors at
+least _requests_ frequencies calculated using the utilization signals.
+Consequently, the only sane governor to use together with EAS is schedutil,
+because it is the only one providing some degree of consistency between
+frequency requests and energy predictions.
+
+Using EAS with any other governor than schedutil is not supported.
+
+
+6.5 Scale-invariant utilization signals
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to make accurate prediction across CPUs and for all performance
+states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
+be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
+callbacks.
+
+Using EAS on a platform that doesn't implement these two callbacks is not
+supported.
+
+
+6.6 Multithreading (SMT)
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+EAS in its current form is SMT unaware and is not able to leverage
+multithreaded hardware to save energy. EAS considers threads as independent
+CPUs, which can actually be counter-productive for both performance and energy.
+
+EAS on SMT is not supported.
diff --git a/Documentation/scheduler/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.rst
similarity index 98%
rename from Documentation/scheduler/sched-nice-design.txt
rename to Documentation/scheduler/sched-nice-design.rst
index 3ac1e46..0571f1b 100644
--- a/Documentation/scheduler/sched-nice-design.txt
+++ b/Documentation/scheduler/sched-nice-design.rst
@@ -1,3 +1,7 @@
+=====================
+Scheduler Nice Design
+=====================
+
This document explains the thinking about the revamped and streamlined
nice-levels implementation in the new Linux scheduler.
@@ -14,7 +18,7 @@
that change), and we also intentionally calibrated the linear timeslice
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
understand it, the timeslice graph went like this (cheesy ASCII art
-alert!):
+alert!)::
A
diff --git a/Documentation/scheduler/sched-pelt.c b/Documentation/scheduler/sched-pelt.c
index e421913..7238b35 100644
--- a/Documentation/scheduler/sched-pelt.c
+++ b/Documentation/scheduler/sched-pelt.c
@@ -20,7 +20,8 @@
int i;
unsigned int x;
- printf("static const u32 runnable_avg_yN_inv[] = {");
+ /* To silence -Wunused-but-set-variable warnings. */
+ printf("static const u32 runnable_avg_yN_inv[] __maybe_unused = {");
for (i = 0; i < HALFLIFE; i++) {
x = ((1UL<<32)-1)*pow(y, i);
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.rst
similarity index 94%
rename from Documentation/scheduler/sched-rt-group.txt
rename to Documentation/scheduler/sched-rt-group.rst
index d8fce3e..655a096 100644
--- a/Documentation/scheduler/sched-rt-group.txt
+++ b/Documentation/scheduler/sched-rt-group.rst
@@ -1,18 +1,18 @@
- Real-Time group scheduling
- --------------------------
+==========================
+Real-Time group scheduling
+==========================
-CONTENTS
-========
+.. CONTENTS
-0. WARNING
-1. Overview
- 1.1 The problem
- 1.2 The solution
-2. The interface
- 2.1 System-wide settings
- 2.2 Default behaviour
- 2.3 Basis for grouping tasks
-3. Future plans
+ 0. WARNING
+ 1. Overview
+ 1.1 The problem
+ 1.2 The solution
+ 2. The interface
+ 2.1 System-wide settings
+ 2.2 Default behaviour
+ 2.3 Basis for grouping tasks
+ 3. Future plans
0. WARNING
@@ -133,7 +133,7 @@
to control the CPU time reserved for each control group.
For more information on working with control groups, you should read
-Documentation/cgroup-v1/cgroups.txt as well.
+Documentation/admin-guide/cgroup-v1/cgroups.rst as well.
Group settings are checked against the following limits in order to keep the
configuration schedulable:
@@ -159,9 +159,11 @@
period is twice the length of B's.
* group A: period=100000us, runtime=50000us
+
- this runs for 0.05s once every 0.1s
* group B: period= 50000us, runtime=25000us
+
- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
This means that currently a while (1) loop in A will run for the full period of
diff --git a/Documentation/scheduler/sched-stats.txt b/Documentation/scheduler/sched-stats.rst
similarity index 90%
rename from Documentation/scheduler/sched-stats.txt
rename to Documentation/scheduler/sched-stats.rst
index 8259b34..0cb0aa7 100644
--- a/Documentation/scheduler/sched-stats.txt
+++ b/Documentation/scheduler/sched-stats.rst
@@ -1,3 +1,7 @@
+====================
+Scheduler Statistics
+====================
+
Version 15 of schedstats dropped counters for some sched_yield:
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
identical to version 14.
@@ -35,19 +39,23 @@
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
+
1) # of times sched_yield() was called
Next three are schedule() statistics:
+
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
+
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
+
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
@@ -67,24 +75,23 @@
The next 24 are a variety of load_balance() statistics in grouped into types
of idleness (idle, busy, and newly idle):
- 1) # of times in this domain load_balance() was called when the
+ 1) # of times in this domain load_balance() was called when the
cpu was idle
- 2) # of times in this domain load_balance() checked but found
+ 2) # of times in this domain load_balance() checked but found
the load did not require balancing when the cpu was idle
- 3) # of times in this domain load_balance() tried to move one or
+ 3) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was idle
- 4) sum of imbalances discovered (if any) with each call to
+ 4) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
- 5) # of times in this domain pull_task() was called when the cpu
+ 5) # of times in this domain pull_task() was called when the cpu
was idle
- 6) # of times in this domain pull_task() was called even though
+ 6) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle
- 7) # of times in this domain load_balance() was called but did
+ 7) # of times in this domain load_balance() was called but did
not find a busier queue while the cpu was idle
- 8) # of times in this domain a busier queue was found while the
+ 8) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found
-
- 9) # of times in this domain load_balance() was called when the
+ 9) # of times in this domain load_balance() was called when the
cpu was busy
10) # of times in this domain load_balance() checked but found the
load did not require balancing when busy
@@ -117,21 +124,25 @@
was just becoming idle but no busier group was found
Next three are active_load_balance() statistics:
+
25) # of times active_load_balance() was called
26) # of times active_load_balance() tried to move a task and failed
27) # of times active_load_balance() successfully moved a task
Next three are sched_balance_exec() statistics:
+
28) sbe_cnt is not used
29) sbe_balanced is not used
30) sbe_pushed is not used
Next three are sched_balance_fork() statistics:
+
31) sbf_cnt is not used
32) sbf_balanced is not used
33) sbf_pushed is not used
Next three are try_to_wake_up() statistics:
+
34) # of times in this domain try_to_wake_up() awoke a task that
last ran on a different cpu in this domain
35) # of times in this domain try_to_wake_up() moved a task to the
@@ -139,10 +150,11 @@
36) # of times in this domain try_to_wake_up() started passive balancing
/proc/<pid>/schedstat
-----------------
+---------------------
schedstats also adds a new /proc/<pid>/schedstat file to include some of
the same information on a per-process level. There are three fields in
this file correlating for that process to:
+
1) time spent on the cpu
2) time spent waiting on a runqueue
3) # of timeslices run on this cpu
@@ -151,4 +163,5 @@
report on how well a particular process or set of processes is faring
under the scheduler's policies. A simple version of such a program is
available at
+
http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c
diff --git a/Documentation/scheduler/text_files.rst b/Documentation/scheduler/text_files.rst
new file mode 100644
index 0000000..0bc5030
--- /dev/null
+++ b/Documentation/scheduler/text_files.rst
@@ -0,0 +1,5 @@
+Scheduler pelt c program
+------------------------
+
+.. literalinclude:: sched-pelt.c
+ :language: c