Update Linux to v5.4.2 Change-Id: Idf6911045d9d382da2cfe01b1edff026404ac8fd

commit: 0f672f6c0b52b7b0700b0915c72b540721af4465 [log] [tgz]
author: David Brazdil <dbrazdil@google.com> Tue Dec 10 10:32:29 2019 +0000
committer: David Brazdil <dbrazdil@google.com> Tue Dec 10 19:03:18 2019 +0000
tree: 85c8cba019caa205e4f8920d72d93f6d6deaf29c
parent: 3a0ad55d848b50499b68d7141d4eca997fce28ef [diff]
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
deleted file mode 100644
index eccf7ad..0000000
--- a/Documentation/scheduler/00-INDEX
+++ /dev/null

@@ -1,18 +0,0 @@
-00-INDEX
-	- this file.
-sched-arch.txt
-	- CPU Scheduler implementation hints for architecture specific code.
-sched-bwc.txt
-	- CFS bandwidth control overview.
-sched-design-CFS.txt
-	- goals, design and implementation of the Completely Fair Scheduler.
-sched-domains.txt
-	- information on scheduling domains.
-sched-nice-design.txt
-	- How and why the scheduler's nice levels are implemented.
-sched-rt-group.txt
-	- real-time group scheduling.
-sched-deadline.txt
-	- deadline scheduling.
-sched-stats.txt
-	- information on schedstats (Linux Scheduler Statistics).

diff --git a/Documentation/scheduler/completion.rst b/Documentation/scheduler/completion.rst
new file mode 100644
index 0000000..9f039b4
--- /dev/null
+++ b/Documentation/scheduler/completion.rst

@@ -0,0 +1,293 @@
+================================================
+Completions - "wait for completion" barrier APIs
+================================================
+
+Introduction:
+-------------
+
+If you have one or more threads that must wait for some kernel activity
+to have reached a point or a specific state, completions can provide a
+race-free solution to this problem. Semantically they are somewhat like a
+pthread_barrier() and have similar use-cases.
+
+Completions are a code synchronization mechanism which is preferable to any
+misuse of locks/semaphores and busy-loops. Any time you think of using
+yield() or some quirky msleep(1) loop to allow something else to proceed,
+you probably want to look into using one of the wait_for_completion*()
+calls and complete() instead.
+
+The advantage of using completions is that they have a well defined, focused
+purpose which makes it very easy to see the intent of the code, but they
+also result in more efficient code as all threads can continue execution
+until the result is actually needed, and both the waiting and the signalling
+is highly efficient using low level scheduler sleep/wakeup facilities.
+
+Completions are built on top of the waitqueue and wakeup infrastructure of
+the Linux scheduler. The event the threads on the waitqueue are waiting for
+is reduced to a simple flag in 'struct completion', appropriately called "done".
+
+As completions are scheduling related, the code can be found in
+kernel/sched/completion.c.
+
+
+Usage:
+------
+
+There are three main parts to using completions:
+
+ - the initialization of the 'struct completion' synchronization object
+ - the waiting part through a call to one of the variants of wait_for_completion(),
+ - the signaling side through a call to complete() or complete_all().
+
+There are also some helper functions for checking the state of completions.
+Note that while initialization must happen first, the waiting and signaling
+part can happen in any order. I.e. it's entirely normal for a thread
+to have marked a completion as 'done' before another thread checks whether
+it has to wait for it.
+
+To use completions you need to #include <linux/completion.h> and
+create a static or dynamic variable of type 'struct completion',
+which has only two fields::
+
+	struct completion {
+		unsigned int done;
+		wait_queue_head_t wait;
+	};
+
+This provides the ->wait waitqueue to place tasks on for waiting (if any), and
+the ->done completion flag for indicating whether it's completed or not.
+
+Completions should be named to refer to the event that is being synchronized on.
+A good example is::
+
+	wait_for_completion(&early_console_added);
+
+	complete(&early_console_added);
+
+Good, intuitive naming (as always) helps code readability. Naming a completion
+'complete' is not helpful unless the purpose is super obvious...
+
+
+Initializing completions:
+-------------------------
+
+Dynamically allocated completion objects should preferably be embedded in data
+structures that are assured to be alive for the life-time of the function/driver,
+to prevent races with asynchronous complete() calls from occurring.
+
+Particular care should be taken when using the _timeout() or _killable()/_interruptible()
+variants of wait_for_completion(), as it must be assured that memory de-allocation
+does not happen until all related activities (complete() or reinit_completion())
+have taken place, even if these wait functions return prematurely due to a timeout
+or a signal triggering.
+
+Initializing of dynamically allocated completion objects is done via a call to
+init_completion()::
+
+	init_completion(&dynamic_object->done);
+
+In this call we initialize the waitqueue and set ->done to 0, i.e. "not completed"
+or "not done".
+
+The re-initialization function, reinit_completion(), simply resets the
+->done field to 0 ("not done"), without touching the waitqueue.
+Callers of this function must make sure that there are no racy
+wait_for_completion() calls going on in parallel.
+
+Calling init_completion() on the same completion object twice is
+most likely a bug as it re-initializes the queue to an empty queue and
+enqueued tasks could get "lost" - use reinit_completion() in that case,
+but be aware of other races.
+
+For static declaration and initialization, macros are available.
+
+For static (or global) declarations in file scope you can use
+DECLARE_COMPLETION()::
+
+	static DECLARE_COMPLETION(setup_done);
+	DECLARE_COMPLETION(setup_done);
+
+Note that in this case the completion is boot time (or module load time)
+initialized to 'not done' and doesn't require an init_completion() call.
+
+When a completion is declared as a local variable within a function,
+then the initialization should always use DECLARE_COMPLETION_ONSTACK()
+explicitly, not just to make lockdep happy, but also to make it clear
+that limited scope had been considered and is intentional::
+
+	DECLARE_COMPLETION_ONSTACK(setup_done)
+
+Note that when using completion objects as local variables you must be
+acutely aware of the short life time of the function stack: the function
+must not return to a calling context until all activities (such as waiting
+threads) have ceased and the completion object is completely unused.
+
+To emphasise this again: in particular when using some of the waiting API variants
+with more complex outcomes, such as the timeout or signalling (_timeout(),
+_killable() and _interruptible()) variants, the wait might complete
+prematurely while the object might still be in use by another thread - and a return
+from the wait_on_completion*() caller function will deallocate the function
+stack and cause subtle data corruption if a complete() is done in some
+other thread. Simple testing might not trigger these kinds of races.
+
+If unsure, use dynamically allocated completion objects, preferably embedded
+in some other long lived object that has a boringly long life time which
+exceeds the life time of any helper threads using the completion object,
+or has a lock or other synchronization mechanism to make sure complete()
+is not called on a freed object.
+
+A naive DECLARE_COMPLETION() on the stack triggers a lockdep warning.
+
+Waiting for completions:
+------------------------
+
+For a thread to wait for some concurrent activity to finish, it
+calls wait_for_completion() on the initialized completion structure::
+
+	void wait_for_completion(struct completion *done)
+
+A typical usage scenario is::
+
+	CPU#1					CPU#2
+
+	struct completion setup_done;
+
+	init_completion(&setup_done);
+	initialize_work(...,&setup_done,...);
+
+	/* run non-dependent code */		/* do setup */
+
+	wait_for_completion(&setup_done);	complete(setup_done);
+
+This is not implying any particular order between wait_for_completion() and
+the call to complete() - if the call to complete() happened before the call
+to wait_for_completion() then the waiting side simply will continue
+immediately as all dependencies are satisfied; if not, it will block until
+completion is signaled by complete().
+
+Note that wait_for_completion() is calling spin_lock_irq()/spin_unlock_irq(),
+so it can only be called safely when you know that interrupts are enabled.
+Calling it from IRQs-off atomic contexts will result in hard-to-detect
+spurious enabling of interrupts.
+
+The default behavior is to wait without a timeout and to mark the task as
+uninterruptible. wait_for_completion() and its variants are only safe
+in process context (as they can sleep) but not in atomic context,
+interrupt context, with disabled IRQs, or preemption is disabled - see also
+try_wait_for_completion() below for handling completion in atomic/interrupt
+context.
+
+As all variants of wait_for_completion() can (obviously) block for a long
+time depending on the nature of the activity they are waiting for, so in
+most cases you probably don't want to call this with held mutexes.
+
+
+wait_for_completion*() variants available:
+------------------------------------------
+
+The below variants all return status and this status should be checked in
+most(/all) cases - in cases where the status is deliberately not checked you
+probably want to make a note explaining this (e.g. see
+arch/arm/kernel/smp.c:__cpu_up()).
+
+A common problem that occurs is to have unclean assignment of return types,
+so take care to assign return-values to variables of the proper type.
+
+Checking for the specific meaning of return values also has been found
+to be quite inaccurate, e.g. constructs like::
+
+	if (!wait_for_completion_interruptible_timeout(...))
+
+... would execute the same code path for successful completion and for the
+interrupted case - which is probably not what you want::
+
+	int wait_for_completion_interruptible(struct completion *done)
+
+This function marks the task TASK_INTERRUPTIBLE while it is waiting.
+If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise::
+
+	unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout)
+
+The task is marked as TASK_UNINTERRUPTIBLE and will wait at most 'timeout'
+jiffies. If a timeout occurs it returns 0, else the remaining time in
+jiffies (but at least 1).
+
+Timeouts are preferably calculated with msecs_to_jiffies() or usecs_to_jiffies(),
+to make the code largely HZ-invariant.
+
+If the returned timeout value is deliberately ignored a comment should probably explain
+why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc())::
+
+	long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout)
+
+This function passes a timeout in jiffies and marks the task as
+TASK_INTERRUPTIBLE. If a signal was received it will return -ERESTARTSYS;
+otherwise it returns 0 if the completion timed out, or the remaining time in
+jiffies if completion occurred.
+
+Further variants include _killable which uses TASK_KILLABLE as the
+designated tasks state and will return -ERESTARTSYS if it is interrupted,
+or 0 if completion was achieved.  There is a _timeout variant as well::
+
+	long wait_for_completion_killable(struct completion *done)
+	long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout)
+
+The _io variants wait_for_completion_io() behave the same as the non-_io
+variants, except for accounting waiting time as 'waiting on IO', which has
+an impact on how the task is accounted in scheduling/IO stats::
+
+	void wait_for_completion_io(struct completion *done)
+	unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout)
+
+
+Signaling completions:
+----------------------
+
+A thread that wants to signal that the conditions for continuation have been
+achieved calls complete() to signal exactly one of the waiters that it can
+continue::
+
+	void complete(struct completion *done)
+
+... or calls complete_all() to signal all current and future waiters::
+
+	void complete_all(struct completion *done)
+
+The signaling will work as expected even if completions are signaled before
+a thread starts waiting. This is achieved by the waiter "consuming"
+(decrementing) the done field of 'struct completion'. Waiting threads
+wakeup order is the same in which they were enqueued (FIFO order).
+
+If complete() is called multiple times then this will allow for that number
+of waiters to continue - each call to complete() will simply increment the
+done field. Calling complete_all() multiple times is a bug though. Both
+complete() and complete_all() can be called in IRQ/atomic context safely.
+
+There can only be one thread calling complete() or complete_all() on a
+particular 'struct completion' at any time - serialized through the wait
+queue spinlock. Any such concurrent calls to complete() or complete_all()
+probably are a design bug.
+
+Signaling completion from IRQ context is fine as it will appropriately
+lock with spin_lock_irqsave()/spin_unlock_irqrestore() and it will never
+sleep.
+
+
+try_wait_for_completion()/completion_done():
+--------------------------------------------
+
+The try_wait_for_completion() function will not put the thread on the wait
+queue but rather returns false if it would need to enqueue (block) the thread,
+else it consumes one posted completion and returns true::
+
+	bool try_wait_for_completion(struct completion *done)
+
+Finally, to check the state of a completion without changing it in any way,
+call completion_done(), which returns false if there are no posted
+completions that were not yet consumed by waiters (implying that there are
+waiters) and true otherwise::
+
+	bool completion_done(struct completion *done)
+
+Both try_wait_for_completion() and completion_done() are safe to be called in
+IRQ or atomic context.

diff --git a/Documentation/scheduler/completion.txt b/Documentation/scheduler/completion.txt
deleted file mode 100644
index 656cf80..0000000
--- a/Documentation/scheduler/completion.txt
+++ /dev/null

@@ -1,247 +0,0 @@
-completions - wait for completion handling
-==========================================
-
-This document was originally written based on 3.18.0 (linux-next)
-
-Introduction:
--------------
-
-If you have one or more threads of execution that must wait for some process
-to have reached a point or a specific state, completions can provide a
-race-free solution to this problem. Semantically they are somewhat like a
-pthread_barrier and have similar use-cases.
-
-Completions are a code synchronization mechanism which is preferable to any
-misuse of locks. Any time you think of using yield() or some quirky
-msleep(1) loop to allow something else to proceed, you probably want to
-look into using one of the wait_for_completion*() calls instead. The
-advantage of using completions is clear intent of the code, but also more
-efficient code as both threads can continue until the result is actually
-needed.
-
-Completions are built on top of the generic event infrastructure in Linux,
-with the event reduced to a simple flag (appropriately called "done") in
-struct completion that tells the waiting threads of execution if they
-can continue safely.
-
-As completions are scheduling related, the code is found in
-kernel/sched/completion.c.
-
-
-Usage:
-------
-
-There are three parts to using completions, the initialization of the
-struct completion, the waiting part through a call to one of the variants of
-wait_for_completion() and the signaling side through a call to complete()
-or complete_all(). Further there are some helper functions for checking the
-state of completions.
-
-To use completions one needs to include <linux/completion.h> and
-create a variable of type struct completion. The structure used for
-handling of completions is:
-
-	struct completion {
-		unsigned int done;
-		wait_queue_head_t wait;
-	};
-
-providing the wait queue to place tasks on for waiting and the flag for
-indicating the state of affairs.
-
-Completions should be named to convey the intent of the waiter. A good
-example is:
-
-	wait_for_completion(&early_console_added);
-
-	complete(&early_console_added);
-
-Good naming (as always) helps code readability.
-
-
-Initializing completions:
--------------------------
-
-Initialization of dynamically allocated completions, often embedded in
-other structures, is done with:
-
-	void init_completion(&done);
-
-Initialization is accomplished by initializing the wait queue and setting
-the default state to "not available", that is, "done" is set to 0.
-
-The re-initialization function, reinit_completion(), simply resets the
-done element to "not available", thus again to 0, without touching the
-wait queue. Calling init_completion() twice on the same completion object is
-most likely a bug as it re-initializes the queue to an empty queue and
-enqueued tasks could get "lost" - use reinit_completion() in that case.
-
-For static declaration and initialization, macros are available. These are:
-
-	static DECLARE_COMPLETION(setup_done)
-
-used for static declarations in file scope. Within functions the static
-initialization should always use:
-
-	DECLARE_COMPLETION_ONSTACK(setup_done)
-
-suitable for automatic/local variables on the stack and will make lockdep
-happy. Note also that one needs to make *sure* the completion passed to
-work threads remains in-scope, and no references remain to on-stack data
-when the initiating function returns.
-
-Using on-stack completions for code that calls any of the _timeout or
-_interruptible/_killable variants is not advisable as they will require
-additional synchronization to prevent the on-stack completion object in
-the timeout/signal cases from going out of scope. Consider using dynamically
-allocated completions when intending to use the _interruptible/_killable
-or _timeout variants of wait_for_completion().
-
-
-Waiting for completions:
-------------------------
-
-For a thread of execution to wait for some concurrent work to finish, it
-calls wait_for_completion() on the initialized completion structure.
-A typical usage scenario is:
-
-	struct completion setup_done;
-	init_completion(&setup_done);
-	initialize_work(...,&setup_done,...)
-
-	/* run non-dependent code */              /* do setup */
-
-	wait_for_completion(&setup_done);         complete(setup_done)
-
-This is not implying any temporal order on wait_for_completion() and the
-call to complete() - if the call to complete() happened before the call
-to wait_for_completion() then the waiting side simply will continue
-immediately as all dependencies are satisfied if not it will block until
-completion is signaled by complete().
-
-Note that wait_for_completion() is calling spin_lock_irq()/spin_unlock_irq(),
-so it can only be called safely when you know that interrupts are enabled.
-Calling it from hard-irq or irqs-off atomic contexts will result in
-hard-to-detect spurious enabling of interrupts.
-
-wait_for_completion():
-
-	void wait_for_completion(struct completion *done):
-
-The default behavior is to wait without a timeout and to mark the task as
-uninterruptible. wait_for_completion() and its variants are only safe
-in process context (as they can sleep) but not in atomic context,
-interrupt context, with disabled irqs. or preemption is disabled - see also
-try_wait_for_completion() below for handling completion in atomic/interrupt
-context.
-
-As all variants of wait_for_completion() can (obviously) block for a long
-time, you probably don't want to call this with held mutexes.
-
-
-Variants available:
--------------------
-
-The below variants all return status and this status should be checked in
-most(/all) cases - in cases where the status is deliberately not checked you
-probably want to make a note explaining this (e.g. see
-arch/arm/kernel/smp.c:__cpu_up()).
-
-A common problem that occurs is to have unclean assignment of return types,
-so care should be taken with assigning return-values to variables of proper
-type. Checking for the specific meaning of return values also has been found
-to be quite inaccurate e.g. constructs like
-if (!wait_for_completion_interruptible_timeout(...)) would execute the same
-code path for successful completion and for the interrupted case - which is
-probably not what you want.
-
-	int wait_for_completion_interruptible(struct completion *done)
-
-This function marks the task TASK_INTERRUPTIBLE. If a signal was received
-while waiting it will return -ERESTARTSYS; 0 otherwise.
-
-	unsigned long wait_for_completion_timeout(struct completion *done,
-		unsigned long timeout)
-
-The task is marked as TASK_UNINTERRUPTIBLE and will wait at most 'timeout'
-(in jiffies). If timeout occurs it returns 0 else the remaining time in
-jiffies (but at least 1). Timeouts are preferably calculated with
-msecs_to_jiffies() or usecs_to_jiffies(). If the returned timeout value is
-deliberately ignored a comment should probably explain why (e.g. see
-drivers/mfd/wm8350-core.c wm8350_read_auxadc())
-
-	long wait_for_completion_interruptible_timeout(
-		struct completion *done, unsigned long timeout)
-
-This function passes a timeout in jiffies and marks the task as
-TASK_INTERRUPTIBLE. If a signal was received it will return -ERESTARTSYS;
-otherwise it returns 0 if the completion timed out or the remaining time in
-jiffies if completion occurred.
-
-Further variants include _killable which uses TASK_KILLABLE as the
-designated tasks state and will return -ERESTARTSYS if it is interrupted or
-else 0 if completion was achieved.  There is a _timeout variant as well:
-
-	long wait_for_completion_killable(struct completion *done)
-	long wait_for_completion_killable_timeout(struct completion *done,
-		unsigned long timeout)
-
-The _io variants wait_for_completion_io() behave the same as the non-_io
-variants, except for accounting waiting time as waiting on IO, which has
-an impact on how the task is accounted in scheduling stats.
-
-	void wait_for_completion_io(struct completion *done)
-	unsigned long wait_for_completion_io_timeout(struct completion *done
-		unsigned long timeout)
-
-
-Signaling completions:
-----------------------
-
-A thread that wants to signal that the conditions for continuation have been
-achieved calls complete() to signal exactly one of the waiters that it can
-continue.
-
-	void complete(struct completion *done)
-
-or calls complete_all() to signal all current and future waiters.
-
-	void complete_all(struct completion *done)
-
-The signaling will work as expected even if completions are signaled before
-a thread starts waiting. This is achieved by the waiter "consuming"
-(decrementing) the done element of struct completion. Waiting threads
-wakeup order is the same in which they were enqueued (FIFO order).
-
-If complete() is called multiple times then this will allow for that number
-of waiters to continue - each call to complete() will simply increment the
-done element. Calling complete_all() multiple times is a bug though. Both
-complete() and complete_all() can be called in hard-irq/atomic context safely.
-
-There only can be one thread calling complete() or complete_all() on a
-particular struct completion at any time - serialized through the wait
-queue spinlock. Any such concurrent calls to complete() or complete_all()
-probably are a design bug.
-
-Signaling completion from hard-irq context is fine as it will appropriately
-lock with spin_lock_irqsave/spin_unlock_irqrestore and it will never sleep.
-
-
-try_wait_for_completion()/completion_done():
---------------------------------------------
-
-The try_wait_for_completion() function will not put the thread on the wait
-queue but rather returns false if it would need to enqueue (block) the thread,
-else it consumes one posted completion and returns true.
-
-	bool try_wait_for_completion(struct completion *done)
-
-Finally, to check the state of a completion without changing it in any way, 
-call completion_done(), which returns false if there are no posted
-completions that were not yet consumed by waiters (implying that there are
-waiters) and true otherwise;
-
-	bool completion_done(struct completion *done)
-
-Both try_wait_for_completion() and completion_done() are safe to be called in
-hard-irq or atomic context.

diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
new file mode 100644
index 0000000..69074e5
--- /dev/null
+++ b/Documentation/scheduler/index.rst

@@ -0,0 +1,27 @@
+===============
+Linux Scheduler
+===============
+
+.. toctree::
+    :maxdepth: 1
+
+
+    completion
+    sched-arch
+    sched-bwc
+    sched-deadline
+    sched-design-CFS
+    sched-domains
+    sched-energy
+    sched-nice-design
+    sched-rt-group
+    sched-stats
+
+    text_files
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`

diff --git a/Documentation/scheduler/sched-arch.txt b/Documentation/scheduler/sched-arch.rst
similarity index 81%
rename from Documentation/scheduler/sched-arch.txt
rename to Documentation/scheduler/sched-arch.rst
index a2f27bb..0eaec66 100644
--- a/Documentation/scheduler/sched-arch.txt
+++ b/Documentation/scheduler/sched-arch.rst

@@ -1,4 +1,6 @@
-	CPU Scheduler implementation hints for architecture specific code
+=================================================================
+CPU Scheduler implementation hints for architecture specific code
+=================================================================
 
 	Nick Piggin, 2005
 
@@ -35,9 +37,10 @@
 4. The only time interrupts need to be disabled when checking
    need_resched is if we are about to sleep the processor until
    the next interrupt (this doesn't provide any protection of
-   need_resched, it prevents losing an interrupt).
+   need_resched, it prevents losing an interrupt):
 
-	4a. Common problem with this type of sleep appears to be:
+	4a. Common problem with this type of sleep appears to be::
+
 	        local_irq_disable();
 	        if (!need_resched()) {
 	                local_irq_enable();
@@ -51,10 +54,10 @@
    although it may be reasonable to do some background work or enter
    a low CPU priority.
 
-   	5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
-	    an interrupt sleep, it needs to be cleared then a memory
-	    barrier issued (followed by a test of need_resched with
-	    interrupts disabled, as explained in 3).
+      - 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
+	an interrupt sleep, it needs to be cleared then a memory
+	barrier issued (followed by a test of need_resched with
+	interrupts disabled, as explained in 3).
 
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
@@ -71,4 +74,3 @@
 
 sparc - IRQs on at this point(?), change local_irq_save to _disable.
       - TODO: needs secondary CPUs to disable preempt (See #1)
-

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
new file mode 100644
index 0000000..9801d6b
--- /dev/null
+++ b/Documentation/scheduler/sched-bwc.rst

@@ -0,0 +1,174 @@
+=====================
+CFS Bandwidth Control
+=====================
+
+[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
+  The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
+
+CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
+specification of the maximum CPU bandwidth available to a group or hierarchy.
+
+The bandwidth allowed for a group is specified using a quota and period. Within
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per-cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
+within each of these updates is tunable and described as the "slice".
+
+Management
+----------
+Quota and period are managed within the cpu subsystem via cgroupfs.
+
+cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.stat: exports throttling statistics [explained further below]
+
+The default values are::
+
+	cpu.cfs_period_us=100ms
+	cpu.cfs_quota=-1
+
+A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
+bandwidth restriction in place, such a group is described as an unconstrained
+bandwidth group. This represents the traditional work-conserving behavior for
+CFS.
+
+Writing any (valid) positive value(s) will enact the specified bandwidth limit.
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
+bandwidth limits are used in a hierarchical fashion, these are explained in
+more detail below.
+
+Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
+and return the group to an unconstrained state once more.
+
+Any updates to a group's bandwidth specification will result in it becoming
+unthrottled if it is in a constrained state.
+
+System wide settings
+--------------------
+For efficiency run-time is transferred between the global pool and CPU local
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
+is described as the "slice".
+
+This is tunable via procfs::
+
+	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
+
+Larger slice values will reduce transfer overheads, while smaller values allow
+for more fine-grained consumption.
+
+Statistics
+----------
+A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+
+cpu.stat:
+
+- nr_periods: Number of enforcement intervals that have elapsed.
+- nr_throttled: Number of times the group has been throttled/limited.
+- throttled_time: The total time duration (in nanoseconds) for which entities
+  of the group have been throttled.
+
+This interface is read-only.
+
+Hierarchical considerations
+---------------------------
+The interface enforces that an individual entity's bandwidth is always
+attainable, that is: max(c_i) <= C. However, over-subscription in the
+aggregate case is explicitly allowed to enable work-conserving semantics
+within a hierarchy:
+
+  e.g. \Sum (c_i) may exceed C
+
+[ Where C is the parent's bandwidth, and c_i its children ]
+
+
+There are two ways in which a group may become throttled:
+
+	a. it fully consumes its own quota within a period
+	b. a parent's quota is fully consumed within its period
+
+In case b) above, even though the child may have runtime remaining it will not
+be allowed to until the parent's runtime is refreshed.
+
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period.  This
+also limits the burst ability to no more than 1ms per cpu.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime::
+
+	If period is 250ms and quota is also 250ms, the group will get
+	1 CPU worth of runtime every 250ms.
+
+	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
+	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
+
+   With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+   runtime every 500ms::
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+	The larger period here allows for increased burst capacity.
+
+3. Limit a group to 20% of 1 CPU.
+
+   With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
+
+	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
+
+   By using a small period here we are ensuring a consistent latency
+   response at the expense of burst capacity.

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
deleted file mode 100644
index f6b1873..0000000
--- a/Documentation/scheduler/sched-bwc.txt
+++ /dev/null

@@ -1,122 +0,0 @@
-CFS Bandwidth Control
-=====================
-
-[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
-  The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
-
-CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
-specification of the maximum CPU bandwidth available to a group or hierarchy.
-
-The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
-within each of these updates is tunable and described as the "slice".
-
-Management
-----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
-
-cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
-cpu.cfs_period_us: the length of a period (in microseconds)
-cpu.stat: exports throttling statistics [explained further below]
-
-The default values are:
-	cpu.cfs_period_us=100ms
-	cpu.cfs_quota=-1
-
-A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
-bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
-CFS.
-
-Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
-bandwidth limits are used in a hierarchical fashion, these are explained in
-more detail below.
-
-Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
-and return the group to an unconstrained state once more.
-
-Any updates to a group's bandwidth specification will result in it becoming
-unthrottled if it is in a constrained state.
-
-System wide settings
---------------------
-For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
-is described as the "slice".
-
-This is tunable via procfs:
-	/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
-
-Larger slice values will reduce transfer overheads, while smaller values allow
-for more fine-grained consumption.
-
-Statistics
-----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
-
-cpu.stat:
-- nr_periods: Number of enforcement intervals that have elapsed.
-- nr_throttled: Number of times the group has been throttled/limited.
-- throttled_time: The total time duration (in nanoseconds) for which entities
-  of the group have been throttled.
-
-This interface is read-only.
-
-Hierarchical considerations
----------------------------
-The interface enforces that an individual entity's bandwidth is always
-attainable, that is: max(c_i) <= C. However, over-subscription in the
-aggregate case is explicitly allowed to enable work-conserving semantics
-within a hierarchy.
-  e.g. \Sum (c_i) may exceed C
-[ Where C is the parent's bandwidth, and c_i its children ]
-
-
-There are two ways in which a group may become throttled:
-	a. it fully consumes its own quota within a period
-	b. a parent's quota is fully consumed within its period
-
-In case b) above, even though the child may have runtime remaining it will not
-be allowed to until the parent's runtime is refreshed.
-
-Examples
---------
-1. Limit a group to 1 CPU worth of runtime.
-
-	If period is 250ms and quota is also 250ms, the group will get
-	1 CPU worth of runtime every 250ms.
-
-	# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
-	# echo 250000 > cpu.cfs_period_us /* period = 250ms */
-
-2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
-
-	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
-	runtime every 500ms.
-
-	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
-	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
-
-	The larger period here allows for increased burst capacity.
-
-3. Limit a group to 20% of 1 CPU.
-
-	With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
-
-	# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
-	# echo 50000 > cpu.cfs_period_us /* period = 50ms */
-
-	By using a small period here we are ensuring a consistent latency
-	response at the expense of burst capacity.
-

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.rst
similarity index 87%
rename from Documentation/scheduler/sched-deadline.txt
rename to Documentation/scheduler/sched-deadline.rst
index b14e03f..14a2f7b 100644
--- a/Documentation/scheduler/sched-deadline.txt
+++ b/Documentation/scheduler/sched-deadline.rst

@@ -1,29 +1,29 @@
-			  Deadline Task Scheduling
-			  ------------------------
+========================
+Deadline Task Scheduling
+========================
 
-CONTENTS
-========
+.. CONTENTS
 
- 0. WARNING
- 1. Overview
- 2. Scheduling algorithm
-   2.1 Main algorithm
-   2.2 Bandwidth reclaiming
- 3. Scheduling Real-Time Tasks
-   3.1 Definitions
-   3.2 Schedulability Analysis for Uniprocessor Systems
-   3.3 Schedulability Analysis for Multiprocessor Systems
-   3.4 Relationship with SCHED_DEADLINE Parameters
- 4. Bandwidth management
-   4.1 System-wide settings
-   4.2 Task interface
-   4.3 Default behavior
-   4.4 Behavior of sched_yield()
- 5. Tasks CPU affinity
-   5.1 SCHED_DEADLINE and cpusets HOWTO
- 6. Future plans
- A. Test suite
- B. Minimal main()
+    0. WARNING
+    1. Overview
+    2. Scheduling algorithm
+      2.1 Main algorithm
+      2.2 Bandwidth reclaiming
+    3. Scheduling Real-Time Tasks
+      3.1 Definitions
+      3.2 Schedulability Analysis for Uniprocessor Systems
+      3.3 Schedulability Analysis for Multiprocessor Systems
+      3.4 Relationship with SCHED_DEADLINE Parameters
+    4. Bandwidth management
+      4.1 System-wide settings
+      4.2 Task interface
+      4.3 Default behavior
+      4.4 Behavior of sched_yield()
+    5. Tasks CPU affinity
+      5.1 SCHED_DEADLINE and cpusets HOWTO
+    6. Future plans
+    A. Test suite
+    B. Minimal main()
 
 
 0. WARNING
@@ -44,7 +44,7 @@
 
 
 2. Scheduling algorithm
-==================
+=======================
 
 2.1 Main algorithm
 ------------------
@@ -80,7 +80,7 @@
     a "remaining runtime". These two parameters are initially set to 0;
 
   - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
-    the scheduler checks if
+    the scheduler checks if::
 
                  remaining runtime                  runtime
         ----------------------------------    >    ---------
@@ -97,7 +97,7 @@
     left unchanged;
 
   - When a SCHED_DEADLINE task executes for an amount of time t, its
-    remaining runtime is decreased as
+    remaining runtime is decreased as::
 
          remaining runtime = remaining runtime - t
 
@@ -112,7 +112,7 @@
 
   - When the current time is equal to the replenishment time of a
     throttled task, the scheduling deadline and the remaining runtime are
-    updated as
+    updated as::
 
          scheduling deadline = scheduling deadline + period
          remaining runtime = remaining runtime + runtime
@@ -129,7 +129,7 @@
  Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
  when flag SCHED_FLAG_RECLAIM is set.
 
- The following diagram illustrates the state names for tasks handled by GRUB:
+ The following diagram illustrates the state names for tasks handled by GRUB::
 
                              ------------
                  (d)        |   Active   |
@@ -168,7 +168,7 @@
       breaking the real-time guarantees.
 
       The 0-lag time for a task entering the ActiveNonContending state is
-      computed as
+      computed as::
 
                         (runtime * dl_period)
              deadline - ---------------------
@@ -183,7 +183,7 @@
       the task's utilization must be removed from the previous runqueue's active
       utilization and must be added to the new runqueue's active utilization.
       In order to avoid races between a task waking up on a runqueue while the
-       "inactive timer" is running on a different CPU, the "dl_non_contending"
+      "inactive timer" is running on a different CPU, the "dl_non_contending"
       flag is used to indicate that a task is not on a runqueue but is active
       (so, the flag is set when the task blocks and is cleared when the
       "inactive timer" fires or when the task  wakes up).
@@ -222,36 +222,36 @@
 
 
  Let's now see a trivial example of two deadline tasks with runtime equal
- to 4 and period equal to 8 (i.e., bandwidth equal to 0.5):
+ to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
 
-     A            Task T1
-     |
-     |                               |
-     |                               |
-     |--------                       |----
-     |       |                       V
-     |---|---|---|---|---|---|---|---|--------->t
-     0   1   2   3   4   5   6   7   8
+         A            Task T1
+         |
+         |                               |
+         |                               |
+         |--------                       |----
+         |       |                       V
+         |---|---|---|---|---|---|---|---|--------->t
+         0   1   2   3   4   5   6   7   8
 
 
-     A            Task T2
-     |
-     |                               |
-     |                               |
-     |       ------------------------|
-     |       |                       V
-     |---|---|---|---|---|---|---|---|--------->t
-     0   1   2   3   4   5   6   7   8
+         A            Task T2
+         |
+         |                               |
+         |                               |
+         |       ------------------------|
+         |       |                       V
+         |---|---|---|---|---|---|---|---|--------->t
+         0   1   2   3   4   5   6   7   8
 
 
-     A            running_bw
-     |
-   1 -----------------               ------
-     |               |               |
-  0.5-               -----------------
-     |                               |
-     |---|---|---|---|---|---|---|---|--------->t
-     0   1   2   3   4   5   6   7   8
+         A            running_bw
+         |
+       1 -----------------               ------
+         |               |               |
+      0.5-               -----------------
+         |                               |
+         |---|---|---|---|---|---|---|---|--------->t
+         0   1   2   3   4   5   6   7   8
 
 
   - Time t = 0:
@@ -284,7 +284,7 @@
 
 
 2.3 Energy-aware scheduling
-------------------------
+---------------------------
 
  When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
  GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
@@ -299,15 +299,20 @@
 3. Scheduling Real-Time Tasks
 =============================
 
- * BIG FAT WARNING ******************************************************
- *
- * This section contains a (not-thorough) summary on classical deadline
- * scheduling theory, and how it applies to SCHED_DEADLINE.
- * The reader can "safely" skip to Section 4 if only interested in seeing
- * how the scheduling policy can be used. Anyway, we strongly recommend
- * to come back here and continue reading (once the urge for testing is
- * satisfied :P) to be sure of fully understanding all technical details.
- ************************************************************************
+
+
+ ..  BIG FAT WARNING ******************************************************
+
+ .. warning::
+
+   This section contains a (not-thorough) summary on classical deadline
+   scheduling theory, and how it applies to SCHED_DEADLINE.
+   The reader can "safely" skip to Section 4 if only interested in seeing
+   how the scheduling policy can be used. Anyway, we strongly recommend
+   to come back here and continue reading (once the urge for testing is
+   satisfied :P) to be sure of fully understanding all technical details.
+
+ .. ************************************************************************
 
  There are no limitations on what kind of task can exploit this new
  scheduling discipline, even if it must be said that it is particularly
@@ -329,6 +334,7 @@
  sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
  d_j = r_j + D, where D is the task's relative deadline.
  Summing up, a real-time task can be described as
+
 	Task = (WCET, D, P)
 
  The utilization of a real-time task is defined as the ratio between its
@@ -352,13 +358,15 @@
  between the finishing time of a job and its absolute deadline).
  More precisely, it can be proven that using a global EDF scheduler the
  maximum tardiness of each task is smaller or equal than
+
 	((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
+
  where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
  is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
  utilization[12].
 
 3.2 Schedulability Analysis for Uniprocessor Systems
-------------------------
+----------------------------------------------------
 
  If M=1 (uniprocessor system), or in case of partitioned scheduling (each
  real-time task is statically assigned to one and only one CPU), it is
@@ -370,7 +378,9 @@
  a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
  of all the tasks running on a CPU if the sum of the densities of the tasks
  running on such a CPU is smaller or equal than 1:
+
 	sum(WCET_i / min{D_i, P_i}) <= 1
+
  It is important to notice that this condition is only sufficient, and not
  necessary: there are task sets that are schedulable, but do not respect the
  condition. For example, consider the task set {Task_1,Task_2} composed by
@@ -379,7 +389,9 @@
  (Task_1 is scheduled as soon as it is released, and finishes just in time
  to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
  its response time cannot be larger than 50ms + 10ms = 60ms) even if
+
 	50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
+
  Of course it is possible to test the exact schedulability of tasks with
  D_i != P_i (checking a condition that is both sufficient and necessary),
  but this cannot be done by comparing the total utilization or density with
@@ -399,7 +411,7 @@
  4 Linux uses an admission test based on the tasks' utilizations.
 
 3.3 Schedulability Analysis for Multiprocessor Systems
-------------------------
+------------------------------------------------------
 
  On multiprocessor systems with global EDF scheduling (non partitioned
  systems), a sufficient test for schedulability can not be based on the
@@ -428,7 +440,9 @@
  between total utilization (or density) and a fixed constant. If all tasks
  have D_i = P_i, a sufficient schedulability condition can be expressed in
  a simple way:
+
 	sum(WCET_i / P_i) <= M - (M - 1) · U_max
+
  where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
  M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
  just confirms the Dhall's effect. A more complete survey of the literature
@@ -447,7 +461,7 @@
  the tasks are limited.
 
 3.4 Relationship with SCHED_DEADLINE Parameters
-------------------------
+-----------------------------------------------
 
  Finally, it is important to understand the relationship between the
  SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
@@ -473,6 +487,7 @@
  this task, as it is not possible to respect its temporal constraints.
 
  References:
+
   1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
       ming in a hard-real-time environment. Journal of the Association for
       Computing Machinery, 20(1), 1973.
@@ -550,7 +565,7 @@
  The interface used to control the CPU bandwidth that can be allocated
  to -deadline tasks is similar to the one already used for -rt
  tasks with real-time group scheduling (a.k.a. RT-throttling - see
- Documentation/scheduler/sched-rt-group.txt), and is based on readable/
+ Documentation/scheduler/sched-rt-group.rst), and is based on readable/
  writable control files located in procfs (for system wide settings).
  Notice that per-group settings (controlled through cgroupfs) are still not
  defined for -deadline tasks, because more discussion is needed in order to
@@ -596,11 +611,13 @@
  Specifying a periodic/sporadic task that executes for a given amount of
  runtime at each instance, and that is scheduled according to the urgency of
  its own timing constraints needs, in general, a way of declaring:
+
   - a (maximum/typical) instance execution time,
   - a minimum interval between consecutive instances,
   - a time constraint by which each instance must be completed.
 
  Therefore:
+
   * a new struct sched_attr, containing all the necessary fields is
     provided;
   * the new scheduling related syscalls that manipulate it, i.e.,
@@ -652,27 +669,27 @@
 
  -deadline tasks cannot have an affinity mask smaller that the entire
  root_domain they are created on. However, affinities can be specified
- through the cpuset facility (Documentation/cgroup-v1/cpusets.txt).
+ through the cpuset facility (Documentation/admin-guide/cgroup-v1/cpusets.rst).
 
 5.1 SCHED_DEADLINE and cpusets HOWTO
 ------------------------------------
 
  An example of a simple configuration (pin a -deadline task to CPU0)
- follows (rt-app is used to create a -deadline task).
+ follows (rt-app is used to create a -deadline task)::
 
- mkdir /dev/cpuset
- mount -t cgroup -o cpuset cpuset /dev/cpuset
- cd /dev/cpuset
- mkdir cpu0
- echo 0 > cpu0/cpuset.cpus
- echo 0 > cpu0/cpuset.mems
- echo 1 > cpuset.cpu_exclusive
- echo 0 > cpuset.sched_load_balance
- echo 1 > cpu0/cpuset.cpu_exclusive
- echo 1 > cpu0/cpuset.mem_exclusive
- echo $$ > cpu0/tasks
- rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
- task affinity)
+   mkdir /dev/cpuset
+   mount -t cgroup -o cpuset cpuset /dev/cpuset
+   cd /dev/cpuset
+   mkdir cpu0
+   echo 0 > cpu0/cpuset.cpus
+   echo 0 > cpu0/cpuset.mems
+   echo 1 > cpuset.cpu_exclusive
+   echo 0 > cpuset.sched_load_balance
+   echo 1 > cpu0/cpuset.cpu_exclusive
+   echo 1 > cpu0/cpuset.mem_exclusive
+   echo $$ > cpu0/tasks
+   rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
+				  # task affinity
 
 6. Future plans
 ===============
@@ -711,7 +728,7 @@
  rt-app is available at: https://github.com/scheduler-tools/rt-app.
 
  Thread parameters can be specified from the command line, with something like
- this:
+ this::
 
   # rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
 
@@ -721,27 +738,27 @@
  of 5 seconds.
 
  More interestingly, configurations can be described with a json file that
- can be passed as input to rt-app with something like this:
+ can be passed as input to rt-app with something like this::
 
   # rt-app my_config.json
 
  The parameters that can be specified with the second method are a superset
  of the command line options. Please refer to rt-app documentation for more
- details (<rt-app-sources>/doc/*.json).
+ details (`<rt-app-sources>/doc/*.json`).
 
  The second testing application is a modification of schedtool, called
  schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
  certain pid/application. schedtool-dl is available at:
  https://github.com/scheduler-tools/schedtool-dl.git.
 
- The usage is straightforward:
+ The usage is straightforward::
 
   # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
 
  With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
  of 10ms every 100ms (note that parameters are expressed in microseconds).
  You can also use schedtool to create a reservation for an already running
- application, given that you know its pid:
+ application, given that you know its pid::
 
   # schedtool -E -t 10000000:100000000 my_app_pid
 
@@ -750,43 +767,43 @@
 
  We provide in what follows a simple (ugly) self-contained code snippet
  showing how SCHED_DEADLINE reservations can be created by a real-time
- application developer.
+ application developer::
 
- #define _GNU_SOURCE
- #include <unistd.h>
- #include <stdio.h>
- #include <stdlib.h>
- #include <string.h>
- #include <time.h>
- #include <linux/unistd.h>
- #include <linux/kernel.h>
- #include <linux/types.h>
- #include <sys/syscall.h>
- #include <pthread.h>
+   #define _GNU_SOURCE
+   #include <unistd.h>
+   #include <stdio.h>
+   #include <stdlib.h>
+   #include <string.h>
+   #include <time.h>
+   #include <linux/unistd.h>
+   #include <linux/kernel.h>
+   #include <linux/types.h>
+   #include <sys/syscall.h>
+   #include <pthread.h>
 
- #define gettid() syscall(__NR_gettid)
+   #define gettid() syscall(__NR_gettid)
 
- #define SCHED_DEADLINE	6
+   #define SCHED_DEADLINE	6
 
- /* XXX use the proper syscall numbers */
- #ifdef __x86_64__
- #define __NR_sched_setattr		314
- #define __NR_sched_getattr		315
- #endif
+   /* XXX use the proper syscall numbers */
+   #ifdef __x86_64__
+   #define __NR_sched_setattr		314
+   #define __NR_sched_getattr		315
+   #endif
 
- #ifdef __i386__
- #define __NR_sched_setattr		351
- #define __NR_sched_getattr		352
- #endif
+   #ifdef __i386__
+   #define __NR_sched_setattr		351
+   #define __NR_sched_getattr		352
+   #endif
 
- #ifdef __arm__
- #define __NR_sched_setattr		380
- #define __NR_sched_getattr		381
- #endif
+   #ifdef __arm__
+   #define __NR_sched_setattr		380
+   #define __NR_sched_getattr		381
+   #endif
 
- static volatile int done;
+   static volatile int done;
 
- struct sched_attr {
+   struct sched_attr {
 	__u32 size;
 
 	__u32 sched_policy;
@@ -802,25 +819,25 @@
 	__u64 sched_runtime;
 	__u64 sched_deadline;
 	__u64 sched_period;
- };
+   };
 
- int sched_setattr(pid_t pid,
+   int sched_setattr(pid_t pid,
 		  const struct sched_attr *attr,
 		  unsigned int flags)
- {
+   {
 	return syscall(__NR_sched_setattr, pid, attr, flags);
- }
+   }
 
- int sched_getattr(pid_t pid,
+   int sched_getattr(pid_t pid,
 		  struct sched_attr *attr,
 		  unsigned int size,
 		  unsigned int flags)
- {
+   {
 	return syscall(__NR_sched_getattr, pid, attr, size, flags);
- }
+   }
 
- void *run_deadline(void *data)
- {
+   void *run_deadline(void *data)
+   {
 	struct sched_attr attr;
 	int x = 0;
 	int ret;
@@ -851,10 +868,10 @@
 
 	printf("deadline thread dies [%ld]\n", gettid());
 	return NULL;
- }
+   }
 
- int main (int argc, char **argv)
- {
+   int main (int argc, char **argv)
+   {
 	pthread_t thread;
 
 	printf("main thread [%ld]\n", gettid());
@@ -868,4 +885,4 @@
 
 	printf("main dies [%ld]\n", gettid());
 	return 0;
- }
+   }

diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.rst
similarity index 96%
rename from Documentation/scheduler/sched-design-CFS.txt
rename to Documentation/scheduler/sched-design-CFS.rst
index edd861c..a96c726 100644
--- a/Documentation/scheduler/sched-design-CFS.txt
+++ b/Documentation/scheduler/sched-design-CFS.rst

@@ -1,9 +1,10 @@
-                      =============
-                      CFS Scheduler
-                      =============
+=============
+CFS Scheduler
+=============
 
 
 1.  OVERVIEW
+============
 
 CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
 scheduler implemented by Ingo Molnar and merged in Linux 2.6.23.  It is the
@@ -27,6 +28,7 @@
 
 
 2.  FEW IMPLEMENTATION DETAILS
+==============================
 
 In CFS the virtual runtime is expressed and tracked via the per-task
 p->se.vruntime (nanosec-unit) value.  This way, it's possible to accurately
@@ -49,6 +51,7 @@
 
 
 3.  THE RBTREE
+==============
 
 CFS's design is quite radical: it does not use the old data structures for the
 runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
@@ -84,6 +87,7 @@
 
 
 4.  SOME FEATURES OF CFS
+========================
 
 CFS uses nanosecond granularity accounting and does not rely on any jiffies or
 other HZ detail.  Thus the CFS scheduler has no notion of "timeslices" in the
@@ -113,6 +117,7 @@
 
 
 5. Scheduling policies
+======================
 
 CFS implements three scheduling policies:
 
@@ -137,6 +142,7 @@
 
 
 6.  SCHEDULING CLASSES
+======================
 
 The new CFS scheduler has been designed in such a way to introduce "Scheduling
 Classes," an extensible hierarchy of scheduler modules.  These modules
@@ -197,6 +203,7 @@
 
 
 7.  GROUP SCHEDULER EXTENSIONS TO CFS
+=====================================
 
 Normally, the scheduler operates on individual tasks and strives to provide
 fair CPU time to each task.  Sometimes, it may be desirable to group tasks and
@@ -215,11 +222,11 @@
 
    These options need CONFIG_CGROUPS to be defined, and let the administrator
    create arbitrary groups of tasks, using the "cgroup" pseudo filesystem.  See
-   Documentation/cgroup-v1/cgroups.txt for more information about this filesystem.
+   Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem.
 
 When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
 group created using the pseudo filesystem.  See example steps below to create
-task groups and modify their CPU share using the "cgroups" pseudo filesystem.
+task groups and modify their CPU share using the "cgroups" pseudo filesystem::
 
 	# mount -t tmpfs cgroup_root /sys/fs/cgroup
 	# mkdir /sys/fs/cgroup/cpu

diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.rst
similarity index 97%
rename from Documentation/scheduler/sched-domains.txt
rename to Documentation/scheduler/sched-domains.rst
index 4af80b1..f750422 100644
--- a/Documentation/scheduler/sched-domains.txt
+++ b/Documentation/scheduler/sched-domains.rst

@@ -1,3 +1,7 @@
+=================
+Scheduler Domains
+=================
+
 Each CPU has a "base" scheduling domain (struct sched_domain). The domain
 hierarchy is built from these base domains via the ->parent pointer. ->parent
 MUST be NULL terminated, and domain structures should be per-CPU as they are
@@ -46,7 +50,9 @@
 to our runqueue. The exact number of tasks amounts to an imbalance previously
 computed while iterating over this sched domain's groups.
 
-*** Implementing sched domains ***
+Implementing sched domains
+==========================
+
 The "base" domain will "span" the first level of the hierarchy. In the case
 of SMT, you'll span all siblings of the physical CPU, with each group being
 a single virtual CPU.

diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst
new file mode 100644
index 0000000..9580c57
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.rst

@@ -0,0 +1,430 @@
+=======================
+Energy Aware Scheduling
+=======================
+
+1. Introduction
+---------------
+
+Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
+the impact of its decisions on the energy consumed by CPUs. EAS relies on an
+Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
+with a minimal impact on throughput. This document aims at providing an
+introduction on how EAS works, what are the main design decisions behind it, and
+details what is needed to get it to run.
+
+Before going any further, please note that at the time of writing::
+
+   /!\ EAS does not support platforms with symmetric CPU topologies /!\
+
+EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
+because this is where the potential for saving energy through scheduling is
+the highest.
+
+The actual EM used by EAS is _not_ maintained by the scheduler, but by a
+dedicated framework. For details about this framework and what it provides,
+please refer to its documentation (see Documentation/power/energy-model.rst).
+
+
+2. Background and Terminology
+-----------------------------
+
+To make it clear from the start:
+ - energy = [joule] (resource like a battery on powered devices)
+ - power = energy/time = [joule/second] = [watt]
+
+The goal of EAS is to minimize energy, while still getting the job done. That
+is, we want to maximize::
+
+	performance [inst/s]
+	--------------------
+	    power [W]
+
+which is equivalent to minimizing::
+
+	energy [J]
+	-----------
+	instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance.
+
+The idea behind introducing an EM is to allow the scheduler to evaluate the
+implications of its decisions rather than blindly applying energy-saving
+techniques that may have positive effects only on some platforms. At the same
+time, the EM must be as simple as possible to minimize the scheduler latency
+impact.
+
+In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
+for the scheduler to decide where a task should run (during wake-up), the EM
+is used to break the tie between several good CPU candidates and pick the one
+that is predicted to yield the best energy consumption without harming the
+system's throughput. The predictions made by EAS rely on specific elements of
+knowledge about the platform's topology, which include the 'capacity' of CPUs,
+and their respective energy costs.
+
+
+3. Topology information
+-----------------------
+
+EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
+differentiate CPUs with different computing throughput. The 'capacity' of a CPU
+represents the amount of work it can absorb when running at its highest
+frequency compared to the most capable CPU of the system. Capacity values are
+normalized in a 1024 range, and are comparable with the utilization signals of
+tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
+to capacity and utilization values, EAS is able to estimate how big/busy a
+task/CPU is, and to take this into consideration when evaluating performance vs
+energy trade-offs. The capacity of CPUs is provided via arch-specific code
+through the arch_scale_cpu_capacity() callback.
+
+The rest of platform knowledge used by EAS is directly read from the Energy
+Model (EM) framework. The EM of a platform is composed of a power cost table
+per 'performance domain' in the system (see Documentation/power/energy-model.rst
+for futher details about performance domains).
+
+The scheduler manages references to the EM objects in the topology code when the
+scheduling domains are built, or re-built. For each root domain (rd), the
+scheduler maintains a singly linked list of all performance domains intersecting
+the current rd->span. Each node in the list contains a pointer to a struct
+em_perf_domain as provided by the EM framework.
+
+The lists are attached to the root domains in order to cope with exclusive
+cpuset configurations. Since the boundaries of exclusive cpusets do not
+necessarily match those of performance domains, the lists of different root
+domains can contain duplicate elements.
+
+Example 1.
+    Let us consider a platform with 12 CPUs, split in 3 performance domains
+    (pd0, pd4 and pd8), organized as follows::
+
+	          CPUs:   0 1 2 3 4 5 6 7 8 9 10 11
+	          PDs:   |--pd0--|--pd4--|---pd8---|
+	          RDs:   |----rd1----|-----rd2-----|
+
+    Now, consider that userspace decided to split the system with two
+    exclusive cpusets, hence creating two independent root domains, each
+    containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
+    above figure. Since pd4 intersects with both rd1 and rd2, it will be
+    present in the linked list '->pd' attached to each of them:
+
+       * rd1->pd: pd0 -> pd4
+       * rd2->pd: pd4 -> pd8
+
+    Please note that the scheduler will create two duplicate list nodes for
+    pd4 (one for each list). However, both just hold a pointer to the same
+    shared data structure of the EM framework.
+
+Since the access to these lists can happen concurrently with hotplug and other
+things, they are protected by RCU, like the rest of topology structures
+manipulated by the scheduler.
+
+EAS also maintains a static key (sched_energy_present) which is enabled when at
+least one root domain meets all conditions for EAS to start. Those conditions
+are summarized in Section 6.
+
+
+4. Energy-Aware task placement
+------------------------------
+
+EAS overrides the CFS task wake-up balancing code. It uses the EM of the
+platform and the PELT signals to choose an energy-efficient target CPU during
+wake-up balance. When EAS is enabled, select_task_rq_fair() calls
+find_energy_efficient_cpu() to do the placement decision. This function looks
+for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
+each performance domain since it is the one which will allow us to keep the
+frequency the lowest. Then, the function checks if placing the task there could
+save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
+in its previous activation.
+
+find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
+energy consumed by the system if the waking task was migrated. compute_energy()
+looks at the current utilization landscape of the CPUs and adjusts it to
+'simulate' the task migration. The EM framework provides the em_pd_energy() API
+which computes the expected energy consumption of each performance domain for
+the given utilization landscape.
+
+An example of energy-optimized task placement decision is detailed below.
+
+Example 2.
+    Let us consider a (fake) platform with 2 independent performance domains
+    composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
+    are big.
+
+    The scheduler must decide where to place a task P whose util_avg = 200
+    and prev_cpu = 0.
+
+    The current utilization landscape of the CPUs is depicted on the graph
+    below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
+    Each performance domain has three Operating Performance Points (OPPs).
+    The CPU capacity and power cost associated with each OPP is listed in
+    the Energy Model table. The util_avg of P is shown on the figures
+    below as 'PP'::
+
+     CPU util.
+      1024                 - - - - - - -              Energy Model
+                                               +-----------+-------------+
+                                               |  Little   |     Big     |
+       768                 =============       +-----+-----+------+------+
+                                               | Cap | Pwr | Cap  | Pwr  |
+                                               +-----+-----+------+------+
+       512  ===========    - ##- - - - -       | 170 | 50  | 512  | 400  |
+                             ##     ##         | 341 | 150 | 768  | 800  |
+       341  -PP - - - -      ##     ##         | 512 | 300 | 1024 | 1700 |
+             PP              ##     ##         +-----+-----+------+------+
+       170  -## - - - -      ##     ##
+             ##     ##       ##     ##
+           ------------    -------------
+            CPU0   CPU1     CPU2   CPU3
+
+      Current OPP: =====       Other OPP: - - -     util_avg (100 each): ##
+
+
+    find_energy_efficient_cpu() will first look for the CPUs with the
+    maximum spare capacity in the two performance domains. In this example,
+    CPU1 and CPU3. Then it will estimate the energy of the system if P was
+    placed on either of them, and check if that would save some energy
+    compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
+    (which is coherent with the behaviour of the schedutil CPUFreq
+    governor, see Section 6. for more details on this topic).
+
+    **Case 1. P is migrated to CPU1**::
+
+      1024                 - - - - - - -
+
+                                            Energy calculation:
+       768                 =============     * CPU0: 200 / 341 * 150 = 88
+                                             * CPU1: 300 / 341 * 150 = 131
+                                             * CPU2: 600 / 768 * 800 = 625
+       512  - - - - - -    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
+                             ##     ##          => total_energy = 1364
+       341  ===========      ##     ##
+                    PP       ##     ##
+       170  -## - - PP-      ##     ##
+             ##     ##       ##     ##
+           ------------    -------------
+            CPU0   CPU1     CPU2   CPU3
+
+
+    **Case 2. P is migrated to CPU3**::
+
+      1024                 - - - - - - -
+
+                                            Energy calculation:
+       768                 =============     * CPU0: 200 / 341 * 150 = 88
+                                             * CPU1: 100 / 341 * 150 = 43
+                                    PP       * CPU2: 600 / 768 * 800 = 625
+       512  - - - - - -    - ##- - -PP -     * CPU3: 700 / 768 * 800 = 729
+                             ##     ##          => total_energy = 1485
+       341  ===========      ##     ##
+                             ##     ##
+       170  -## - - - -      ##     ##
+             ##     ##       ##     ##
+           ------------    -------------
+            CPU0   CPU1     CPU2   CPU3
+
+
+    **Case 3. P stays on prev_cpu / CPU 0**::
+
+      1024                 - - - - - - -
+
+                                            Energy calculation:
+       768                 =============     * CPU0: 400 / 512 * 300 = 234
+                                             * CPU1: 100 / 512 * 300 = 58
+                                             * CPU2: 600 / 768 * 800 = 625
+       512  ===========    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
+                             ##     ##          => total_energy = 1437
+       341  -PP - - - -      ##     ##
+             PP              ##     ##
+       170  -## - - - -      ##     ##
+             ##     ##       ##     ##
+           ------------    -------------
+            CPU0   CPU1     CPU2   CPU3
+
+
+    From these calculations, the Case 1 has the lowest total energy. So CPU 1
+    is be the best candidate from an energy-efficiency standpoint.
+
+Big CPUs are generally more power hungry than the little ones and are thus used
+mainly when a task doesn't fit the littles. However, little CPUs aren't always
+necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
+of the little CPUs can be less energy-efficient than the lowest OPPs of the
+bigs, for example. So, if the little CPUs happen to have enough utilization at
+a specific point in time, a small task waking up at that moment could be better
+of executing on the big side in order to save energy, even though it would fit
+on the little side.
+
+And even in the case where all OPPs of the big CPUs are less energy-efficient
+than those of the little, using the big CPUs for a small task might still, under
+specific conditions, save energy. Indeed, placing a task on a little CPU can
+result in raising the OPP of the entire performance domain, and that will
+increase the cost of the tasks already running there. If the waking task is
+placed on a big CPU, its own execution cost might be higher than if it was
+running on a little, but it won't impact the other tasks of the little CPUs
+which will keep running at a lower OPP. So, when considering the total energy
+consumed by CPUs, the extra cost of running that one task on a big core can be
+smaller than the cost of raising the OPP on the little CPUs for all the other
+tasks.
+
+The examples above would be nearly impossible to get right in a generic way, and
+for all platforms, without knowing the cost of running at different OPPs on all
+CPUs of the system. Thanks to its EM-based design, EAS should cope with them
+correctly without too many troubles. However, in order to ensure a minimal
+impact on throughput for high-utilization scenarios, EAS also implements another
+mechanism called 'over-utilization'.
+
+
+5. Over-utilization
+-------------------
+
+From a general standpoint, the use-cases where EAS can help the most are those
+involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
+being run, they will require all of the available CPU capacity, and there isn't
+much that can be done by the scheduler to save energy without severly harming
+throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
+'over-utilized' as soon as they are used at more than 80% of their compute
+capacity. As long as no CPUs are over-utilized in a root domain, load balancing
+is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
+the most energy efficient CPUs of the system more than the others if that can be
+done without harming throughput. So, the load-balancer is disabled to prevent
+it from breaking the energy-efficient task placement found by EAS. It is safe to
+do so when the system isn't overutilized since being below the 80% tipping point
+implies that:
+
+    a. there is some idle time on all CPUs, so the utilization signals used by
+       EAS are likely to accurately represent the 'size' of the various tasks
+       in the system;
+    b. all tasks should already be provided with enough CPU capacity,
+       regardless of their nice values;
+    c. since there is spare capacity all tasks must be blocking/sleeping
+       regularly and balancing at wake-up is sufficient.
+
+As soon as one CPU goes above the 80% tipping point, at least one of the three
+assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
+is raised for the entire root domain, EAS is disabled, and the load-balancer is
+re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
+wake-up and load balance under CPU-bound conditions. This provides a better
+respect of the nice values of tasks.
+
+Since the notion of overutilization largely relies on detecting whether or not
+there is some idle time in the system, the CPU capacity 'stolen' by higher
+(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
+such, the detection of overutilization accounts for the capacity used not only
+by CFS tasks, but also by the other scheduling classes and IRQ.
+
+
+6. Dependencies and requirements for EAS
+----------------------------------------
+
+Energy Aware Scheduling depends on the CPUs of the system having specific
+hardware properties and on other features of the kernel being enabled. This
+section lists these dependencies and provides hints as to how they can be met.
+
+
+6.1 - Asymmetric CPU topology
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+As mentioned in the introduction, EAS is only supported on platforms with
+asymmetric CPU topologies for now. This requirement is checked at run-time by
+looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling
+domains are built.
+
+The flag is set/cleared automatically by the scheduler topology code whenever
+there are CPUs with different capacities in a root domain. The capacities of
+CPUs are provided by arch-specific code through the arch_scale_cpu_capacity()
+callback. As an example, arm and arm64 share an implementation of this callback
+which uses a combination of CPUFreq data and device-tree bindings to compute the
+capacity of CPUs (see drivers/base/arch_topology.c for more details).
+
+So, in order to use EAS on your platform your architecture must implement the
+arch_scale_cpu_capacity() callback, and some of the CPUs must have a lower
+capacity than others.
+
+Please note that EAS is not fundamentally incompatible with SMP, but no
+significant savings on SMP platforms have been observed yet. This restriction
+could be amended in the future if proven otherwise.
+
+
+6.2 - Energy Model presence
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+EAS uses the EM of a platform to estimate the impact of scheduling decisions on
+energy. So, your platform must provide power cost tables to the EM framework in
+order to make EAS start. To do so, please refer to documentation of the
+independent EM framework in Documentation/power/energy-model.rst.
+
+Please also note that the scheduling domains need to be re-built after the
+EM has been registered in order to start EAS.
+
+
+6.3 - Energy Model complexity
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The task wake-up path is very latency-sensitive. When the EM of a platform is
+too complex (too many CPUs, too many performance domains, too many performance
+states, ...), the cost of using it in the wake-up path can become prohibitive.
+The energy-aware wake-up algorithm has a complexity of:
+
+	C = Nd * (Nc + Ns)
+
+with: Nd the number of performance domains; Nc the number of CPUs; and Ns the
+total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8).
+
+A complexity check is performed at the root domain level, when scheduling
+domains are built. EAS will not start on a root domain if its C happens to be
+higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the
+time of writing).
+
+If you really want to use EAS but the complexity of your platform's Energy
+Model is too high to be used with a single root domain, you're left with only
+two possible options:
+
+    1. split your system into separate, smaller, root domains using exclusive
+       cpusets and enable EAS locally on each of them. This option has the
+       benefit to work out of the box but the drawback of preventing load
+       balance between root domains, which can result in an unbalanced system
+       overall;
+    2. submit patches to reduce the complexity of the EAS wake-up algorithm,
+       hence enabling it to cope with larger EMs in reasonable time.
+
+
+6.4 - Schedutil governor
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+EAS tries to predict at which OPP will the CPUs be running in the close future
+in order to estimate their energy consumption. To do so, it is assumed that OPPs
+of CPUs follow their utilization.
+
+Although it is very difficult to provide hard guarantees regarding the accuracy
+of this assumption in practice (because the hardware might not do what it is
+told to do, for example), schedutil as opposed to other CPUFreq governors at
+least _requests_ frequencies calculated using the utilization signals.
+Consequently, the only sane governor to use together with EAS is schedutil,
+because it is the only one providing some degree of consistency between
+frequency requests and energy predictions.
+
+Using EAS with any other governor than schedutil is not supported.
+
+
+6.5 Scale-invariant utilization signals
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to make accurate prediction across CPUs and for all performance
+states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
+be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
+callbacks.
+
+Using EAS on a platform that doesn't implement these two callbacks is not
+supported.
+
+
+6.6 Multithreading (SMT)
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+EAS in its current form is SMT unaware and is not able to leverage
+multithreaded hardware to save energy. EAS considers threads as independent
+CPUs, which can actually be counter-productive for both performance and energy.
+
+EAS on SMT is not supported.

diff --git a/Documentation/scheduler/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.rst
similarity index 98%
rename from Documentation/scheduler/sched-nice-design.txt
rename to Documentation/scheduler/sched-nice-design.rst
index 3ac1e46..0571f1b 100644
--- a/Documentation/scheduler/sched-nice-design.txt
+++ b/Documentation/scheduler/sched-nice-design.rst

@@ -1,3 +1,7 @@
+=====================
+Scheduler Nice Design
+=====================
+
 This document explains the thinking about the revamped and streamlined
 nice-levels implementation in the new Linux scheduler.
 
@@ -14,7 +18,7 @@
 that change), and we also intentionally calibrated the linear timeslice
 rule so that nice +19 level would be _exactly_ 1 jiffy. To better
 understand it, the timeslice graph went like this (cheesy ASCII art
-alert!):
+alert!)::
 
 
                    A

diff --git a/Documentation/scheduler/sched-pelt.c b/Documentation/scheduler/sched-pelt.c
index e421913..7238b35 100644
--- a/Documentation/scheduler/sched-pelt.c
+++ b/Documentation/scheduler/sched-pelt.c

@@ -20,7 +20,8 @@
 	int i;
 	unsigned int x;
 
-	printf("static const u32 runnable_avg_yN_inv[] = {");
+	/* To silence -Wunused-but-set-variable warnings. */
+	printf("static const u32 runnable_avg_yN_inv[] __maybe_unused = {");
 	for (i = 0; i < HALFLIFE; i++) {
 		x = ((1UL<<32)-1)*pow(y, i);
 

diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.rst
similarity index 94%
rename from Documentation/scheduler/sched-rt-group.txt
rename to Documentation/scheduler/sched-rt-group.rst
index d8fce3e..655a096 100644
--- a/Documentation/scheduler/sched-rt-group.txt
+++ b/Documentation/scheduler/sched-rt-group.rst

@@ -1,18 +1,18 @@
-				Real-Time group scheduling
-				--------------------------
+==========================
+Real-Time group scheduling
+==========================
 
-CONTENTS
-========
+.. CONTENTS
 
-0. WARNING
-1. Overview
-  1.1 The problem
-  1.2 The solution
-2. The interface
-  2.1 System-wide settings
-  2.2 Default behaviour
-  2.3 Basis for grouping tasks
-3. Future plans
+   0. WARNING
+   1. Overview
+     1.1 The problem
+     1.2 The solution
+   2. The interface
+     2.1 System-wide settings
+     2.2 Default behaviour
+     2.3 Basis for grouping tasks
+   3. Future plans
 
 
 0. WARNING
@@ -133,7 +133,7 @@
 to control the CPU time reserved for each control group.
 
 For more information on working with control groups, you should read
-Documentation/cgroup-v1/cgroups.txt as well.
+Documentation/admin-guide/cgroup-v1/cgroups.rst as well.
 
 Group settings are checked against the following limits in order to keep the
 configuration schedulable:
@@ -159,9 +159,11 @@
 period is twice the length of B's.
 
 * group A: period=100000us, runtime=50000us
+
 	- this runs for 0.05s once every 0.1s
 
 * group B: period= 50000us, runtime=25000us
+
 	- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
 
 This means that currently a while (1) loop in A will run for the full period of

diff --git a/Documentation/scheduler/sched-stats.txt b/Documentation/scheduler/sched-stats.rst
similarity index 90%
rename from Documentation/scheduler/sched-stats.txt
rename to Documentation/scheduler/sched-stats.rst
index 8259b34..0cb0aa7 100644
--- a/Documentation/scheduler/sched-stats.txt
+++ b/Documentation/scheduler/sched-stats.rst

@@ -1,3 +1,7 @@
+====================
+Scheduler Statistics
+====================
+
 Version 15 of schedstats dropped counters for some sched_yield:
 yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
 identical to version 14.
@@ -35,19 +39,23 @@
 cpu<N> 1 2 3 4 5 6 7 8 9
 
 First field is a sched_yield() statistic:
+
      1) # of times sched_yield() was called
 
 Next three are schedule() statistics:
+
      2) This field is a legacy array expiration count field used in the O(1)
 	scheduler. We kept it for ABI compatibility, but it is always set to zero.
      3) # of times schedule() was called
      4) # of times schedule() left the processor idle
 
 Next two are try_to_wake_up() statistics:
+
      5) # of times try_to_wake_up() was called
      6) # of times try_to_wake_up() was called to wake up the local cpu
 
 Next three are statistics describing scheduling latency:
+
      7) sum of all time spent running by tasks on this processor (in jiffies)
      8) sum of all time spent waiting to run by tasks on this processor (in
         jiffies)
@@ -67,24 +75,23 @@
 The next 24 are a variety of load_balance() statistics in grouped into types
 of idleness (idle, busy, and newly idle):
 
-     1) # of times in this domain load_balance() was called when the
+    1)  # of times in this domain load_balance() was called when the
         cpu was idle
-     2) # of times in this domain load_balance() checked but found
+    2)  # of times in this domain load_balance() checked but found
         the load did not require balancing when the cpu was idle
-     3) # of times in this domain load_balance() tried to move one or
+    3)  # of times in this domain load_balance() tried to move one or
         more tasks and failed, when the cpu was idle
-     4) sum of imbalances discovered (if any) with each call to
+    4)  sum of imbalances discovered (if any) with each call to
         load_balance() in this domain when the cpu was idle
-     5) # of times in this domain pull_task() was called when the cpu
+    5)  # of times in this domain pull_task() was called when the cpu
         was idle
-     6) # of times in this domain pull_task() was called even though
+    6)  # of times in this domain pull_task() was called even though
         the target task was cache-hot when idle
-     7) # of times in this domain load_balance() was called but did
+    7)  # of times in this domain load_balance() was called but did
         not find a busier queue while the cpu was idle
-     8) # of times in this domain a busier queue was found while the
+    8)  # of times in this domain a busier queue was found while the
         cpu was idle but no busier group was found
-
-     9) # of times in this domain load_balance() was called when the
+    9)  # of times in this domain load_balance() was called when the
         cpu was busy
     10) # of times in this domain load_balance() checked but found the
         load did not require balancing when busy
@@ -117,21 +124,25 @@
         was just becoming idle but no busier group was found
 
    Next three are active_load_balance() statistics:
+
     25) # of times active_load_balance() was called
     26) # of times active_load_balance() tried to move a task and failed
     27) # of times active_load_balance() successfully moved a task
 
    Next three are sched_balance_exec() statistics:
+
     28) sbe_cnt is not used
     29) sbe_balanced is not used
     30) sbe_pushed is not used
 
    Next three are sched_balance_fork() statistics:
+
     31) sbf_cnt is not used
     32) sbf_balanced is not used
     33) sbf_pushed is not used
 
    Next three are try_to_wake_up() statistics:
+
     34) # of times in this domain try_to_wake_up() awoke a task that
         last ran on a different cpu in this domain
     35) # of times in this domain try_to_wake_up() moved a task to the
@@ -139,10 +150,11 @@
     36) # of times in this domain try_to_wake_up() started passive balancing
 
 /proc/<pid>/schedstat
-----------------
+---------------------
 schedstats also adds a new /proc/<pid>/schedstat file to include some of
 the same information on a per-process level.  There are three fields in
 this file correlating for that process to:
+
      1) time spent on the cpu
      2) time spent waiting on a runqueue
      3) # of timeslices run on this cpu
@@ -151,4 +163,5 @@
 report on how well a particular process or set of processes is faring
 under the scheduler's policies.  A simple version of such a program is
 available at
+
     http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c

diff --git a/Documentation/scheduler/text_files.rst b/Documentation/scheduler/text_files.rst
new file mode 100644
index 0000000..0bc5030
--- /dev/null
+++ b/Documentation/scheduler/text_files.rst

@@ -0,0 +1,5 @@
+Scheduler pelt c program
+------------------------
+
+.. literalinclude:: sched-pelt.c
+    :language: c
commit	0f672f6c0b52b7b0700b0915c72b540721af4465	[log] [tgz]
author	David Brazdil <dbrazdil@google.com>	Tue Dec 10 10:32:29 2019 +0000
committer	David Brazdil <dbrazdil@google.com>	Tue Dec 10 19:03:18 2019 +0000
tree	85c8cba019caa205e4f8920d72d93f6d6deaf29c
parent	3a0ad55d848b50499b68d7141d4eca997fce28ef [diff]