Update Linux to v5.4.2

Change-Id: Idf6911045d9d382da2cfe01b1edff026404ac8fd
diff --git a/Documentation/accounting/cgroupstats.txt b/Documentation/accounting/cgroupstats.rst
similarity index 77%
rename from Documentation/accounting/cgroupstats.txt
rename to Documentation/accounting/cgroupstats.rst
index d16a984..b9afc48 100644
--- a/Documentation/accounting/cgroupstats.txt
+++ b/Documentation/accounting/cgroupstats.rst
@@ -1,3 +1,7 @@
+==================
+Control Groupstats
+==================
+
 Control Groupstats is inspired by the discussion at
 http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as
 suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
@@ -19,9 +23,9 @@
 information will not be available.
 
 To extract cgroup statistics a utility very similar to getdelays.c
-has been developed, the sample output of the utility is shown below
+has been developed, the sample output of the utility is shown below::
 
-~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup/a"
-sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
-~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup"
-sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
+  ~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup/a"
+  sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
+  ~/balbir/cgroupstats # ./getdelays  -C "/sys/fs/cgroup"
+  sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
diff --git a/Documentation/accounting/delay-accounting.txt b/Documentation/accounting/delay-accounting.rst
similarity index 77%
rename from Documentation/accounting/delay-accounting.txt
rename to Documentation/accounting/delay-accounting.rst
index 042ea59..7cc7f58 100644
--- a/Documentation/accounting/delay-accounting.txt
+++ b/Documentation/accounting/delay-accounting.rst
@@ -1,5 +1,6 @@
+================
 Delay accounting
-----------------
+================
 
 Tasks encounter delays in execution when they wait
 for some kernel resource to become available e.g. a
@@ -39,7 +40,9 @@
 generic data structure to userspace corresponding to per-pid and per-tgid
 statistics. The delay accounting functionality populates specific fields of
 this structure. See
+
      include/linux/taskstats.h
+
 for a description of the fields pertaining to delay accounting.
 It will generally be in the form of counters returning the cumulative
 delay seen for cpu, sync block I/O, swapin, memory reclaim etc.
@@ -61,13 +64,16 @@
 Usage
 -----
 
-Compile the kernel with
+Compile the kernel with::
+
 	CONFIG_TASK_DELAY_ACCT=y
 	CONFIG_TASKSTATS=y
 
 Delay accounting is enabled by default at boot up.
-To disable, add
+To disable, add::
+
    nodelayacct
+
 to the kernel boot options. The rest of the instructions
 below assume this has not been done.
 
@@ -78,40 +84,43 @@
 executed and the corresponding delays to be
 seen.
 
-General format of the getdelays command
+General format of the getdelays command::
 
-getdelays [-t tgid] [-p pid] [-c cmd...]
+	getdelays [-t tgid] [-p pid] [-c cmd...]
 
 
-Get delays, since system boot, for pid 10
-# ./getdelays -p 10
-(output similar to next case)
+Get delays, since system boot, for pid 10::
 
-Get sum of delays, since system boot, for all pids with tgid 5
-# ./getdelays -t 5
+	# ./getdelays -p 10
+	(output similar to next case)
+
+Get sum of delays, since system boot, for all pids with tgid 5::
+
+	# ./getdelays -t 5
 
 
-CPU	count	real total	virtual total	delay total
-	7876	92005750	100000000	24001500
-IO	count	delay total
-	0	0
-SWAP	count	delay total
-	0	0
-RECLAIM	count	delay total
-	0	0
+	CPU	count	real total	virtual total	delay total
+		7876	92005750	100000000	24001500
+	IO	count	delay total
+		0	0
+	SWAP	count	delay total
+		0	0
+	RECLAIM	count	delay total
+		0	0
 
-Get delays seen in executing a given simple command
-# ./getdelays -c ls /
+Get delays seen in executing a given simple command::
 
-bin   data1  data3  data5  dev  home  media  opt   root  srv        sys  usr
-boot  data2  data4  data6  etc  lib   mnt    proc  sbin  subdomain  tmp  var
+  # ./getdelays -c ls /
+
+  bin   data1  data3  data5  dev  home  media  opt   root  srv        sys  usr
+  boot  data2  data4  data6  etc  lib   mnt    proc  sbin  subdomain  tmp  var
 
 
-CPU	count	real total	virtual total	delay total
+  CPU	count	real total	virtual total	delay total
 	6	4000250		4000000		0
-IO	count	delay total
+  IO	count	delay total
 	0	0
-SWAP	count	delay total
+  SWAP	count	delay total
 	0	0
-RECLAIM	count	delay total
+  RECLAIM	count	delay total
 	0	0
diff --git a/Documentation/accounting/index.rst b/Documentation/accounting/index.rst
new file mode 100644
index 0000000..9369d8b
--- /dev/null
+++ b/Documentation/accounting/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Accounting
+==========
+
+.. toctree::
+   :maxdepth: 1
+
+   cgroupstats
+   delay-accounting
+   psi
+   taskstats
+   taskstats-struct
diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst
new file mode 100644
index 0000000..621111c
--- /dev/null
+++ b/Documentation/accounting/psi.rst
@@ -0,0 +1,182 @@
+================================
+PSI - Pressure Stall Information
+================================
+
+:Date: April, 2018
+:Author: Johannes Weiner <hannes@cmpxchg.org>
+
+When CPU, memory or IO devices are contended, workloads experience
+latency spikes, throughput losses, and run the risk of OOM kills.
+
+Without an accurate measure of such contention, users are forced to
+either play it safe and under-utilize their hardware resources, or
+roll the dice and frequently suffer the disruptions resulting from
+excessive overcommit.
+
+The psi feature identifies and quantifies the disruptions caused by
+such resource crunches and the time impact it has on complex workloads
+or even entire systems.
+
+Having an accurate measure of productivity losses caused by resource
+scarcity aids users in sizing workloads to hardware--or provisioning
+hardware according to workload demand.
+
+As psi aggregates this information in realtime, systems can be managed
+dynamically using techniques such as load shedding, migrating jobs to
+other systems or data centers, or strategically pausing or killing low
+priority or restartable batch jobs.
+
+This allows maximizing hardware utilization without sacrificing
+workload health or risking major disruptions such as OOM kills.
+
+Pressure interface
+==================
+
+Pressure information for each resource is exported through the
+respective file in /proc/pressure/ -- cpu, memory, and io.
+
+The format for CPU is as such::
+
+	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+and for memory and IO::
+
+	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
+	full avg10=0.00 avg60=0.00 avg300=0.00 total=0
+
+The "some" line indicates the share of time in which at least some
+tasks are stalled on a given resource.
+
+The "full" line indicates the share of time in which all non-idle
+tasks are stalled on a given resource simultaneously. In this state
+actual CPU cycles are going to waste, and a workload that spends
+extended time in this state is considered to be thrashing. This has
+severe impact on performance, and it's useful to distinguish this
+situation from a state where some tasks are stalled but the CPU is
+still doing productive work. As such, time spent in this subset of the
+stall state is tracked separately and exported in the "full" averages.
+
+The ratios (in %) are tracked as recent trends over ten, sixty, and
+three hundred second windows, which gives insight into short term events
+as well as medium and long term trends. The total absolute stall time
+(in us) is tracked and exported as well, to allow detection of latency
+spikes which wouldn't necessarily make a dent in the time averages,
+or to average trends over custom time frames.
+
+Monitoring for pressure thresholds
+==================================
+
+Users can register triggers and use poll() to be woken up when resource
+pressure exceeds certain thresholds.
+
+A trigger describes the maximum cumulative stall time over a specific
+time window, e.g. 100ms of total stall time within any 500ms window to
+generate a wakeup event.
+
+To register a trigger user has to open psi interface file under
+/proc/pressure/ representing the resource to be monitored and write the
+desired threshold and time window. The open file descriptor should be
+used to wait for trigger events using select(), poll() or epoll().
+The following format is used::
+
+	<some|full> <stall amount in us> <time window in us>
+
+For example writing "some 150000 1000000" into /proc/pressure/memory
+would add 150ms threshold for partial memory stall measured within
+1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
+would add 50ms threshold for full io stall measured within 1sec time window.
+
+Triggers can be set on more than one psi metric and more than one trigger
+for the same psi metric can be specified. However for each trigger a separate
+file descriptor is required to be able to poll it separately from others,
+therefore for each trigger a separate open() syscall should be made even
+when opening the same psi interface file.
+
+Monitors activate only when system enters stall state for the monitored
+psi metric and deactivates upon exit from the stall state. While system is
+in the stall state psi signal growth is monitored at a rate of 10 times per
+tracking window.
+
+The kernel accepts window sizes ranging from 500ms to 10s, therefore min
+monitoring update interval is 50ms and max is 1s. Min limit is set to
+prevent overly frequent polling. Max limit is chosen as a high enough number
+after which monitors are most likely not needed and psi averages can be used
+instead.
+
+When activated, psi monitor stays active for at least the duration of one
+tracking window to avoid repeated activations/deactivations when system is
+bouncing in and out of the stall state.
+
+Notifications to the userspace are rate-limited to one per tracking window.
+
+The trigger will de-register when the file descriptor used to define the
+trigger  is closed.
+
+Userspace monitor usage example
+===============================
+
+::
+
+  #include <errno.h>
+  #include <fcntl.h>
+  #include <stdio.h>
+  #include <poll.h>
+  #include <string.h>
+  #include <unistd.h>
+
+  /*
+   * Monitor memory partial stall with 1s tracking window size
+   * and 150ms threshold.
+   */
+  int main() {
+	const char trig[] = "some 150000 1000000";
+	struct pollfd fds;
+	int n;
+
+	fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
+	if (fds.fd < 0) {
+		printf("/proc/pressure/memory open error: %s\n",
+			strerror(errno));
+		return 1;
+	}
+	fds.events = POLLPRI;
+
+	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
+		printf("/proc/pressure/memory write error: %s\n",
+			strerror(errno));
+		return 1;
+	}
+
+	printf("waiting for events...\n");
+	while (1) {
+		n = poll(&fds, 1, -1);
+		if (n < 0) {
+			printf("poll error: %s\n", strerror(errno));
+			return 1;
+		}
+		if (fds.revents & POLLERR) {
+			printf("got POLLERR, event source is gone\n");
+			return 0;
+		}
+		if (fds.revents & POLLPRI) {
+			printf("event triggered!\n");
+		} else {
+			printf("unknown event received: 0x%x\n", fds.revents);
+			return 1;
+		}
+	}
+
+	return 0;
+  }
+
+Cgroup2 interface
+=================
+
+In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
+mounted, pressure stall information is also tracked for tasks grouped
+into cgroups. Each subdirectory in the cgroupfs mountpoint contains
+cpu.pressure, memory.pressure, and io.pressure files; the format is
+the same as the /proc/pressure/ files.
+
+Per-cgroup psi monitors can be specified and used the same way as
+system-wide ones.
diff --git a/Documentation/accounting/taskstats-struct.txt b/Documentation/accounting/taskstats-struct.rst
similarity index 78%
rename from Documentation/accounting/taskstats-struct.txt
rename to Documentation/accounting/taskstats-struct.rst
index e7512c0..ca90fd4 100644
--- a/Documentation/accounting/taskstats-struct.txt
+++ b/Documentation/accounting/taskstats-struct.rst
@@ -1,5 +1,6 @@
+====================
 The struct taskstats
---------------------
+====================
 
 This document contains an explanation of the struct taskstats fields.
 
@@ -10,16 +11,24 @@
     the common fields and basic accounting fields are collected for
     delivery at do_exit() of a task.
 2) Delay accounting fields
-    These fields are placed between
-    /* Delay accounting fields start */
-    and
-    /* Delay accounting fields end */
+    These fields are placed between::
+
+	/* Delay accounting fields start */
+
+    and::
+
+	/* Delay accounting fields end */
+
     Their values are collected if CONFIG_TASK_DELAY_ACCT is set.
 3) Extended accounting fields
-    These fields are placed between
-    /* Extended accounting fields start */
-    and
-    /* Extended accounting fields end */
+    These fields are placed between::
+
+	/* Extended accounting fields start */
+
+    and::
+
+	/* Extended accounting fields end */
+
     Their values are collected if CONFIG_TASK_XACCT is set.
 
 4) Per-task and per-thread context switch count statistics
@@ -31,31 +40,33 @@
 Future extension should add fields to the end of the taskstats struct, and
 should not change the relative position of each field within the struct.
 
+::
 
-struct taskstats {
+  struct taskstats {
 
-1) Common and basic accounting fields:
+1) Common and basic accounting fields::
+
 	/* The version number of this struct. This field is always set to
 	 * TAKSTATS_VERSION, which is defined in <linux/taskstats.h>.
 	 * Each time the struct is changed, the value should be incremented.
 	 */
 	__u16	version;
 
-  	/* The exit code of a task. */
+	/* The exit code of a task. */
 	__u32	ac_exitcode;		/* Exit status */
 
-  	/* The accounting flags of a task as defined in <linux/acct.h>
+	/* The accounting flags of a task as defined in <linux/acct.h>
 	 * Defined values are AFORK, ASU, ACOMPAT, ACORE, and AXSIG.
 	 */
 	__u8	ac_flag;		/* Record flags */
 
-  	/* The value of task_nice() of a task. */
+	/* The value of task_nice() of a task. */
 	__u8	ac_nice;		/* task_nice */
 
-  	/* The name of the command that started this task. */
+	/* The name of the command that started this task. */
 	char	ac_comm[TS_COMM_LEN];	/* Command name */
 
-  	/* The scheduling discipline as set in task->policy field. */
+	/* The scheduling discipline as set in task->policy field. */
 	__u8	ac_sched;		/* Scheduling discipline */
 
 	__u8	ac_pad[3];
@@ -64,26 +75,27 @@
 	__u32	ac_pid;			/* Process ID */
 	__u32	ac_ppid;		/* Parent process ID */
 
-  	/* The time when a task begins, in [secs] since 1970. */
+	/* The time when a task begins, in [secs] since 1970. */
 	__u32	ac_btime;		/* Begin time [sec since 1970] */
 
-  	/* The elapsed time of a task, in [usec]. */
+	/* The elapsed time of a task, in [usec]. */
 	__u64	ac_etime;		/* Elapsed time [usec] */
 
-  	/* The user CPU time of a task, in [usec]. */
+	/* The user CPU time of a task, in [usec]. */
 	__u64	ac_utime;		/* User CPU time [usec] */
 
-  	/* The system CPU time of a task, in [usec]. */
+	/* The system CPU time of a task, in [usec]. */
 	__u64	ac_stime;		/* System CPU time [usec] */
 
-  	/* The minor page fault count of a task, as set in task->min_flt. */
+	/* The minor page fault count of a task, as set in task->min_flt. */
 	__u64	ac_minflt;		/* Minor Page Fault Count */
 
 	/* The major page fault count of a task, as set in task->maj_flt. */
 	__u64	ac_majflt;		/* Major Page Fault Count */
 
 
-2) Delay accounting fields:
+2) Delay accounting fields::
+
 	/* Delay accounting fields start
 	 *
 	 * All values, until the comment "Delay accounting fields end" are
@@ -134,7 +146,8 @@
 	/* version 1 ends here */
 
 
-3) Extended accounting fields
+3) Extended accounting fields::
+
 	/* Extended accounting fields start */
 
 	/* Accumulated RSS usage in duration of a task, in MBytes-usecs.
@@ -145,15 +158,15 @@
 	 */
 	__u64	coremem;		/* accumulated RSS usage in MB-usec */
 
-  	/* Accumulated virtual memory usage in duration of a task.
+	/* Accumulated virtual memory usage in duration of a task.
 	 * Same as acct_rss_mem1 above except that we keep track of VM usage.
 	 */
 	__u64	virtmem;		/* accumulated VM usage in MB-usec */
 
-  	/* High watermark of RSS usage in duration of a task, in KBytes. */
+	/* High watermark of RSS usage in duration of a task, in KBytes. */
 	__u64	hiwater_rss;		/* High-watermark of RSS usage */
 
-  	/* High watermark of VM  usage in duration of a task, in KBytes. */
+	/* High watermark of VM  usage in duration of a task, in KBytes. */
 	__u64	hiwater_vm;		/* High-water virtual memory usage */
 
 	/* The following four fields are I/O statistics of a task. */
@@ -164,17 +177,23 @@
 
 	/* Extended accounting fields end */
 
-4) Per-task and per-thread statistics
+4) Per-task and per-thread statistics::
+
 	__u64	nvcsw;			/* Context voluntary switch counter */
 	__u64	nivcsw;			/* Context involuntary switch counter */
 
-5) Time accounting for SMT machines
+5) Time accounting for SMT machines::
+
 	__u64	ac_utimescaled;		/* utime scaled on frequency etc */
 	__u64	ac_stimescaled;		/* stime scaled on frequency etc */
 	__u64	cpu_scaled_run_real_total; /* scaled cpu_run_real_total */
 
-6) Extended delay accounting fields for memory reclaim
+6) Extended delay accounting fields for memory reclaim::
+
 	/* Delay waiting for memory reclaim */
 	__u64	freepages_count;
 	__u64	freepages_delay_total;
-}
+
+::
+
+  }
diff --git a/Documentation/accounting/taskstats.txt b/Documentation/accounting/taskstats.rst
similarity index 95%
rename from Documentation/accounting/taskstats.txt
rename to Documentation/accounting/taskstats.rst
index ff06b73..2a28b7f 100644
--- a/Documentation/accounting/taskstats.txt
+++ b/Documentation/accounting/taskstats.rst
@@ -1,5 +1,6 @@
+=============================
 Per-task statistics interface
------------------------------
+=============================
 
 
 Taskstats is a netlink-based interface for sending per-task and
@@ -65,7 +66,7 @@
 
 The data exchanged between user and kernel space is a netlink message belonging
 to the NETLINK_GENERIC family and using the netlink attributes interface.
-The messages are in the format
+The messages are in the format::
 
     +----------+- - -+-------------+-------------------+
     | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
@@ -167,15 +168,13 @@
 To avoid losing statistics, userspace should do one or more of the following:
 
 - increase the receive buffer sizes for the netlink sockets opened by
-listeners to receive exit data.
+  listeners to receive exit data.
 
 - create more listeners and reduce the number of cpus being listened to by
-each listener. In the extreme case, there could be one listener for each cpu.
-Users may also consider setting the cpu affinity of the listener to the subset
-of cpus to which it listens, especially if they are listening to just one cpu.
+  each listener. In the extreme case, there could be one listener for each cpu.
+  Users may also consider setting the cpu affinity of the listener to the subset
+  of cpus to which it listens, especially if they are listening to just one cpu.
 
 Despite these measures, if the userspace receives ENOBUFS error messages
 indicated overflow of receive buffers, it should take measures to handle the
 loss of data.
-
-----