Update Linux to v5.10.109
Sourced from [1]
[1] https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.10.109.tar.xz
Change-Id: I19bca9fc6762d4e63bcf3e4cba88bbe560d9c76c
Signed-off-by: Olivier Deprez <olivier.deprez@arm.com>
diff --git a/Documentation/trace/boottime-trace.rst b/Documentation/trace/boottime-trace.rst
new file mode 100644
index 0000000..89b6433
--- /dev/null
+++ b/Documentation/trace/boottime-trace.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Boot-time tracing
+=================
+
+:Author: Masami Hiramatsu <mhiramat@kernel.org>
+
+Overview
+========
+
+Boot-time tracing allows users to trace boot-time process including
+device initialization with full features of ftrace including per-event
+filter and actions, histograms, kprobe-events and synthetic-events,
+and trace instances.
+Since kernel command line is not enough to control these complex features,
+this uses bootconfig file to describe tracing feature programming.
+
+Options in the Boot Config
+==========================
+
+Here is the list of available options list for boot time tracing in
+boot config file [1]_. All options are under "ftrace." or "kernel."
+prefix. See kernel parameters for the options which starts
+with "kernel." prefix [2]_.
+
+.. [1] See :ref:`Documentation/admin-guide/bootconfig.rst <bootconfig>`
+.. [2] See :ref:`Documentation/admin-guide/kernel-parameters.rst <kernelparameters>`
+
+Ftrace Global Options
+---------------------
+
+Ftrace global options have "kernel." prefix in boot config, which means
+these options are passed as a part of kernel legacy command line.
+
+kernel.tp_printk
+ Output trace-event data on printk buffer too.
+
+kernel.dump_on_oops [= MODE]
+ Dump ftrace on Oops. If MODE = 1 or omitted, dump trace buffer
+ on all CPUs. If MODE = 2, dump a buffer on a CPU which kicks Oops.
+
+kernel.traceoff_on_warning
+ Stop tracing if WARN_ON() occurs.
+
+kernel.fgraph_max_depth = MAX_DEPTH
+ Set MAX_DEPTH to maximum depth of fgraph tracer.
+
+kernel.fgraph_filters = FILTER[, FILTER2...]
+ Add fgraph tracing function filters.
+
+kernel.fgraph_notraces = FILTER[, FILTER2...]
+ Add fgraph non-tracing function filters.
+
+
+Ftrace Per-instance Options
+---------------------------
+
+These options can be used for each instance including global ftrace node.
+
+ftrace.[instance.INSTANCE.]options = OPT1[, OPT2[...]]
+ Enable given ftrace options.
+
+ftrace.[instance.INSTANCE.]tracing_on = 0|1
+ Enable/Disable tracing on this instance when starting boot-time tracing.
+ (you can enable it by the "traceon" event trigger action)
+
+ftrace.[instance.INSTANCE.]trace_clock = CLOCK
+ Set given CLOCK to ftrace's trace_clock.
+
+ftrace.[instance.INSTANCE.]buffer_size = SIZE
+ Configure ftrace buffer size to SIZE. You can use "KB" or "MB"
+ for that SIZE.
+
+ftrace.[instance.INSTANCE.]alloc_snapshot
+ Allocate snapshot buffer.
+
+ftrace.[instance.INSTANCE.]cpumask = CPUMASK
+ Set CPUMASK as trace cpu-mask.
+
+ftrace.[instance.INSTANCE.]events = EVENT[, EVENT2[...]]
+ Enable given events on boot. You can use a wild card in EVENT.
+
+ftrace.[instance.INSTANCE.]tracer = TRACER
+ Set TRACER to current tracer on boot. (e.g. function)
+
+ftrace.[instance.INSTANCE.]ftrace.filters
+ This will take an array of tracing function filter rules.
+
+ftrace.[instance.INSTANCE.]ftrace.notraces
+ This will take an array of NON-tracing function filter rules.
+
+
+Ftrace Per-Event Options
+------------------------
+
+These options are setting per-event options.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.enable
+ Enable GROUP:EVENT tracing.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.filter = FILTER
+ Set FILTER rule to the GROUP:EVENT.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.actions = ACTION[, ACTION2[...]]
+ Set ACTIONs to the GROUP:EVENT.
+
+ftrace.[instance.INSTANCE.]event.kprobes.EVENT.probes = PROBE[, PROBE2[...]]
+ Defines new kprobe event based on PROBEs. It is able to define
+ multiple probes on one event, but those must have same type of
+ arguments. This option is available only for the event which
+ group name is "kprobes".
+
+ftrace.[instance.INSTANCE.]event.synthetic.EVENT.fields = FIELD[, FIELD2[...]]
+ Defines new synthetic event with FIELDs. Each field should be
+ "type varname".
+
+Note that kprobe and synthetic event definitions can be written under
+instance node, but those are also visible from other instances. So please
+take care for event name conflict.
+
+
+When to Start
+=============
+
+All boot-time tracing options starting with ``ftrace`` will be enabled at the
+end of core_initcall. This means you can trace the events from postcore_initcall.
+Most of the subsystems and architecture dependent drivers will be initialized
+after that (arch_initcall or subsys_initcall). Thus, you can trace those with
+boot-time tracing.
+If you want to trace events before core_initcall, you can use the options
+starting with ``kernel``. Some of them will be enabled eariler than the initcall
+processing (for example,. ``kernel.ftrace=function`` and ``kernel.trace_event``
+will start before the initcall.)
+
+
+Examples
+========
+
+For example, to add filter and actions for each event, define kprobe
+events, and synthetic events with histogram, write a boot config like
+below::
+
+ ftrace.event {
+ task.task_newtask {
+ filter = "pid < 128"
+ enable
+ }
+ kprobes.vfs_read {
+ probes = "vfs_read $arg1 $arg2"
+ filter = "common_pid < 200"
+ enable
+ }
+ synthetic.initcall_latency {
+ fields = "unsigned long func", "u64 lat"
+ actions = "hist:keys=func.sym,lat:vals=lat:sort=lat"
+ }
+ initcall.initcall_start {
+ actions = "hist:keys=func:ts0=common_timestamp.usecs"
+ }
+ initcall.initcall_finish {
+ actions = "hist:keys=func:lat=common_timestamp.usecs-$ts0:onmatch(initcall.initcall_start).initcall_latency(func,$lat)"
+ }
+ }
+
+Also, boot-time tracing supports "instance" node, which allows us to run
+several tracers for different purpose at once. For example, one tracer
+is for tracing functions starting with "user\_", and others tracing
+"kernel\_" functions, you can write boot config as below::
+
+ ftrace.instance {
+ foo {
+ tracer = "function"
+ ftrace.filters = "user_*"
+ }
+ bar {
+ tracer = "function"
+ ftrace.filters = "kernel_*"
+ }
+ }
+
+The instance node also accepts event nodes so that each instance
+can customize its event tracing.
+
+With the trigger action and kprobes, you can trace function-graph while
+a function is called. For example, this will trace all function calls in
+the pci_proc_init()::
+
+ ftrace {
+ tracing_on = 0
+ tracer = function_graph
+ event.kprobes {
+ start_event {
+ probes = "pci_proc_init"
+ actions = "traceon"
+ }
+ end_event {
+ probes = "pci_proc_init%return"
+ actions = "traceoff"
+ }
+ }
+ }
+
+
+This boot-time tracing also supports ftrace kernel parameters via boot
+config.
+For example, following kernel parameters::
+
+ trace_options=sym-addr trace_event=initcall:* tp_printk trace_buf_size=1M ftrace=function ftrace_filter="vfs*"
+
+This can be written in boot config like below::
+
+ kernel {
+ trace_options = sym-addr
+ trace_event = "initcall:*"
+ tp_printk
+ trace_buf_size = 1M
+ ftrace = function
+ ftrace_filter = "vfs*"
+ }
+
+Note that parameters start with "kernel" prefix instead of "ftrace".
diff --git a/Documentation/trace/coresight-cpu-debug.rst b/Documentation/trace/coresight/coresight-cpu-debug.rst
similarity index 100%
rename from Documentation/trace/coresight-cpu-debug.rst
rename to Documentation/trace/coresight/coresight-cpu-debug.rst
diff --git a/Documentation/trace/coresight/coresight-ect.rst b/Documentation/trace/coresight/coresight-ect.rst
new file mode 100644
index 0000000..a68732c
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-ect.rst
@@ -0,0 +1,226 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
+CoreSight Embedded Cross Trigger (CTI & CTM).
+=============================================
+
+ :Author: Mike Leach <mike.leach@linaro.org>
+ :Date: November 2019
+
+Hardware Description
+--------------------
+
+The CoreSight Cross Trigger Interface (CTI) is a hardware device that takes
+individual input and output hardware signals known as triggers to and from
+devices and interconnects them via the Cross Trigger Matrix (CTM) to other
+devices via numbered channels, in order to propagate events between devices.
+
+e.g.::
+
+ 0000000 in_trigs :::::::
+ 0 C 0----------->: : +======>(other CTI channel IO)
+ 0 P 0<-----------: : v
+ 0 U 0 out_trigs : : Channels ***** :::::::
+ 0000000 : CTI :<=========>*CTM*<====>: CTI :---+
+ ####### in_trigs : : (id 0-3) ***** ::::::: v
+ # ETM #----------->: : ^ #######
+ # #<-----------: : +---# ETR #
+ ####### out_trigs ::::::: #######
+
+The CTI driver enables the programming of the CTI to attach triggers to
+channels. When an input trigger becomes active, the attached channel will
+become active. Any output trigger attached to that channel will also
+become active. The active channel is propagated to other CTIs via the CTM,
+activating connected output triggers there, unless filtered by the CTI
+channel gate.
+
+It is also possible to activate a channel using system software directly
+programming registers in the CTI.
+
+The CTIs are registered by the system to be associated with CPUs and/or other
+CoreSight devices on the trace data path. When these devices are enabled the
+attached CTIs will also be enabled. By default/on power up the CTIs have
+no programmed trigger/channel attachments, so will not affect the system
+until explicitly programmed.
+
+The hardware trigger connections between CTIs and devices is implementation
+defined, unless the CPU/ETM combination is a v8 architecture, in which case
+the connections have an architecturally defined standard layout.
+
+The hardware trigger signals can also be connected to non-CoreSight devices
+(e.g. UART), or be propagated off chip as hardware IO lines.
+
+All the CTI devices are associated with a CTM. On many systems there will be a
+single effective CTM (one CTM, or multiple CTMs all interconnected), but it is
+possible that systems can have nets of CTIs+CTM that are not interconnected by
+a CTM to each other. On these systems a CTM index is declared to associate
+CTI devices that are interconnected via a given CTM.
+
+Sysfs files and directories
+---------------------------
+
+The CTI devices appear on the existing CoreSight bus alongside the other
+CoreSight devices::
+
+ >$ ls /sys/bus/coresight/devices
+ cti_cpu0 cti_cpu2 cti_sys0 etm0 etm2 funnel0 replicator0 tmc_etr0
+ cti_cpu1 cti_cpu3 cti_sys1 etm1 etm3 funnel1 tmc_etf0 tpiu0
+
+The ``cti_cpu<N>`` named CTIs are associated with a CPU, and any ETM used by
+that core. The ``cti_sys<N>`` CTIs are general system infrastructure CTIs that
+can be associated with other CoreSight devices, or other system hardware
+capable of generating or using trigger signals.::
+
+ >$ ls /sys/bus/coresight/devices/etm0/cti_cpu0
+ channels ctmid enable nr_trigger_cons mgmt power powered regs
+ connections subsystem triggers0 triggers1 uevent
+
+*Key file items are:-*
+ * ``enable``: enables/disables the CTI. Read to determine current state.
+ If this shows as enabled (1), but ``powered`` shows unpowered (0), then
+ the enable indicates a request to enabled when the device is powered.
+ * ``ctmid`` : associated CTM - only relevant if system has multiple CTI+CTM
+ clusters that are not interconnected.
+ * ``nr_trigger_cons`` : total connections - triggers<N> directories.
+ * ``powered`` : Read to determine if the CTI is currently powered.
+
+*Sub-directories:-*
+ * ``triggers<N>``: contains list of triggers for an individual connection.
+ * ``channels``: Contains the channel API - CTI main programming interface.
+ * ``regs``: Gives access to the raw programmable CTI regs.
+ * ``mgmt``: the standard CoreSight management registers.
+ * ``connections``: Links to connected *CoreSight* devices. The number of
+ links can be 0 to ``nr_trigger_cons``. Actual number given by ``nr_links``
+ in this directory.
+
+
+triggers<N> directories
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Individual trigger connection information. This describes trigger signals for
+CoreSight and non-CoreSight connections.
+
+Each triggers directory has a set of parameters describing the triggers for
+the connection.
+
+ * ``name`` : name of connection
+ * ``in_signals`` : input trigger signal indexes used in this connection.
+ * ``in_types`` : functional types for in signals.
+ * ``out_signals`` : output trigger signals for this connection.
+ * ``out_types`` : functional types for out signals.
+
+e.g::
+
+ >$ ls ./cti_cpu0/triggers0/
+ in_signals in_types name out_signals out_types
+ >$ cat ./cti_cpu0/triggers0/name
+ cpu0
+ >$ cat ./cti_cpu0/triggers0/out_signals
+ 0-2
+ >$ cat ./cti_cpu0/triggers0/out_types
+ pe_edbgreq pe_dbgrestart pe_ctiirq
+ >$ cat ./cti_cpu0/triggers0/in_signals
+ 0-1
+ >$ cat ./cti_cpu0/triggers0/in_types
+ pe_dbgtrigger pe_pmuirq
+
+If a connection has zero signals in either the 'in' or 'out' triggers then
+those parameters will be omitted.
+
+Channels API Directory
+~~~~~~~~~~~~~~~~~~~~~~
+
+This provides an easy way to attach triggers to channels, without needing
+the multiple register operations that are required if manipulating the
+'regs' sub-directory elements directly.
+
+A number of files provide this API::
+
+ >$ ls ./cti_sys0/channels/
+ chan_clear chan_inuse chan_xtrigs_out trigin_attach
+ chan_free chan_pulse chan_xtrigs_reset trigin_detach
+ chan_gate_disable chan_set chan_xtrigs_sel trigout_attach
+ chan_gate_enable chan_xtrigs_in trig_filter_enable trigout_detach
+ trigout_filtered
+
+Most access to these elements take the form::
+
+ echo <chan> [<trigger>] > /<device_path>/<operation>
+
+where the optional <trigger> is only needed for trigXX_attach | detach
+operations.
+
+e.g.::
+
+ >$ echo 0 1 > ./cti_sys0/channels/trigout_attach
+ >$ echo 0 > ./cti_sys0/channels/chan_set
+
+Attaches trigout(1) to channel(0), then activates channel(0) generating a
+set state on cti_sys0.trigout(1)
+
+
+*API operations*
+
+ * ``trigin_attach, trigout_attach``: Attach a channel to a trigger signal.
+ * ``trigin_detach, trigout_detach``: Detach a channel from a trigger signal.
+ * ``chan_set``: Set the channel - the set state will be propagated around
+ the CTM to other connected devices.
+ * ``chan_clear``: Clear the channel.
+ * ``chan_pulse``: Set the channel for a single CoreSight clock cycle.
+ * ``chan_gate_enable``: Write operation sets the CTI gate to propagate
+ (enable) the channel to other devices. This operation takes a channel
+ number. CTI gate is enabled for all channels by default at power up. Read
+ to list the currently enabled channels on the gate.
+ * ``chan_gate_disable``: Write channel number to disable gate for that
+ channel.
+ * ``chan_inuse``: Show the current channels attached to any signal
+ * ``chan_free``: Show channels with no attached signals.
+ * ``chan_xtrigs_sel``: write a channel number to select a channel to view,
+ read to show the selected channel number.
+ * ``chan_xtrigs_in``: Read to show the input triggers attached to
+ the selected view channel.
+ * ``chan_xtrigs_out``:Read to show the output triggers attached to
+ the selected view channel.
+ * ``trig_filter_enable``: Defaults to enabled, disable to allow potentially
+ dangerous output signals to be set.
+ * ``trigout_filtered``: Trigger out signals that are prevented from being
+ set if filtering ``trig_filter_enable`` is enabled. One use is to prevent
+ accidental ``EDBGREQ`` signals stopping a core.
+ * ``chan_xtrigs_reset``: Write 1 to clear all channel / trigger programming.
+ Resets device hardware to default state.
+
+
+The example below attaches input trigger index 1 to channel 2, and output
+trigger index 6 to the same channel. It then examines the state of the
+channel / trigger connections using the appropriate sysfs attributes.
+
+The settings mean that if either input trigger 1, or channel 2 go active then
+trigger out 6 will go active. We then enable the CTI, and use the software
+channel control to activate channel 2. We see the active channel on the
+``choutstatus`` register and the active signal on the ``trigoutstatus``
+register. Finally clearing the channel removes this.
+
+e.g.::
+
+ .../cti_sys0/channels# echo 2 1 > trigin_attach
+ .../cti_sys0/channels# echo 2 6 > trigout_attach
+ .../cti_sys0/channels# cat chan_free
+ 0-1,3
+ .../cti_sys0/channels# cat chan_inuse
+ 2
+ .../cti_sys0/channels# echo 2 > chan_xtrigs_sel
+ .../cti_sys0/channels# cat chan_xtrigs_trigin
+ 1
+ .../cti_sys0/channels# cat chan_xtrigs_trigout
+ 6
+ .../cti_sys0/# echo 1 > enable
+ .../cti_sys0/channels# echo 2 > chan_set
+ .../cti_sys0/channels# cat ../regs/choutstatus
+ 0x4
+ .../cti_sys0/channels# cat ../regs/trigoutstatus
+ 0x40
+ .../cti_sys0/channels# echo 2 > chan_clear
+ .../cti_sys0/channels# cat ../regs/trigoutstatus
+ 0x0
+ .../cti_sys0/channels# cat ../regs/choutstatus
+ 0x0
diff --git a/Documentation/trace/coresight/coresight-etm4x-reference.rst b/Documentation/trace/coresight/coresight-etm4x-reference.rst
new file mode 100644
index 0000000..b64d9a9
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-etm4x-reference.rst
@@ -0,0 +1,798 @@
+===============================================
+ETMv4 sysfs linux driver programming reference.
+===============================================
+
+ :Author: Mike Leach <mike.leach@linaro.org>
+ :Date: October 11th, 2019
+
+Supplement to existing ETMv4 driver documentation.
+
+Sysfs files and directories
+---------------------------
+
+Root: ``/sys/bus/coresight/devices/etm<N>``
+
+
+The following paragraphs explain the association between sysfs files and the
+ETMv4 registers that they effect. Note the register names are given without
+the ‘TRC’ prefix.
+
+----
+
+:File: ``mode`` (rw)
+:Trace Registers: {CONFIGR + others}
+:Notes:
+ Bit select trace features. See ‘mode’ section below. Bits
+ in this will cause equivalent programming of trace config and
+ other registers to enable the features requested.
+
+:Syntax & eg:
+ ``echo bitfield > mode``
+
+ bitfield up to 32 bits setting trace features.
+
+:Example:
+ ``$> echo 0x012 > mode``
+
+----
+
+:File: ``reset`` (wo)
+:Trace Registers: All
+:Notes:
+ Reset all programming to trace nothing / no logic programmed.
+
+:Syntax:
+ ``echo 1 > reset``
+
+----
+
+:File: ``enable_source`` (wo)
+:Trace Registers: PRGCTLR, All hardware regs.
+:Notes:
+ - > 0 : Programs up the hardware with the current values held in the driver
+ and enables trace.
+
+ - = 0 : disable trace hardware.
+
+:Syntax:
+ ``echo 1 > enable_source``
+
+----
+
+:File: ``cpu`` (ro)
+:Trace Registers: None.
+:Notes:
+ CPU ID that this ETM is attached to.
+
+:Example:
+ ``$> cat cpu``
+
+ ``$> 0``
+
+----
+
+:File: ``addr_idx`` (rw)
+:Trace Registers: None.
+:Notes:
+ Virtual register to index address comparator and range
+ features. Set index for first of the pair in a range.
+
+:Syntax:
+ ``echo idx > addr_idx``
+
+ Where idx < nr_addr_cmp x 2
+
+----
+
+:File: ``addr_range`` (rw)
+:Trace Registers: ACVR[idx, idx+1], VIIECTLR
+:Notes:
+ Pair of addresses for a range selected by addr_idx. Include
+ / exclude according to the optional parameter, or if omitted
+ uses the current ‘mode’ setting. Select comparator range in
+ control register. Error if index is odd value.
+
+:Depends: ``mode, addr_idx``
+:Syntax:
+ ``echo addr1 addr2 [exclude] > addr_range``
+
+ Where addr1 and addr2 define the range and addr1 < addr2.
+
+ Optional exclude value:-
+
+ - 0 for include
+ - 1 for exclude.
+:Example:
+ ``$> echo 0x0000 0x2000 0 > addr_range``
+
+----
+
+:File: ``addr_single`` (rw)
+:Trace Registers: ACVR[idx]
+:Notes:
+ Set a single address comparator according to addr_idx. This
+ is used if the address comparator is used as part of event
+ generation logic etc.
+
+:Depends: ``addr_idx``
+:Syntax:
+ ``echo addr1 > addr_single``
+
+----
+
+:File: ``addr_start`` (rw)
+:Trace Registers: ACVR[idx], VISSCTLR
+:Notes:
+ Set a trace start address comparator according to addr_idx.
+ Select comparator in control register.
+
+:Depends: ``addr_idx``
+:Syntax:
+ ``echo addr1 > addr_start``
+
+----
+
+:File: ``addr_stop`` (rw)
+:Trace Registers: ACVR[idx], VISSCTLR
+:Notes:
+ Set a trace stop address comparator according to addr_idx.
+ Select comparator in control register.
+
+:Depends: ``addr_idx``
+:Syntax:
+ ``echo addr1 > addr_stop``
+
+----
+
+:File: ``addr_context`` (rw)
+:Trace Registers: ACATR[idx,{6:4}]
+:Notes:
+ Link context ID comparator to address comparator addr_idx
+
+:Depends: ``addr_idx``
+:Syntax:
+ ``echo ctxt_idx > addr_context``
+
+ Where ctxt_idx is the index of the linked context id / vmid
+ comparator.
+
+----
+
+:File: ``addr_ctxtype`` (rw)
+:Trace Registers: ACATR[idx,{3:2}]
+:Notes:
+ Input value string. Set type for linked context ID comparator
+
+:Depends: ``addr_idx``
+:Syntax:
+ ``echo type > addr_ctxtype``
+
+ Type one of {all, vmid, ctxid, none}
+:Example:
+ ``$> echo ctxid > addr_ctxtype``
+
+----
+
+:File: ``addr_exlevel_s_ns`` (rw)
+:Trace Registers: ACATR[idx,{14:8}]
+:Notes:
+ Set the ELx secure and non-secure matching bits for the
+ selected address comparator
+
+:Depends: ``addr_idx``
+:Syntax:
+ ``echo val > addr_exlevel_s_ns``
+
+ val is a 7 bit value for exception levels to exclude. Input
+ value shifted to correct bits in register.
+:Example:
+ ``$> echo 0x4F > addr_exlevel_s_ns``
+
+----
+
+:File: ``addr_instdatatype`` (rw)
+:Trace Registers: ACATR[idx,{1:0}]
+:Notes:
+ Set the comparator address type for matching. Driver only
+ supports setting instruction address type.
+
+:Depends: ``addr_idx``
+
+----
+
+:File: ``addr_cmp_view`` (ro)
+:Trace Registers: ACVR[idx, idx+1], ACATR[idx], VIIECTLR
+:Notes:
+ Read the currently selected address comparator. If part of
+ address range then display both addresses.
+
+:Depends: ``addr_idx``
+:Syntax:
+ ``cat addr_cmp_view``
+:Example:
+ ``$> cat addr_cmp_view``
+
+ ``addr_cmp[0] range 0x0 0xffffffffffffffff include ctrl(0x4b00)``
+
+----
+
+:File: ``nr_addr_cmp`` (ro)
+:Trace Registers: From IDR4
+:Notes:
+ Number of address comparator pairs
+
+----
+
+:File: ``sshot_idx`` (rw)
+:Trace Registers: None
+:Notes:
+ Select single shot register set.
+
+----
+
+:File: ``sshot_ctrl`` (rw)
+:Trace Registers: SSCCR[idx]
+:Notes:
+ Access a single shot comparator control register.
+
+:Depends: ``sshot_idx``
+:Syntax:
+ ``echo val > sshot_ctrl``
+
+ Writes val into the selected control register.
+
+----
+
+:File: ``sshot_status`` (ro)
+:Trace Registers: SSCSR[idx]
+:Notes:
+ Read a single shot comparator status register
+
+:Depends: ``sshot_idx``
+:Syntax:
+ ``cat sshot_status``
+
+ Read status.
+:Example:
+ ``$> cat sshot_status``
+
+ ``0x1``
+
+----
+
+:File: ``sshot_pe_ctrl`` (rw)
+:Trace Registers: SSPCICR[idx]
+:Notes:
+ Access a single shot PE comparator input control register.
+
+:Depends: ``sshot_idx``
+:Syntax:
+ ``echo val > sshot_pe_ctrl``
+
+ Writes val into the selected control register.
+
+----
+
+:File: ``ns_exlevel_vinst`` (rw)
+:Trace Registers: VICTLR{23:20}
+:Notes:
+ Program non-secure exception level filters. Set / clear NS
+ exception filter bits. Setting ‘1’ excludes trace from the
+ exception level.
+
+:Syntax:
+ ``echo bitfield > ns_exlevel_viinst``
+
+ Where bitfield contains bits to set clear for EL0 to EL2
+:Example:
+ ``%> echo 0x4 > ns_exlevel_viinst``
+
+ Excludes EL2 NS trace.
+
+----
+
+:File: ``vinst_pe_cmp_start_stop`` (rw)
+:Trace Registers: VIPCSSCTLR
+:Notes:
+ Access PE start stop comparator input control registers
+
+----
+
+:File: ``bb_ctrl`` (rw)
+:Trace Registers: BBCTLR
+:Notes:
+ Define ranges that Branch Broadcast will operate in.
+ Default (0x0) is all addresses.
+
+:Depends: BB enabled.
+
+----
+
+:File: ``cyc_threshold`` (rw)
+:Trace Registers: CCCTLR
+:Notes:
+ Set the threshold for which cycle counts will be emitted.
+ Error if attempt to set below minimum defined in IDR3, masked
+ to width of valid bits.
+
+:Depends: CC enabled.
+
+----
+
+:File: ``syncfreq`` (rw)
+:Trace Registers: SYNCPR
+:Notes:
+ Set trace synchronisation period. Power of 2 value, 0 (off)
+ or 8-20. Driver defaults to 12 (every 4096 bytes).
+
+----
+
+:File: ``cntr_idx`` (rw)
+:Trace Registers: none
+:Notes:
+ Select the counter to access
+
+:Syntax:
+ ``echo idx > cntr_idx``
+
+ Where idx < nr_cntr
+
+----
+
+:File: ``cntr_ctrl`` (rw)
+:Trace Registers: CNTCTLR[idx]
+:Notes:
+ Set counter control value.
+
+:Depends: ``cntr_idx``
+:Syntax:
+ ``echo val > cntr_ctrl``
+
+ Where val is per ETMv4 spec.
+
+----
+
+:File: ``cntrldvr`` (rw)
+:Trace Registers: CNTRLDVR[idx]
+:Notes:
+ Set counter reload value.
+
+:Depends: ``cntr_idx``
+:Syntax:
+ ``echo val > cntrldvr``
+
+ Where val is per ETMv4 spec.
+
+----
+
+:File: ``nr_cntr`` (ro)
+:Trace Registers: From IDR5
+
+:Notes:
+ Number of counters implemented.
+
+----
+
+:File: ``ctxid_idx`` (rw)
+:Trace Registers: None
+:Notes:
+ Select the context ID comparator to access
+
+:Syntax:
+ ``echo idx > ctxid_idx``
+
+ Where idx < numcidc
+
+----
+
+:File: ``ctxid_pid`` (rw)
+:Trace Registers: CIDCVR[idx]
+:Notes:
+ Set the context ID comparator value
+
+:Depends: ``ctxid_idx``
+
+----
+
+:File: ``ctxid_masks`` (rw)
+:Trace Registers: CIDCCTLR0, CIDCCTLR1, CIDCVR<0-7>
+:Notes:
+ Pair of values to set the byte masks for 1-8 context ID
+ comparators. Automatically clears masked bytes to 0 in CID
+ value registers.
+
+:Syntax:
+ ``echo m3m2m1m0 [m7m6m5m4] > ctxid_masks``
+
+ 32 bit values made up of mask bytes, where mN represents a
+ byte mask value for Context ID comparator N.
+
+ Second value not required on systems that have fewer than 4
+ context ID comparators
+
+----
+
+:File: ``numcidc`` (ro)
+:Trace Registers: From IDR4
+:Notes:
+ Number of Context ID comparators
+
+----
+
+:File: ``vmid_idx`` (rw)
+:Trace Registers: None
+:Notes:
+ Select the VM ID comparator to access.
+
+:Syntax:
+ ``echo idx > vmid_idx``
+
+ Where idx < numvmidc
+
+----
+
+:File: ``vmid_val`` (rw)
+:Trace Registers: VMIDCVR[idx]
+:Notes:
+ Set the VM ID comparator value
+
+:Depends: ``vmid_idx``
+
+----
+
+:File: ``vmid_masks`` (rw)
+:Trace Registers: VMIDCCTLR0, VMIDCCTLR1, VMIDCVR<0-7>
+:Notes:
+ Pair of values to set the byte masks for 1-8 VM ID comparators.
+ Automatically clears masked bytes to 0 in VMID value registers.
+
+:Syntax:
+ ``echo m3m2m1m0 [m7m6m5m4] > vmid_masks``
+
+ Where mN represents a byte mask value for VMID comparator N.
+ Second value not required on systems that have fewer than 4
+ VMID comparators.
+
+----
+
+:File: ``numvmidc`` (ro)
+:Trace Registers: From IDR4
+:Notes:
+ Number of VMID comparators
+
+----
+
+:File: ``res_idx`` (rw)
+:Trace Registers: None.
+:Notes:
+ Select the resource selector control to access. Must be 2 or
+ higher as selectors 0 and 1 are hardwired.
+
+:Syntax:
+ ``echo idx > res_idx``
+
+ Where 2 <= idx < nr_resource x 2
+
+----
+
+:File: ``res_ctrl`` (rw)
+:Trace Registers: RSCTLR[idx]
+:Notes:
+ Set resource selector control value. Value per ETMv4 spec.
+
+:Depends: ``res_idx``
+:Syntax:
+ ``echo val > res_cntr``
+
+ Where val is per ETMv4 spec.
+
+----
+
+:File: ``nr_resource`` (ro)
+:Trace Registers: From IDR4
+:Notes:
+ Number of resource selector pairs
+
+----
+
+:File: ``event`` (rw)
+:Trace Registers: EVENTCTRL0R
+:Notes:
+ Set up to 4 implemented event fields.
+
+:Syntax:
+ ``echo ev3ev2ev1ev0 > event``
+
+ Where evN is an 8 bit event field. Up to 4 event fields make up the
+ 32-bit input value. Number of valid fields is implementation dependent,
+ defined in IDR0.
+
+----
+
+:File: ``event_instren`` (rw)
+:Trace Registers: EVENTCTRL1R
+:Notes:
+ Choose events which insert event packets into trace stream.
+
+:Depends: EVENTCTRL0R
+:Syntax:
+ ``echo bitfield > event_instren``
+
+ Where bitfield is up to 4 bits according to number of event fields.
+
+----
+
+:File: ``event_ts`` (rw)
+:Trace Registers: TSCTLR
+:Notes:
+ Set the event that will generate timestamp requests.
+
+:Depends: ``TS activated``
+:Syntax:
+ ``echo evfield > event_ts``
+
+ Where evfield is an 8 bit event selector.
+
+----
+
+:File: ``seq_idx`` (rw)
+:Trace Registers: None
+:Notes:
+ Sequencer event register select - 0 to 2
+
+----
+
+:File: ``seq_state`` (rw)
+:Trace Registers: SEQSTR
+:Notes:
+ Sequencer current state - 0 to 3.
+
+----
+
+:File: ``seq_event`` (rw)
+:Trace Registers: SEQEVR[idx]
+:Notes:
+ State transition event registers
+
+:Depends: ``seq_idx``
+:Syntax:
+ ``echo evBevF > seq_event``
+
+ Where evBevF is a 16 bit value made up of two event selectors,
+
+ - evB : back
+ - evF : forwards.
+
+----
+
+:File: ``seq_reset_event`` (rw)
+:Trace Registers: SEQRSTEVR
+:Notes:
+ Sequencer reset event
+
+:Syntax:
+ ``echo evfield > seq_reset_event``
+
+ Where evfield is an 8 bit event selector.
+
+----
+
+:File: ``nrseqstate`` (ro)
+:Trace Registers: From IDR5
+:Notes:
+ Number of sequencer states (0 or 4)
+
+----
+
+:File: ``nr_pe_cmp`` (ro)
+:Trace Registers: From IDR4
+:Notes:
+ Number of PE comparator inputs
+
+----
+
+:File: ``nr_ext_inp`` (ro)
+:Trace Registers: From IDR5
+:Notes:
+ Number of external inputs
+
+----
+
+:File: ``nr_ss_cmp`` (ro)
+:Trace Registers: From IDR4
+:Notes:
+ Number of Single Shot control registers
+
+----
+
+*Note:* When programming any address comparator the driver will tag the
+comparator with a type used - i.e. RANGE, SINGLE, START, STOP. Once this tag
+is set, then only the values can be changed using the same sysfs file / type
+used to program it.
+
+Thus::
+
+ % echo 0 > addr_idx ; select address comparator 0
+ % echo 0x1000 0x5000 0 > addr_range ; set address range on comparators 0, 1.
+ % echo 0x2000 > addr_start ; error as comparator 0 is a range comparator
+ % echo 2 > addr_idx ; select address comparator 2
+ % echo 0x2000 > addr_start ; this is OK as comparator 2 is unused.
+ % echo 0x3000 > addr_stop ; error as comparator 2 set as start address.
+ % echo 2 > addr_idx ; select address comparator 3
+ % echo 0x3000 > addr_stop ; this is OK
+
+To remove programming on all the comparators (and all the other hardware) use
+the reset parameter::
+
+ % echo 1 > reset
+
+
+
+The ‘mode’ sysfs parameter.
+---------------------------
+
+This is a bitfield selection parameter that sets the overall trace mode for the
+ETM. The table below describes the bits, using the defines from the driver
+source file, along with a description of the feature these represent. Many
+features are optional and therefore dependent on implementation in the
+hardware.
+
+Bit assignments shown below:-
+
+----
+
+**bit (0):**
+ ETM_MODE_EXCLUDE
+
+**description:**
+ This is the default value for the include / exclude function when
+ setting address ranges. Set 1 for exclude range. When the mode
+ parameter is set this value is applied to the currently indexed
+ address range.
+
+
+**bit (4):**
+ ETM_MODE_BB
+
+**description:**
+ Set to enable branch broadcast if supported in hardware [IDR0].
+
+
+**bit (5):**
+ ETMv4_MODE_CYCACC
+
+**description:**
+ Set to enable cycle accurate trace if supported [IDR0].
+
+
+**bit (6):**
+ ETMv4_MODE_CTXID
+
+**description:**
+ Set to enable context ID tracing if supported in hardware [IDR2].
+
+
+**bit (7):**
+ ETM_MODE_VMID
+
+**description:**
+ Set to enable virtual machine ID tracing if supported [IDR2].
+
+
+**bit (11):**
+ ETMv4_MODE_TIMESTAMP
+
+**description:**
+ Set to enable timestamp generation if supported [IDR0].
+
+
+**bit (12):**
+ ETM_MODE_RETURNSTACK
+**description:**
+ Set to enable trace return stack use if supported [IDR0].
+
+
+**bit (13-14):**
+ ETM_MODE_QELEM(val)
+
+**description:**
+ ‘val’ determines level of Q element support enabled if
+ implemented by the ETM [IDR0]
+
+
+**bit (19):**
+ ETM_MODE_ATB_TRIGGER
+
+**description:**
+ Set to enable the ATBTRIGGER bit in the event control register
+ [EVENTCTLR1] if supported [IDR5].
+
+
+**bit (20):**
+ ETM_MODE_LPOVERRIDE
+
+**description:**
+ Set to enable the LPOVERRIDE bit in the event control register
+ [EVENTCTLR1], if supported [IDR5].
+
+
+**bit (21):**
+ ETM_MODE_ISTALL_EN
+
+**description:**
+ Set to enable the ISTALL bit in the stall control register
+ [STALLCTLR]
+
+
+**bit (23):**
+ ETM_MODE_INSTPRIO
+
+**description:**
+ Set to enable the INSTPRIORITY bit in the stall control register
+ [STALLCTLR] , if supported [IDR0].
+
+
+**bit (24):**
+ ETM_MODE_NOOVERFLOW
+
+**description:**
+ Set to enable the NOOVERFLOW bit in the stall control register
+ [STALLCTLR], if supported [IDR3].
+
+
+**bit (25):**
+ ETM_MODE_TRACE_RESET
+
+**description:**
+ Set to enable the TRCRESET bit in the viewinst control register
+ [VICTLR] , if supported [IDR3].
+
+
+**bit (26):**
+ ETM_MODE_TRACE_ERR
+
+**description:**
+ Set to enable the TRCCTRL bit in the viewinst control register
+ [VICTLR].
+
+
+**bit (27):**
+ ETM_MODE_VIEWINST_STARTSTOP
+
+**description:**
+ Set the initial state value of the ViewInst start / stop logic
+ in the viewinst control register [VICTLR]
+
+
+**bit (30):**
+ ETM_MODE_EXCL_KERN
+
+**description:**
+ Set default trace setup to exclude kernel mode trace (see note a)
+
+
+**bit (31):**
+ ETM_MODE_EXCL_USER
+
+**description:**
+ Set default trace setup to exclude user space trace (see note a)
+
+----
+
+*Note a)* On startup the ETM is programmed to trace the complete address space
+using address range comparator 0. ‘mode’ bits 30 / 31 modify this setting to
+set EL exclude bits for NS state in either user space (EL0) or kernel space
+(EL1) in the address range comparator. (the default setting excludes all
+secure EL, and NS EL2)
+
+Once the reset parameter has been used, and/or custom programming has been
+implemented - using these bits will result in the EL bits for address
+comparator 0 being set in the same way.
+
+*Note b)* Bits 2-3, 8-10, 15-16, 18, 22, control features that only work with
+data trace. As A-profile data trace is architecturally prohibited in ETMv4,
+these have been omitted here. Possible uses could be where a kernel has
+support for control of R or M profile infrastructure as part of a heterogeneous
+system.
+
+Bits 17, 28-29 are unused.
diff --git a/Documentation/trace/coresight.rst b/Documentation/trace/coresight/coresight.rst
similarity index 84%
rename from Documentation/trace/coresight.rst
rename to Documentation/trace/coresight/coresight.rst
index 72f4b7e..0b73acb 100644
--- a/Documentation/trace/coresight.rst
+++ b/Documentation/trace/coresight/coresight.rst
@@ -241,6 +241,91 @@
system is not unexpected. One must use the "names" as they appear on
the system under specified locations.
+Topology Representation
+-----------------------
+
+Each CoreSight component has a ``connections`` directory which will contain
+links to other CoreSight components. This allows the user to explore the trace
+topology and for larger systems, determine the most appropriate sink for a
+given source. The connection information can also be used to establish
+which CTI devices are connected to a given component. This directory contains a
+``nr_links`` attribute detailing the number of links in the directory.
+
+For an ETM source, in this case ``etm0`` on a Juno platform, a typical
+arrangement will be::
+
+ linaro-developer:~# ls - l /sys/bus/coresight/devices/etm0/connections
+ <file details> cti_cpu0 -> ../../../23020000.cti/cti_cpu0
+ <file details> nr_links
+ <file details> out:0 -> ../../../230c0000.funnel/funnel2
+
+Following the out port to ``funnel2``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel2/connections
+ <file details> in:0 -> ../../../23040000.etm/etm0
+ <file details> in:1 -> ../../../23140000.etm/etm3
+ <file details> in:2 -> ../../../23240000.etm/etm4
+ <file details> in:3 -> ../../../23340000.etm/etm5
+ <file details> nr_links
+ <file details> out:0 -> ../../../20040000.funnel/funnel0
+
+And again to ``funnel0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel0/connections
+ <file details> in:0 -> ../../../220c0000.funnel/funnel1
+ <file details> in:1 -> ../../../230c0000.funnel/funnel2
+ <file details> nr_links
+ <file details> out:0 -> ../../../20010000.etf/tmc_etf0
+
+Finding the first sink ``tmc_etf0``. This can be used to collect data
+as a sink, or as a link to propagate further along the chain::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/tmc_etf0/connections
+ <file details> cti_sys0 -> ../../../20020000.cti/cti_sys0
+ <file details> in:0 -> ../../../20040000.funnel/funnel0
+ <file details> nr_links
+ <file details> out:0 -> ../../../20150000.funnel/funnel4
+
+via ``funnel4``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel4/connections
+ <file details> in:0 -> ../../../20010000.etf/tmc_etf0
+ <file details> in:1 -> ../../../20140000.etf/tmc_etf1
+ <file details> nr_links
+ <file details> out:0 -> ../../../20120000.replicator/replicator0
+
+and a ``replicator0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/replicator0/connections
+ <file details> in:0 -> ../../../20150000.funnel/funnel4
+ <file details> nr_links
+ <file details> out:0 -> ../../../20030000.tpiu/tpiu0
+ <file details> out:1 -> ../../../20070000.etr/tmc_etr0
+
+Arriving at the final sink in the chain, ``tmc_etr0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/tmc_etr0/connections
+ <file details> cti_sys0 -> ../../../20020000.cti/cti_sys0
+ <file details> in:0 -> ../../../20120000.replicator/replicator0
+ <file details> nr_links
+
+As described below, when using sysfs it is sufficient to enable a sink and
+a source for successful trace. The framework will correctly enable all
+intermediate links as required.
+
+Note: ``cti_sys0`` appears in two of the connections lists above.
+CTIs can connect to multiple devices and are arranged in a star topology
+via the CTM. See (:doc:`coresight-ect`) [#fourth]_ for further details.
+Looking at this device we see 4 connections::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/cti_sys0/connections
+ <file details> nr_links
+ <file details> stm0 -> ../../../20100000.stm/stm0
+ <file details> tmc_etf0 -> ../../../20010000.etf/tmc_etf0
+ <file details> tmc_etr0 -> ../../../20070000.etr/tmc_etr0
+ <file details> tpiu0 -> ../../../20030000.tpiu/tpiu0
+
+
How to use the tracer modules
-----------------------------
@@ -489,10 +574,23 @@
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/stm0
root@genericarmv8:~#
-Details on how to use the generic STM API can be found here [#second]_.
+Details on how to use the generic STM API can be found here:- :doc:`../stm` [#second]_.
+
+The CTI & CTM Modules
+---------------------
+
+The CTI (Cross Trigger Interface) provides a set of trigger signals between
+individual CTIs and components, and can propagate these between all CTIs via
+channels on the CTM (Cross Trigger Matrix).
+
+A separate documentation file is provided to explain the use of these devices.
+(:doc:`coresight-ect`) [#fourth]_.
+
.. [#first] Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
.. [#second] Documentation/trace/stm.rst
.. [#third] https://github.com/Linaro/perf-opencsd
+
+.. [#fourth] Documentation/trace/coresight/coresight-ect.rst
diff --git a/Documentation/trace/coresight/index.rst b/Documentation/trace/coresight/index.rst
new file mode 100644
index 0000000..8d31b15
--- /dev/null
+++ b/Documentation/trace/coresight/index.rst
@@ -0,0 +1,9 @@
+==============================
+CoreSight - ARM Hardware Trace
+==============================
+
+.. toctree::
+ :maxdepth: 2
+ :glob:
+
+ *
diff --git a/Documentation/trace/events-msr.rst b/Documentation/trace/events-msr.rst
index e938aa0..810481e 100644
--- a/Documentation/trace/events-msr.rst
+++ b/Documentation/trace/events-msr.rst
@@ -4,7 +4,7 @@
The x86 kernel supports tracing most MSR (Model Specific Register) accesses.
To see the definition of the MSRs on Intel systems please see the SDM
-at http://www.intel.com/sdm (Volume 3)
+at https://www.intel.com/sdm (Volume 3)
Available trace points:
diff --git a/Documentation/trace/events-power.rst b/Documentation/trace/events-power.rst
index 2ef3189..f45bf11 100644
--- a/Documentation/trace/events-power.rst
+++ b/Documentation/trace/events-power.rst
@@ -75,16 +75,6 @@
target/flags update.
::
- pm_qos_add_request "pm_qos_class=%s value=%d"
- pm_qos_update_request "pm_qos_class=%s value=%d"
- pm_qos_remove_request "pm_qos_class=%s value=%d"
- pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld"
-
-The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY").
-The second parameter is value to be added/updated/removed.
-The third parameter is timeout value in usec.
-::
-
pm_qos_update_target "action=%s prev_value=%d curr_value=%d"
pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x"
@@ -92,7 +82,7 @@
The second parameter is the previous QoS value.
The third parameter is the current QoS value to update.
-And, there are also events used for device PM QoS add/update/remove request.
+There are also events used for device PM QoS add/update/remove request.
::
dev_pm_qos_add_request "device=%s type=%s new_value=%d"
@@ -103,3 +93,12 @@
QoS requests.
The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY").
The third parameter is value to be added/updated/removed.
+
+And, there are events used for CPU latency QoS add/update/remove request.
+::
+
+ pm_qos_add_request "value=%d"
+ pm_qos_update_request "value=%d"
+ pm_qos_remove_request "value=%d"
+
+The parameter is the value to be added/updated/removed.
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index f7e1fcc..9df29a9 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -198,6 +198,15 @@
prev_comm ~ "*sh*"
prev_comm ~ "ba*sh"
+If the field is a pointer that points into user space (for example
+"filename" from sys_enter_openat), then you have to append ".ustring" to the
+field name::
+
+ filename.ustring ~ "password"
+
+As the kernel will have to know how to retrieve the memory that the pointer
+is at from user space.
+
5.2 Setting filters
-------------------
@@ -230,6 +239,16 @@
the filter string; the error message should still be useful though
even without more accurate position info.
+5.2.1 Filter limitations
+------------------------
+
+If a filter is placed on a string pointer ``(char *)`` that does not point
+to a string on the ring buffer, but instead points to kernel or user space
+memory, then, for safety reasons, at most 1024 bytes of the content is
+copied onto a temporary buffer to do the compare. If the copy of the memory
+faults (the pointer points to memory that should not be accessed), then the
+string compare will be treated as not matching.
+
5.3 Clearing filters
--------------------
@@ -342,7 +361,8 @@
differences and the implementation isn't currently tied to it in any
way, so beware about making generalizations between the two.
-Note: Writing into trace_marker (See Documentation/trace/ftrace.rst)
+.. Note::
+ Writing into trace_marker (See Documentation/trace/ftrace.rst)
can also enable triggers that are written into
/sys/kernel/tracing/events/ftrace/print/trigger
@@ -525,3 +545,529 @@
event counts (hitcount).
See Documentation/trace/histogram.rst for details and examples.
+
+7. In-kernel trace event API
+============================
+
+In most cases, the command-line interface to trace events is more than
+sufficient. Sometimes, however, applications might find the need for
+more complex relationships than can be expressed through a simple
+series of linked command-line expressions, or putting together sets of
+commands may be simply too cumbersome. An example might be an
+application that needs to 'listen' to the trace stream in order to
+maintain an in-kernel state machine detecting, for instance, when an
+illegal kernel state occurs in the scheduler.
+
+The trace event subsystem provides an in-kernel API allowing modules
+or other kernel code to generate user-defined 'synthetic' events at
+will, which can be used to either augment the existing trace stream
+and/or signal that a particular important state has occurred.
+
+A similar in-kernel API is also available for creating kprobe and
+kretprobe events.
+
+Both the synthetic event and k/ret/probe event APIs are built on top
+of a lower-level "dynevent_cmd" event command API, which is also
+available for more specialized applications, or as the basis of other
+higher-level trace event APIs.
+
+The API provided for these purposes is describe below and allows the
+following:
+
+ - dynamically creating synthetic event definitions
+ - dynamically creating kprobe and kretprobe event definitions
+ - tracing synthetic events from in-kernel code
+ - the low-level "dynevent_cmd" API
+
+7.1 Dyamically creating synthetic event definitions
+---------------------------------------------------
+
+There are a couple ways to create a new synthetic event from a kernel
+module or other kernel code.
+
+The first creates the event in one step, using synth_event_create().
+In this method, the name of the event to create and an array defining
+the fields is supplied to synth_event_create(). If successful, a
+synthetic event with that name and fields will exist following that
+call. For example, to create a new "schedtest" synthetic event::
+
+ ret = synth_event_create("schedtest", sched_fields,
+ ARRAY_SIZE(sched_fields), THIS_MODULE);
+
+The sched_fields param in this example points to an array of struct
+synth_field_desc, each of which describes an event field by type and
+name::
+
+ static struct synth_field_desc sched_fields[] = {
+ { .type = "pid_t", .name = "next_pid_field" },
+ { .type = "char[16]", .name = "next_comm_field" },
+ { .type = "u64", .name = "ts_ns" },
+ { .type = "u64", .name = "ts_ms" },
+ { .type = "unsigned int", .name = "cpu" },
+ { .type = "char[64]", .name = "my_string_field" },
+ { .type = "int", .name = "my_int_field" },
+ };
+
+See synth_field_size() for available types.
+
+If field_name contains [n], the field is considered to be a static array.
+
+If field_names contains[] (no subscript), the field is considered to
+be a dynamic array, which will only take as much space in the event as
+is required to hold the array.
+
+Because space for an event is reserved before assigning field values
+to the event, using dynamic arrays implies that the piecewise
+in-kernel API described below can't be used with dynamic arrays. The
+other non-piecewise in-kernel APIs can, however, be used with dynamic
+arrays.
+
+If the event is created from within a module, a pointer to the module
+must be passed to synth_event_create(). This will ensure that the
+trace buffer won't contain unreadable events when the module is
+removed.
+
+At this point, the event object is ready to be used for generating new
+events.
+
+In the second method, the event is created in several steps. This
+allows events to be created dynamically and without the need to create
+and populate an array of fields beforehand.
+
+To use this method, an empty or partially empty synthetic event should
+first be created using synth_event_gen_cmd_start() or
+synth_event_gen_cmd_array_start(). For synth_event_gen_cmd_start(),
+the name of the event along with one or more pairs of args each pair
+representing a 'type field_name;' field specification should be
+supplied. For synth_event_gen_cmd_array_start(), the name of the
+event along with an array of struct synth_field_desc should be
+supplied. Before calling synth_event_gen_cmd_start() or
+synth_event_gen_cmd_array_start(), the user should create and
+initialize a dynevent_cmd object using synth_event_cmd_init().
+
+For example, to create a new "schedtest" synthetic event with two
+fields::
+
+ struct dynevent_cmd cmd;
+ char *buf;
+
+ /* Create a buffer to hold the generated command */
+ buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
+
+ /* Before generating the command, initialize the cmd object */
+ synth_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
+
+ ret = synth_event_gen_cmd_start(&cmd, "schedtest", THIS_MODULE,
+ "pid_t", "next_pid_field",
+ "u64", "ts_ns");
+
+Alternatively, using an array of struct synth_field_desc fields
+containing the same information::
+
+ ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE,
+ fields, n_fields);
+
+Once the synthetic event object has been created, it can then be
+populated with more fields. Fields are added one by one using
+synth_event_add_field(), supplying the dynevent_cmd object, a field
+type, and a field name. For example, to add a new int field named
+"intfield", the following call should be made::
+
+ ret = synth_event_add_field(&cmd, "int", "intfield");
+
+See synth_field_size() for available types. If field_name contains [n]
+the field is considered to be an array.
+
+A group of fields can also be added all at once using an array of
+synth_field_desc with add_synth_fields(). For example, this would add
+just the first four sched_fields::
+
+ ret = synth_event_add_fields(&cmd, sched_fields, 4);
+
+If you already have a string of the form 'type field_name',
+synth_event_add_field_str() can be used to add it as-is; it will
+also automatically append a ';' to the string.
+
+Once all the fields have been added, the event should be finalized and
+registered by calling the synth_event_gen_cmd_end() function::
+
+ ret = synth_event_gen_cmd_end(&cmd);
+
+At this point, the event object is ready to be used for tracing new
+events.
+
+7.2 Tracing synthetic events from in-kernel code
+------------------------------------------------
+
+To trace a synthetic event, there are several options. The first
+option is to trace the event in one call, using synth_event_trace()
+with a variable number of values, or synth_event_trace_array() with an
+array of values to be set. A second option can be used to avoid the
+need for a pre-formed array of values or list of arguments, via
+synth_event_trace_start() and synth_event_trace_end() along with
+synth_event_add_next_val() or synth_event_add_val() to add the values
+piecewise.
+
+7.2.1 Tracing a synthetic event all at once
+-------------------------------------------
+
+To trace a synthetic event all at once, the synth_event_trace() or
+synth_event_trace_array() functions can be used.
+
+The synth_event_trace() function is passed the trace_event_file
+representing the synthetic event (which can be retrieved using
+trace_get_event_file() using the synthetic event name, "synthetic" as
+the system name, and the trace instance name (NULL if using the global
+trace array)), along with an variable number of u64 args, one for each
+synthetic event field, and the number of values being passed.
+
+So, to trace an event corresponding to the synthetic event definition
+above, code like the following could be used::
+
+ ret = synth_event_trace(create_synth_test, 7, /* number of values */
+ 444, /* next_pid_field */
+ (u64)"clackers", /* next_comm_field */
+ 1000000, /* ts_ns */
+ 1000, /* ts_ms */
+ smp_processor_id(),/* cpu */
+ (u64)"Thneed", /* my_string_field */
+ 999); /* my_int_field */
+
+All vals should be cast to u64, and string vals are just pointers to
+strings, cast to u64. Strings will be copied into space reserved in
+the event for the string, using these pointers.
+
+Alternatively, the synth_event_trace_array() function can be used to
+accomplish the same thing. It is passed the trace_event_file
+representing the synthetic event (which can be retrieved using
+trace_get_event_file() using the synthetic event name, "synthetic" as
+the system name, and the trace instance name (NULL if using the global
+trace array)), along with an array of u64, one for each synthetic
+event field.
+
+To trace an event corresponding to the synthetic event definition
+above, code like the following could be used::
+
+ u64 vals[7];
+
+ vals[0] = 777; /* next_pid_field */
+ vals[1] = (u64)"tiddlywinks"; /* next_comm_field */
+ vals[2] = 1000000; /* ts_ns */
+ vals[3] = 1000; /* ts_ms */
+ vals[4] = smp_processor_id(); /* cpu */
+ vals[5] = (u64)"thneed"; /* my_string_field */
+ vals[6] = 398; /* my_int_field */
+
+The 'vals' array is just an array of u64, the number of which must
+match the number of field in the synthetic event, and which must be in
+the same order as the synthetic event fields.
+
+All vals should be cast to u64, and string vals are just pointers to
+strings, cast to u64. Strings will be copied into space reserved in
+the event for the string, using these pointers.
+
+In order to trace a synthetic event, a pointer to the trace event file
+is needed. The trace_get_event_file() function can be used to get
+it - it will find the file in the given trace instance (in this case
+NULL since the top trace array is being used) while at the same time
+preventing the instance containing it from going away::
+
+ schedtest_event_file = trace_get_event_file(NULL, "synthetic",
+ "schedtest");
+
+Before tracing the event, it should be enabled in some way, otherwise
+the synthetic event won't actually show up in the trace buffer.
+
+To enable a synthetic event from the kernel, trace_array_set_clr_event()
+can be used (which is not specific to synthetic events, so does need
+the "synthetic" system name to be specified explicitly).
+
+To enable the event, pass 'true' to it::
+
+ trace_array_set_clr_event(schedtest_event_file->tr,
+ "synthetic", "schedtest", true);
+
+To disable it pass false::
+
+ trace_array_set_clr_event(schedtest_event_file->tr,
+ "synthetic", "schedtest", false);
+
+Finally, synth_event_trace_array() can be used to actually trace the
+event, which should be visible in the trace buffer afterwards::
+
+ ret = synth_event_trace_array(schedtest_event_file, vals,
+ ARRAY_SIZE(vals));
+
+To remove the synthetic event, the event should be disabled, and the
+trace instance should be 'put' back using trace_put_event_file()::
+
+ trace_array_set_clr_event(schedtest_event_file->tr,
+ "synthetic", "schedtest", false);
+ trace_put_event_file(schedtest_event_file);
+
+If those have been successful, synth_event_delete() can be called to
+remove the event::
+
+ ret = synth_event_delete("schedtest");
+
+7.2.2 Tracing a synthetic event piecewise
+-----------------------------------------
+
+To trace a synthetic using the piecewise method described above, the
+synth_event_trace_start() function is used to 'open' the synthetic
+event trace::
+
+ struct synth_trace_state trace_state;
+
+ ret = synth_event_trace_start(schedtest_event_file, &trace_state);
+
+It's passed the trace_event_file representing the synthetic event
+using the same methods as described above, along with a pointer to a
+struct synth_trace_state object, which will be zeroed before use and
+used to maintain state between this and following calls.
+
+Once the event has been opened, which means space for it has been
+reserved in the trace buffer, the individual fields can be set. There
+are two ways to do that, either one after another for each field in
+the event, which requires no lookups, or by name, which does. The
+tradeoff is flexibility in doing the assignments vs the cost of a
+lookup per field.
+
+To assign the values one after the other without lookups,
+synth_event_add_next_val() should be used. Each call is passed the
+same synth_trace_state object used in the synth_event_trace_start(),
+along with the value to set the next field in the event. After each
+field is set, the 'cursor' points to the next field, which will be set
+by the subsequent call, continuing until all the fields have been set
+in order. The same sequence of calls as in the above examples using
+this method would be (without error-handling code)::
+
+ /* next_pid_field */
+ ret = synth_event_add_next_val(777, &trace_state);
+
+ /* next_comm_field */
+ ret = synth_event_add_next_val((u64)"slinky", &trace_state);
+
+ /* ts_ns */
+ ret = synth_event_add_next_val(1000000, &trace_state);
+
+ /* ts_ms */
+ ret = synth_event_add_next_val(1000, &trace_state);
+
+ /* cpu */
+ ret = synth_event_add_next_val(smp_processor_id(), &trace_state);
+
+ /* my_string_field */
+ ret = synth_event_add_next_val((u64)"thneed_2.01", &trace_state);
+
+ /* my_int_field */
+ ret = synth_event_add_next_val(395, &trace_state);
+
+To assign the values in any order, synth_event_add_val() should be
+used. Each call is passed the same synth_trace_state object used in
+the synth_event_trace_start(), along with the field name of the field
+to set and the value to set it to. The same sequence of calls as in
+the above examples using this method would be (without error-handling
+code)::
+
+ ret = synth_event_add_val("next_pid_field", 777, &trace_state);
+ ret = synth_event_add_val("next_comm_field", (u64)"silly putty",
+ &trace_state);
+ ret = synth_event_add_val("ts_ns", 1000000, &trace_state);
+ ret = synth_event_add_val("ts_ms", 1000, &trace_state);
+ ret = synth_event_add_val("cpu", smp_processor_id(), &trace_state);
+ ret = synth_event_add_val("my_string_field", (u64)"thneed_9",
+ &trace_state);
+ ret = synth_event_add_val("my_int_field", 3999, &trace_state);
+
+Note that synth_event_add_next_val() and synth_event_add_val() are
+incompatible if used within the same trace of an event - either one
+can be used but not both at the same time.
+
+Finally, the event won't be actually traced until it's 'closed',
+which is done using synth_event_trace_end(), which takes only the
+struct synth_trace_state object used in the previous calls::
+
+ ret = synth_event_trace_end(&trace_state);
+
+Note that synth_event_trace_end() must be called at the end regardless
+of whether any of the add calls failed (say due to a bad field name
+being passed in).
+
+7.3 Dyamically creating kprobe and kretprobe event definitions
+--------------------------------------------------------------
+
+To create a kprobe or kretprobe trace event from kernel code, the
+kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start()
+functions can be used.
+
+To create a kprobe event, an empty or partially empty kprobe event
+should first be created using kprobe_event_gen_cmd_start(). The name
+of the event and the probe location should be specfied along with one
+or args each representing a probe field should be supplied to this
+function. Before calling kprobe_event_gen_cmd_start(), the user
+should create and initialize a dynevent_cmd object using
+kprobe_event_cmd_init().
+
+For example, to create a new "schedtest" kprobe event with two fields::
+
+ struct dynevent_cmd cmd;
+ char *buf;
+
+ /* Create a buffer to hold the generated command */
+ buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
+
+ /* Before generating the command, initialize the cmd object */
+ kprobe_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
+
+ /*
+ * Define the gen_kprobe_test event with the first 2 kprobe
+ * fields.
+ */
+ ret = kprobe_event_gen_cmd_start(&cmd, "gen_kprobe_test", "do_sys_open",
+ "dfd=%ax", "filename=%dx");
+
+Once the kprobe event object has been created, it can then be
+populated with more fields. Fields can be added using
+kprobe_event_add_fields(), supplying the dynevent_cmd object along
+with a variable arg list of probe fields. For example, to add a
+couple additional fields, the following call could be made::
+
+ ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)");
+
+Once all the fields have been added, the event should be finalized and
+registered by calling the kprobe_event_gen_cmd_end() or
+kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe
+or kretprobe command was started::
+
+ ret = kprobe_event_gen_cmd_end(&cmd);
+
+or::
+
+ ret = kretprobe_event_gen_cmd_end(&cmd);
+
+At this point, the event object is ready to be used for tracing new
+events.
+
+Similarly, a kretprobe event can be created using
+kretprobe_event_gen_cmd_start() with a probe name and location and
+additional params such as $retval::
+
+ ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test",
+ "do_sys_open", "$retval");
+
+Similar to the synthetic event case, code like the following can be
+used to enable the newly created kprobe event::
+
+ gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test");
+
+ ret = trace_array_set_clr_event(gen_kprobe_test->tr,
+ "kprobes", "gen_kprobe_test", true);
+
+Finally, also similar to synthetic events, the following code can be
+used to give the kprobe event file back and delete the event::
+
+ trace_put_event_file(gen_kprobe_test);
+
+ ret = kprobe_event_delete("gen_kprobe_test");
+
+7.4 The "dynevent_cmd" low-level API
+------------------------------------
+
+Both the in-kernel synthetic event and kprobe interfaces are built on
+top of a lower-level "dynevent_cmd" interface. This interface is
+meant to provide the basis for higher-level interfaces such as the
+synthetic and kprobe interfaces, which can be used as examples.
+
+The basic idea is simple and amounts to providing a general-purpose
+layer that can be used to generate trace event commands. The
+generated command strings can then be passed to the command-parsing
+and event creation code that already exists in the trace event
+subystem for creating the corresponding trace events.
+
+In a nutshell, the way it works is that the higher-level interface
+code creates a struct dynevent_cmd object, then uses a couple
+functions, dynevent_arg_add() and dynevent_arg_pair_add() to build up
+a command string, which finally causes the command to be executed
+using the dynevent_create() function. The details of the interface
+are described below.
+
+The first step in building a new command string is to create and
+initialize an instance of a dynevent_cmd. Here, for instance, we
+create a dynevent_cmd on the stack and initialize it::
+
+ struct dynevent_cmd cmd;
+ char *buf;
+ int ret;
+
+ buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
+
+ dynevent_cmd_init(cmd, buf, maxlen, DYNEVENT_TYPE_FOO,
+ foo_event_run_command);
+
+The dynevent_cmd initialization needs to be given a user-specified
+buffer and the length of the buffer (MAX_DYNEVENT_CMD_LEN can be used
+for this purpose - at 2k it's generally too big to be comfortably put
+on the stack, so is dynamically allocated), a dynevent type id, which
+is meant to be used to check that further API calls are for the
+correct command type, and a pointer to an event-specific run_command()
+callback that will be called to actually execute the event-specific
+command function.
+
+Once that's done, the command string can by built up by successive
+calls to argument-adding functions.
+
+To add a single argument, define and initialize a struct dynevent_arg
+or struct dynevent_arg_pair object. Here's an example of the simplest
+possible arg addition, which is simply to append the given string as
+a whitespace-separated argument to the command::
+
+ struct dynevent_arg arg;
+
+ dynevent_arg_init(&arg, NULL, 0);
+
+ arg.str = name;
+
+ ret = dynevent_arg_add(cmd, &arg);
+
+The arg object is first initialized using dynevent_arg_init() and in
+this case the parameters are NULL or 0, which means there's no
+optional sanity-checking function or separator appended to the end of
+the arg.
+
+Here's another more complicated example using an 'arg pair', which is
+used to create an argument that consists of a couple components added
+together as a unit, for example, a 'type field_name;' arg or a simple
+expression arg e.g. 'flags=%cx'::
+
+ struct dynevent_arg_pair arg_pair;
+
+ dynevent_arg_pair_init(&arg_pair, dynevent_foo_check_arg_fn, 0, ';');
+
+ arg_pair.lhs = type;
+ arg_pair.rhs = name;
+
+ ret = dynevent_arg_pair_add(cmd, &arg_pair);
+
+Again, the arg_pair is first initialized, in this case with a callback
+function used to check the sanity of the args (for example, that
+neither part of the pair is NULL), along with a character to be used
+to add an operator between the pair (here none) and a separator to be
+appended onto the end of the arg pair (here ';').
+
+There's also a dynevent_str_add() function that can be used to simply
+add a string as-is, with no spaces, delimeters, or arg check.
+
+Any number of dynevent_*_add() calls can be made to build up the string
+(until its length surpasses cmd->maxlen). When all the arguments have
+been added and the command string is complete, the only thing left to
+do is run the command, which happens by simply calling
+dynevent_create()::
+
+ ret = dynevent_create(&cmd);
+
+At that point, if the return value is 0, the dynamic event has been
+created and is ready to use.
+
+See the dynevent_cmd function definitions themselves for the details
+of the API.
diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst
index a8e22e0..6893399 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -229,14 +229,6 @@
pass the return address pointer as the 'retp' argument to
ftrace_push_return_trace().
-HAVE_FTRACE_NMI_ENTER
----------------------
-
-If you can't trace NMI functions, then skip this option.
-
-<details to be filled>
-
-
HAVE_SYSCALL_TRACEPOINTS
------------------------
diff --git a/Documentation/trace/ftrace-uses.rst b/Documentation/trace/ftrace-uses.rst
index 1fbc698..a4955f7 100644
--- a/Documentation/trace/ftrace-uses.rst
+++ b/Documentation/trace/ftrace-uses.rst
@@ -55,17 +55,17 @@
Both .flags and .private are optional. Only .func is required.
-To enable tracing call:
+To enable tracing call::
-.. c:function:: register_ftrace_function(&ops);
+ register_ftrace_function(&ops);
-To disable tracing call:
+To disable tracing call::
-.. c:function:: unregister_ftrace_function(&ops);
+ unregister_ftrace_function(&ops);
-The above is defined by including the header:
+The above is defined by including the header::
-.. c:function:: #include <linux/ftrace.h>
+ #include <linux/ftrace.h>
The registered callback will start being called some time after the
register_ftrace_function() is called and before it returns. The exact time
@@ -146,7 +146,7 @@
itself or any nested functions that those functions call.
If this flag is set, it is possible that the callback will also
- be called with preemption enabled (when CONFIG_PREEMPT is set),
+ be called with preemption enabled (when CONFIG_PREEMPTION is set),
but this is not guaranteed.
FTRACE_OPS_FL_IPMODIFY
@@ -170,6 +170,14 @@
a callback may be executed and RCU synchronization will not protect
it.
+FTRACE_OPS_FL_PERMANENT
+ If this is set on any ftrace ops, then the tracing cannot disabled by
+ writing 0 to the proc sysctl ftrace_enabled. Equally, a callback with
+ the flag set cannot be registered if ftrace_enabled is 0.
+
+ Livepatch uses it not to lose the function redirection, so the system
+ stays protected.
+
Filtering which functions to trace
==================================
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index e3060ee..87cf5c0 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -34,7 +34,7 @@
can be enabled via the tracefs file system to see what is
going on in certain parts of the kernel.
-See events.txt for more information.
+See events.rst for more information.
Implementation Details
@@ -95,7 +95,8 @@
current_tracer:
This is used to set or display the current tracer
- that is configured.
+ that is configured. Changing the current tracer clears
+ the ring buffer content as well as the "snapshot" buffer.
available_tracers:
@@ -124,9 +125,13 @@
trace:
This file holds the output of the trace in a human
- readable format (described below). Note, tracing is temporarily
- disabled when the file is open for reading. Once all readers
- are closed, tracing is re-enabled.
+ readable format (described below). Opening this file for
+ writing with the O_TRUNC flag clears the ring buffer content.
+ Note, this file is not a consumer. If tracing is off
+ (no tracer running, or tracing_on is zero), it will produce
+ the same output each time it is read. When tracing is on,
+ it may produce inconsistent results as it tries to read
+ the entire buffer without consuming it.
trace_pipe:
@@ -140,9 +145,7 @@
will not be read again with a sequential read. The
"trace" file is static, and if the tracer is not
adding more data, it will display the same
- information every time it is read. Unlike the
- "trace" file, opening this file for reading will not
- temporarily disable tracing.
+ information every time it is read.
trace_options:
@@ -185,7 +188,8 @@
CPU buffer and not total size of all buffers. The
trace buffers are allocated in pages (blocks of memory
that the kernel uses for allocation, usually 4 KB in size).
- If the last page allocated has room for more bytes
+ A few extra pages may be allocated to accommodate buffer management
+ meta-data. If the last page allocated has room for more bytes
than requested, the rest of the page will be used,
making the actual allocation bigger than requested or shown.
( Note, the size may not be a multiple of the page size
@@ -235,7 +239,7 @@
This interface also allows for commands to be used. See the
"Filter commands" section for more details.
- As a speed up, since processing strings can't be quite expensive
+ As a speed up, since processing strings can be quite expensive
and requires a check of all functions registered to tracing, instead
an index can be written into this file. A number (starting with "1")
written will instead select the same corresponding at the line position
@@ -259,6 +263,20 @@
traced by the function tracer as well. This option will also
cause PIDs of tasks that exit to be removed from the file.
+ set_ftrace_notrace_pid:
+
+ Have the function tracer ignore threads whose PID are listed in
+ this file.
+
+ If the "function-fork" option is set, then when a task whose
+ PID is listed in this file forks, the child's PID will
+ automatically be added to this file, and the child will not be
+ traced by the function tracer as well. This option will also
+ cause PIDs of tasks that exit to be removed from the file.
+
+ If a PID is in both this file and "set_ftrace_pid", then this
+ file takes precedence, and the thread will not be traced.
+
set_event_pid:
Have the events only trace a task with a PID listed in this file.
@@ -270,6 +288,19 @@
cause the PIDs of tasks to be removed from this file when the task
exits.
+ set_event_notrace_pid:
+
+ Have the events not trace a task with a PID listed in this file.
+ Note, sched_switch and sched_wakeup will trace threads not listed
+ in this file, even if a thread's PID is in the file if the
+ sched_switch or sched_wakeup events also trace a thread that should
+ be traced.
+
+ To have the PIDs of children of tasks with their PID in this file
+ added on fork, enable the "event-fork" option. That option will also
+ cause the PIDs of tasks to be removed from this file when the task
+ exits.
+
set_graph_function:
Functions listed in this file will cause the function graph
@@ -345,11 +376,11 @@
kprobe_events:
- Enable dynamic trace points. See kprobetrace.txt.
+ Enable dynamic trace points. See kprobetrace.rst.
kprobe_profile:
- Dynamic trace points stats. See kprobetrace.txt.
+ Dynamic trace points stats. See kprobetrace.rst.
max_graph_depth:
@@ -382,7 +413,7 @@
By default, 128 comms are saved (see "saved_cmdlines" above). To
increase or decrease the amount of comms that are cached, echo
- in a the number of comms to cache, into this file.
+ the number of comms to cache into this file.
saved_tgids:
@@ -490,6 +521,9 @@
# echo global > trace_clock
+ Setting a clock clears the ring buffer content as well as the
+ "snapshot" buffer.
+
trace_marker:
This is a very useful file for synchronizing user space
@@ -527,14 +561,14 @@
trace_marker_raw:
- This is similar to trace_marker above, but is meant for for binary data
+ This is similar to trace_marker above, but is meant for binary data
to be written to it, where a tool can be used to parse the data
from trace_pipe_raw.
uprobe_events:
Add dynamic tracepoints in programs.
- See uprobetracer.txt
+ See uprobetracer.rst
uprobe_profile:
@@ -555,19 +589,19 @@
files at various levels that can enable the tracepoints
when a "1" is written to them.
- See events.txt for more information.
+ See events.rst for more information.
set_event:
By echoing in the event into this file, will enable that event.
- See events.txt for more information.
+ See events.rst for more information.
available_events:
A list of events that can be enabled in tracing.
- See events.txt for more information.
+ See events.rst for more information.
timestamp_mode:
@@ -1119,6 +1153,12 @@
the trace displays additional information about the
latency, as described in "Latency trace format".
+ pause-on-trace
+ When set, opening the trace file for read, will pause
+ writing to the ring buffer (as if tracing_on was set to zero).
+ This simulates the original behavior of the trace file.
+ When the file is closed, tracing will be enabled again.
+
record-cmd
When any event or tracer is enabled, a hook is enabled
in the sched_switch trace point to fill comm cache
@@ -1170,6 +1210,8 @@
tasks fork. Also, when tasks with PIDs in set_event_pid exit,
their PIDs will be removed from the file.
+ This affects PIDs listed in set_event_notrace_pid as well.
+
function-trace
The latency tracers will enable function tracing
if this option is enabled (default it is). When
@@ -1184,6 +1226,8 @@
set_ftrace_pid exit, their PIDs will be removed from the
file.
+ This affects PIDs in set_ftrace_notrace_pid as well.
+
display-graph
When set, the latency tracers (irqsoff, wakeup, etc) will
use function graph tracing instead of function tracing.
@@ -1350,7 +1394,7 @@
=> x86_64_start_reservations
=> x86_64_start_kernel
-Here we see that that we had a latency of 16 microseconds (which is
+Here we see that we had a latency of 16 microseconds (which is
very good). The _raw_spin_lock_irq in run_timer_softirq disabled
interrupts. The difference between the 16 and the displayed
timestamp 25us occurred because the clock was incremented
@@ -1409,7 +1453,7 @@
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
- => generic_make_request
+ => submit_bio_noacct
=> submit_bio
=> submit_bh
=> __ext3_get_inode_loc
@@ -1480,7 +1524,7 @@
=> remove_vma
=> exit_mmap
=> mmput
- => flush_old_exec
+ => begin_new_exec
=> load_elf_binary
=> search_binary_handler
=> __do_execve_file.isra.32
@@ -1694,7 +1738,7 @@
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
- => generic_make_request
+ => submit_bio_noacct
=> submit_bio
=> submit_bh
=> ext3_bread
@@ -2120,6 +2164,8 @@
# cat trace
# tracer: hwlat
#
+ # entries-in-buffer/entries-written: 13/13 #P:8
+ #
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
@@ -2127,12 +2173,18 @@
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
- <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940
- <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365
- <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062
- <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633
- <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961
- <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1
+ <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6
+ <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1
+ <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5
+ <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1
+ <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1
+ <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2
+ <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1
+ <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1
+ <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1
+ <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1
+ <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1
+ <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1
The above output is somewhat the same in the header. All events will have
@@ -2142,7 +2194,7 @@
This is the count of events recorded that were greater than the
tracing_threshold (See below).
- inner/outer(us): 12/14
+ inner/outer(us): 11/11
This shows two numbers as "inner latency" and "outer latency". The test
runs in a loop checking a timestamp twice. The latency detected within
@@ -2150,11 +2202,15 @@
after the previous timestamp and the next timestamp in the loop is
the "outer latency".
- ts:1499801089.066141940
+ ts:1581527483.343962693
- The absolute timestamp that the event happened.
+ The absolute timestamp that the first latency was recorded in the window.
- nmi-total:4 nmi-count:1
+ count:6
+
+ The number of times a latency was detected during the window.
+
+ nmi-total:7 nmi-count:1
On architectures that support it, if an NMI comes in during the
test, the time spent in NMI is reported in "nmi-total" (in
@@ -2976,7 +3032,9 @@
function tracer. By default it is enabled (when function tracing is
enabled in the kernel). If it is disabled, all function tracing is
disabled. This includes not only the function tracers for ftrace, but
-also for any other uses (perf, kprobes, stack tracing, profiling, etc).
+also for any other uses (perf, kprobes, stack tracing, profiling, etc). It
+cannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set
+registered.
Please disable this with care.
@@ -3322,7 +3380,7 @@
As you can see, the new directory looks similar to the tracing directory
itself. In fact, it is very similar, except that the buffer and
-events are agnostic from the main director, or from any other
+events are agnostic from the main directory, or from any other
instances that are created.
The files in the new directory work just like the files with the
diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst
new file mode 100644
index 0000000..088c8cc
--- /dev/null
+++ b/Documentation/trace/histogram-design.rst
@@ -0,0 +1,2115 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Histogram Design Notes
+======================
+
+:Author: Tom Zanussi <zanussi@kernel.org>
+
+This document attempts to provide a description of how the ftrace
+histograms work and how the individual pieces map to the data
+structures used to implement them in trace_events_hist.c and
+tracing_map.c.
+
+Note: All the ftrace histogram command examples assume the working
+ directory is the ftrace /tracing directory. For example::
+
+ # cd /sys/kernel/debug/tracing
+
+Also, the histogram output displayed for those commands will be
+generally be truncated - only enough to make the point is displayed.
+
+'hist_debug' trace event files
+==============================
+
+If the kernel is compiled with CONFIG_HIST_TRIGGERS_DEBUG set, an
+event file named 'hist_debug' will appear in each event's
+subdirectory. This file can be read at any time and will display some
+of the hist trigger internals described in this document. Specific
+examples and output will be described in test cases below.
+
+Basic histograms
+================
+
+First, basic histograms. Below is pretty much the simplest thing you
+can do with histograms - create one with a single key on a single
+event and cat the output::
+
+ # echo 'hist:keys=pid' >> events/sched/sched_waking/trigger
+
+ # cat events/sched/sched_waking/hist
+
+ { pid: 18249 } hitcount: 1
+ { pid: 13399 } hitcount: 1
+ { pid: 17973 } hitcount: 1
+ { pid: 12572 } hitcount: 1
+ ...
+ { pid: 10 } hitcount: 921
+ { pid: 18255 } hitcount: 1444
+ { pid: 25526 } hitcount: 2055
+ { pid: 5257 } hitcount: 2055
+ { pid: 27367 } hitcount: 2055
+ { pid: 1728 } hitcount: 2161
+
+ Totals:
+ Hits: 21305
+ Entries: 183
+ Dropped: 0
+
+What this does is create a histogram on the sched_waking event using
+pid as a key and with a single value, hitcount, which even if not
+explicitly specified, exists for every histogram regardless.
+
+The hitcount value is a per-bucket value that's automatically
+incremented on every hit for the given key, which in this case is the
+pid.
+
+So in this histogram, there's a separate bucket for each pid, and each
+bucket contains a value for that bucket, counting the number of times
+sched_waking was called for that pid.
+
+Each histogram is represented by a hist_data struct.
+
+To keep track of each key and value field in the histogram, hist_data
+keeps an array of these fields named fields[]. The fields[] array is
+an array containing struct hist_field representations of each
+histogram val and key in the histogram (variables are also included
+here, but are discussed later). So for the above histogram we have one
+key and one value; in this case the one value is the hitcount value,
+which all histograms have, regardless of whether they define that
+value or not, which the above histogram does not.
+
+Each struct hist_field contains a pointer to the ftrace_event_field
+from the event's trace_event_file along with various bits related to
+that such as the size, offset, type, and a hist_field_fn_t function,
+which is used to grab the field's data from the ftrace event buffer
+(in most cases - some hist_fields such as hitcount don't directly map
+to an event field in the trace buffer - in these cases the function
+implementation gets its value from somewhere else). The flags field
+indicates which type of field it is - key, value, variable, variable
+reference, etc., with value being the default.
+
+The other important hist_data data structure in addition to the
+fields[] array is the tracing_map instance created for the histogram,
+which is held in the .map member. The tracing_map implements the
+lock-free hash table used to implement histograms (see
+kernel/trace/tracing_map.h for much more discussion about the
+low-level data structures implementing the tracing_map). For the
+purposes of this discussion, the tracing_map contains a number of
+buckets, each bucket corresponding to a particular tracing_map_elt
+object hashed by a given histogram key.
+
+Below is a diagram the first part of which describes the hist_data and
+associated key and value fields for the histogram described above. As
+you can see, there are two fields in the fields array, one val field
+for the hitcount and one key field for the pid key.
+
+Below that is a diagram of a run-time snapshot of what the tracing_map
+might look like for a given run. It attempts to show the
+relationships between the hist_data fields and the tracing_map
+elements for a couple hypothetical keys and values.::
+
+ +------------------+
+ | hist_data |
+ +------------------+ +----------------+
+ | .fields[] |---->| val = hitcount |----------------------------+
+ +----------------+ +----------------+ |
+ | .map | | .size | |
+ +----------------+ +--------------+ |
+ | .offset | |
+ +--------------+ |
+ | .fn() | |
+ +--------------+ |
+ . |
+ . |
+ . |
+ +----------------+ <--- n_vals |
+ | key = pid |----------------------------|--+
+ +----------------+ | |
+ | .size | | |
+ +--------------+ | |
+ | .offset | | |
+ +--------------+ | |
+ | .fn() | | |
+ +----------------+ <--- n_fields | |
+ | unused | | |
+ +----------------+ | |
+ | | | |
+ +--------------+ | |
+ | | | |
+ +--------------+ | |
+ | | | |
+ +--------------+ | |
+ n_keys = n_fields - n_vals | |
+
+The hist_data n_vals and n_fields delineate the extent of the fields[] | |
+array and separate keys from values for the rest of the code. | |
+
+Below is a run-time representation of the tracing_map part of the | |
+histogram, with pointers from various parts of the fields[] array | |
+to corresponding parts of the tracing_map. | |
+
+The tracing_map consists of an array of tracing_map_entrys and a set | |
+of preallocated tracing_map_elts (abbreviated below as map_entry and | |
+map_elt). The total number of map_entrys in the hist_data.map array = | |
+map->max_elts (actually map->map_size but only max_elts of those are | |
+used. This is a property required by the map_insert() algorithm). | |
+
+If a map_entry is unused, meaning no key has yet hashed into it, its | |
+.key value is 0 and its .val pointer is NULL. Once a map_entry has | |
+been claimed, the .key value contains the key's hash value and the | |
+.val member points to a map_elt containing the full key and an entry | |
+for each key or value in the map_elt.fields[] array. There is an | |
+entry in the map_elt.fields[] array corresponding to each hist_field | |
+in the histogram, and this is where the continually aggregated sums | |
+corresponding to each histogram value are kept. | |
+
+The diagram attempts to show the relationship between the | |
+hist_data.fields[] and the map_elt.fields[] with the links drawn | |
+between diagrams::
+
+ +-----------+ | |
+ | hist_data | | |
+ +-----------+ | |
+ | .fields | | |
+ +---------+ +-----------+ | |
+ | .map |---->| map_entry | | |
+ +---------+ +-----------+ | |
+ | .key |---> 0 | |
+ +---------+ | |
+ | .val |---> NULL | |
+ +-----------+ | |
+ | map_entry | | |
+ +-----------+ | |
+ | .key |---> pid = 999 | |
+ +---------+ +-----------+ | |
+ | .val |--->| map_elt | | |
+ +---------+ +-----------+ | |
+ . | .key |---> full key * | |
+ . +---------+ +---------------+ | |
+ . | .fields |--->| .sum (val) |<-+ |
+ +-----------+ +---------+ | 2345 | | |
+ | map_entry | +---------------+ | |
+ +-----------+ | .offset (key) |<----+
+ | .key |---> 0 | 0 | | |
+ +---------+ +---------------+ | |
+ | .val |---> NULL . | |
+ +-----------+ . | |
+ | map_entry | . | |
+ +-----------+ +---------------+ | |
+ | .key | | .sum (val) or | | |
+ +---------+ +---------+ | .offset (key) | | |
+ | .val |--->| map_elt | +---------------+ | |
+ +-----------+ +---------+ | .sum (val) or | | |
+ | map_entry | | .offset (key) | | |
+ +-----------+ +---------------+ | |
+ | .key |---> pid = 4444 | |
+ +---------+ +-----------+ | |
+ | .val | | map_elt | | |
+ +---------+ +-----------+ | |
+ | .key |---> full key * | |
+ +---------+ +---------------+ | |
+ | .fields |--->| .sum (val) |<-+ |
+ +---------+ | 65523 | |
+ +---------------+ |
+ | .offset (key) |<----+
+ | 0 |
+ +---------------+
+ .
+ .
+ .
+ +---------------+
+ | .sum (val) or |
+ | .offset (key) |
+ +---------------+
+ | .sum (val) or |
+ | .offset (key) |
+ +---------------+
+
+Abbreviations used in the diagrams::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ fn = hist_field_fn_t
+ map_entry = struct tracing_map_entry
+ map_elt = struct tracing_map_elt
+ map_elt.fields = struct tracing_map_field
+
+Whenever a new event occurs and it has a hist trigger associated with
+it, event_hist_trigger() is called. event_hist_trigger() first deals
+with the key: for each subkey in the key (in the above example, there
+is just one subkey corresponding to pid), the hist_field that
+represents that subkey is retrieved from hist_data.fields[] and the
+hist_field_fn_t fn() associated with that field, along with the
+field's size and offset, is used to grab that subkey's data from the
+current trace record.
+
+Once the complete key has been retrieved, it's used to look that key
+up in the tracing_map. If there's no tracing_map_elt associated with
+that key, an empty one is claimed and inserted in the map for the new
+key. In either case, the tracing_map_elt associated with that key is
+returned.
+
+Once a tracing_map_elt available, hist_trigger_elt_update() is called.
+As the name implies, this updates the element, which basically means
+updating the element's fields. There's a tracing_map_field associated
+with each key and value in the histogram, and each of these correspond
+to the key and value hist_fields created when the histogram was
+created. hist_trigger_elt_update() goes through each value hist_field
+and, as for the keys, uses the hist_field's fn() and size and offset
+to grab the field's value from the current trace record. Once it has
+that value, it simply adds that value to that field's
+continually-updated tracing_map_field.sum member. Some hist_field
+fn()s, such as for the hitcount, don't actually grab anything from the
+trace record (the hitcount fn() just increments the counter sum by 1),
+but the idea is the same.
+
+Once all the values have been updated, hist_trigger_elt_update() is
+done and returns. Note that there are also tracing_map_fields for
+each subkey in the key, but hist_trigger_elt_update() doesn't look at
+them or update anything - those exist only for sorting, which can
+happen later.
+
+Basic histogram test
+--------------------
+
+This is a good example to try. It produces 3 value fields and 2 key
+fields in the output::
+
+ # echo 'hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger
+
+To see the debug data, cat the kmem/kmalloc's 'hist_debug' file. It
+will show the trigger info of the histogram it corresponds to, along
+with the address of the hist_data associated with the histogram, which
+will become useful in later examples. It then displays the number of
+total hist_fields associated with the histogram along with a count of
+how many of those correspond to keys and how many correspond to values.
+
+It then goes on to display details for each field, including the
+field's flags and the position of each field in the hist_data's
+fields[] array, which is useful information for verifying that things
+internally appear correct or not, and which again will become even
+more useful in further examples::
+
+ # cat events/kmem/kmalloc/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=common_pid,call_site.sym:vals=hitcount,bytes_req,bytes_alloc:sort=hitcount:size=2048 [active]
+ #
+
+ hist_data: 000000005e48c9a5
+
+ n_vals: 3
+ n_keys: 2
+ n_fields: 5
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ VAL: normal u64 value
+ ftrace_event_field name: bytes_req
+ type: size_t
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[2]:
+ flags:
+ VAL: normal u64 value
+ ftrace_event_field name: bytes_alloc
+ type: size_t
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: common_pid
+ type: int
+ size: 8
+ is_signed: 1
+
+ hist_data->fields[4]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: call_site
+ type: unsigned long
+ size: 8
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger
+
+Variables
+=========
+
+Variables allow data from one hist trigger to be saved by one hist
+trigger and retrieved by another hist trigger. For example, a trigger
+on the sched_waking event can capture a timestamp for a particular
+pid, and later a sched_switch event that switches to that pid event
+can grab the timestamp and use it to calculate a time delta between
+the two events::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
+ events/sched/sched_waking/trigger
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >>
+ events/sched/sched_switch/trigger
+
+In terms of the histogram data structures, variables are implemented
+as another type of hist_field and for a given hist trigger are added
+to the hist_data.fields[] array just after all the val fields. To
+distinguish them from the existing key and val fields, they're given a
+new flag type, HIST_FIELD_FL_VAR (abbreviated FL_VAR) and they also
+make use of a new .var.idx field member in struct hist_field, which
+maps them to an index in a new map_elt.vars[] array added to the
+map_elt specifically designed to store and retrieve variable values.
+The diagram below shows those new elements and adds a new variable
+entry, ts0, corresponding to the ts0 variable in the sched_waking
+trigger above.
+
+sched_waking histogram
+----------------------::
+
+ +------------------+
+ | hist_data |<-------------------------------------------------------+
+ +------------------+ +-------------------+ |
+ | .fields[] |-->| val = hitcount | |
+ +----------------+ +-------------------+ |
+ | .map | | .size | |
+ +----------------+ +-----------------+ |
+ | .offset | |
+ +-----------------+ |
+ | .fn() | |
+ +-----------------+ |
+ | .flags | |
+ +-----------------+ |
+ | .var.idx | |
+ +-------------------+ |
+ | var = ts0 | |
+ +-------------------+ |
+ | .size | |
+ +-----------------+ |
+ | .offset | |
+ +-----------------+ |
+ | .fn() | |
+ +-----------------+ |
+ | .flags & FL_VAR | |
+ +-----------------+ |
+ | .var.idx |----------------------------+-+ |
+ +-----------------+ | | |
+ . | | |
+ . | | |
+ . | | |
+ +-------------------+ <--- n_vals | | |
+ | key = pid | | | |
+ +-------------------+ | | |
+ | .size | | | |
+ +-----------------+ | | |
+ | .offset | | | |
+ +-----------------+ | | |
+ | .fn() | | | |
+ +-----------------+ | | |
+ | .flags & FL_KEY | | | |
+ +-----------------+ | | |
+ | .var.idx | | | |
+ +-------------------+ <--- n_fields | | |
+ | unused | | | |
+ +-------------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ n_keys = n_fields - n_vals | | |
+ | | |
+
+This is very similar to the basic case. In the above diagram, we can | | |
+see a new .flags member has been added to the struct hist_field | | |
+struct, and a new entry added to hist_data.fields representing the ts0 | | |
+variable. For a normal val hist_field, .flags is just 0 (modulo | | |
+modifier flags), but if the value is defined as a variable, the .flags | | |
+contains a set FL_VAR bit. | | |
+
+As you can see, the ts0 entry's .var.idx member contains the index | | |
+into the tracing_map_elts' .vars[] array containing variable values. | | |
+This idx is used whenever the value of the variable is set or read. | | |
+The map_elt.vars idx assigned to the given variable is assigned and | | |
+saved in .var.idx by create_tracing_map_fields() after it calls | | |
+tracing_map_add_var(). | | |
+
+Below is a representation of the histogram at run-time, which | | |
+populates the map, along with correspondence to the above hist_data and | | |
+hist_field data structures. | | |
+
+The diagram attempts to show the relationship between the | | |
+hist_data.fields[] and the map_elt.fields[] and map_elt.vars[] with | | |
+the links drawn between diagrams. For each of the map_elts, you can | | |
+see that the .fields[] members point to the .sum or .offset of a key | | |
+or val and the .vars[] members point to the value of a variable. The | | |
+arrows between the two diagrams show the linkages between those | | |
+tracing_map members and the field definitions in the corresponding | | |
+hist_data fields[] members.::
+
+ +-----------+ | | |
+ | hist_data | | | |
+ +-----------+ | | |
+ | .fields | | | |
+ +---------+ +-----------+ | | |
+ | .map |---->| map_entry | | | |
+ +---------+ +-----------+ | | |
+ | .key |---> 0 | | |
+ +---------+ | | |
+ | .val |---> NULL | | |
+ +-----------+ | | |
+ | map_entry | | | |
+ +-----------+ | | |
+ | .key |---> pid = 999 | | |
+ +---------+ +-----------+ | | |
+ | .val |--->| map_elt | | | |
+ +---------+ +-----------+ | | |
+ . | .key |---> full key * | | |
+ . +---------+ +---------------+ | | |
+ . | .fields |--->| .sum (val) | | | |
+ . +---------+ | 2345 | | | |
+ . +--| .vars | +---------------+ | | |
+ . | +---------+ | .offset (key) | | | |
+ . | | 0 | | | |
+ . | +---------------+ | | |
+ . | . | | |
+ . | . | | |
+ . | . | | |
+ . | +---------------+ | | |
+ . | | .sum (val) or | | | |
+ . | | .offset (key) | | | |
+ . | +---------------+ | | |
+ . | | .sum (val) or | | | |
+ . | | .offset (key) | | | |
+ . | +---------------+ | | |
+ . | | | |
+ . +---------------->+---------------+ | | |
+ . | ts0 |<--+ | |
+ . | 113345679876 | | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . . | | |
+ . . | | |
+ . . | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . | | |
+ +-----------+ | | |
+ | map_entry | | | |
+ +-----------+ | | |
+ | .key |---> pid = 4444 | | |
+ +---------+ +-----------+ | | |
+ | .val |--->| map_elt | | | |
+ +---------+ +-----------+ | | |
+ . | .key |---> full key * | | |
+ . +---------+ +---------------+ | | |
+ . | .fields |--->| .sum (val) | | | |
+ +---------+ | 2345 | | | |
+ +--| .vars | +---------------+ | | |
+ | +---------+ | .offset (key) | | | |
+ | | 0 | | | |
+ | +---------------+ | | |
+ | . | | |
+ | . | | |
+ | . | | |
+ | +---------------+ | | |
+ | | .sum (val) or | | | |
+ | | .offset (key) | | | |
+ | +---------------+ | | |
+ | | .sum (val) or | | | |
+ | | .offset (key) | | | |
+ | +---------------+ | | |
+ | | | |
+ | +---------------+ | | |
+ +---------------->| ts0 |<--+ | |
+ | 213499240729 | | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+ . | |
+ . | |
+ . | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+
+For each used map entry, there's a map_elt pointing to an array of | |
+.vars containing the current value of the variables associated with | |
+that histogram entry. So in the above, the timestamp associated with | |
+pid 999 is 113345679876, and the timestamp variable in the same | |
+.var.idx for pid 4444 is 213499240729. | |
+
+sched_switch histogram | |
+---------------------- | |
+
+The sched_switch histogram paired with the above sched_waking | |
+histogram is shown below. The most important aspect of the | |
+sched_switch histogram is that it references a variable on the | |
+sched_waking histogram above. | |
+
+The histogram diagram is very similar to the others so far displayed, | |
+but it adds variable references. You can see the normal hitcount and | |
+key fields along with a new wakeup_lat variable implemented in the | |
+same way as the sched_waking ts0 variable, but in addition there's an | |
+entry with the new FL_VAR_REF (short for HIST_FIELD_FL_VAR_REF) flag. | |
+
+Associated with the new var ref field are a couple of new hist_field | |
+members, var.hist_data and var_ref_idx. For a variable reference, the | |
+var.hist_data goes with the var.idx, which together uniquely identify | |
+a particular variable on a particular histogram. The var_ref_idx is | |
+just the index into the var_ref_vals[] array that caches the values of | |
+each variable whenever a hist trigger is updated. Those resulting | |
+values are then finally accessed by other code such as trace action | |
+code that uses the var_ref_idx values to assign param values. | |
+
+The diagram below describes the situation for the sched_switch | |
+histogram referred to before::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> | |
+ events/sched/sched_switch/trigger | |
+ | |
+ +------------------+ | |
+ | hist_data | | |
+ +------------------+ +-----------------------+ | |
+ | .fields[] |-->| val = hitcount | | |
+ +----------------+ +-----------------------+ | |
+ | .map | | .size | | |
+ +----------------+ +---------------------+ | |
+ +--| .var_refs[] | | .offset | | |
+ | +----------------+ +---------------------+ | |
+ | | .fn() | | |
+ | var_ref_vals[] +---------------------+ | |
+ | +-------------+ | .flags | | |
+ | | $ts0 |<---+ +---------------------+ | |
+ | +-------------+ | | .var.idx | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .var.hist_data | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .var_ref_idx | | |
+ | | | | +-----------------------+ | |
+ | +-------------+ | | var = wakeup_lat | | |
+ | . | +-----------------------+ | |
+ | . | | .size | | |
+ | . | +---------------------+ | |
+ | +-------------+ | | .offset | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .fn() | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .flags & FL_VAR | | |
+ | | +---------------------+ | |
+ | | | .var.idx | | |
+ | | +---------------------+ | |
+ | | | .var.hist_data | | |
+ | | +---------------------+ | |
+ | | | .var_ref_idx | | |
+ | | +---------------------+ | |
+ | | . | |
+ | | . | |
+ | | . | |
+ | | +-----------------------+ <--- n_vals | |
+ | | | key = pid | | |
+ | | +-----------------------+ | |
+ | | | .size | | |
+ | | +---------------------+ | |
+ | | | .offset | | |
+ | | +---------------------+ | |
+ | | | .fn() | | |
+ | | +---------------------+ | |
+ | | | .flags | | |
+ | | +---------------------+ | |
+ | | | .var.idx | | |
+ | | +-----------------------+ <--- n_fields | |
+ | | | unused | | |
+ | | +-----------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | n_keys = n_fields - n_vals | |
+ | | | |
+ | | | |
+ | | +-----------------------+ | |
+ +---------------------->| var_ref = $ts0 | | |
+ | +-----------------------+ | |
+ | | .size | | |
+ | +---------------------+ | |
+ | | .offset | | |
+ | +---------------------+ | |
+ | | .fn() | | |
+ | +---------------------+ | |
+ | | .flags & FL_VAR_REF | | |
+ | +---------------------+ | |
+ | | .var.idx |--------------------------+ |
+ | +---------------------+ |
+ | | .var.hist_data |----------------------------+
+ | +---------------------+
+ +---| .var_ref_idx |
+ +---------------------+
+
+Abbreviations used in the diagrams::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ fn = hist_field_fn_t
+ FL_KEY = HIST_FIELD_FL_KEY
+ FL_VAR = HIST_FIELD_FL_VAR
+ FL_VAR_REF = HIST_FIELD_FL_VAR_REF
+
+When a hist trigger makes use of a variable, a new hist_field is
+created with flag HIST_FIELD_FL_VAR_REF. For a VAR_REF field, the
+var.idx and var.hist_data take the same values as the referenced
+variable, as well as the referenced variable's size, type, and
+is_signed values. The VAR_REF field's .name is set to the name of the
+variable it references. If a variable reference was created using the
+explicit system.event.$var_ref notation, the hist_field's system and
+event_name variables are also set.
+
+So, in order to handle an event for the sched_switch histogram,
+because we have a reference to a variable on another histogram, we
+need to resolve all variable references first. This is done via the
+resolve_var_refs() calls made from event_hist_trigger(). What this
+does is grabs the var_refs[] array from the hist_data representing the
+sched_switch histogram. For each one of those, the referenced
+variable's var.hist_data along with the current key is used to look up
+the corresponding tracing_map_elt in that histogram. Once found, the
+referenced variable's var.idx is used to look up the variable's value
+using tracing_map_read_var(elt, var.idx), which yields the value of
+the variable for that element, ts0 in the case above. Note that both
+the hist_fields representing both the variable and the variable
+reference have the same var.idx, so this is straightforward.
+
+Variable and variable reference test
+------------------------------------
+
+This example creates a variable on the sched_waking event, ts0, and
+uses it in the sched_switch trigger. The sched_switch trigger also
+creates its own variable, wakeup_lat, but nothing yet uses it::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger
+
+Looking at the sched_waking 'hist_debug' output, in addition to the
+normal key and value hist_fields, in the val fields section we see a
+field with the HIST_FIELD_FL_VAR flag, which indicates that that field
+represents a variable. Note that in addition to the variable name,
+contained in the var.name field, it includes the var.idx, which is the
+index into the tracing_map_elt.vars[] array of the actual variable
+location. Note also that the output shows that variables live in the
+same part of the hist_data->fields[] array as normal values::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 000000009536f554
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+Moving on to the sched_switch trigger hist_debug output, in addition
+to the unused wakeup_lat variable, we see a new section displaying
+variable references. Variable references are displayed in a separate
+section because in addition to being logically separate from
+variables and values, they actually live in a separate hist_data
+array, var_refs[].
+
+In this example, the sched_switch trigger has a reference to a
+variable on the sched_waking trigger, $ts0. Looking at the details,
+we can see that the var.hist_data value of the referenced variable
+matches the previously displayed sched_waking trigger, and the var.idx
+value matches the previously displayed var.idx value for that
+variable. Also displayed is the var_ref_idx value for that variable
+reference, which is where the value for that variable is cached for
+use when the trigger is invoked::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000f4ee8006
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 000000009536f554
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Actions and Handlers
+====================
+
+Adding onto the previous example, we will now do something with that
+wakeup_lat variable, namely send it and another field as a synthetic
+event.
+
+The onmatch() action below basically says that whenever we have a
+sched_switch event, if we have a matching sched_waking event, in this
+case if we have a pid in the sched_waking histogram that matches the
+next_pid field on this sched_switch event, we retrieve the
+variables specified in the wakeup_latency() trace action, and use
+them to generate a new wakeup_latency event into the trace stream.
+
+Note that the way the trace handlers such as wakeup_latency() (which
+could equivalently be written trace(wakeup_latency,$wakeup_lat,next_pid)
+are implemented, the parameters specified to the trace handler must be
+variables. In this case, $wakeup_lat is obviously a variable, but
+next_pid isn't, since it's just naming a field in the sched_switch
+trace event. Since this is something that almost every trace() and
+save() action does, a special shortcut is implemented to allow field
+names to be used directly in those cases. How it works is that under
+the covers, a temporary variable is created for the named field, and
+this variable is what is actually passed to the trace handler. In the
+code and documentation, this type of variable is called a 'field
+variable'.
+
+Fields on other trace event's histograms can be used as well. In that
+case we have to generate a new histogram and an unfortunately named
+'synthetic_field' (the use of synthetic here has nothing to do with
+synthetic events) and use that special histogram field as a variable.
+
+The diagram below illustrates the new elements described above in the
+context of the sched_switch histogram using the onmatch() handler and
+the trace() action.
+
+First, we define the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid' >> synthetic_events
+
+Next, the sched_waking hist trigger as before::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
+ events/sched/sched_waking/trigger
+
+Finally, we create a hist trigger on the sched_switch event that
+generates a wakeup_latency() trace event. In this case we pass
+next_pid into the wakeup_latency synthetic event invocation, which
+means it will be automatically converted into a field variable::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid)' >>
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+The diagram for the sched_switch event is similar to previous examples
+but shows the additional field_vars[] array for hist_data and shows
+the linkages between the field_vars and the variables and references
+created to implement the field variables. The details are discussed
+below::
+
+ +------------------+
+ | hist_data |
+ +------------------+ +-----------------------+
+ | .fields[] |-->| val = hitcount |
+ +----------------+ +-----------------------+
+ | .map | | .size |
+ +----------------+ +---------------------+
+ +---| .field_vars[] | | .offset |
+ | +----------------+ +---------------------+
+ |+--| .var_refs[] | | .offset |
+ || +----------------+ +---------------------+
+ || | .fn() |
+ || var_ref_vals[] +---------------------+
+ || +-------------+ | .flags |
+ || | $ts0 |<---+ +---------------------+
+ || +-------------+ | | .var.idx |
+ || | $next_pid |<-+ | +---------------------+
+ || +-------------+ | | | .var.hist_data |
+ ||+>| $wakeup_lat | | | +---------------------+
+ ||| +-------------+ | | | .var_ref_idx |
+ ||| | | | | +-----------------------+
+ ||| +-------------+ | | | var = wakeup_lat |
+ ||| . | | +-----------------------+
+ ||| . | | | .size |
+ ||| . | | +---------------------+
+ ||| +-------------+ | | | .offset |
+ ||| | | | | +---------------------+
+ ||| +-------------+ | | | .fn() |
+ ||| | | | | +---------------------+
+ ||| +-------------+ | | | .flags & FL_VAR |
+ ||| | | +---------------------+
+ ||| | | | .var.idx |
+ ||| | | +---------------------+
+ ||| | | | .var.hist_data |
+ ||| | | +---------------------+
+ ||| | | | .var_ref_idx |
+ ||| | | +---------------------+
+ ||| | | .
+ ||| | | .
+ ||| | | .
+ ||| | | .
+ ||| +--------------+ | | .
+ +-->| field_var | | | .
+ || +--------------+ | | .
+ || | var | | | .
+ || +------------+ | | .
+ || | val | | | .
+ || +--------------+ | | .
+ || | field_var | | | .
+ || +--------------+ | | .
+ || | var | | | .
+ || +------------+ | | .
+ || | val | | | .
+ || +------------+ | | .
+ || . | | .
+ || . | | .
+ || . | | +-----------------------+ <--- n_vals
+ || +--------------+ | | | key = pid |
+ || | field_var | | | +-----------------------+
+ || +--------------+ | | | .size |
+ || | var |--+| +---------------------+
+ || +------------+ ||| | .offset |
+ || | val |-+|| +---------------------+
+ || +------------+ ||| | .fn() |
+ || ||| +---------------------+
+ || ||| | .flags |
+ || ||| +---------------------+
+ || ||| | .var.idx |
+ || ||| +---------------------+ <--- n_fields
+ || |||
+ || ||| n_keys = n_fields - n_vals
+ || ||| +-----------------------+
+ || |+->| var = next_pid |
+ || | | +-----------------------+
+ || | | | .size |
+ || | | +---------------------+
+ || | | | .offset |
+ || | | +---------------------+
+ || | | | .flags & FL_VAR |
+ || | | +---------------------+
+ || | | | .var.idx |
+ || | | +---------------------+
+ || | | | .var.hist_data |
+ || | | +-----------------------+
+ || +-->| val for next_pid |
+ || | | +-----------------------+
+ || | | | .size |
+ || | | +---------------------+
+ || | | | .offset |
+ || | | +---------------------+
+ || | | | .fn() |
+ || | | +---------------------+
+ || | | | .flags |
+ || | | +---------------------+
+ || | | | |
+ || | | +---------------------+
+ || | |
+ || | |
+ || | | +-----------------------+
+ +|------------------|-|>| var_ref = $ts0 |
+ | | | +-----------------------+
+ | | | | .size |
+ | | | +---------------------+
+ | | | | .offset |
+ | | | +---------------------+
+ | | | | .fn() |
+ | | | +---------------------+
+ | | | | .flags & FL_VAR_REF |
+ | | | +---------------------+
+ | | +---| .var_ref_idx |
+ | | +-----------------------+
+ | | | var_ref = $next_pid |
+ | | +-----------------------+
+ | | | .size |
+ | | +---------------------+
+ | | | .offset |
+ | | +---------------------+
+ | | | .fn() |
+ | | +---------------------+
+ | | | .flags & FL_VAR_REF |
+ | | +---------------------+
+ | +-----| .var_ref_idx |
+ | +-----------------------+
+ | | var_ref = $wakeup_lat |
+ | +-----------------------+
+ | | .size |
+ | +---------------------+
+ | | .offset |
+ | +---------------------+
+ | | .fn() |
+ | +---------------------+
+ | | .flags & FL_VAR_REF |
+ | +---------------------+
+ +------------------------| .var_ref_idx |
+ +---------------------+
+
+As you can see, for a field variable, two hist_fields are created: one
+representing the variable, in this case next_pid, and one to actually
+get the value of the field from the trace stream, like a normal val
+field does. These are created separately from normal variable
+creation and are saved in the hist_data->field_vars[] array. See
+below for how these are used. In addition, a reference hist_field is
+also created, which is needed to reference the field variables such as
+$next_pid variable in the trace() action.
+
+Note that $wakeup_lat is also a variable reference, referencing the
+value of the expression common_timestamp-$ts0, and so also needs to
+have a hist field entry representing that reference created.
+
+When hist_trigger_elt_update() is called to get the normal key and
+value fields, it also calls update_field_vars(), which goes through
+each field_var created for the histogram, and available from
+hist_data->field_vars and calls val->fn() to get the data from the
+current trace record, and then uses the var's var.idx to set the
+variable at the var.idx offset in the appropriate tracing_map_elt's
+variable at elt->vars[var.idx].
+
+Once all the variables have been updated, resolve_var_refs() can be
+called from event_hist_trigger(), and not only can our $ts0 and
+$next_pid references be resolved but the $wakeup_lat reference as
+well. At this point, the trace() action can simply access the values
+assembled in the var_ref_vals[] array and generate the trace event.
+
+The same process occurs for the field variables associated with the
+save() action.
+
+Abbreviations used in the diagram::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ field_var = struct field_var
+ fn = hist_field_fn_t
+ FL_KEY = HIST_FIELD_FL_KEY
+ FL_VAR = HIST_FIELD_FL_VAR
+ FL_VAR_REF = HIST_FIELD_FL_VAR_REF
+
+trace() action field variable test
+----------------------------------
+
+This example adds to the previous test example by finally making use
+of the wakeup_lat variable, but in addition also creates a couple of
+field variables that then are all passed to the wakeup_latency() trace
+action via the onmatch() handler.
+
+First, we create the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+Next, the sched_waking trigger from previous examples::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Finally, as in the previous test example, we calculate and assign the
+wakeup latency using the $ts0 reference from the sched_waking trigger
+to the wakeup_lat variable, and finally use it along with a couple
+sched_switch event fields, next_pid and next_comm, to generate a
+wakeup_latency trace event. The next_pid and next_comm event fields
+are automatically converted into field variables for this purpose::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+The sched_waking hist_debug output shows the same data as in the
+previous test example::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000d60ff61f
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The sched_switch hist_debug output shows the same key and value fields
+as in the previous test example - note that wakeup_lat is still in the
+val fields section, but that the new field variables are not there -
+although the field variables are variables, they're held separately in
+the hist_data's field_vars[] array. Although the field variables and
+the normal variables are located in separate places, you can see that
+the actual variable locations for those variables in the
+tracing_map_elt.vars[] do have increasing indices as expected:
+wakeup_lat takes the var.idx = 0 slot, while the field variables for
+next_pid and next_comm have values var.idx = 1, and var.idx = 2. Note
+also that those are the same values displayed for the variable
+references corresponding to those variables in the variable reference
+fields section. Since there are two triggers and thus two hist_data
+addresses, those addresses also need to be accounted for when doing
+the matching - you can see that the first variable refers to the 0
+var.idx on the previous hist trigger (see the hist_data address
+associated with that trigger), while the second variable refers to the
+0 var.idx on the sched_switch hist trigger, as do all the remaining
+variable references.
+
+Finally, the action tracking variables section just shows the system
+and event name for the onmatch() handler::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm) [active]
+ #
+
+ hist_data: 0000000008f551b7
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000d60ff61f
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: pid_t
+ size: 4
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+
+ field_vars[0].val:
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->field_vars[1]:
+
+ field_vars[1].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ field_vars[1].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+action_data and the trace() action
+----------------------------------
+
+As mentioned above, when the trace() action generates a synthetic
+event, all the parameters to the synthetic event either already are
+variables or are converted into variables (via field variables), and
+finally all those variable values are collected via references to them
+into a var_ref_vals[] array.
+
+The values in the var_ref_vals[] array, however, don't necessarily
+follow the same ordering as the synthetic event params. To address
+that, struct action_data contains another array, var_ref_idx[] that
+maps the trace action params to the var_ref_vals[] values. Below is a
+diagram illustrating that for the wakeup_latency() synthetic event::
+
+ +------------------+ wakeup_latency()
+ | action_data | event params var_ref_vals[]
+ +------------------+ +-----------------+ +-----------------+
+ | .var_ref_idx[] |--->| $wakeup_lat idx |---+ | |
+ +----------------+ +-----------------+ | +-----------------+
+ | .synth_event | | $next_pid idx |---|-+ | $wakeup_lat val |
+ +----------------+ +-----------------+ | | +-----------------+
+ . | +->| $next_pid val |
+ . | +-----------------+
+ . | .
+ +-----------------+ | .
+ | | | .
+ +-----------------+ | +-----------------+
+ +--->| $wakeup_lat val |
+ +-----------------+
+
+Basically, how this ends up getting used in the synthetic event probe
+function, trace_event_raw_event_synth(), is as follows::
+
+ for each field i in .synth_event
+ val_idx = .var_ref_idx[i]
+ val = var_ref_vals[val_idx]
+
+action_data and the onXXX() handlers
+------------------------------------
+
+The hist trigger onXXX() actions other than onmatch(), such as onmax()
+and onchange(), also make use of and internally create hidden
+variables. This information is contained in the
+action_data.track_data struct, and is also visible in the hist_debug
+output as will be described in the example below.
+
+Typically, the onmax() or onchange() handlers are used in conjunction
+with the save() and snapshot() actions. For example::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >>
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+or::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmax($wakeup_lat).snapshot()' >>
+ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+
+save() action field variable test
+---------------------------------
+
+For this example, instead of generating a synthetic event, the save()
+action is used to save field values whenever an onmax() handler
+detects that a new max latency has been hit. As in the previous
+example, the values being saved are also field values, but in this
+case, are kept in a separate hist_data array named save_vars[].
+
+As in previous test examples, we set up the sched_waking trigger::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+In this case, however, we set up the sched_switch trigger to save some
+sched_switch field values whenever we hit a new maximum latency. For
+both the onmax() handler and save() action, variables will be created,
+which we can use the hist_debug files to examine::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger
+
+The sched_waking hist_debug output shows the same data as in the
+previous test examples::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000e6290f48
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The output of the sched_switch trigger shows the same val and key
+values as before, but also shows a couple new sections.
+
+First, the action tracking variables section now shows the
+actions[].track_data information describing the special tracking
+variables and references used to track, in this case, the running
+maximum value. The actions[].track_data.var_ref member contains the
+reference to the variable being tracked, in this case the $wakeup_lat
+variable. In order to perform the onmax() handler function, there
+also needs to be a variable that tracks the current maximum by getting
+updated whenever a new maximum is hit. In this case, we can see that
+an auto-generated variable named ' __max' has been created and is
+visible in the actions[].track_data.track_var variable.
+
+Finally, in the new 'save action variables' section, we can see that
+the 4 params to the save() function have resulted in 4 field variables
+being created for the purposes of saving the values of the named
+fields when the max is hit. These variables are kept in a separate
+save_vars[] array off of hist_data, so are displayed in a separate
+section::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) [active]
+ #
+
+ hist_data: 0000000057bcd28d
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000e6290f48
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000057bcd28d
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].track_data.var_ref:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000057bcd28d
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->actions[0].track_data.track_var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: __max
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ save action variables (save() params):
+
+ hist_data->save_vars[0]:
+
+ save_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ save_vars[0].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ hist_data->save_vars[1]:
+
+ save_vars[1].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_pid
+ var.idx (into tracing_map_elt.vars[]): 3
+
+ save_vars[1].val:
+ ftrace_event_field name: prev_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->save_vars[2]:
+
+ save_vars[2].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_prio
+ var.idx (into tracing_map_elt.vars[]): 4
+
+ save_vars[2].val:
+ ftrace_event_field name: prev_prio
+ type: int
+ size: 4
+ is_signed: 1
+
+ hist_data->save_vars[3]:
+
+ save_vars[3].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_comm
+ var.idx (into tracing_map_elt.vars[]): 5
+
+ save_vars[3].val:
+ ftrace_event_field name: prev_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+A couple special cases
+======================
+
+While the above covers the basics of the histogram internals, there
+are a couple of special cases that should be discussed, since they
+tend to create even more confusion. Those are field variables on other
+histograms, and aliases, both described below through example tests
+using the hist_debug files.
+
+Test of field variables on other histograms
+-------------------------------------------
+
+This example is similar to the previous examples, but in this case,
+the sched_switch trigger references a hist trigger field on another
+event, namely the sched_waking event. In order to accomplish this, a
+field variable is created for the other event, but since an existing
+histogram can't be used, as existing histograms are immutable, a new
+histogram with a matching variable is created and used, and we'll see
+that reflected in the hist_debug output shown below.
+
+First, we create the wakeup_latency synthetic event. Note the
+addition of the prio field::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events
+
+As in previous test examples, we set up the sched_waking trigger::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Here we set up a hist trigger on sched_switch to send a wakeup_latency
+event using an onmatch handler naming the sched_waking event. Note
+that the third param being passed to the wakeup_latency() is prio,
+which is a field name that needs to have a field variable created for
+it. There isn't however any prio field on the sched_switch event so
+it would seem that it wouldn't be possible to create a field variable
+for it. The matching sched_waking event does have a prio field, so it
+should be possible to make use of it for this purpose. The problem
+with that is that it's not currently possible to define a new variable
+on an existing histogram, so it's not possible to add a new prio field
+variable to the existing sched_waking histogram. It is however
+possible to create an additional new 'matching' sched_waking histogram
+for the same event, meaning that it uses the same key and filters, and
+define the new prio field variable on that.
+
+Here's the sched_switch trigger::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger
+
+And here's the output of the hist_debug information for the
+sched_waking hist trigger. Note that there are two histograms
+displayed in the output: the first is the normal sched_waking
+histogram we've seen in the previous examples, and the second is the
+special histogram we created to provide the prio field variable.
+
+Looking at the second histogram below, we see a variable with the name
+synthetic_prio. This is the field variable created for the prio field
+on that sched_waking histogram::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000349570e4
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:synthetic_prio=prio:sort=hitcount:size=2048 [active]
+ #
+
+ hist_data: 000000006920cf38
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ ftrace_event_field name: prio
+ var.name: synthetic_prio
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: int
+ size: 4
+ is_signed: 1
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+Looking at the sched_switch histogram below, we can see a reference to
+the synthetic_prio variable on sched_waking, and looking at the
+associated hist_data address we see that it is indeed associated with
+the new histogram. Note also that the other references are to a
+normal variable, wakeup_lat, and to a normal field variable, next_pid,
+the details of which are in the field variables section::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio) [active]
+ #
+
+ hist_data: 00000000a73b67df
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000349570e4
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000a73b67df
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 00000000a73b67df
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: pid_t
+ size: 4
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: synthetic_prio
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 000000006920cf38
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: int
+ size: 4
+ is_signed: 1
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+
+ field_vars[0].val:
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events
+
+Alias test
+----------
+
+This example is very similar to previous examples, but demonstrates
+the alias flag.
+
+First, we create the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+Next, we create a sched_waking trigger similar to previous examples,
+but in this case we save the pid in the waking_pid variable::
+
+ # echo 'hist:keys=pid:waking_pid=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+For the sched_switch trigger, instead of using $waking_pid directly in
+the wakeup_latency synthetic event invocation, we create an alias of
+$waking_pid named $woken_pid, and use that in the synthetic event
+invocation instead::
+
+ # echo 'hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger
+
+Looking at the sched_waking hist_debug output, in addition to the
+normal fields, we can see the waking_pid variable::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:waking_pid=pid,ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000a250528c
+
+ n_vals: 3
+ n_keys: 1
+ n_fields: 4
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ ftrace_event_field name: pid
+ var.name: waking_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The sched_switch hist_debug output shows that a variable named
+woken_pid has been created but that it also has the
+HIST_FIELD_FL_ALIAS flag set. It also has the HIST_FIELD_FL_VAR flag
+set, which is why it appears in the val field section.
+
+Despite that implementation detail, an alias variable is actually more
+like a variable reference; in fact it can be thought of as a reference
+to a reference. The implementation copies the var_ref->fn() from the
+variable reference being referenced, in this case, the waking_pid
+fn(), which is hist_field_var_ref() and makes that the fn() of the
+alias. The hist_field_var_ref() fn() requires the var_ref_idx of the
+variable reference it's using, so waking_pid's var_ref_idx is also
+copied to the alias. The end result is that when the value of alias
+is retrieved, in the end it just does the same thing the original
+reference would have done and retrieves the same value from the
+var_ref_vals[] array. You can verify this in the output by noting
+that the var_ref_idx of the alias, in this case woken_pid, is the same
+as the var_ref_idx of the reference, waking_pid, in the variable
+reference fields section.
+
+Additionally, once it gets that value, since it is also a variable, it
+then saves that value into its var.idx. So the var.idx of the
+woken_pid alias is 0, which it fills with the value from var_ref_idx 0
+when its fn() is called to update itself. You'll also notice that
+there's a woken_pid var_ref in the variable refs section. That is the
+reference to the woken_pid alias variable, and you can see that it
+retrieves the value from the same var.idx as the woken_pid alias, 0,
+and then in turn saves that value in its own var_ref_idx slot, 3, and
+the value at this position is finally what gets assigned to the
+$woken_pid slot in the trace event invocation::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:woken_pid=$waking_pid,wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm) [active]
+ #
+
+ hist_data: 0000000055d65ed0
+
+ n_vals: 3
+ n_keys: 1
+ n_fields: 4
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ HIST_FIELD_FL_ALIAS
+ var.name: woken_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: waking_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000a250528c
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 00000000a250528c
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: woken_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->var_refs[4]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 4
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ field_vars[0].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 3f3d1b9..f99be80 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -1495,7 +1495,7 @@
#
{ stacktrace:
- _do_fork+0x18e/0x330
+ kernel_clone+0x18e/0x330
kernel_thread+0x29/0x30
kthreadd+0x154/0x1b0
ret_from_fork+0x3f/0x70
@@ -1588,7 +1588,7 @@
SYSC_sendto+0xef/0x170
} hitcount: 88
{ stacktrace:
- _do_fork+0x18e/0x330
+ kernel_clone+0x18e/0x330
SyS_clone+0x19/0x20
entry_SYSCALL_64_fastpath+0x12/0x6a
} hitcount: 244
@@ -1776,6 +1776,24 @@
variables and their types, which can be any valid field type,
separated by semicolons, to the tracing/synthetic_events file.
+See synth_field_size() for available types.
+
+If field_name contains [n], the field is considered to be a static array.
+
+If field_names contains[] (no subscript), the field is considered to
+be a dynamic array, which will only take as much space in the event as
+is required to hold the array.
+
+A string field can be specified using either the static notation:
+
+ char name[32];
+
+Or the dynamic:
+
+ char name[];
+
+The size limit for either is 256.
+
For instance, the following creates a new event named 'wakeup_latency'
with 3 fields: lat, pid, and prio. Each of those fields is simply a
variable reference to a variable on another event::
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index b7891cb..f634b36 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -9,6 +9,7 @@
tracepoint-analysis
ftrace
ftrace-uses
+ kprobes
kprobetrace
uprobetracer
tracepoints
@@ -19,9 +20,11 @@
events-msr
mmiotrace
histogram
+ histogram-design
+ boottime-trace
hwlat_detector
intel_th
+ ring-buffer-design
stm
sys-t
- coresight
- coresight-cpu-debug
+ coresight/index
diff --git a/Documentation/trace/intel_th.rst b/Documentation/trace/intel_th.rst
index baa12eb..b31818d 100644
--- a/Documentation/trace/intel_th.rst
+++ b/Documentation/trace/intel_th.rst
@@ -44,7 +44,8 @@
MSU can be configured to collect trace data into a system memory
buffer, which can later on be read from its device nodes via read() or
-mmap() interface.
+mmap() interface and directed to a "software sink" driver that will
+consume the data and/or relay it further.
On the whole, Intel(R) Trace Hub does not require any special
userspace software to function; everything can be configured, started
@@ -57,7 +58,7 @@
For each Intel TH device in the system a bus of its own is
created and assigned an id number that reflects the order in which TH
-devices were emumerated. All TH subdevices (devices on intel_th bus)
+devices were enumerated. All TH subdevices (devices on intel_th bus)
begin with this id: 0-gth, 0-msc0, 0-msc1, 0-pti, 0-sth, which is
followed by device's name and an optional index.
@@ -122,3 +123,28 @@
will show up on the intel_th bus. Also, trace configuration and
capture controlling attribute groups of the 'gth' device will not be
exposed. The 'sth' device will operate as usual.
+
+Software Sinks
+--------------
+
+The Memory Storage Unit (MSU) driver provides an in-kernel API for
+drivers to register themselves as software sinks for the trace data.
+Such drivers can further export the data via other devices, such as
+USB device controllers or network cards.
+
+The API has two main parts::
+ - notifying the software sink that a particular window is full, and
+ "locking" that window, that is, making it unavailable for the trace
+ collection; when this happens, the MSU driver will automatically
+ switch to the next window in the buffer if it is unlocked, or stop
+ the trace capture if it's not;
+ - tracking the "locked" state of windows and providing a way for the
+ software sink driver to notify the MSU driver when a window is
+ unlocked and can be used again to collect trace data.
+
+An example sink driver, msu-sink illustrates the implementation of a
+software sink. Functionally, it simply unlocks windows as soon as they
+are full, keeping the MSU running in a circular buffer mode. Unlike the
+"multi" mode, it will fill out all the windows in the buffer as opposed
+to just the first one. It can be enabled by writing "sink" to the "mode"
+file (assuming msu-sink.ko is loaded).
diff --git a/Documentation/trace/kprobes.rst b/Documentation/trace/kprobes.rst
new file mode 100644
index 0000000..b757b6d
--- /dev/null
+++ b/Documentation/trace/kprobes.rst
@@ -0,0 +1,803 @@
+=======================
+Kernel Probes (Kprobes)
+=======================
+
+:Author: Jim Keniston <jkenisto@us.ibm.com>
+:Author: Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com>
+:Author: Masami Hiramatsu <mhiramat@redhat.com>
+
+.. CONTENTS
+
+ 1. Concepts: Kprobes, and Return Probes
+ 2. Architectures Supported
+ 3. Configuring Kprobes
+ 4. API Reference
+ 5. Kprobes Features and Limitations
+ 6. Probe Overhead
+ 7. TODO
+ 8. Kprobes Example
+ 9. Kretprobes Example
+ 10. Deprecated Features
+ Appendix A: The kprobes debugfs interface
+ Appendix B: The kprobes sysctl interface
+ Appendix C: References
+
+Concepts: Kprobes and Return Probes
+=========================================
+
+Kprobes enables you to dynamically break into any kernel routine and
+collect debugging and performance information non-disruptively. You
+can trap at almost any kernel code address [1]_, specifying a handler
+routine to be invoked when the breakpoint is hit.
+
+.. [1] some parts of the kernel code can not be trapped, see
+ :ref:`kprobes_blacklist`)
+
+There are currently two types of probes: kprobes, and kretprobes
+(also called return probes). A kprobe can be inserted on virtually
+any instruction in the kernel. A return probe fires when a specified
+function returns.
+
+In the typical case, Kprobes-based instrumentation is packaged as
+a kernel module. The module's init function installs ("registers")
+one or more probes, and the exit function unregisters them. A
+registration function such as register_kprobe() specifies where
+the probe is to be inserted and what handler is to be called when
+the probe is hit.
+
+There are also ``register_/unregister_*probes()`` functions for batch
+registration/unregistration of a group of ``*probes``. These functions
+can speed up unregistration process when you have to unregister
+a lot of probes at once.
+
+The next four subsections explain how the different types of
+probes work and how jump optimization works. They explain certain
+things that you'll need to know in order to make the best use of
+Kprobes -- e.g., the difference between a pre_handler and
+a post_handler, and how to use the maxactive and nmissed fields of
+a kretprobe. But if you're in a hurry to start using Kprobes, you
+can skip ahead to :ref:`kprobes_archs_supported`.
+
+How Does a Kprobe Work?
+-----------------------
+
+When a kprobe is registered, Kprobes makes a copy of the probed
+instruction and replaces the first byte(s) of the probed instruction
+with a breakpoint instruction (e.g., int3 on i386 and x86_64).
+
+When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
+registers are saved, and control passes to Kprobes via the
+notifier_call_chain mechanism. Kprobes executes the "pre_handler"
+associated with the kprobe, passing the handler the addresses of the
+kprobe struct and the saved registers.
+
+Next, Kprobes single-steps its copy of the probed instruction.
+(It would be simpler to single-step the actual instruction in place,
+but then Kprobes would have to temporarily remove the breakpoint
+instruction. This would open a small time window when another CPU
+could sail right past the probepoint.)
+
+After the instruction is single-stepped, Kprobes executes the
+"post_handler," if any, that is associated with the kprobe.
+Execution then continues with the instruction following the probepoint.
+
+Changing Execution Path
+-----------------------
+
+Since kprobes can probe into a running kernel code, it can change the
+register set, including instruction pointer. This operation requires
+maximum care, such as keeping the stack frame, recovering the execution
+path etc. Since it operates on a running kernel and needs deep knowledge
+of computer architecture and concurrent computing, you can easily shoot
+your foot.
+
+If you change the instruction pointer (and set up other related
+registers) in pre_handler, you must return !0 so that kprobes stops
+single stepping and just returns to the given address.
+This also means post_handler should not be called anymore.
+
+Note that this operation may be harder on some architectures which use
+TOC (Table of Contents) for function call, since you have to setup a new
+TOC for your function in your module, and recover the old one after
+returning from it.
+
+Return Probes
+-------------
+
+How Does a Return Probe Work?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When you call register_kretprobe(), Kprobes establishes a kprobe at
+the entry to the function. When the probed function is called and this
+probe is hit, Kprobes saves a copy of the return address, and replaces
+the return address with the address of a "trampoline." The trampoline
+is an arbitrary piece of code -- typically just a nop instruction.
+At boot time, Kprobes registers a kprobe at the trampoline.
+
+When the probed function executes its return instruction, control
+passes to the trampoline and that probe is hit. Kprobes' trampoline
+handler calls the user-specified return handler associated with the
+kretprobe, then sets the saved instruction pointer to the saved return
+address, and that's where execution resumes upon return from the trap.
+
+While the probed function is executing, its return address is
+stored in an object of type kretprobe_instance. Before calling
+register_kretprobe(), the user sets the maxactive field of the
+kretprobe struct to specify how many instances of the specified
+function can be probed simultaneously. register_kretprobe()
+pre-allocates the indicated number of kretprobe_instance objects.
+
+For example, if the function is non-recursive and is called with a
+spinlock held, maxactive = 1 should be enough. If the function is
+non-recursive and can never relinquish the CPU (e.g., via a semaphore
+or preemption), NR_CPUS should be enough. If maxactive <= 0, it is
+set to a default value. If CONFIG_PREEMPT is enabled, the default
+is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS.
+
+It's not a disaster if you set maxactive too low; you'll just miss
+some probes. In the kretprobe struct, the nmissed field is set to
+zero when the return probe is registered, and is incremented every
+time the probed function is entered but there is no kretprobe_instance
+object available for establishing the return probe.
+
+Kretprobe entry-handler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Kretprobes also provides an optional user-specified handler which runs
+on function entry. This handler is specified by setting the entry_handler
+field of the kretprobe struct. Whenever the kprobe placed by kretprobe at the
+function entry is hit, the user-defined entry_handler, if any, is invoked.
+If the entry_handler returns 0 (success) then a corresponding return handler
+is guaranteed to be called upon function return. If the entry_handler
+returns a non-zero error then Kprobes leaves the return address as is, and
+the kretprobe has no further effect for that particular function instance.
+
+Multiple entry and return handler invocations are matched using the unique
+kretprobe_instance object associated with them. Additionally, a user
+may also specify per return-instance private data to be part of each
+kretprobe_instance object. This is especially useful when sharing private
+data between corresponding user entry and return handlers. The size of each
+private data object can be specified at kretprobe registration time by
+setting the data_size field of the kretprobe struct. This data can be
+accessed through the data field of each kretprobe_instance object.
+
+In case probed function is entered but there is no kretprobe_instance
+object available, then in addition to incrementing the nmissed count,
+the user entry_handler invocation is also skipped.
+
+.. _kprobes_jump_optimization:
+
+How Does Jump Optimization Work?
+--------------------------------
+
+If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
+is automatically set 'y' on x86/x86-64, non-preemptive kernel) and
+the "debug.kprobes_optimization" kernel parameter is set to 1 (see
+sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
+instruction instead of a breakpoint instruction at each probepoint.
+
+Init a Kprobe
+^^^^^^^^^^^^^
+
+When a probe is registered, before attempting this optimization,
+Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
+address. So, even if it's not possible to optimize this particular
+probepoint, there'll be a probe there.
+
+Safety Check
+^^^^^^^^^^^^
+
+Before optimizing a probe, Kprobes performs the following safety checks:
+
+- Kprobes verifies that the region that will be replaced by the jump
+ instruction (the "optimized region") lies entirely within one function.
+ (A jump instruction is multiple bytes, and so may overlay multiple
+ instructions.)
+
+- Kprobes analyzes the entire function and verifies that there is no
+ jump into the optimized region. Specifically:
+
+ - the function contains no indirect jump;
+ - the function contains no instruction that causes an exception (since
+ the fixup code triggered by the exception could jump back into the
+ optimized region -- Kprobes checks the exception tables to verify this);
+ - there is no near jump to the optimized region (other than to the first
+ byte).
+
+- For each instruction in the optimized region, Kprobes verifies that
+ the instruction can be executed out of line.
+
+Preparing Detour Buffer
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, Kprobes prepares a "detour" buffer, which contains the following
+instruction sequence:
+
+- code to push the CPU's registers (emulating a breakpoint trap)
+- a call to the trampoline code which calls user's probe handlers.
+- code to restore registers
+- the instructions from the optimized region
+- a jump back to the original execution path.
+
+Pre-optimization
+^^^^^^^^^^^^^^^^
+
+After preparing the detour buffer, Kprobes verifies that none of the
+following situations exist:
+
+- The probe has a post_handler.
+- Other instructions in the optimized region are probed.
+- The probe is disabled.
+
+In any of the above cases, Kprobes won't start optimizing the probe.
+Since these are temporary situations, Kprobes tries to start
+optimizing it again if the situation is changed.
+
+If the kprobe can be optimized, Kprobes enqueues the kprobe to an
+optimizing list, and kicks the kprobe-optimizer workqueue to optimize
+it. If the to-be-optimized probepoint is hit before being optimized,
+Kprobes returns control to the original instruction path by setting
+the CPU's instruction pointer to the copied code in the detour buffer
+-- thus at least avoiding the single-step.
+
+Optimization
+^^^^^^^^^^^^
+
+The Kprobe-optimizer doesn't insert the jump instruction immediately;
+rather, it calls synchronize_rcu() for safety first, because it's
+possible for a CPU to be interrupted in the middle of executing the
+optimized region [3]_. As you know, synchronize_rcu() can ensure
+that all interruptions that were active when synchronize_rcu()
+was called are done, but only if CONFIG_PREEMPT=n. So, this version
+of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_.
+
+After that, the Kprobe-optimizer calls stop_machine() to replace
+the optimized region with a jump instruction to the detour buffer,
+using text_poke_smp().
+
+Unoptimization
+^^^^^^^^^^^^^^
+
+When an optimized kprobe is unregistered, disabled, or blocked by
+another kprobe, it will be unoptimized. If this happens before
+the optimization is complete, the kprobe is just dequeued from the
+optimized list. If the optimization has been done, the jump is
+replaced with the original code (except for an int3 breakpoint in
+the first byte) by using text_poke_smp().
+
+.. [3] Please imagine that the 2nd instruction is interrupted and then
+ the optimizer replaces the 2nd instruction with the jump *address*
+ while the interrupt handler is running. When the interrupt
+ returns to original address, there is no valid instruction,
+ and it causes an unexpected result.
+
+.. [4] This optimization-safety checking may be replaced with the
+ stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
+ kernel.
+
+NOTE for geeks:
+The jump optimization changes the kprobe's pre_handler behavior.
+Without optimization, the pre_handler can change the kernel's execution
+path by changing regs->ip and returning 1. However, when the probe
+is optimized, that modification is ignored. Thus, if you want to
+tweak the kernel's execution path, you need to suppress optimization,
+using one of the following techniques:
+
+- Specify an empty function for the kprobe's post_handler.
+
+or
+
+- Execute 'sysctl -w debug.kprobes_optimization=n'
+
+.. _kprobes_blacklist:
+
+Blacklist
+---------
+
+Kprobes can probe most of the kernel except itself. This means
+that there are some functions where kprobes cannot probe. Probing
+(trapping) such functions can cause a recursive trap (e.g. double
+fault) or the nested probe handler may never be called.
+Kprobes manages such functions as a blacklist.
+If you want to add a function into the blacklist, you just need
+to (1) include linux/kprobes.h and (2) use NOKPROBE_SYMBOL() macro
+to specify a blacklisted function.
+Kprobes checks the given probe address against the blacklist and
+rejects registering it, if the given address is in the blacklist.
+
+.. _kprobes_archs_supported:
+
+Architectures Supported
+=======================
+
+Kprobes and return probes are implemented on the following
+architectures:
+
+- i386 (Supports jump optimization)
+- x86_64 (AMD-64, EM64T) (Supports jump optimization)
+- ppc64
+- ia64 (Does not support probes on instruction slot1.)
+- sparc64 (Return probes not yet implemented.)
+- arm
+- ppc
+- mips
+- s390
+- parisc
+
+Configuring Kprobes
+===================
+
+When configuring the kernel using make menuconfig/xconfig/oldconfig,
+ensure that CONFIG_KPROBES is set to "y". Under "General setup", look
+for "Kprobes".
+
+So that you can load and unload Kprobes-based instrumentation modules,
+make sure "Loadable module support" (CONFIG_MODULES) and "Module
+unloading" (CONFIG_MODULE_UNLOAD) are set to "y".
+
+Also make sure that CONFIG_KALLSYMS and perhaps even CONFIG_KALLSYMS_ALL
+are set to "y", since kallsyms_lookup_name() is used by the in-kernel
+kprobe address resolution code.
+
+If you need to insert a probe in the middle of a function, you may find
+it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
+so you can use "objdump -d -l vmlinux" to see the source-to-object
+code mapping.
+
+API Reference
+=============
+
+The Kprobes API includes a "register" function and an "unregister"
+function for each type of probe. The API also includes "register_*probes"
+and "unregister_*probes" functions for (un)registering arrays of probes.
+Here are terse, mini-man-page specifications for these functions and
+the associated probe handlers that you'll write. See the files in the
+samples/kprobes/ sub-directory for examples.
+
+register_kprobe
+---------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kprobe(struct kprobe *kp);
+
+Sets a breakpoint at the address kp->addr. When the breakpoint is
+hit, Kprobes calls kp->pre_handler. After the probed instruction
+is single-stepped, Kprobe calls kp->post_handler. If a fault
+occurs during execution of kp->pre_handler or kp->post_handler,
+or during single-stepping of the probed instruction, Kprobes calls
+kp->fault_handler. Any or all handlers can be NULL. If kp->flags
+is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled,
+so, its handlers aren't hit until calling enable_kprobe(kp).
+
+.. note::
+
+ 1. With the introduction of the "symbol_name" field to struct kprobe,
+ the probepoint address resolution will now be taken care of by the kernel.
+ The following will now work::
+
+ kp.symbol_name = "symbol_name";
+
+ (64-bit powerpc intricacies such as function descriptors are handled
+ transparently)
+
+ 2. Use the "offset" field of struct kprobe if the offset into the symbol
+ to install a probepoint is known. This field is used to calculate the
+ probepoint.
+
+ 3. Specify either the kprobe "symbol_name" OR the "addr". If both are
+ specified, kprobe registration will fail with -EINVAL.
+
+ 4. With CISC architectures (such as i386 and x86_64), the kprobes code
+ does not validate if the kprobe.addr is at an instruction boundary.
+ Use "offset" with caution.
+
+register_kprobe() returns 0 on success, or a negative errno otherwise.
+
+User's pre-handler (kp->pre_handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ int pre_handler(struct kprobe *p, struct pt_regs *regs);
+
+Called with p pointing to the kprobe associated with the breakpoint,
+and regs pointing to the struct containing the registers saved when
+the breakpoint was hit. Return 0 here unless you're a Kprobes geek.
+
+User's post-handler (kp->post_handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ void post_handler(struct kprobe *p, struct pt_regs *regs,
+ unsigned long flags);
+
+p and regs are as described for the pre_handler. flags always seems
+to be zero.
+
+User's fault-handler (kp->fault_handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr);
+
+p and regs are as described for the pre_handler. trapnr is the
+architecture-specific trap number associated with the fault (e.g.,
+on i386, 13 for a general protection fault or 14 for a page fault).
+Returns 1 if it successfully handled the exception.
+
+register_kretprobe
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kretprobe(struct kretprobe *rp);
+
+Establishes a return probe for the function whose address is
+rp->kp.addr. When that function returns, Kprobes calls rp->handler.
+You must set rp->maxactive appropriately before you call
+register_kretprobe(); see "How Does a Return Probe Work?" for details.
+
+register_kretprobe() returns 0 on success, or a negative errno
+otherwise.
+
+User's return-probe handler (rp->handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ int kretprobe_handler(struct kretprobe_instance *ri,
+ struct pt_regs *regs);
+
+regs is as described for kprobe.pre_handler. ri points to the
+kretprobe_instance object, of which the following fields may be
+of interest:
+
+- ret_addr: the return address
+- rp: points to the corresponding kretprobe object
+- task: points to the corresponding task struct
+- data: points to per return-instance private data; see "Kretprobe
+ entry-handler" for details.
+
+The regs_return_value(regs) macro provides a simple abstraction to
+extract the return value from the appropriate register as defined by
+the architecture's ABI.
+
+The handler's return value is currently ignored.
+
+unregister_*probe
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ void unregister_kprobe(struct kprobe *kp);
+ void unregister_kretprobe(struct kretprobe *rp);
+
+Removes the specified probe. The unregister function can be called
+at any time after the probe has been registered.
+
+.. note::
+
+ If the functions find an incorrect probe (ex. an unregistered probe),
+ they clear the addr field of the probe.
+
+register_*probes
+----------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kprobes(struct kprobe **kps, int num);
+ int register_kretprobes(struct kretprobe **rps, int num);
+
+Registers each of the num probes in the specified array. If any
+error occurs during registration, all probes in the array, up to
+the bad probe, are safely unregistered before the register_*probes
+function returns.
+
+- kps/rps: an array of pointers to ``*probe`` data structures
+- num: the number of the array entries.
+
+.. note::
+
+ You have to allocate(or define) an array of pointers and set all
+ of the array entries before using these functions.
+
+unregister_*probes
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ void unregister_kprobes(struct kprobe **kps, int num);
+ void unregister_kretprobes(struct kretprobe **rps, int num);
+
+Removes each of the num probes in the specified array at once.
+
+.. note::
+
+ If the functions find some incorrect probes (ex. unregistered
+ probes) in the specified array, they clear the addr field of those
+ incorrect probes. However, other probes in the array are
+ unregistered correctly.
+
+disable_*probe
+--------------
+
+::
+
+ #include <linux/kprobes.h>
+ int disable_kprobe(struct kprobe *kp);
+ int disable_kretprobe(struct kretprobe *rp);
+
+Temporarily disables the specified ``*probe``. You can enable it again by using
+enable_*probe(). You must specify the probe which has been registered.
+
+enable_*probe
+-------------
+
+::
+
+ #include <linux/kprobes.h>
+ int enable_kprobe(struct kprobe *kp);
+ int enable_kretprobe(struct kretprobe *rp);
+
+Enables ``*probe`` which has been disabled by disable_*probe(). You must specify
+the probe which has been registered.
+
+Kprobes Features and Limitations
+================================
+
+Kprobes allows multiple probes at the same address. Also,
+a probepoint for which there is a post_handler cannot be optimized.
+So if you install a kprobe with a post_handler, at an optimized
+probepoint, the probepoint will be unoptimized automatically.
+
+In general, you can install a probe anywhere in the kernel.
+In particular, you can probe interrupt handlers. Known exceptions
+are discussed in this section.
+
+The register_*probe functions will return -EINVAL if you attempt
+to install a probe in the code that implements Kprobes (mostly
+kernel/kprobes.c and ``arch/*/kernel/kprobes.c``, but also functions such
+as do_page_fault and notifier_call_chain).
+
+If you install a probe in an inline-able function, Kprobes makes
+no attempt to chase down all inline instances of the function and
+install probes there. gcc may inline a function without being asked,
+so keep this in mind if you're not seeing the probe hits you expect.
+
+A probe handler can modify the environment of the probed function
+-- e.g., by modifying kernel data structures, or by modifying the
+contents of the pt_regs struct (which are restored to the registers
+upon return from the breakpoint). So Kprobes can be used, for example,
+to install a bug fix or to inject faults for testing. Kprobes, of
+course, has no way to distinguish the deliberately injected faults
+from the accidental ones. Don't drink and probe.
+
+Kprobes makes no attempt to prevent probe handlers from stepping on
+each other -- e.g., probing printk() and then calling printk() from a
+probe handler. If a probe handler hits a probe, that second probe's
+handlers won't be run in that instance, and the kprobe.nmissed member
+of the second probe will be incremented.
+
+As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of
+the same handler) may run concurrently on different CPUs.
+
+Kprobes does not use mutexes or allocate memory except during
+registration and unregistration.
+
+Probe handlers are run with preemption disabled or interrupt disabled,
+which depends on the architecture and optimization state. (e.g.,
+kretprobe handlers and optimized kprobe handlers run without interrupt
+disabled on x86/x86-64). In any case, your handler should not yield
+the CPU (e.g., by attempting to acquire a semaphore, or waiting I/O).
+
+Since a return probe is implemented by replacing the return
+address with the trampoline's address, stack backtraces and calls
+to __builtin_return_address() will typically yield the trampoline's
+address instead of the real return address for kretprobed functions.
+(As far as we can tell, __builtin_return_address() is used only
+for instrumentation and error reporting.)
+
+If the number of times a function is called does not match the number
+of times it returns, registering a return probe on that function may
+produce undesirable results. In such a case, a line:
+kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c
+gets printed. With this information, one will be able to correlate the
+exact instance of the kretprobe that caused the problem. We have the
+do_exit() case covered. do_execve() and do_fork() are not an issue.
+We're unaware of other specific cases where this could be a problem.
+
+If, upon entry to or exit from a function, the CPU is running on
+a stack other than that of the current task, registering a return
+probe on that function may produce undesirable results. For this
+reason, Kprobes doesn't support return probes (or kprobes)
+on the x86_64 version of __switch_to(); the registration functions
+return -EINVAL.
+
+On x86/x86-64, since the Jump Optimization of Kprobes modifies
+instructions widely, there are some limitations to optimization. To
+explain it, we introduce some terminology. Imagine a 3-instruction
+sequence consisting of a two 2-byte instructions and one 3-byte
+instruction.
+
+::
+
+ IA
+ |
+ [-2][-1][0][1][2][3][4][5][6][7]
+ [ins1][ins2][ ins3 ]
+ [<- DCR ->]
+ [<- JTPR ->]
+
+ ins1: 1st Instruction
+ ins2: 2nd Instruction
+ ins3: 3rd Instruction
+ IA: Insertion Address
+ JTPR: Jump Target Prohibition Region
+ DCR: Detoured Code Region
+
+The instructions in DCR are copied to the out-of-line buffer
+of the kprobe, because the bytes in DCR are replaced by
+a 5-byte jump instruction. So there are several limitations.
+
+a) The instructions in DCR must be relocatable.
+b) The instructions in DCR must not include a call instruction.
+c) JTPR must not be targeted by any jump or call instruction.
+d) DCR must not straddle the border between functions.
+
+Anyway, these limitations are checked by the in-kernel instruction
+decoder, so you don't need to worry about that.
+
+Probe Overhead
+==============
+
+On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
+microseconds to process. Specifically, a benchmark that hits the same
+probepoint repeatedly, firing a simple handler each time, reports 1-2
+million hits per second, depending on the architecture. A return-probe
+hit typically takes 50-75% longer than a kprobe hit.
+When you have a return probe set on a function, adding a kprobe at
+the entry to that function adds essentially no overhead.
+
+Here are sample overhead figures (in usec) for different architectures::
+
+ k = kprobe; r = return probe; kr = kprobe + return probe
+ on same function
+
+ i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
+ k = 0.57 usec; r = 0.92; kr = 0.99
+
+ x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
+ k = 0.49 usec; r = 0.80; kr = 0.82
+
+ ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
+ k = 0.77 usec; r = 1.26; kr = 1.45
+
+Optimized Probe Overhead
+------------------------
+
+Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
+process. Here are sample overhead figures (in usec) for x86 architectures::
+
+ k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
+ r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
+
+ i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+ k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33
+
+ x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+ k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30
+
+TODO
+====
+
+a. SystemTap (http://sourceware.org/systemtap): Provides a simplified
+ programming interface for probe-based instrumentation. Try it out.
+b. Kernel return probes for sparc64.
+c. Support for other architectures.
+d. User-space probes.
+e. Watchpoint probes (which fire on data references).
+
+Kprobes Example
+===============
+
+See samples/kprobes/kprobe_example.c
+
+Kretprobes Example
+==================
+
+See samples/kprobes/kretprobe_example.c
+
+Deprecated Features
+===================
+
+Jprobes is now a deprecated feature. People who are depending on it should
+migrate to other tracing features or use older kernels. Please consider to
+migrate your tool to one of the following options:
+
+- Use trace-event to trace target function with arguments.
+
+ trace-event is a low-overhead (and almost no visible overhead if it
+ is off) statically defined event interface. You can define new events
+ and trace it via ftrace or any other tracing tools.
+
+ See the following urls:
+
+ - https://lwn.net/Articles/379903/
+ - https://lwn.net/Articles/381064/
+ - https://lwn.net/Articles/383362/
+
+- Use ftrace dynamic events (kprobe event) with perf-probe.
+
+ If you build your kernel with debug info (CONFIG_DEBUG_INFO=y), you can
+ find which register/stack is assigned to which local variable or arguments
+ by using perf-probe and set up new event to trace it.
+
+ See following documents:
+
+ - Documentation/trace/kprobetrace.rst
+ - Documentation/trace/events.rst
+ - tools/perf/Documentation/perf-probe.txt
+
+
+The kprobes debugfs interface
+=============================
+
+
+With recent kernels (> 2.6.20) the list of registered kprobes is visible
+under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug).
+
+/sys/kernel/debug/kprobes/list: Lists all registered probes on the system::
+
+ c015d71a k vfs_read+0x0
+ c03dedc5 r tcp_v4_rcv+0x0
+
+The first column provides the kernel address where the probe is inserted.
+The second column identifies the type of probe (k - kprobe and r - kretprobe)
+while the third column specifies the symbol+offset of the probe.
+If the probed function belongs to a module, the module name is also
+specified. Following columns show probe status. If the probe is on
+a virtual address that is no longer valid (module init sections, module
+virtual addresses that correspond to modules that've been unloaded),
+such probes are marked with [GONE]. If the probe is temporarily disabled,
+such probes are marked with [DISABLED]. If the probe is optimized, it is
+marked with [OPTIMIZED]. If the probe is ftrace-based, it is marked with
+[FTRACE].
+
+/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly.
+
+Provides a knob to globally and forcibly turn registered kprobes ON or OFF.
+By default, all kprobes are enabled. By echoing "0" to this file, all
+registered probes will be disarmed, till such time a "1" is echoed to this
+file. Note that this knob just disarms and arms all kprobes and doesn't
+change each probe's disabling state. This means that disabled kprobes (marked
+[DISABLED]) will be not enabled if you turn ON all kprobes by this knob.
+
+
+The kprobes sysctl interface
+============================
+
+/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF.
+
+When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides
+a knob to globally and forcibly turn jump optimization (see section
+:ref:`kprobes_jump_optimization`) ON or OFF. By default, jump optimization
+is allowed (ON). If you echo "0" to this file or set
+"debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be
+unoptimized, and any new probes registered after that will not be optimized.
+
+Note that this knob *changes* the optimized state. This means that optimized
+probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be
+removed). If the knob is turned on, they will be optimized again.
+
+References
+==========
+
+For additional information on Kprobes, refer to the following URLs:
+
+- https://www.ibm.com/developerworks/library/l-kprobes/index.html
+- https://www.kernel.org/doc/ols/2006/ols2006v2-pages-109-124.pdf
+
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 5599305..b175d88 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -30,6 +30,7 @@
p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
+ p:[GRP/]EVENT] [MOD:]SYM[+0]%return [FETCHARGS] : Set a return probe
-:[GRP/]EVENT : Clear a probe
GRP : Group name. If omitted, use "kprobes" for it.
@@ -37,10 +38,11 @@
based on SYM+offs or MEMADDR.
MOD : Module name which has given SYM.
SYM[+offs] : Symbol+offset where the probe is inserted.
+ SYM%return : Return address of the symbol
MEMADDR : Address where the probe is inserted.
MAXACTIVE : Maximum number of instances of the specified function that
can be probed simultaneously, or 0 for the default value
- as defined in Documentation/kprobes.txt section 1.3.1.
+ as defined in Documentation/trace/kprobes.rst section 1.3.1.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
@@ -97,6 +99,7 @@
For $comm, the default type is "string"; any other type is invalid.
.. _user_mem_access:
+
User Memory Access
------------------
Kprobe events supports user-space memory access. For that purpose, you can use
@@ -252,4 +255,3 @@
Each line shows when the kernel hits an event, and <- SYMBOL means kernel
returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel
returns from do_sys_open to sys_open+0x1b).
-
diff --git a/Documentation/trace/mmiotrace.rst b/Documentation/trace/mmiotrace.rst
index 5116e8c..fed13ea 100644
--- a/Documentation/trace/mmiotrace.rst
+++ b/Documentation/trace/mmiotrace.rst
@@ -5,7 +5,7 @@
Home page and links to optional user space tools:
- http://nouveau.freedesktop.org/wiki/MmioTrace
+ https://nouveau.freedesktop.org/wiki/MmioTrace
MMIO tracing was originally developed by Intel around 2003 for their Fault
Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel,
diff --git a/Documentation/trace/ring-buffer-design.rst b/Documentation/trace/ring-buffer-design.rst
new file mode 100644
index 0000000..c5d77fc
--- /dev/null
+++ b/Documentation/trace/ring-buffer-design.rst
@@ -0,0 +1,983 @@
+.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only
+
+===========================
+Lockless Ring Buffer Design
+===========================
+
+Copyright 2009 Red Hat Inc.
+
+:Author: Steven Rostedt <srostedt@redhat.com>
+:License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+:Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
+ and Frederic Weisbecker.
+
+
+Written for: 2.6.31
+
+Terminology used in this Document
+---------------------------------
+
+tail
+ - where new writes happen in the ring buffer.
+
+head
+ - where new reads happen in the ring buffer.
+
+producer
+ - the task that writes into the ring buffer (same as writer)
+
+writer
+ - same as producer
+
+consumer
+ - the task that reads from the buffer (same as reader)
+
+reader
+ - same as consumer.
+
+reader_page
+ - A page outside the ring buffer used solely (for the most part)
+ by the reader.
+
+head_page
+ - a pointer to the page that the reader will use next
+
+tail_page
+ - a pointer to the page that will be written to next
+
+commit_page
+ - a pointer to the page with the last finished non-nested write.
+
+cmpxchg
+ - hardware-assisted atomic transaction that performs the following::
+
+ A = B if previous A == C
+
+ R = cmpxchg(A, C, B) is saying that we replace A with B if and only
+ if current A is equal to C, and we put the old (current)
+ A into R
+
+ R gets the previous A regardless if A is updated with B or not.
+
+ To see if the update was successful a compare of ``R == C``
+ may be used.
+
+The Generic Ring Buffer
+-----------------------
+
+The ring buffer can be used in either an overwrite mode or in
+producer/consumer mode.
+
+Producer/consumer mode is where if the producer were to fill up the
+buffer before the consumer could free up anything, the producer
+will stop writing to the buffer. This will lose most recent events.
+
+Overwrite mode is where if the producer were to fill up the buffer
+before the consumer could free up anything, the producer will
+overwrite the older data. This will lose the oldest events.
+
+No two writers can write at the same time (on the same per-cpu buffer),
+but a writer may interrupt another writer, but it must finish writing
+before the previous writer may continue. This is very important to the
+algorithm. The writers act like a "stack". The way interrupts works
+enforces this behavior::
+
+
+ writer1 start
+ <preempted> writer2 start
+ <preempted> writer3 start
+ writer3 finishes
+ writer2 finishes
+ writer1 finishes
+
+This is very much like a writer being preempted by an interrupt and
+the interrupt doing a write as well.
+
+Readers can happen at any time. But no two readers may run at the
+same time, nor can a reader preempt/interrupt another reader. A reader
+cannot preempt/interrupt a writer, but it may read/consume from the
+buffer at the same time as a writer is writing, but the reader must be
+on another processor to do so. A reader may read on its own processor
+and can be preempted by a writer.
+
+A writer can preempt a reader, but a reader cannot preempt a writer.
+But a reader can read the buffer at the same time (on another processor)
+as a writer.
+
+The ring buffer is made up of a list of pages held together by a linked list.
+
+At initialization a reader page is allocated for the reader that is not
+part of the ring buffer.
+
+The head_page, tail_page and commit_page are all initialized to point
+to the same page.
+
+The reader page is initialized to have its next pointer pointing to
+the head page, and its previous pointer pointing to a page before
+the head page.
+
+The reader has its own page to use. At start up time, this page is
+allocated but is not attached to the list. When the reader wants
+to read from the buffer, if its page is empty (like it is on start-up),
+it will swap its page with the head_page. The old reader page will
+become part of the ring buffer and the head_page will be removed.
+The page after the inserted page (old reader_page) will become the
+new head page.
+
+Once the new page is given to the reader, the reader could do what
+it wants with it, as long as a writer has left that page.
+
+A sample of how the reader page is swapped: Note this does not
+show the head page in the buffer, it is for demonstrating a swap
+only.
+
+::
+
+ +------+
+ |reader| RING BUFFER
+ |page |
+ +------+
+ +---+ +---+ +---+
+ | |-->| |-->| |
+ | |<--| |<--| |
+ +---+ +---+ +---+
+ ^ | ^ |
+ | +-------------+ |
+ +-----------------+
+
+
+ +------+
+ |reader| RING BUFFER
+ |page |-------------------+
+ +------+ v
+ | +---+ +---+ +---+
+ | | |-->| |-->| |
+ | | |<--| |<--| |<-+
+ | +---+ +---+ +---+ |
+ | ^ | ^ | |
+ | | +-------------+ | |
+ | +-----------------+ |
+ +------------------------------------+
+
+ +------+
+ |reader| RING BUFFER
+ |page |-------------------+
+ +------+ <---------------+ v
+ | ^ +---+ +---+ +---+
+ | | | |-->| |-->| |
+ | | | | | |<--| |<-+
+ | | +---+ +---+ +---+ |
+ | | | ^ | |
+ | | +-------------+ | |
+ | +-----------------------------+ |
+ +------------------------------------+
+
+ +------+
+ |buffer| RING BUFFER
+ |page |-------------------+
+ +------+ <---------------+ v
+ | ^ +---+ +---+ +---+
+ | | | | | |-->| |
+ | | New | | | |<--| |<-+
+ | | Reader +---+ +---+ +---+ |
+ | | page ----^ | |
+ | | | |
+ | +-----------------------------+ |
+ +------------------------------------+
+
+
+
+It is possible that the page swapped is the commit page and the tail page,
+if what is in the ring buffer is less than what is held in a buffer page.
+
+::
+
+ reader page commit page tail page
+ | | |
+ v | |
+ +---+ | |
+ | |<----------+ |
+ | |<------------------------+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+This case is still valid for this algorithm.
+When the writer leaves the page, it simply goes into the ring buffer
+since the reader page still points to the next location in the ring
+buffer.
+
+
+The main pointers:
+
+ reader page
+ - The page used solely by the reader and is not part
+ of the ring buffer (may be swapped in)
+
+ head page
+ - the next page in the ring buffer that will be swapped
+ with the reader page.
+
+ tail page
+ - the page where the next write will take place.
+
+ commit page
+ - the page that last finished a write.
+
+The commit page only is updated by the outermost writer in the
+writer stack. A writer that preempts another writer will not move the
+commit page.
+
+When data is written into the ring buffer, a position is reserved
+in the ring buffer and passed back to the writer. When the writer
+is finished writing data into that position, it commits the write.
+
+Another write (or a read) may take place at anytime during this
+transaction. If another write happens it must finish before continuing
+with the previous write.
+
+
+ Write reserve::
+
+ Buffer page
+ +---------+
+ |written |
+ +---------+ <--- given back to writer (current commit)
+ |reserved |
+ +---------+ <--- tail pointer
+ | empty |
+ +---------+
+
+ Write commit::
+
+ Buffer page
+ +---------+
+ |written |
+ +---------+
+ |written |
+ +---------+ <--- next position for write (current commit)
+ | empty |
+ +---------+
+
+
+ If a write happens after the first reserve::
+
+ Buffer page
+ +---------+
+ |written |
+ +---------+ <-- current commit
+ |reserved |
+ +---------+ <--- given back to second writer
+ |reserved |
+ +---------+ <--- tail pointer
+
+ After second writer commits::
+
+
+ Buffer page
+ +---------+
+ |written |
+ +---------+ <--(last full commit)
+ |reserved |
+ +---------+
+ |pending |
+ |commit |
+ +---------+ <--- tail pointer
+
+ When the first writer commits::
+
+ Buffer page
+ +---------+
+ |written |
+ +---------+
+ |written |
+ +---------+
+ |written |
+ +---------+ <--(last full commit and tail pointer)
+
+
+The commit pointer points to the last write location that was
+committed without preempting another write. When a write that
+preempted another write is committed, it only becomes a pending commit
+and will not be a full commit until all writes have been committed.
+
+The commit page points to the page that has the last full commit.
+The tail page points to the page with the last write (before
+committing).
+
+The tail page is always equal to or after the commit page. It may
+be several pages ahead. If the tail page catches up to the commit
+page then no more writes may take place (regardless of the mode
+of the ring buffer: overwrite and produce/consumer).
+
+The order of pages is::
+
+ head page
+ commit page
+ tail page
+
+Possible scenario::
+
+ tail page
+ head page commit page |
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+There is a special case that the head page is after either the commit page
+and possibly the tail page. That is when the commit (and tail) page has been
+swapped with the reader page. This is because the head page is always
+part of the ring buffer, but the reader page is not. Whenever there
+has been less than a full page that has been committed inside the ring buffer,
+and a reader swaps out a page, it will be swapping out the commit page.
+
+::
+
+ reader page commit page tail page
+ | | |
+ v | |
+ +---+ | |
+ | |<----------+ |
+ | |<------------------------+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+
+In this case, the head page will not move when the tail and commit
+move back into the ring buffer.
+
+The reader cannot swap a page into the ring buffer if the commit page
+is still on that page. If the read meets the last commit (real commit
+not pending or reserved), then there is nothing more to read.
+The buffer is considered empty until another full commit finishes.
+
+When the tail meets the head page, if the buffer is in overwrite mode,
+the head page will be pushed ahead one. If the buffer is in producer/consumer
+mode, the write will fail.
+
+Overwrite mode::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+Note, the reader page will still point to the previous head page.
+But when a swap takes place, it will use the most recent head page.
+
+
+Making the Ring Buffer Lockless:
+--------------------------------
+
+The main idea behind the lockless algorithm is to combine the moving
+of the head_page pointer with the swapping of pages with the reader.
+State flags are placed inside the pointer to the page. To do this,
+each page must be aligned in memory by 4 bytes. This will allow the 2
+least significant bits of the address to be used as flags, since
+they will always be zero for the address. To get the address,
+simply mask out the flags::
+
+ MASK = ~3
+
+ address & MASK
+
+Two flags will be kept by these two bits:
+
+ HEADER
+ - the page being pointed to is a head page
+
+ UPDATE
+ - the page being pointed to is being updated by a writer
+ and was or is about to be a head page.
+
+::
+
+ reader page
+ |
+ v
+ +---+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+
+The above pointer "-H->" would have the HEADER flag set. That is
+the next page is the next page to be swapped out by the reader.
+This pointer means the next page is the head page.
+
+When the tail page meets the head pointer, it will use cmpxchg to
+change the pointer to the UPDATE state::
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+"-U->" represents a pointer in the UPDATE state.
+
+Any access to the reader will need to take some sort of lock to serialize
+the readers. But the writers will never take a lock to write to the
+ring buffer. This means we only need to worry about a single reader,
+and writes only preempt in "stack" formation.
+
+When the reader tries to swap the page with the ring buffer, it
+will also use cmpxchg. If the flag bit in the pointer to the
+head page does not have the HEADER flag set, the compare will fail
+and the reader will need to look for the new head page and try again.
+Note, the flags UPDATE and HEADER are never set at the same time.
+
+The reader swaps the reader page as follows::
+
+ +------+
+ |reader| RING BUFFER
+ |page |
+ +------+
+ +---+ +---+ +---+
+ | |--->| |--->| |
+ | |<---| |<---| |
+ +---+ +---+ +---+
+ ^ | ^ |
+ | +---------------+ |
+ +-----H-------------+
+
+The reader sets the reader page next pointer as HEADER to the page after
+the head page::
+
+
+ +------+
+ |reader| RING BUFFER
+ |page |-------H-----------+
+ +------+ v
+ | +---+ +---+ +---+
+ | | |--->| |--->| |
+ | | |<---| |<---| |<-+
+ | +---+ +---+ +---+ |
+ | ^ | ^ | |
+ | | +---------------+ | |
+ | +-----H-------------+ |
+ +--------------------------------------+
+
+It does a cmpxchg with the pointer to the previous head page to make it
+point to the reader page. Note that the new pointer does not have the HEADER
+flag set. This action atomically moves the head page forward::
+
+ +------+
+ |reader| RING BUFFER
+ |page |-------H-----------+
+ +------+ v
+ | ^ +---+ +---+ +---+
+ | | | |-->| |-->| |
+ | | | |<--| |<--| |<-+
+ | | +---+ +---+ +---+ |
+ | | | ^ | |
+ | | +-------------+ | |
+ | +-----------------------------+ |
+ +------------------------------------+
+
+After the new head page is set, the previous pointer of the head page is
+updated to the reader page::
+
+ +------+
+ |reader| RING BUFFER
+ |page |-------H-----------+
+ +------+ <---------------+ v
+ | ^ +---+ +---+ +---+
+ | | | |-->| |-->| |
+ | | | | | |<--| |<-+
+ | | +---+ +---+ +---+ |
+ | | | ^ | |
+ | | +-------------+ | |
+ | +-----------------------------+ |
+ +------------------------------------+
+
+ +------+
+ |buffer| RING BUFFER
+ |page |-------H-----------+ <--- New head page
+ +------+ <---------------+ v
+ | ^ +---+ +---+ +---+
+ | | | | | |-->| |
+ | | New | | | |<--| |<-+
+ | | Reader +---+ +---+ +---+ |
+ | | page ----^ | |
+ | | | |
+ | +-----------------------------+ |
+ +------------------------------------+
+
+Another important point: The page that the reader page points back to
+by its previous pointer (the one that now points to the new head page)
+never points back to the reader page. That is because the reader page is
+not part of the ring buffer. Traversing the ring buffer via the next pointers
+will always stay in the ring buffer. Traversing the ring buffer via the
+prev pointers may not.
+
+Note, the way to determine a reader page is simply by examining the previous
+pointer of the page. If the next pointer of the previous page does not
+point back to the original page, then the original page is a reader page::
+
+
+ +--------+
+ | reader | next +----+
+ | page |-------->| |<====== (buffer page)
+ +--------+ +----+
+ | | ^
+ | v | next
+ prev | +----+
+ +------------->| |
+ +----+
+
+The way the head page moves forward:
+
+When the tail page meets the head page and the buffer is in overwrite mode
+and more writes take place, the head page must be moved forward before the
+writer may move the tail page. The way this is done is that the writer
+performs a cmpxchg to convert the pointer to the head page from the HEADER
+flag to have the UPDATE flag set. Once this is done, the reader will
+not be able to swap the head page from the buffer, nor will it be able to
+move the head page, until the writer is finished with the move.
+
+This eliminates any races that the reader can have on the writer. The reader
+must spin, and this is why the reader cannot preempt the writer::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The following page will be made into the new head page::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+After the new head page has been set, we can set the old head page
+pointer back to NORMAL::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+After the head page has been moved, the tail page may now move forward::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+
+The above are the trivial updates. Now for the more complex scenarios.
+
+
+As stated before, if enough writes preempt the first write, the
+tail page may make it all the way around the buffer and meet the commit
+page. At this time, we must start dropping writes (usually with some kind
+of warning to the user). But what happens if the commit was still on the
+reader page? The commit page is not part of the ring buffer. The tail page
+must account for this::
+
+
+ reader page commit page
+ | |
+ v |
+ +---+ |
+ | |<----------+
+ | |
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ tail page
+
+If the tail page were to simply push the head page forward, the commit when
+leaving the reader page would not be pointing to the correct page.
+
+The solution to this is to test if the commit page is on the reader page
+before pushing the head page. If it is, then it can be assumed that the
+tail page wrapped the buffer, and we must drop new writes.
+
+This is not a race condition, because the commit page can only be moved
+by the outermost writer (the writer that was preempted).
+This means that the commit will not move while a writer is moving the
+tail page. The reader cannot swap the reader page if it is also being
+used as the commit page. The reader can simply check that the commit
+is off the reader page. Once the commit page leaves the reader page
+it will never go back on it unless a reader does another swap with the
+buffer page that is also the commit page.
+
+
+Nested writes
+-------------
+
+In the pushing forward of the tail page we must first push forward
+the head page if the head page is the next page. If the head page
+is not the next page, the tail page is simply updated with a cmpxchg.
+
+Only writers move the tail page. This must be done atomically to protect
+against nested writers::
+
+ temp_page = tail_page
+ next_page = temp_page->next
+ cmpxchg(tail_page, temp_page, next_page)
+
+The above will update the tail page if it is still pointing to the expected
+page. If this fails, a nested write pushed it forward, the current write
+does not need to push it::
+
+
+ temp page
+ |
+ v
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+Nested write comes in and moves the tail page forward::
+
+ tail page (moved by nested writer)
+ temp page |
+ | |
+ v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The above would fail the cmpxchg, but since the tail page has already
+been moved forward, the writer will just try again to reserve storage
+on the new tail page.
+
+But the moving of the head page is a bit more complex::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The write converts the head page pointer to UPDATE::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+But if a nested writer preempts here, it will see that the next
+page is a head page, but it is also nested. It will detect that
+it is nested and will save that information. The detection is the
+fact that it sees the UPDATE flag instead of a HEADER or NORMAL
+pointer.
+
+The nested writer will set the new head page pointer::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+But it will not reset the update back to normal. Only the writer
+that converted a pointer from HEAD to UPDATE will convert it back
+to NORMAL::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+After the nested writer finishes, the outermost writer will convert
+the UPDATE pointer to NORMAL::
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+
+It can be even more complex if several nested writes came in and moved
+the tail page ahead several pages::
+
+
+ (first writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The write converts the head page pointer to UPDATE::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+Next writer comes in, and sees the update and sets up the new
+head page::
+
+ (second writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The nested writer moves the tail page forward. But does not set the old
+update page to NORMAL because it is not the outermost writer::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+Another writer preempts and sees the page after the tail page is a head page.
+It changes it from HEAD to UPDATE::
+
+ (third writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-U->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The writer will move the head page forward::
+
+
+ (third writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-U->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+But now that the third writer did change the HEAD flag to UPDATE it
+will convert it to normal::
+
+
+ (third writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+
+Then it will move the tail page, and return back to the second writer::
+
+
+ (second writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+
+The second writer will fail to move the tail page because it was already
+moved, so it will try again and add its data to the new tail page.
+It will return to the first writer::
+
+
+ (first writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The first writer cannot know atomically if the tail page moved
+while it updates the HEAD page. It will then update the head page to
+what it thinks is the new head page::
+
+
+ (first writer)
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+Since the cmpxchg returns the old value of the pointer the first writer
+will see it succeeded in updating the pointer from NORMAL to HEAD.
+But as we can see, this is not good enough. It must also check to see
+if the tail page is either where it use to be or on the next page::
+
+
+ (first writer)
+
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+If tail page != A and tail page != B, then it must reset the pointer
+back to NORMAL. The fact that it only needs to worry about nested
+writers means that it only needs to check this after setting the HEAD page::
+
+
+ (first writer)
+
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+Now the writer can update the head page. This is also why the head page must
+remain in UPDATE and only reset by the outermost writer. This prevents
+the reader from seeing the incorrect head page::
+
+
+ (first writer)
+
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt
deleted file mode 100644
index ff747b6..0000000
--- a/Documentation/trace/ring-buffer-design.txt
+++ /dev/null
@@ -1,955 +0,0 @@
- Lockless Ring Buffer Design
- ===========================
-
-Copyright 2009 Red Hat Inc.
- Author: Steven Rostedt <srostedt@redhat.com>
- License: The GNU Free Documentation License, Version 1.2
- (dual licensed under the GPL v2)
-Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
- and Frederic Weisbecker.
-
-
-Written for: 2.6.31
-
-Terminology used in this Document
----------------------------------
-
-tail - where new writes happen in the ring buffer.
-
-head - where new reads happen in the ring buffer.
-
-producer - the task that writes into the ring buffer (same as writer)
-
-writer - same as producer
-
-consumer - the task that reads from the buffer (same as reader)
-
-reader - same as consumer.
-
-reader_page - A page outside the ring buffer used solely (for the most part)
- by the reader.
-
-head_page - a pointer to the page that the reader will use next
-
-tail_page - a pointer to the page that will be written to next
-
-commit_page - a pointer to the page with the last finished non-nested write.
-
-cmpxchg - hardware-assisted atomic transaction that performs the following:
-
- A = B iff previous A == C
-
- R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
- current A is equal to C, and we put the old (current) A into R
-
- R gets the previous A regardless if A is updated with B or not.
-
- To see if the update was successful a compare of R == C may be used.
-
-The Generic Ring Buffer
------------------------
-
-The ring buffer can be used in either an overwrite mode or in
-producer/consumer mode.
-
-Producer/consumer mode is where if the producer were to fill up the
-buffer before the consumer could free up anything, the producer
-will stop writing to the buffer. This will lose most recent events.
-
-Overwrite mode is where if the producer were to fill up the buffer
-before the consumer could free up anything, the producer will
-overwrite the older data. This will lose the oldest events.
-
-No two writers can write at the same time (on the same per-cpu buffer),
-but a writer may interrupt another writer, but it must finish writing
-before the previous writer may continue. This is very important to the
-algorithm. The writers act like a "stack". The way interrupts works
-enforces this behavior.
-
-
- writer1 start
- <preempted> writer2 start
- <preempted> writer3 start
- writer3 finishes
- writer2 finishes
- writer1 finishes
-
-This is very much like a writer being preempted by an interrupt and
-the interrupt doing a write as well.
-
-Readers can happen at any time. But no two readers may run at the
-same time, nor can a reader preempt/interrupt another reader. A reader
-cannot preempt/interrupt a writer, but it may read/consume from the
-buffer at the same time as a writer is writing, but the reader must be
-on another processor to do so. A reader may read on its own processor
-and can be preempted by a writer.
-
-A writer can preempt a reader, but a reader cannot preempt a writer.
-But a reader can read the buffer at the same time (on another processor)
-as a writer.
-
-The ring buffer is made up of a list of pages held together by a linked list.
-
-At initialization a reader page is allocated for the reader that is not
-part of the ring buffer.
-
-The head_page, tail_page and commit_page are all initialized to point
-to the same page.
-
-The reader page is initialized to have its next pointer pointing to
-the head page, and its previous pointer pointing to a page before
-the head page.
-
-The reader has its own page to use. At start up time, this page is
-allocated but is not attached to the list. When the reader wants
-to read from the buffer, if its page is empty (like it is on start-up),
-it will swap its page with the head_page. The old reader page will
-become part of the ring buffer and the head_page will be removed.
-The page after the inserted page (old reader_page) will become the
-new head page.
-
-Once the new page is given to the reader, the reader could do what
-it wants with it, as long as a writer has left that page.
-
-A sample of how the reader page is swapped: Note this does not
-show the head page in the buffer, it is for demonstrating a swap
-only.
-
- +------+
- |reader| RING BUFFER
- |page |
- +------+
- +---+ +---+ +---+
- | |-->| |-->| |
- | |<--| |<--| |
- +---+ +---+ +---+
- ^ | ^ |
- | +-------------+ |
- +-----------------+
-
-
- +------+
- |reader| RING BUFFER
- |page |-------------------+
- +------+ v
- | +---+ +---+ +---+
- | | |-->| |-->| |
- | | |<--| |<--| |<-+
- | +---+ +---+ +---+ |
- | ^ | ^ | |
- | | +-------------+ | |
- | +-----------------+ |
- +------------------------------------+
-
- +------+
- |reader| RING BUFFER
- |page |-------------------+
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | |-->| |-->| |
- | | | | | |<--| |<-+
- | | +---+ +---+ +---+ |
- | | | ^ | |
- | | +-------------+ | |
- | +-----------------------------+ |
- +------------------------------------+
-
- +------+
- |buffer| RING BUFFER
- |page |-------------------+
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | | | |-->| |
- | | New | | | |<--| |<-+
- | | Reader +---+ +---+ +---+ |
- | | page ----^ | |
- | | | |
- | +-----------------------------+ |
- +------------------------------------+
-
-
-
-It is possible that the page swapped is the commit page and the tail page,
-if what is in the ring buffer is less than what is held in a buffer page.
-
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-This case is still valid for this algorithm.
-When the writer leaves the page, it simply goes into the ring buffer
-since the reader page still points to the next location in the ring
-buffer.
-
-
-The main pointers:
-
- reader page - The page used solely by the reader and is not part
- of the ring buffer (may be swapped in)
-
- head page - the next page in the ring buffer that will be swapped
- with the reader page.
-
- tail page - the page where the next write will take place.
-
- commit page - the page that last finished a write.
-
-The commit page only is updated by the outermost writer in the
-writer stack. A writer that preempts another writer will not move the
-commit page.
-
-When data is written into the ring buffer, a position is reserved
-in the ring buffer and passed back to the writer. When the writer
-is finished writing data into that position, it commits the write.
-
-Another write (or a read) may take place at anytime during this
-transaction. If another write happens it must finish before continuing
-with the previous write.
-
-
- Write reserve:
-
- Buffer page
- +---------+
- |written |
- +---------+ <--- given back to writer (current commit)
- |reserved |
- +---------+ <--- tail pointer
- | empty |
- +---------+
-
- Write commit:
-
- Buffer page
- +---------+
- |written |
- +---------+
- |written |
- +---------+ <--- next position for write (current commit)
- | empty |
- +---------+
-
-
- If a write happens after the first reserve:
-
- Buffer page
- +---------+
- |written |
- +---------+ <-- current commit
- |reserved |
- +---------+ <--- given back to second writer
- |reserved |
- +---------+ <--- tail pointer
-
- After second writer commits:
-
-
- Buffer page
- +---------+
- |written |
- +---------+ <--(last full commit)
- |reserved |
- +---------+
- |pending |
- |commit |
- +---------+ <--- tail pointer
-
- When the first writer commits:
-
- Buffer page
- +---------+
- |written |
- +---------+
- |written |
- +---------+
- |written |
- +---------+ <--(last full commit and tail pointer)
-
-
-The commit pointer points to the last write location that was
-committed without preempting another write. When a write that
-preempted another write is committed, it only becomes a pending commit
-and will not be a full commit until all writes have been committed.
-
-The commit page points to the page that has the last full commit.
-The tail page points to the page with the last write (before
-committing).
-
-The tail page is always equal to or after the commit page. It may
-be several pages ahead. If the tail page catches up to the commit
-page then no more writes may take place (regardless of the mode
-of the ring buffer: overwrite and produce/consumer).
-
-The order of pages is:
-
- head page
- commit page
- tail page
-
-Possible scenario:
- tail page
- head page commit page |
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-There is a special case that the head page is after either the commit page
-and possibly the tail page. That is when the commit (and tail) page has been
-swapped with the reader page. This is because the head page is always
-part of the ring buffer, but the reader page is not. Whenever there
-has been less than a full page that has been committed inside the ring buffer,
-and a reader swaps out a page, it will be swapping out the commit page.
-
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
-In this case, the head page will not move when the tail and commit
-move back into the ring buffer.
-
-The reader cannot swap a page into the ring buffer if the commit page
-is still on that page. If the read meets the last commit (real commit
-not pending or reserved), then there is nothing more to read.
-The buffer is considered empty until another full commit finishes.
-
-When the tail meets the head page, if the buffer is in overwrite mode,
-the head page will be pushed ahead one. If the buffer is in producer/consumer
-mode, the write will fail.
-
-Overwrite mode:
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-Note, the reader page will still point to the previous head page.
-But when a swap takes place, it will use the most recent head page.
-
-
-Making the Ring Buffer Lockless:
---------------------------------
-
-The main idea behind the lockless algorithm is to combine the moving
-of the head_page pointer with the swapping of pages with the reader.
-State flags are placed inside the pointer to the page. To do this,
-each page must be aligned in memory by 4 bytes. This will allow the 2
-least significant bits of the address to be used as flags, since
-they will always be zero for the address. To get the address,
-simply mask out the flags.
-
- MASK = ~3
-
- address & MASK
-
-Two flags will be kept by these two bits:
-
- HEADER - the page being pointed to is a head page
-
- UPDATE - the page being pointed to is being updated by a writer
- and was or is about to be a head page.
-
-
- reader page
- |
- v
- +---+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-The above pointer "-H->" would have the HEADER flag set. That is
-the next page is the next page to be swapped out by the reader.
-This pointer means the next page is the head page.
-
-When the tail page meets the head pointer, it will use cmpxchg to
-change the pointer to the UPDATE state:
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-"-U->" represents a pointer in the UPDATE state.
-
-Any access to the reader will need to take some sort of lock to serialize
-the readers. But the writers will never take a lock to write to the
-ring buffer. This means we only need to worry about a single reader,
-and writes only preempt in "stack" formation.
-
-When the reader tries to swap the page with the ring buffer, it
-will also use cmpxchg. If the flag bit in the pointer to the
-head page does not have the HEADER flag set, the compare will fail
-and the reader will need to look for the new head page and try again.
-Note, the flags UPDATE and HEADER are never set at the same time.
-
-The reader swaps the reader page as follows:
-
- +------+
- |reader| RING BUFFER
- |page |
- +------+
- +---+ +---+ +---+
- | |--->| |--->| |
- | |<---| |<---| |
- +---+ +---+ +---+
- ^ | ^ |
- | +---------------+ |
- +-----H-------------+
-
-The reader sets the reader page next pointer as HEADER to the page after
-the head page.
-
-
- +------+
- |reader| RING BUFFER
- |page |-------H-----------+
- +------+ v
- | +---+ +---+ +---+
- | | |--->| |--->| |
- | | |<---| |<---| |<-+
- | +---+ +---+ +---+ |
- | ^ | ^ | |
- | | +---------------+ | |
- | +-----H-------------+ |
- +--------------------------------------+
-
-It does a cmpxchg with the pointer to the previous head page to make it
-point to the reader page. Note that the new pointer does not have the HEADER
-flag set. This action atomically moves the head page forward.
-
- +------+
- |reader| RING BUFFER
- |page |-------H-----------+
- +------+ v
- | ^ +---+ +---+ +---+
- | | | |-->| |-->| |
- | | | |<--| |<--| |<-+
- | | +---+ +---+ +---+ |
- | | | ^ | |
- | | +-------------+ | |
- | +-----------------------------+ |
- +------------------------------------+
-
-After the new head page is set, the previous pointer of the head page is
-updated to the reader page.
-
- +------+
- |reader| RING BUFFER
- |page |-------H-----------+
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | |-->| |-->| |
- | | | | | |<--| |<-+
- | | +---+ +---+ +---+ |
- | | | ^ | |
- | | +-------------+ | |
- | +-----------------------------+ |
- +------------------------------------+
-
- +------+
- |buffer| RING BUFFER
- |page |-------H-----------+ <--- New head page
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | | | |-->| |
- | | New | | | |<--| |<-+
- | | Reader +---+ +---+ +---+ |
- | | page ----^ | |
- | | | |
- | +-----------------------------+ |
- +------------------------------------+
-
-Another important point: The page that the reader page points back to
-by its previous pointer (the one that now points to the new head page)
-never points back to the reader page. That is because the reader page is
-not part of the ring buffer. Traversing the ring buffer via the next pointers
-will always stay in the ring buffer. Traversing the ring buffer via the
-prev pointers may not.
-
-Note, the way to determine a reader page is simply by examining the previous
-pointer of the page. If the next pointer of the previous page does not
-point back to the original page, then the original page is a reader page:
-
-
- +--------+
- | reader | next +----+
- | page |-------->| |<====== (buffer page)
- +--------+ +----+
- | | ^
- | v | next
- prev | +----+
- +------------->| |
- +----+
-
-The way the head page moves forward:
-
-When the tail page meets the head page and the buffer is in overwrite mode
-and more writes take place, the head page must be moved forward before the
-writer may move the tail page. The way this is done is that the writer
-performs a cmpxchg to convert the pointer to the head page from the HEADER
-flag to have the UPDATE flag set. Once this is done, the reader will
-not be able to swap the head page from the buffer, nor will it be able to
-move the head page, until the writer is finished with the move.
-
-This eliminates any races that the reader can have on the writer. The reader
-must spin, and this is why the reader cannot preempt the writer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The following page will be made into the new head page.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-After the new head page has been set, we can set the old head page
-pointer back to NORMAL.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-After the head page has been moved, the tail page may now move forward.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-The above are the trivial updates. Now for the more complex scenarios.
-
-
-As stated before, if enough writes preempt the first write, the
-tail page may make it all the way around the buffer and meet the commit
-page. At this time, we must start dropping writes (usually with some kind
-of warning to the user). But what happens if the commit was still on the
-reader page? The commit page is not part of the ring buffer. The tail page
-must account for this.
-
-
- reader page commit page
- | |
- v |
- +---+ |
- | |<----------+
- | |
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- tail page
-
-If the tail page were to simply push the head page forward, the commit when
-leaving the reader page would not be pointing to the correct page.
-
-The solution to this is to test if the commit page is on the reader page
-before pushing the head page. If it is, then it can be assumed that the
-tail page wrapped the buffer, and we must drop new writes.
-
-This is not a race condition, because the commit page can only be moved
-by the outermost writer (the writer that was preempted).
-This means that the commit will not move while a writer is moving the
-tail page. The reader cannot swap the reader page if it is also being
-used as the commit page. The reader can simply check that the commit
-is off the reader page. Once the commit page leaves the reader page
-it will never go back on it unless a reader does another swap with the
-buffer page that is also the commit page.
-
-
-Nested writes
--------------
-
-In the pushing forward of the tail page we must first push forward
-the head page if the head page is the next page. If the head page
-is not the next page, the tail page is simply updated with a cmpxchg.
-
-Only writers move the tail page. This must be done atomically to protect
-against nested writers.
-
- temp_page = tail_page
- next_page = temp_page->next
- cmpxchg(tail_page, temp_page, next_page)
-
-The above will update the tail page if it is still pointing to the expected
-page. If this fails, a nested write pushed it forward, the current write
-does not need to push it.
-
-
- temp page
- |
- v
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Nested write comes in and moves the tail page forward:
-
- tail page (moved by nested writer)
- temp page |
- | |
- v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The above would fail the cmpxchg, but since the tail page has already
-been moved forward, the writer will just try again to reserve storage
-on the new tail page.
-
-But the moving of the head page is a bit more complex.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The write converts the head page pointer to UPDATE.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-But if a nested writer preempts here, it will see that the next
-page is a head page, but it is also nested. It will detect that
-it is nested and will save that information. The detection is the
-fact that it sees the UPDATE flag instead of a HEADER or NORMAL
-pointer.
-
-The nested writer will set the new head page pointer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-But it will not reset the update back to normal. Only the writer
-that converted a pointer from HEAD to UPDATE will convert it back
-to NORMAL.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-After the nested writer finishes, the outermost writer will convert
-the UPDATE pointer to NORMAL.
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-It can be even more complex if several nested writes came in and moved
-the tail page ahead several pages:
-
-
-(first writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The write converts the head page pointer to UPDATE.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Next writer comes in, and sees the update and sets up the new
-head page.
-
-(second writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The nested writer moves the tail page forward. But does not set the old
-update page to NORMAL because it is not the outermost writer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Another writer preempts and sees the page after the tail page is a head page.
-It changes it from HEAD to UPDATE.
-
-(third writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The writer will move the head page forward:
-
-
-(third writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-But now that the third writer did change the HEAD flag to UPDATE it
-will convert it to normal:
-
-
-(third writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-Then it will move the tail page, and return back to the second writer.
-
-
-(second writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-The second writer will fail to move the tail page because it was already
-moved, so it will try again and add its data to the new tail page.
-It will return to the first writer.
-
-
-(first writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The first writer cannot know atomically if the tail page moved
-while it updates the HEAD page. It will then update the head page to
-what it thinks is the new head page.
-
-
-(first writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Since the cmpxchg returns the old value of the pointer the first writer
-will see it succeeded in updating the pointer from NORMAL to HEAD.
-But as we can see, this is not good enough. It must also check to see
-if the tail page is either where it use to be or on the next page:
-
-
-(first writer)
-
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-If tail page != A and tail page != B, then it must reset the pointer
-back to NORMAL. The fact that it only needs to worry about nested
-writers means that it only needs to check this after setting the HEAD page.
-
-
-(first writer)
-
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Now the writer can update the head page. This is also why the head page must
-remain in UPDATE and only reset by the outermost writer. This prevents
-the reader from seeing the incorrect head page.
-
-
-(first writer)
-
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
diff --git a/Documentation/trace/stm.rst b/Documentation/trace/stm.rst
index 99f9996..1ed49dd 100644
--- a/Documentation/trace/stm.rst
+++ b/Documentation/trace/stm.rst
@@ -33,8 +33,8 @@
have a name (string identifier) and a range of masters and channels
associated with it, located in "stp-policy" subsystem directory in
configfs. The topmost directory's name (the policy) is formatted as
-the STM device name to which this policy applies and and arbitrary
-string identifier separated by a stop. From the examle above, a rule
+the STM device name to which this policy applies and an arbitrary
+string identifier separated by a stop. From the example above, a rule
may look like this::
$ ls /config/stp-policy/dummy_stm.my-policy/user
diff --git a/Documentation/trace/tracepoints.rst b/Documentation/trace/tracepoints.rst
index 6e3ce3b..0cb8d9c 100644
--- a/Documentation/trace/tracepoints.rst
+++ b/Documentation/trace/tracepoints.rst
@@ -146,3 +146,30 @@
define tracepoints. Check http://lwn.net/Articles/379903,
http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362
for a series of articles with more details.
+
+If you require calling a tracepoint from a header file, it is not
+recommended to call one directly or to use the trace_<tracepoint>_enabled()
+function call, as tracepoints in header files can have side effects if a
+header is included from a file that has CREATE_TRACE_POINTS set, as
+well as the trace_<tracepoint>() is not that small of an inline
+and can bloat the kernel if used by other inlined functions. Instead,
+include tracepoint-defs.h and use tracepoint_enabled().
+
+In a C file::
+
+ void do_trace_foo_bar_wrapper(args)
+ {
+ trace_foo_bar(args);
+ }
+
+In the header file::
+
+ DECLARE_TRACEPOINT(foo_bar);
+
+ static inline void some_inline_function()
+ {
+ [..]
+ if (tracepoint_enabled(foo_bar))
+ do_trace_foo_bar_wrapper(args);
+ [..]
+ }
diff --git a/Documentation/trace/uprobetracer.rst b/Documentation/trace/uprobetracer.rst
index 98cde99..a8e5938 100644
--- a/Documentation/trace/uprobetracer.rst
+++ b/Documentation/trace/uprobetracer.rst
@@ -28,6 +28,7 @@
p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
+ p[:[GRP/]EVENT] PATH:OFFSET%return [FETCHARGS] : Set a return uprobe (uretprobe)
-:[GRP/]EVENT : Clear uprobe or uretprobe event
GRP : Group name. If omitted, "uprobes" is the default value.
@@ -35,6 +36,7 @@
on PATH+OFFSET.
PATH : Path to an executable or a library.
OFFSET : Offset where the probe is inserted.
+ OFFSET%return : Offset where the return probe is inserted.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG