David Brazdil | 0f672f6 | 2019-12-10 10:32:29 +0000 | [diff] [blame^] | 1 | ========================= |
| 2 | NXP SJA1105 switch driver |
| 3 | ========================= |
| 4 | |
| 5 | Overview |
| 6 | ======== |
| 7 | |
| 8 | The NXP SJA1105 is a family of 6 devices: |
| 9 | |
| 10 | - SJA1105E: First generation, no TTEthernet |
| 11 | - SJA1105T: First generation, TTEthernet |
| 12 | - SJA1105P: Second generation, no TTEthernet, no SGMII |
| 13 | - SJA1105Q: Second generation, TTEthernet, no SGMII |
| 14 | - SJA1105R: Second generation, no TTEthernet, SGMII |
| 15 | - SJA1105S: Second generation, TTEthernet, SGMII |
| 16 | |
| 17 | These are SPI-managed automotive switches, with all ports being gigabit |
| 18 | capable, and supporting MII/RMII/RGMII and optionally SGMII on one port. |
| 19 | |
| 20 | Being automotive parts, their configuration interface is geared towards |
| 21 | set-and-forget use, with minimal dynamic interaction at runtime. They |
| 22 | require a static configuration to be composed by software and packed |
| 23 | with CRC and table headers, and sent over SPI. |
| 24 | |
| 25 | The static configuration is composed of several configuration tables. Each |
| 26 | table takes a number of entries. Some configuration tables can be (partially) |
| 27 | reconfigured at runtime, some not. Some tables are mandatory, some not: |
| 28 | |
| 29 | ============================= ================== ============================= |
| 30 | Table Mandatory Reconfigurable |
| 31 | ============================= ================== ============================= |
| 32 | Schedule no no |
| 33 | Schedule entry points if Scheduling no |
| 34 | VL Lookup no no |
| 35 | VL Policing if VL Lookup no |
| 36 | VL Forwarding if VL Lookup no |
| 37 | L2 Lookup no no |
| 38 | L2 Policing yes no |
| 39 | VLAN Lookup yes yes |
| 40 | L2 Forwarding yes partially (fully on P/Q/R/S) |
| 41 | MAC Config yes partially (fully on P/Q/R/S) |
| 42 | Schedule Params if Scheduling no |
| 43 | Schedule Entry Points Params if Scheduling no |
| 44 | VL Forwarding Params if VL Forwarding no |
| 45 | L2 Lookup Params no partially (fully on P/Q/R/S) |
| 46 | L2 Forwarding Params yes no |
| 47 | Clock Sync Params no no |
| 48 | AVB Params no no |
| 49 | General Params yes partially |
| 50 | Retagging no yes |
| 51 | xMII Params yes no |
| 52 | SGMII no yes |
| 53 | ============================= ================== ============================= |
| 54 | |
| 55 | |
| 56 | Also the configuration is write-only (software cannot read it back from the |
| 57 | switch except for very few exceptions). |
| 58 | |
| 59 | The driver creates a static configuration at probe time, and keeps it at |
| 60 | all times in memory, as a shadow for the hardware state. When required to |
| 61 | change a hardware setting, the static configuration is also updated. |
| 62 | If that changed setting can be transmitted to the switch through the dynamic |
| 63 | reconfiguration interface, it is; otherwise the switch is reset and |
| 64 | reprogrammed with the updated static configuration. |
| 65 | |
| 66 | Traffic support |
| 67 | =============== |
| 68 | |
| 69 | The switches do not support switch tagging in hardware. But they do support |
| 70 | customizing the TPID by which VLAN traffic is identified as such. The switch |
| 71 | driver is leveraging ``CONFIG_NET_DSA_TAG_8021Q`` by requesting that special |
| 72 | VLANs (with a custom TPID of ``ETH_P_EDSA`` instead of ``ETH_P_8021Q``) are |
| 73 | installed on its ports when not in ``vlan_filtering`` mode. This does not |
| 74 | interfere with the reception and transmission of real 802.1Q-tagged traffic, |
| 75 | because the switch does no longer parse those packets as VLAN after the TPID |
| 76 | change. |
| 77 | The TPID is restored when ``vlan_filtering`` is requested by the user through |
| 78 | the bridge layer, and general IP termination becomes no longer possible through |
| 79 | the switch netdevices in this mode. |
| 80 | |
| 81 | The switches have two programmable filters for link-local destination MACs. |
| 82 | These are used to trap BPDUs and PTP traffic to the master netdevice, and are |
| 83 | further used to support STP and 1588 ordinary clock/boundary clock |
| 84 | functionality. |
| 85 | |
| 86 | The following traffic modes are supported over the switch netdevices: |
| 87 | |
| 88 | +--------------------+------------+------------------+------------------+ |
| 89 | | | Standalone | Bridged with | Bridged with | |
| 90 | | | ports | vlan_filtering 0 | vlan_filtering 1 | |
| 91 | +====================+============+==================+==================+ |
| 92 | | Regular traffic | Yes | Yes | No (use master) | |
| 93 | +--------------------+------------+------------------+------------------+ |
| 94 | | Management traffic | Yes | Yes | Yes | |
| 95 | | (BPDU, PTP) | | | | |
| 96 | +--------------------+------------+------------------+------------------+ |
| 97 | |
| 98 | Switching features |
| 99 | ================== |
| 100 | |
| 101 | The driver supports the configuration of L2 forwarding rules in hardware for |
| 102 | port bridging. The forwarding, broadcast and flooding domain between ports can |
| 103 | be restricted through two methods: either at the L2 forwarding level (isolate |
| 104 | one bridge's ports from another's) or at the VLAN port membership level |
| 105 | (isolate ports within the same bridge). The final forwarding decision taken by |
| 106 | the hardware is a logical AND of these two sets of rules. |
| 107 | |
| 108 | The hardware tags all traffic internally with a port-based VLAN (pvid), or it |
| 109 | decodes the VLAN information from the 802.1Q tag. Advanced VLAN classification |
| 110 | is not possible. Once attributed a VLAN tag, frames are checked against the |
| 111 | port's membership rules and dropped at ingress if they don't match any VLAN. |
| 112 | This behavior is available when switch ports are enslaved to a bridge with |
| 113 | ``vlan_filtering 1``. |
| 114 | |
| 115 | Normally the hardware is not configurable with respect to VLAN awareness, but |
| 116 | by changing what TPID the switch searches 802.1Q tags for, the semantics of a |
| 117 | bridge with ``vlan_filtering 0`` can be kept (accept all traffic, tagged or |
| 118 | untagged), and therefore this mode is also supported. |
| 119 | |
| 120 | Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but |
| 121 | all bridges should have the same level of VLAN awareness (either both have |
| 122 | ``vlan_filtering`` 0, or both 1). Also an inevitable limitation of the fact |
| 123 | that VLAN awareness is global at the switch level is that once a bridge with |
| 124 | ``vlan_filtering`` enslaves at least one switch port, the other un-bridged |
| 125 | ports are no longer available for standalone traffic termination. |
| 126 | |
| 127 | Topology and loop detection through STP is supported. |
| 128 | |
| 129 | L2 FDB manipulation (add/delete/dump) is currently possible for the first |
| 130 | generation devices. Aging time of FDB entries, as well as enabling fully static |
| 131 | management (no address learning and no flooding of unknown traffic) is not yet |
| 132 | configurable in the driver. |
| 133 | |
| 134 | A special comment about bridging with other netdevices (illustrated with an |
| 135 | example): |
| 136 | |
| 137 | A board has eth0, eth1, swp0@eth1, swp1@eth1, swp2@eth1, swp3@eth1. |
| 138 | The switch ports (swp0-3) are under br0. |
| 139 | It is desired that eth0 is turned into another switched port that communicates |
| 140 | with swp0-3. |
| 141 | |
| 142 | If br0 has vlan_filtering 0, then eth0 can simply be added to br0 with the |
| 143 | intended results. |
| 144 | If br0 has vlan_filtering 1, then a new br1 interface needs to be created that |
| 145 | enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in |
| 146 | this mode, the switch ports beneath br0 are not capable of regular traffic, and |
| 147 | are only used as a conduit for switchdev operations. |
| 148 | |
| 149 | Offloads |
| 150 | ======== |
| 151 | |
| 152 | Time-aware scheduling |
| 153 | --------------------- |
| 154 | |
| 155 | The switch supports a variation of the enhancements for scheduled traffic |
| 156 | specified in IEEE 802.1Q-2018 (formerly 802.1Qbv). This means it can be used to |
| 157 | ensure deterministic latency for priority traffic that is sent in-band with its |
| 158 | gate-open event in the network schedule. |
| 159 | |
| 160 | This capability can be managed through the tc-taprio offload ('flags 2'). The |
| 161 | difference compared to the software implementation of taprio is that the latter |
| 162 | would only be able to shape traffic originated from the CPU, but not |
| 163 | autonomously forwarded flows. |
| 164 | |
| 165 | The device has 8 traffic classes, and maps incoming frames to one of them based |
| 166 | on the VLAN PCP bits (if no VLAN is present, the port-based default is used). |
| 167 | As described in the previous sections, depending on the value of |
| 168 | ``vlan_filtering``, the EtherType recognized by the switch as being VLAN can |
| 169 | either be the typical 0x8100 or a custom value used internally by the driver |
| 170 | for tagging. Therefore, the switch ignores the VLAN PCP if used in standalone |
| 171 | or bridge mode with ``vlan_filtering=0``, as it will not recognize the 0x8100 |
| 172 | EtherType. In these modes, injecting into a particular TX queue can only be |
| 173 | done by the DSA net devices, which populate the PCP field of the tagging header |
| 174 | on egress. Using ``vlan_filtering=1``, the behavior is the other way around: |
| 175 | offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA |
| 176 | net devices are no longer able to do that. To inject frames into a hardware TX |
| 177 | queue with VLAN awareness active, it is necessary to create a VLAN |
| 178 | sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged |
| 179 | towards the switch, with the VLAN PCP bits set appropriately. |
| 180 | |
| 181 | Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the |
| 182 | notable exception: the switch always treats it with a fixed priority and |
| 183 | disregards any VLAN PCP bits even if present. The traffic class for management |
| 184 | traffic has a value of 7 (highest priority) at the moment, which is not |
| 185 | configurable in the driver. |
| 186 | |
| 187 | Below is an example of configuring a 500 us cyclic schedule on egress port |
| 188 | ``swp5``. The traffic class gate for management traffic (7) is open for 100 us, |
| 189 | and the gates for all other traffic classes are open for 400 us:: |
| 190 | |
| 191 | #!/bin/bash |
| 192 | |
| 193 | set -e -u -o pipefail |
| 194 | |
| 195 | NSEC_PER_SEC="1000000000" |
| 196 | |
| 197 | gatemask() { |
| 198 | local tc_list="$1" |
| 199 | local mask=0 |
| 200 | |
| 201 | for tc in ${tc_list}; do |
| 202 | mask=$((${mask} | (1 << ${tc}))) |
| 203 | done |
| 204 | |
| 205 | printf "%02x" ${mask} |
| 206 | } |
| 207 | |
| 208 | if ! systemctl is-active --quiet ptp4l; then |
| 209 | echo "Please start the ptp4l service" |
| 210 | exit |
| 211 | fi |
| 212 | |
| 213 | now=$(phc_ctl /dev/ptp1 get | gawk '/clock time is/ { print $5; }') |
| 214 | # Phase-align the base time to the start of the next second. |
| 215 | sec=$(echo "${now}" | gawk -F. '{ print $1; }') |
| 216 | base_time="$(((${sec} + 1) * ${NSEC_PER_SEC}))" |
| 217 | |
| 218 | tc qdisc add dev swp5 parent root handle 100 taprio \ |
| 219 | num_tc 8 \ |
| 220 | map 0 1 2 3 5 6 7 \ |
| 221 | queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ |
| 222 | base-time ${base_time} \ |
| 223 | sched-entry S $(gatemask 7) 100000 \ |
| 224 | sched-entry S $(gatemask "0 1 2 3 4 5 6") 400000 \ |
| 225 | flags 2 |
| 226 | |
| 227 | It is possible to apply the tc-taprio offload on multiple egress ports. There |
| 228 | are hardware restrictions related to the fact that no gate event may trigger |
| 229 | simultaneously on two ports. The driver checks the consistency of the schedules |
| 230 | against this restriction and errors out when appropriate. Schedule analysis is |
| 231 | needed to avoid this, which is outside the scope of the document. |
| 232 | |
| 233 | At the moment, the time-aware scheduler can only be triggered based on a |
| 234 | standalone clock and not based on PTP time. This means the base-time argument |
| 235 | from tc-taprio is ignored and the schedule starts right away. It also means it |
| 236 | is more difficult to phase-align the scheduler with the other devices in the |
| 237 | network. |
| 238 | |
| 239 | Device Tree bindings and board design |
| 240 | ===================================== |
| 241 | |
| 242 | This section references ``Documentation/devicetree/bindings/net/dsa/sja1105.txt`` |
| 243 | and aims to showcase some potential switch caveats. |
| 244 | |
| 245 | RMII PHY role and out-of-band signaling |
| 246 | --------------------------------------- |
| 247 | |
| 248 | In the RMII spec, the 50 MHz clock signals are either driven by the MAC or by |
| 249 | an external oscillator (but not by the PHY). |
| 250 | But the spec is rather loose and devices go outside it in several ways. |
| 251 | Some PHYs go against the spec and may provide an output pin where they source |
| 252 | the 50 MHz clock themselves, in an attempt to be helpful. |
| 253 | On the other hand, the SJA1105 is only binary configurable - when in the RMII |
| 254 | MAC role it will also attempt to drive the clock signal. To prevent this from |
| 255 | happening it must be put in RMII PHY role. |
| 256 | But doing so has some unintended consequences. |
| 257 | In the RMII spec, the PHY can transmit extra out-of-band signals via RXD[1:0]. |
| 258 | These are practically some extra code words (/J/ and /K/) sent prior to the |
| 259 | preamble of each frame. The MAC does not have this out-of-band signaling |
| 260 | mechanism defined by the RMII spec. |
| 261 | So when the SJA1105 port is put in PHY role to avoid having 2 drivers on the |
| 262 | clock signal, inevitably an RMII PHY-to-PHY connection is created. The SJA1105 |
| 263 | emulates a PHY interface fully and generates the /J/ and /K/ symbols prior to |
| 264 | frame preambles, which the real PHY is not expected to understand. So the PHY |
| 265 | simply encodes the extra symbols received from the SJA1105-as-PHY onto the |
| 266 | 100Base-Tx wire. |
| 267 | On the other side of the wire, some link partners might discard these extra |
| 268 | symbols, while others might choke on them and discard the entire Ethernet |
| 269 | frames that follow along. This looks like packet loss with some link partners |
| 270 | but not with others. |
| 271 | The take-away is that in RMII mode, the SJA1105 must be let to drive the |
| 272 | reference clock if connected to a PHY. |
| 273 | |
| 274 | RGMII fixed-link and internal delays |
| 275 | ------------------------------------ |
| 276 | |
| 277 | As mentioned in the bindings document, the second generation of devices has |
| 278 | tunable delay lines as part of the MAC, which can be used to establish the |
| 279 | correct RGMII timing budget. |
| 280 | When powered up, these can shift the Rx and Tx clocks with a phase difference |
| 281 | between 73.8 and 101.7 degrees. |
| 282 | The catch is that the delay lines need to lock onto a clock signal with a |
| 283 | stable frequency. This means that there must be at least 2 microseconds of |
| 284 | silence between the clock at the old vs at the new frequency. Otherwise the |
| 285 | lock is lost and the delay lines must be reset (powered down and back up). |
| 286 | In RGMII the clock frequency changes with link speed (125 MHz at 1000 Mbps, 25 |
| 287 | MHz at 100 Mbps and 2.5 MHz at 10 Mbps), and link speed might change during the |
| 288 | AN process. |
| 289 | In the situation where the switch port is connected through an RGMII fixed-link |
| 290 | to a link partner whose link state life cycle is outside the control of Linux |
| 291 | (such as a different SoC), then the delay lines would remain unlocked (and |
| 292 | inactive) until there is manual intervention (ifdown/ifup on the switch port). |
| 293 | The take-away is that in RGMII mode, the switch's internal delays are only |
| 294 | reliable if the link partner never changes link speeds, or if it does, it does |
| 295 | so in a way that is coordinated with the switch port (practically, both ends of |
| 296 | the fixed-link are under control of the same Linux system). |
| 297 | As to why would a fixed-link interface ever change link speeds: there are |
| 298 | Ethernet controllers out there which come out of reset in 100 Mbps mode, and |
| 299 | their driver inevitably needs to change the speed and clock frequency if it's |
| 300 | required to work at gigabit. |
| 301 | |
| 302 | MDIO bus and PHY management |
| 303 | --------------------------- |
| 304 | |
| 305 | The SJA1105 does not have an MDIO bus and does not perform in-band AN either. |
| 306 | Therefore there is no link state notification coming from the switch device. |
| 307 | A board would need to hook up the PHYs connected to the switch to any other |
| 308 | MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO |
| 309 | bus). Link state management then works by the driver manually keeping in sync |
| 310 | (over SPI commands) the MAC link speed with the settings negotiated by the PHY. |