Andrew Walbran | 32731a6 | 2019-11-13 18:04:31 +0000 | [diff] [blame] | 1 | # Hafnium architecture |
| 2 | |
| 3 | The purpose of Hafnium is to provide memory isolation between a set of security |
| 4 | domains, to better separate untrusted code from security-critical code. It is |
| 5 | implemented as a type-1 hypervisor, where each security domain is a VM. |
| 6 | |
| 7 | On AArch64 (currently the only supported architecture) it runs at EL2, while the |
| 8 | VMs it manages run at EL1 (and user space applications within those VMs at EL0). |
| 9 | A Secure Monitor such as |
| 10 | [Trusted Firmware-A](https://www.trustedfirmware.org/about/) runs underneath it |
| 11 | at EL3. |
| 12 | |
| 13 | Hafnium provides memory isolation between these VMs by managing their stage 2 |
| 14 | page tables, and using IOMMUs to restrict how DMA devices can be used to access |
| 15 | memory. It must also prevent them from accessing system resources in a way which |
| 16 | would allow them to escape this containment. It also provides: |
| 17 | |
| 18 | * Means for VMs to communicate with each other through message passing and |
| 19 | memory sharing, according to the Arm |
Olivier Deprez | 85780e1 | 2021-09-07 09:04:17 +0200 | [diff] [blame] | 20 | [Arm Firmware Framework for Arm A-profile (FF-A)](https://developer.arm.com/documentation/den0077/latest/). |
Andrew Walbran | 32731a6 | 2019-11-13 18:04:31 +0000 | [diff] [blame] | 21 | * Emulation of some basic hardware features such as timers. |
| 22 | * A simple paravirtualised interrupt controller for secondary VMs, as they |
| 23 | don't have access to hardware interrupts. |
| 24 | * A simple logging API for bringup and low-level debugging of VMs. |
| 25 | |
| 26 | See the [VM interface](VmInterface.md) documentation for more details. |
| 27 | |
| 28 | Hafnium makes a distinction between a **primary VM**, which would typically run |
| 29 | the main user-facing operating system such as Android, and a number of |
| 30 | **secondary VMs** which are smaller and exist to provide various services to the |
| 31 | primary VM. The primary VM typically owns the majority of the system resources, |
| 32 | and is likely to be more latency-sensitive as it is running user-facing tasks. |
| 33 | Some of the differences between primary and secondary VMs are explained below. |
| 34 | |
| 35 | [TOC] |
| 36 | |
| 37 | ## Security model |
| 38 | |
| 39 | Hafnium runs a set of VMs without trusting any of them. Neither do the VMs trust |
| 40 | each other. Hafnium aims to prevent malicious software running in one VM from |
| 41 | compromising any of the other VMs. Specifically, we guarantee |
| 42 | **confidentiality** and **memory integrity** of each VM: no other VM should be |
| 43 | able to read or modify the memory that belongs to a VM without that VM's |
| 44 | consent. |
| 45 | |
| 46 | We do not make any guarantees of **availability** of VMs, except for the primary |
| 47 | VM. In other words, a compromised primary VM may prevent secondary VMs from |
| 48 | running, but not gain unauthorised access to their memory. A compromised |
| 49 | secondary VM should not be able to prevent the primary VM or other secondary VMs |
| 50 | from running. |
| 51 | |
| 52 | ## Design principles |
| 53 | |
| 54 | Hafnium is designed with the following principles in mind: |
| 55 | |
| 56 | * Open design |
| 57 | * Hafnium is developed as open source, available for all to use, |
| 58 | contribute and scrutinise. |
| 59 | * Economy of mechanism |
| 60 | * Hafnium strives to be as small and simple of possible, to reduce the |
| 61 | attack surface. |
| 62 | * This also makes Hafnium more amenable to formal verification. |
| 63 | * Least privilege |
| 64 | * Each VM is a separate security domain and is given access only to what |
| 65 | it needs, to reduce the impact if it is compromised. |
| 66 | * Everything that doesn't strictly need to be part of Hafnium itself (in |
| 67 | EL2) should be moved to a VM (in EL1). |
| 68 | * Defence in depth |
| 69 | * Hafnium provides an extra layer of security isolation on top of those |
| 70 | provided by the OS kernel, to better isolate sensitive workloads from |
| 71 | untrusted code. |
| 72 | |
| 73 | ## VM model |
| 74 | |
| 75 | A [VM](../inc/hf/vm.h) in Hafnium consists of: |
| 76 | |
| 77 | * A set of memory pages owned by and/or available to the VM, stored in the |
| 78 | stage 2 page table managed by Hafnium. |
| 79 | * One or more vCPUs. (The primary VM always has the same number of vCPUs as |
| 80 | the system has physical CPUs; secondary VMs have a configurable number.) |
| 81 | * A one page TX buffer used for sending messages to other VMs. |
| 82 | * A one page RX buffer used for receiving messages from other VMs. |
| 83 | * Some configuration information (VM ID, whitelist of allowed SMCs). |
| 84 | * Some internal state maintained by Hafnium (locks, mailbox wait lists, |
| 85 | mailbox state, log buffer). |
| 86 | |
Andrew Walbran | 3f4167e | 2019-12-02 11:51:36 +0000 | [diff] [blame] | 87 | Each [vCPU](../inc/hf/vcpu.h) also has: |
Andrew Walbran | 32731a6 | 2019-11-13 18:04:31 +0000 | [diff] [blame] | 88 | |
| 89 | * A set of saved registers, for when it isn't being run on a physical CPU. |
| 90 | * A current state (switched off, ready to run, running, waiting for a message |
| 91 | or interrupt, aborted). |
| 92 | * A set of virtual interrupts which may be enabled and/or pending. |
| 93 | * Some internal locking state. |
| 94 | |
| 95 | VMs and their vCPUs are configured statically from a [manifest](Manifest.md) |
| 96 | read at boot time. There is no way to create or destroy VMs at run time. |
| 97 | |
| 98 | ## System resources |
| 99 | |
| 100 | ### CPU |
| 101 | |
| 102 | Unlike many other type-1 hypervisors, Hafnium does not include a scheduler. |
| 103 | Instead, we rely on the primary VM to handle scheduling, calling Hafnium when it |
| 104 | wants to run a secondary VM's vCPU. This is because: |
| 105 | |
| 106 | * In line with our design principles of _economy of mechanism_ and _least |
| 107 | privilege_, we prefer to avoid complexity in Hafnium and instead rely on VMs |
| 108 | to handle complex tasks. |
| 109 | * According to our security model, we don't guarantee availability of |
| 110 | secondary VMs, so it is acceptable for a compromised primary VM to deny CPU |
| 111 | time to secondary VMs. |
| 112 | * A lot of effort has been put into making the Linux scheduler work well to |
| 113 | maintain a responsive user experience without jank, manage power |
| 114 | efficiently, and handle heterogeneous CPU architectures such as big.LITTLE. |
| 115 | We would rather avoid re-implementing this. |
| 116 | |
| 117 | Hafnium therefore maintains a 1:1 mapping of physical CPUs to vCPUs for the |
| 118 | primary VM, and allows the primary VM to control the power state of physical |
| 119 | CPUs directly through the standard Arm Power State Coordination Interface |
| 120 | (PSCI). The primary VM should then create kernel threads for each secondary VM |
| 121 | vCPU and schedule them to run the vCPUs according to the |
| 122 | [interface expectations defined by Hafnium](SchedulerExpectations.md). PSCI |
| 123 | calls made by secondary VMs are handled by Hafnium, to change the state of the |
| 124 | VM's vCPUs. In the case of (Android) Linux running in the primary VM this is |
| 125 | handled by the Hafnium kernel module. |
| 126 | |
| 127 | #### Example |
| 128 | |
| 129 | For example, considering a simple system with a single physical CPU, and a |
| 130 | single secondary VM with one vCPU, where the primary VM kernel has created |
| 131 | **thread 1** to run the secondary VM's vCPU while **thread 2** is some other |
| 132 | normal thread: |
| 133 | |
| 134 |  |
| 135 | |
| 136 | 1. Scheduler chooses thread 1 to run. |
| 137 | 2. Scheduler runs thread 1, and configures a physical timer to expire once the |
| 138 | quantum runs out. |
| 139 | 3. Thread 1 is responsible for running a vCPU, so it asks Hafnium to run it. |
| 140 | 4. Hafnium switches to the secondary VM vCPU. |
| 141 | 5. Eventually the quantum runs out and the physical timer interrupts the CPU. |
| 142 | 6. Hafnium traps the interrupt. Physical interrupts are owned by the primary |
| 143 | VM, so it switches back to the primary VM. |
| 144 | 7. The interrupt handler in the primary VM gets invoked, and calls the |
| 145 | scheduler. |
| 146 | 8. Scheduler chooses a different thread to run (thread 2). |
| 147 | 9. Scheduler runs thread 2. |
| 148 | |
| 149 | ### Memory |
| 150 | |
| 151 | At boot time each VM owns a mutually exclusive subset of memory pages, as |
| 152 | configured by the [manifest](Manifest.md). These pages are all identity mapped |
| 153 | in the stage 2 page table which Hafnium manages for the VM, so that it has full |
| 154 | access to use them however it wishes. |
| 155 | |
| 156 | Hafnium maintains state of which VM **owns** each page, and which VMs have |
| 157 | **access** to it. It does this using the stage 2 page tables of the VMs, with |
| 158 | some extra application-defined bits in the page table entries. A VM may share, |
Andrew Walbran | b5ab43c | 2020-04-30 11:32:54 +0100 | [diff] [blame] | 159 | lend or donate memory pages to another VM using the appropriate FF-A requests. A |
Andrew Walbran | 32731a6 | 2019-11-13 18:04:31 +0000 | [diff] [blame] | 160 | given page of memory may never be shared with more than two VMs, either in terms |
| 161 | of ownership or access. Thus, the following states are possible for each page, |
| 162 | for some values of X and Y: |
| 163 | |
| 164 | * Owned by VM X, accessible only by VM X |
| 165 | * This is the initial state for each page, and also the state of a page |
| 166 | that has been donated. |
| 167 | * Owned by VM X, accessible only by VM Y |
| 168 | * This state is reached when a page is lent. |
| 169 | * Owned by VM X, accessible by VMs X and Y |
| 170 | * This state is reached when a page is shared. |
| 171 | |
| 172 | For now, in the interests of simplicity, Hafnium always uses identity mapping in |
| 173 | all page tables it manages (stage 2 page tables for VMs, and stage 1 for itself) |
| 174 | – i.e. the IPA (intermediate physical address) is always equal to the PA |
| 175 | (physical address) in the stage 2 page table, if it is mapped at all. |
| 176 | |
| 177 | ### Devices |
| 178 | |
| 179 | From Hafnium's point of view a device consists of: |
| 180 | |
| 181 | * An MMIO address range (i.e. a set of pages). |
| 182 | * A set of interrupts that the device may generate. |
| 183 | * Some IOMMU configuration associated with the device. |
| 184 | |
| 185 | For now, each device is associated with exactly one VM, which is statically |
| 186 | assigned at boot time (through the manifest) and cannot be changed at runtime. |
| 187 | |
| 188 | Hafnium is responsible for mapping the device's MMIO pages into the owning VM's |
| 189 | stage 2 page table with the appropriate attributes, and for configuring the |
| 190 | IOMMU so that the device can only access the memory that is accessible by its |
| 191 | owning VM. This needs to be kept in sync as the VM's memory access changes with |
| 192 | memory sharing operations. Hafnium may also need to re-initialise the IOMMU if |
| 193 | the device is powered off and powered on again. |
| 194 | |
| 195 | The primary VM is responsible for forwarding interrupts to the owning VM, in |
| 196 | case the device is owned by a secondary VM. This does mean that a compromised |
| 197 | primary VM may choose not to forward interrupts, or to inject spurious |
| 198 | interrupts, but this is consistent with our security model that secondary VMs |
| 199 | are not guaranteed any level of availability. |