Document Hafnium architecture and code structure. Bug: 123318812 Change-Id: If78d6c2ff4ab95f8dbbe5ce38c727b45ec201a03

commit: 32731a685130beb0861a87781654a41789a2d27e [log] [tgz]
author: Andrew Walbran <qwandor@google.com> Wed Nov 13 18:04:31 2019 +0000
committer: Andrew Walbran <qwandor@google.com> Wed Nov 20 11:26:03 2019 +0000
tree: 327466d0f42224250594ef0a6c57d0a4a3bb244d
parent: 67f88e8e3e974f37aba082e2e52a698a6fcb2b5d [diff]
diff --git a/README.md b/README.md
index eed18d8..5ee74bf 100644
--- a/README.md
+++ b/README.md

@@ -17,6 +17,8 @@
 
 More documentation is available on:
 
+*   [Hafnium architecture](docs/Architecture.md)
+*   [Code structure](docs/CodeStructure.md)
 *   [Hafnium test infrastructure](docs/Testing.md)
 *   [Running Hafnium under the Arm Fixed Virtual Platform](docs/FVP.md)
 *   [How to build a RAM disk containing VMs for Hafnium to run](docs/HafniumRamDisk.md)

diff --git a/docs/Architecture.md b/docs/Architecture.md
new file mode 100644
index 0000000..461d51f
--- /dev/null
+++ b/docs/Architecture.md

@@ -0,0 +1,199 @@
+# Hafnium architecture
+
+The purpose of Hafnium is to provide memory isolation between a set of security
+domains, to better separate untrusted code from security-critical code. It is
+implemented as a type-1 hypervisor, where each security domain is a VM.
+
+On AArch64 (currently the only supported architecture) it runs at EL2, while the
+VMs it manages run at EL1 (and user space applications within those VMs at EL0).
+A Secure Monitor such as
+[Trusted Firmware-A](https://www.trustedfirmware.org/about/) runs underneath it
+at EL3.
+
+Hafnium provides memory isolation between these VMs by managing their stage 2
+page tables, and using IOMMUs to restrict how DMA devices can be used to access
+memory. It must also prevent them from accessing system resources in a way which
+would allow them to escape this containment. It also provides:
+
+*   Means for VMs to communicate with each other through message passing and
+    memory sharing, according to the Arm
+    [Secure Partition Communication Interface (SPCI)](https://developer.arm.com/docs/den0077/a).
+*   Emulation of some basic hardware features such as timers.
+*   A simple paravirtualised interrupt controller for secondary VMs, as they
+    don't have access to hardware interrupts.
+*   A simple logging API for bringup and low-level debugging of VMs.
+
+See the [VM interface](VmInterface.md) documentation for more details.
+
+Hafnium makes a distinction between a **primary VM**, which would typically run
+the main user-facing operating system such as Android, and a number of
+**secondary VMs** which are smaller and exist to provide various services to the
+primary VM. The primary VM typically owns the majority of the system resources,
+and is likely to be more latency-sensitive as it is running user-facing tasks.
+Some of the differences between primary and secondary VMs are explained below.
+
+[TOC]
+
+## Security model
+
+Hafnium runs a set of VMs without trusting any of them. Neither do the VMs trust
+each other. Hafnium aims to prevent malicious software running in one VM from
+compromising any of the other VMs. Specifically, we guarantee
+**confidentiality** and **memory integrity** of each VM: no other VM should be
+able to read or modify the memory that belongs to a VM without that VM's
+consent.
+
+We do not make any guarantees of **availability** of VMs, except for the primary
+VM. In other words, a compromised primary VM may prevent secondary VMs from
+running, but not gain unauthorised access to their memory. A compromised
+secondary VM should not be able to prevent the primary VM or other secondary VMs
+from running.
+
+## Design principles
+
+Hafnium is designed with the following principles in mind:
+
+*   Open design
+    *   Hafnium is developed as open source, available for all to use,
+        contribute and scrutinise.
+*   Economy of mechanism
+    *   Hafnium strives to be as small and simple of possible, to reduce the
+        attack surface.
+    *   This also makes Hafnium more amenable to formal verification.
+*   Least privilege
+    *   Each VM is a separate security domain and is given access only to what
+        it needs, to reduce the impact if it is compromised.
+    *   Everything that doesn't strictly need to be part of Hafnium itself (in
+        EL2) should be moved to a VM (in EL1).
+*   Defence in depth
+    *   Hafnium provides an extra layer of security isolation on top of those
+        provided by the OS kernel, to better isolate sensitive workloads from
+        untrusted code.
+
+## VM model
+
+A [VM](../inc/hf/vm.h) in Hafnium consists of:
+
+*   A set of memory pages owned by and/or available to the VM, stored in the
+    stage 2 page table managed by Hafnium.
+*   One or more vCPUs. (The primary VM always has the same number of vCPUs as
+    the system has physical CPUs; secondary VMs have a configurable number.)
+*   A one page TX buffer used for sending messages to other VMs.
+*   A one page RX buffer used for receiving messages from other VMs.
+*   Some configuration information (VM ID, whitelist of allowed SMCs).
+*   Some internal state maintained by Hafnium (locks, mailbox wait lists,
+    mailbox state, log buffer).
+
+Each [vCPU](../inc/hf/cpu.h) also has:
+
+*   A set of saved registers, for when it isn't being run on a physical CPU.
+*   A current state (switched off, ready to run, running, waiting for a message
+    or interrupt, aborted).
+*   A set of virtual interrupts which may be enabled and/or pending.
+*   Some internal locking state.
+
+VMs and their vCPUs are configured statically from a [manifest](Manifest.md)
+read at boot time. There is no way to create or destroy VMs at run time.
+
+## System resources
+
+### CPU
+
+Unlike many other type-1 hypervisors, Hafnium does not include a scheduler.
+Instead, we rely on the primary VM to handle scheduling, calling Hafnium when it
+wants to run a secondary VM's vCPU. This is because:
+
+*   In line with our design principles of _economy of mechanism_ and _least
+    privilege_, we prefer to avoid complexity in Hafnium and instead rely on VMs
+    to handle complex tasks.
+*   According to our security model, we don't guarantee availability of
+    secondary VMs, so it is acceptable for a compromised primary VM to deny CPU
+    time to secondary VMs.
+*   A lot of effort has been put into making the Linux scheduler work well to
+    maintain a responsive user experience without jank, manage power
+    efficiently, and handle heterogeneous CPU architectures such as big.LITTLE.
+    We would rather avoid re-implementing this.
+
+Hafnium therefore maintains a 1:1 mapping of physical CPUs to vCPUs for the
+primary VM, and allows the primary VM to control the power state of physical
+CPUs directly through the standard Arm Power State Coordination Interface
+(PSCI). The primary VM should then create kernel threads for each secondary VM
+vCPU and schedule them to run the vCPUs according to the
+[interface expectations defined by Hafnium](SchedulerExpectations.md). PSCI
+calls made by secondary VMs are handled by Hafnium, to change the state of the
+VM's vCPUs. In the case of (Android) Linux running in the primary VM this is
+handled by the Hafnium kernel module.
+
+#### Example
+
+For example, considering a simple system with a single physical CPU, and a
+single secondary VM with one vCPU, where the primary VM kernel has created
+**thread 1** to run the secondary VM's vCPU while **thread 2** is some other
+normal thread:
+
+![scheduler example sequence diagram](scheduler.png)
+
+1.  Scheduler chooses thread 1 to run.
+2.  Scheduler runs thread 1, and configures a physical timer to expire once the
+    quantum runs out.
+3.  Thread 1 is responsible for running a vCPU, so it asks Hafnium to run it.
+4.  Hafnium switches to the secondary VM vCPU.
+5.  Eventually the quantum runs out and the physical timer interrupts the CPU.
+6.  Hafnium traps the interrupt. Physical interrupts are owned by the primary
+    VM, so it switches back to the primary VM.
+7.  The interrupt handler in the primary VM gets invoked, and calls the
+    scheduler.
+8.  Scheduler chooses a different thread to run (thread 2).
+9.  Scheduler runs thread 2.
+
+### Memory
+
+At boot time each VM owns a mutually exclusive subset of memory pages, as
+configured by the [manifest](Manifest.md). These pages are all identity mapped
+in the stage 2 page table which Hafnium manages for the VM, so that it has full
+access to use them however it wishes.
+
+Hafnium maintains state of which VM **owns** each page, and which VMs have
+**access** to it. It does this using the stage 2 page tables of the VMs, with
+some extra application-defined bits in the page table entries. A VM may share,
+lend or donate memory pages to another VM using the appropriate SPCI requests. A
+given page of memory may never be shared with more than two VMs, either in terms
+of ownership or access. Thus, the following states are possible for each page,
+for some values of X and Y:
+
+*   Owned by VM X, accessible only by VM X
+    *   This is the initial state for each page, and also the state of a page
+        that has been donated.
+*   Owned by VM X, accessible only by VM Y
+    *   This state is reached when a page is lent.
+*   Owned by VM X, accessible by VMs X and Y
+    *   This state is reached when a page is shared.
+
+For now, in the interests of simplicity, Hafnium always uses identity mapping in
+all page tables it manages (stage 2 page tables for VMs, and stage 1 for itself)
+– i.e. the IPA (intermediate physical address) is always equal to the PA
+(physical address) in the stage 2 page table, if it is mapped at all.
+
+### Devices
+
+From Hafnium's point of view a device consists of:
+
+*   An MMIO address range (i.e. a set of pages).
+*   A set of interrupts that the device may generate.
+*   Some IOMMU configuration associated with the device.
+
+For now, each device is associated with exactly one VM, which is statically
+assigned at boot time (through the manifest) and cannot be changed at runtime.
+
+Hafnium is responsible for mapping the device's MMIO pages into the owning VM's
+stage 2 page table with the appropriate attributes, and for configuring the
+IOMMU so that the device can only access the memory that is accessible by its
+owning VM. This needs to be kept in sync as the VM's memory access changes with
+memory sharing operations. Hafnium may also need to re-initialise the IOMMU if
+the device is powered off and powered on again.
+
+The primary VM is responsible for forwarding interrupts to the owning VM, in
+case the device is owned by a secondary VM. This does mean that a compromised
+primary VM may choose not to forward interrupts, or to inject spurious
+interrupts, but this is consistent with our security model that secondary VMs
+are not guaranteed any level of availability.

diff --git a/docs/CodeStructure.md b/docs/CodeStructure.md
new file mode 100644
index 0000000..b3fa8d6
--- /dev/null
+++ b/docs/CodeStructure.md

@@ -0,0 +1,75 @@
+# Code structure
+
+The Hafnium repository contains Hafnium itself, along with unit tests and
+integration tests, a small client library for VMs, a Linux kernel module for the
+primary VM, prebuilt binaries of tools needed for building it and running tests.
+Everything is built with [GN](https://gn.googlesource.com/gn/).
+
+Hafnium can be built for an **architecture**, currently including:
+
+*   `aarch64`: 64-bit Armv8
+*   `fake`: A dummy architecture used for running unit tests on the host system.
+
+And for a **platform**, such as:
+
+*   `aem_v8a_fvp`: The Arm [Fixed Virtual Platform](FVP.md) emulator.
+*   `qemu_aarch64`: QEMU emulating an AArch64 device.
+*   `rpi4`: A Raspberry Pi 4 board.
+
+Each platform has a single associated architecture.
+
+The source tree is organised as follows:
+
+*   [`build`](../build): Common GN configuration, build scripts, and linker
+    script.
+*   [`docs`](.): Documentation
+*   [`driver/linux`](../driver/linux): Linux kernel driver for Hafnium, for use
+    in the primary VM.
+*   [`inc`](../inc): Header files...
+    *   [`hf`](../inc/hf): ... internal to Hafnium
+        *   [`arch`](../inc/hf/arch): Architecture-dependent modules, which have
+            a common interface but separate implementations per architecture.
+            This includes details of CPU initialisation, exception handling,
+            timers, page table management, and other system registers.
+        *   [`plat`](../inc/hf/plat): Platform-dependent modules, which have a
+            common interface but separate implementations per platform. This
+            includes details of the boot flow, and a UART driver for the debug
+            log console.
+    *   [`system`](../inc/system): ... which are included by the `stdatomic.h`
+        which we use from Android Clang but not really needed, so we use dummy
+        empty versions.
+    *   [`vmapi/hf`](../inc/vmapi/hf): ... for the interface exposed to VMs.
+*   [`kokoro`](../kokoro): Scripts and configuration for continuous integration
+    and presubmit checks.
+*   [`prebuilts`](../prebuilts): Prebuilt binaries needed for building Hafnium
+    or running tests.
+*   [`project`](../project): Configuration and extra code for each **project**.
+    A project is a set of one or more _platforms_ (see above) that are built
+    together. Hafnium comes with the [`reference`](../project/reference) project
+    for running it on some common emulators and development boards. To port
+    Hafnium to a new board, you can create a new project under this directory
+    with the platform or platforms you want to add, without affecting the core
+    Hafnium code.
+*   [`src`](../src): Source code for Hafnium itself in C and assembly, and
+    [unit tests](Testing.md) in C++.
+    *   [`arch`](../src/arch): Implementation of architecture-dependent modules.
+*   [`test`](../test): [Integration tests](Testing.md)
+    *   [`arch`](../test/arch): Tests for components of Hafnium that need to be
+        run on a real architecture.
+    *   [`hftest`](../test/hftest): A simple test framework that supports
+        running tests standalone on bare metal, in VMs under Hafnium, or as
+        user-space binaries under Linux under Hafnium.
+    *   [`linux`](../test/linux): Tests which are run in a Linux VM under
+        Hafnium.
+    *   [`vmapi`](../test/vmapi): Tests which are run in minimal test VMs under
+        Hafnium.
+        *   [`arch`](../test/vmapi/arch): Tests which are rely on specific
+            architectural details such as the GIC version.
+        *   [`primary_only`](../test/vmapi/primary_only): Tests which run only a
+            single (primary) VM.
+        *   [`primary_with_secondaries`](../test/vmapi/primary_with_secondaries):
+            Test which run with a primary VM and one or more secondary VMs to
+            test how they interact.
+*   [`third_party`](../third_party): Third party code needed for building
+    Hafnium.
+*   [`vmlib`](../vmlib): A small client library for VMs running under Hafnium.

diff --git a/docs/scheduler.png b/docs/scheduler.png
new file mode 100644
index 0000000..afce871
--- /dev/null
+++ b/docs/scheduler.png
Binary files differ
commit	32731a685130beb0861a87781654a41789a2d27e	[log] [tgz]
author	Andrew Walbran <qwandor@google.com>	Wed Nov 13 18:04:31 2019 +0000
committer	Andrew Walbran <qwandor@google.com>	Wed Nov 20 11:26:03 2019 +0000
tree	327466d0f42224250594ef0a6c57d0a4a3bb244d
parent	67f88e8e3e974f37aba082e2e52a698a6fcb2b5d [diff]