Joakim Bech | 8e5c5b3 | 2018-10-25 08:18:32 +0200 | [diff] [blame] | 1 | ############## |
| 2 | Virtualization |
| 3 | ############## |
| 4 | OP-TEE have experimental virtualization support. This is when one OP-TEE |
| 5 | instance can run TAs from multiple virtual machines. OP-TEE isolates all |
| 6 | VM-related states, so one VM can't affect another in any way. |
| 7 | |
| 8 | With virtualization support enabled, OP-TEE will rely on a hypervisor, because |
| 9 | only the hypervisor knows which VM is calling OP-TEE. Also, naturally the |
| 10 | hypervisor should inform OP-TEE about creation and destruction of VMs. Besides, |
| 11 | in almost all cases, hypervisor enables two-stage MMU translation, so VMs does |
| 12 | not see real physical address of memory, instead they work with intermediate |
| 13 | physical addresses (IPAs). On other hand OP-TEE can't translate IPA to PA, so |
| 14 | this is a hypervisor's responsibility to do this kind of translation. So, |
| 15 | hypervisor should include a component that knows about OP-TEE protocol internals |
| 16 | and can do this translation. We call this component "TEE mediator" and right now |
| 17 | only XEN hypervisor have OP-TEE mediator. |
| 18 | |
| 19 | Configuration |
| 20 | ************* |
| 21 | Virtualization support is enabled with ``CFG_VIRTUALIZATION`` configuration |
| 22 | option. When this option is enabled, OP-TEE will **not** work without compatible |
| 23 | a hypervisor. This is because the hypervisor should send |
| 24 | ``OPTEE_SMC_VM_CREATED`` SMC with VM ID before any standard SMC can be received |
| 25 | from client. |
| 26 | |
| 27 | ``CFG_VIRT_GUEST_COUNT`` controls the maximum number of supported VMs. As OP-TEE |
| 28 | have limited size of available memory, increasing this count will decrease |
| 29 | amount of memory available to one VM. Because we want VMs to be independent, |
| 30 | OP-TEE splits available memory in equal portions to every VM, so one VM can't |
| 31 | consume all memory and cause DoS to other VMs. |
| 32 | |
| 33 | Requirements for hypervisor |
| 34 | *************************** |
| 35 | As said earlier, hypervisor should be aware of OP-TEE and SMCs from virtual |
| 36 | guests to OP-TEE. This is a list of things, that compatible hypervisor should |
| 37 | perform: |
| 38 | |
| 39 | 1. When new OP-TEE-capable VM is created, hypervisor should inform OP-TEE |
| 40 | about it with SMC ``OPTEE_SMC_VM_CREATED``. ``a1`` parameter should |
| 41 | contain VM id. ID 0 is defined as ``HYP_CLNT_ID`` and is reserved for |
| 42 | hypervisor itself. |
| 43 | |
| 44 | 2. When OP-TEE-capable VM is being destroyed, hypervisor should stop all |
| 45 | VCPUs (this will ensure that OP-TEE have no active threads for that VMs) |
| 46 | and send SMC ``OPTEE_SMC_VM_DESTROYED`` with the same parameters as for |
| 47 | ``OPTEE_SMC_VM_CREATED``. |
| 48 | |
| 49 | 3. Any SMC to OP-TEE should have VM ID in ``a7`` parameter. This is either |
| 50 | ``HYP_CLNT_ID`` if call originates from hypervisor or VM ID that was |
| 51 | passed in ``OPTEE_SMC_VM_CREATED`` call. |
| 52 | |
| 53 | 4. Hypervisor should perform IPA<->PA address translation for all SMCs. This |
| 54 | includes both arguments in ``a1``-``a6`` registers and in in-memory |
| 55 | command buffers. |
| 56 | |
| 57 | 5. Hypervisor should pin memory pages that VM shares with OP-TEE. This |
| 58 | means, that hypervisor should ensure that pinned page will reside at the |
| 59 | original PA as long, as it is shared with OP-TEE. Also it should still |
| 60 | belong to the VM that shared it. For example, the hypervisor should not |
| 61 | swap out this page, transfer ownership to another VM, unmap it from VM |
| 62 | address space and so on. |
| 63 | |
| 64 | 6. Naturally, the hypervisor should correctly handle the OP-TEE protocol, so |
| 65 | for any VM it should look like it is working with OP-TEE directly. |
| 66 | |
| 67 | Limitations |
| 68 | *********** |
| 69 | Virtualization support is in experimental state and it have some limitations, |
| 70 | user should be aware of. |
| 71 | |
| 72 | Platforms support |
| 73 | ================= |
| 74 | Only Armv8 architecture is supported. There is no hard restriction, but |
| 75 | currently Armv7-specific code (like MMU or thread manipulation) just know |
| 76 | nothing about virtualization. Only one platform has been tested right now and |
| 77 | that is QEMU-V8 (aka qemu that emulates Arm Versatile Express with Armv8 |
| 78 | architecture). Support for Rcar Gen3 should be added soon. |
| 79 | |
| 80 | Static VMs guest count and memory allocation |
| 81 | ============================================ |
| 82 | Currently, a user should configure maximum number of guests. OP-TEE will split |
| 83 | memory into equal chunks, so every VM will have the same amount of memory. For |
| 84 | example, if you have 6MB for your TAs, you can set ``CFG_VIRT_GUEST_COUNT`` to 3 |
| 85 | and every VM would be able to use 2MB maximum, even if there is no other VMs |
| 86 | running. This is okay for embedded setups when you know exact number and roles |
| 87 | of VMs, but can be inconvenient for server applications. Also, it is impossible |
| 88 | to configure amount of memory available for a given VM. Every VM instance will |
| 89 | have exactly the same amount of memory. |
| 90 | |
| 91 | Sharing hardware resources and PTAs |
| 92 | =================================== |
| 93 | Right now only HW that can be used by multiple VMs simultaneously is serial |
| 94 | console, used for logging. Devices like HW crypto accelerators, secure storage |
| 95 | devices (e.g. external flash storage, accessed directly from OP-TEE) and others |
| 96 | are not supported right now. Drivers should be made virtualization-aware before |
| 97 | they can be used with virtualization extensions. |
| 98 | |
| 99 | Every VM will have own PTA states, which is a good thing in most cases. But if |
| 100 | one wants PTA to have some global state that is shared between VMs, he need to |
| 101 | write PTA accordingly. |
| 102 | |
| 103 | No compatibility with "normal" mode |
| 104 | =================================== |
| 105 | OP-TEE built with ``CFG_VIRTUALIZATION=y`` will not work without a hypervisor, |
| 106 | because before executing any standard SMC, ``OPTEE_SMC_VM_CREATED`` must be |
| 107 | called. This can be inconvenient if one wants to switch between virtualized and |
| 108 | non-virtualized environment frequently. On other hand, it is not a big deal in a |
| 109 | production environment. Simple workaround can be made for this: if OP-TEE |
| 110 | receives standard SMC prior to ``OPTEE_SMC_VM_CREATED``, it implicitly creates |
| 111 | VM context and uses it for all subsequent calls. |
| 112 | |
| 113 | Implementation details |
| 114 | ====================== |
| 115 | OP-TEE as a whole can be split into two entities. Let us call them "nexus" and |
| 116 | TEE. Nexus is a core part of OP-TEE that takes care of low level things: SMC |
| 117 | handling, memory management, threads creation and so on. TEE is a part that does |
| 118 | the actual job: handles requests, loads TAs, executes them, and so on. So, it is |
| 119 | natural to have one nexus instance and multiple instances of TEE, one TEE |
| 120 | instance per registered VM. This can be done either explicitly or implicitly. |
| 121 | |
| 122 | Explicit way is to move TEE state in some sort of structure and make all code to |
| 123 | access fields of this structure. Something like ``struct task_struct`` and |
| 124 | ``current`` in linux kernel. Then it is easy to allocate such structure for |
| 125 | every VM instance. But this approach basically requires to rewrite all OP-TEE |
| 126 | code. |
| 127 | |
| 128 | Implicit way is to have banked memory sections for TEE/VM instances. So memory |
| 129 | layout can look something like that: |
| 130 | |
| 131 | .. code-block:: none |
| 132 | |
| 133 | +-------------------------------------------------+ |
| 134 | | Nexus: .nex_bss, .nex_data, ... | |
| 135 | +-------------------------------------------------+ |
| 136 | | TEE states | |
| 137 | | | |
| 138 | | VM1 TEE state | VM 2 TEE state | VM 3 TEE state | |
| 139 | | .bss, .data | .bss, .data | .bss, .data, | |
| 140 | +-------------------------------------------------+ |
| 141 | |
| 142 | This approach requires no changes in TEE code and requires some changes into |
| 143 | nexus code. So, idea that Nexus state resides in separate sections |
| 144 | (``.nex_data``, ``.nex_bss``, ``.nex_nozi``, ``.nex_heap`` and others) and is |
| 145 | always mapped. |
| 146 | |
| 147 | TEE state resides in standard sections (like ``.data``, ``.bss``, ``.heap`` and |
| 148 | so on). There is a separate set of this sections for every VM registered and |
| 149 | Nexus maps them only when it receives call from corresponding VM. |
| 150 | |
| 151 | As Nexus and TEE have separate heaps, ``bget`` allocator was extended to work |
| 152 | with multiple "contexts". ``malloc()``, ``free()`` with friends work with one |
| 153 | context. ``nex_malloc()`` (and other ``nex_`` functions) were added. They use |
| 154 | different context, so now Nexus can use separate heap, which is always mapped |
| 155 | into OP-TEE address space. When virtualization support is disabled, all those |
| 156 | ``nex_`` functions are defined to point to standard ``malloc()`` counterparts. |
| 157 | |
| 158 | To change memory mappings in run-time, in MMU code we have added a new entity, |
| 159 | named "partition", which is defined by ``struct mmu_partition``. It holds |
| 160 | information about all page-tables, so the whole MMU mapping can be switched by |
| 161 | one write to ``TTBR`` register. |
| 162 | |
| 163 | There is the default partition, it holds MMU state when there is no VM context |
| 164 | active, so no TEE state is mapped. When OP-TEE receives ``OPTEE_SMC_VM_CREATED`` |
| 165 | call, it copies default partition into new one and then maps sections with TEE |
| 166 | data. This is done by ``prepare_memory_map()`` function in ``virtualization.c``. |
| 167 | |
| 168 | When OP-TEE receives STD call it checks that the supplied VM ID is valid and |
| 169 | then activates corresponding MMU partition, so TEE code can access its own data. |
| 170 | This is basically how virtualization support is working. |