Skip to content

Commit edb7b4f

Browse files
CoW Developer docs (#1305)
* Add doc on guest-aided CoW design Signed-off-by: James Sturtevant <jsturtevant@gmail.com> * Update docs to include some diagrams and current behaviors Signed-off-by: James Sturtevant <jsturtevant@gmail.com> * Respond to feedback Signed-off-by: James Sturtevant <jsturtevant@gmail.com> * typo Signed-off-by: James Sturtevant <jsturtevant@gmail.com> --------- Signed-off-by: James Sturtevant <jsturtevant@gmail.com> Co-authored-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
1 parent 99d2ec2 commit edb7b4f

File tree

1 file changed

+170
-76
lines changed

1 file changed

+170
-76
lines changed

docs/paging-development-notes.md

Lines changed: 170 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,80 +1,174 @@
1-
# Paging in Hyperlight
1+
# Guest-aided Copy-on-Write snapshots
2+
3+
When running on a Type 1 hypervisor, servicing a Stage 2 translation
4+
page fault is relatively quite expensive, since it requires quite a
5+
lot of context switches. To help alleviate this, Hyperlight uses a
6+
design in which the guest is aware of a readonly snapshot from
7+
which it is being run, and manages its own copy-on-write.
8+
9+
Because of this, there are two very fundamental regions of the guest
10+
physical address space, which are always populated: one, at the very
11+
bottom of memory, is a (hypervisor-enforced) readonly mapping of the
12+
base snapshot from which this guest is being evolved. Another, at the top of memory, is simply
13+
a large bag of blank pages: scratch memory into which this VM can
14+
write.
15+
16+
For the detailed layout of each region, including field offsets, see
17+
the diagrams and comments in [`src/hyperlight_host/src/mem/layout.rs`](../src/hyperlight_host/src/mem/layout.rs)
18+
and the constants in [`hyperlight_common::layout`](../src/hyperlight_common/src/layout.rs).
19+
20+
## The scratch map
21+
22+
Whenever the guest needs to write to a page in the snapshot region, it
23+
will need to copy it into a page in the scratch region, and change the
24+
original virtual address to point to the new page.
25+
26+
```
27+
CoW page fault flow:
28+
29+
BEFORE (guest writes to CoW page -> fault)
30+
31+
PTE for VA 0x5000:
32+
+----------+-----+-----+
33+
| GPA | CoW | R/O | Points to snapshot page
34+
| 0x5000 | 1 | 1 |
35+
+----------+-----+-----+
36+
|
37+
v
38+
Snapshot region (readonly)
39+
+--------------------+
40+
| original content | GPA 0x5000
41+
+--------------------+
42+
43+
AFTER (fault handler resolves)
44+
45+
1. Allocate fresh page from scratch (bump allocator)
46+
2. Copy snapshot page -> new scratch page
47+
3. Update PTE to point to scratch page
48+
49+
PTE for VA 0x5000:
50+
+----------+-----+-----+
51+
| GPA | CoW | R/W | Points to scratch page
52+
| 0xf_ff.. | 0 | 1 |
53+
+----------+-----+-----+
54+
|
55+
v
56+
Scratch region (writable)
57+
+--------------------+
58+
| copied content | (new GPA in scratch)
59+
+--------------------+
60+
61+
Snapshot page at GPA 0x5000 is untouched.
62+
```
63+
64+
The page table entries to do this will likely need to be copied themselves, and so a
65+
ready supply of already-mapped scratch pages to use for replacement
66+
page tables is needed. Currently, the guest accomplishes this by
67+
keeping an identity mapping of the entire scratch memory around.
68+
69+
The host and the guest need to agree on the location of this mapping,
70+
so that (a) the host can create it when first setting up a blank guest
71+
and (b) the host can ignore it when taking a snapshot (see below).
72+
73+
Currently, the host always creates the scratch map at the top of
74+
virtual memory. In the future, we may add support for a guest to
75+
request that it be moved.
76+
77+
## The snapshot mapping
78+
79+
The snapshot page tables must be mapped at some virtual address so
80+
that the guest can read and copy them during CoW operations. The
81+
preferred approach is to map the snapshot page tables directly from
82+
the snapshot region into the guest's virtual address space.
83+
84+
However, on amd64, this is complicated by architectural constraints.
85+
Currently, the host simply copies the page tables into scratch when
86+
restoring a sandbox, and the guest works on those scratch copies
87+
directly. In the near future, we expect to be able to use the
88+
preferred approach on aarch64, and with some minor hypervisor changes,
89+
on amd64 as well.
90+
91+
## Top-of-scratch metadata layout
92+
93+
The top of the scratch region contains structured metadata at fixed
94+
offsets such as the scratch size, allocator state and where the exceptions starts.
95+
These offsets are defined as `SCRATCH_TOP_*` constants in
96+
[`hyperlight_common::layout`](../src/hyperlight_common/src/layout.rs), which has detailed comments on each
97+
field.
98+
99+
## The physical page allocator
100+
101+
The host needs to be able to reset the state of the physical page
102+
allocator when resuming from a snapshot. Currently, we use a simple
103+
bump allocator as a physical page allocator, with no support for free,
104+
since pages not in use will automatically be omitted from a snapshot.
105+
The allocator state is a single `u64` tracking the address of the
106+
first free page, located below the metadata at the top of scratch.
107+
The guest advances it atomically.
108+
109+
## The guest exception stack
110+
111+
Similarly, the guest needs a stack that is always writable, in order
112+
to be able to take exceptions to it. The exception stack begins below
113+
the metadata at the top of the scratch region and grows downward.
114+
115+
## Taking a snapshot
116+
117+
When the host takes a snapshot of a guest, it will traverse the guest
118+
page tables, collecting every (non-page-table) physical page that is
119+
mapped (outside of the scratch map) in the guest. It will write out a
120+
new compacted snapshot with precisely those pages in order, and a new
121+
set of page tables which produce precisely the same virtual memory
122+
layout, except for the scratch map.
123+
124+
### Pre-sizing the scratch region
125+
126+
When creating a snapshot, the host must provide the size of the
127+
scratch region that will be used when this snapshot is next restored
128+
into a sandbox. This will then be baked into the guest page tables
129+
created in the snapshot.
130+
131+
TODO: add support, if found to be useful operationally, for either
132+
dynamically growing the scratch region, or changing its size between
133+
taking a snapshot and restoring it.
134+
135+
### Call descriptors
136+
137+
Taking a snapshot is presently only supported in between top-level
138+
calls, i.e. there may be no calls in flight at the time of
139+
snapshotting. This is not enforced, but odd things may happen if it is
140+
violated.
141+
142+
Buffer management between the host and guest is needed to pass call
143+
arguments and return values. Ideally, buffers would be dynamically
144+
allocated from the scratch region as needed.
145+
146+
Currently, I/O buffers are statically allocated at the bottom of the
147+
scratch region. This is a stopgap pending improved
148+
physical allocation and buffer management.
149+
150+
The minimum scratch size is calculated by `min_scratch_size()` in the
151+
architecture-specific layout modules under `hyperlight_common`; see
152+
that function for the detailed breakdown of required overhead.
153+
154+
## Creating a fresh guest
155+
156+
When a fresh guest is created, the snapshot region will contain the
157+
loadable pages of the input ELF and an initial set of page tables,
158+
which simply map the segments of that ELF to the appropriate places in
159+
virtual memory. If the ELF has segments whose virtual addresses
160+
overlap with the scratch map, an error will be returned.
2161

3-
Hyperlight uses paging, which means that all addresses inside a Hyperlight VM are treated as virtual addresses by the processor. Specifically, Hyperlight uses (ordinary) 4-level paging. 4-level paging is used because we set the following control registers on logical cores inside a VM: `CR0.PG = 1, CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 0`. A Hyperlight VM is limited to 1GB of addressable memory, see below for more details. These control register settings have the following effects:
162+
In the current startup path, the host enters the guest with
163+
the stack pointer pointing to the exception stack. Early guest init
164+
then allocates the main stack at `MAIN_STACK_TOP_GVA`, switches to
165+
it, and continues generic initialization. Note that exception stack
166+
overflows can be difficult to detect, since there is no guard page
167+
below the exception stack within the scratch region.
4168

5-
- `CR0.PG = 1`: Enables paging
6-
- `CR4.PAE = 1`: Enables Physical Address Extension (PAE) mode (this is required for 4-level paging)
7-
- `IA32_EFER.LME = 1`: Enables Long Mode (64-bit mode)
8-
- `CR4.LA57 = 0`: Makes sure 5-level paging is disabled
169+
# Architecture-specific details of virtual memory setup
9170

10-
## Host-to-Guest memory mapping
171+
## amd64
11172

12-
Into each Hyperlight VM, memory from the host is mapped into the VM as physical memory. The physical memory inside the VM starts at address `0x0` and extends linearly to however much memory was mapped into the VM (depends on various parameters).
13-
14-
## Page table setup
15-
16-
The following page table structs are set up in memory before running a Hyperlight VM (See [Access Flags](#access-flags) for details on access flags that are also set on each entry)
17-
18-
### PML4 (Page Map Level 4) Table
19-
20-
The PML4 table is located at physical address specified in CR3. In Hyperlight we set `CR3=0x0`, which means the PML4 table is located at physical address `0x0`. The PML4 table comprises 512 64-bit entries.
21-
22-
In Hyperlight, we only initialize the first entry (at address `0x0`), with value `0x1_000`, implying that we only have a single PDPT.
23-
24-
### PDPT (Page-directory-pointer Table)
25-
26-
The first and only PDPT is located at physical address `0x1_000`. The PDPT comprises 512 64-bit entries. In Hyperlight, we only initialize the first entry of the PDPT (at address `0x1_000`), with the value `0x2_000`, implying that we only have a single PD.
27-
28-
### PD (Page Directory)
29-
30-
The first and only PD is located at physical address `0x2_000`. The PD comprises 512 64-bit entries, each entry `i` is set to the value `(i * 0x1000) + 0x3_000`. Thus, the first entry is `0x3_000`, the second entry is `0x4_000` and so on.
31-
32-
### PT (Page Table)
33-
34-
The page tables start at physical address `0x3_000`. Each page table has 512 64-bit entries. Each entry is set to the value `p << 21|i << 12` where `p` is the page table number and `i` is the index of the entry in the page table. Thus, the first entry of the first page table is `0x000_000`, the second entry is `0x000_000 + 0x1000`, and so on. The first entry of the second page table is `0x200_000 + 0x1000`, the second entry is `0x200_000 + 0x2000`, and so on. Enough page tables are created to cover the size of memory mapped into the VM.
35-
36-
## Address Translation
37-
38-
Given a 64-bit virtual address X, the corresponding physical address is obtained as follows:
39-
40-
1. PML4 table's physical address is located using CR3 (CR3 is `0x0`).
41-
2. Bits 47:39 of X are used to index into PML4, giving us the address of the PDPT.
42-
3. Bits 38:30 of X are used to index into PDPT, giving us the address of the PD.
43-
4. Bits 29:21 of X are used to index into PD, giving us the address of the PT.
44-
5. Bits 20:12 of X are used to index into PT, giving us a base address of a 4K page.
45-
6. Bits 11:0 of X are treated as an offset.
46-
7. The final physical address is the base address + the offset.
47-
48-
However, because we have only one PDPT4E and only one PDPT4E, bits 47:30 must always be zero. Each PDE points to a PT, and because each PTE with index `p,i` (where p is the page table number of i is the entry within that page) has value `p << 21|i << 12`, the base address received in step 5 above is always just bits 29:12 of X itself. **As bits 11:0 are an offset this means that translating a virtual address to a physical address is essentially a NO-OP**.
49-
50-
A diagram to describe how a linear (virtual) address is translated to physical address inside a Hyperlight VM:
51-
52-
![A diagram to describe how a linear (virtual) address is translated to physical](assets/linear-address-translation.png)
53-
54-
Diagram is taken from "The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide"
55-
56-
### Limitations
57-
58-
Since we only have 1 PML4E and only 1 PDPTE, bits 47:30 of a linear address must be zero. Thus, we have only 30 bits (bit 29:0) to work with, giving us access to (1 << 30) bytes of memory (1GB).
59-
60-
## Access Flags
61-
62-
In addition to providing addresses, page table entries also contain access flags that describe how memory can be accessed, and whether it is present or not. The following access flags are set on each entry:
63-
64-
PML4E, PDPTE, and PD Entries have the present flag set to 1, and the rest of the flags are not set.
65-
66-
PTE Entries all have the present flag set to 1.
67-
68-
In addition, the following flags are set according to the type of memory being mapped:
69-
70-
For `Host Function Definitions` and `Host Exception Data` the NX flag is set to 1 meaning that the memory is not executable in the guest and is not accessible to guest code (ring 3) and is also read only even in ring 0.
71-
72-
For `Input/Output Data`, `Page Table Data`, `PEB`, `PanicContext` and `GuestErrorData` the NX flag is set to 1 meaning that the memory is not executable in the guest and the RW flag is set to 1 meaning that the memory is read/write in ring 0, this means that this data is not accessible to guest code unless accessed via the Hyperlight Guest API (which will be in ring 0).
73-
74-
For `Code` the NX flag is not set meaning that the memory is executable in the guest and the RW flag is set to 1 meaning the data is read/write, as the user/supervisor flag is set then the memory is also read/write accessible to user code. (The code section contains both code and data, so it is marked as read/write. In a future update we will parse the layout of the code and set the access flags accordingly).
75-
76-
For `Stack` the NX flag is set to 1 meaning that the memory is not executable in the guest, the RW flag is set to 1 meaning the data is read/write, as the user/supervisor flag is set then the memory is also read/write accessible to user code.
77-
78-
For `Heap` the RW flag is set to 1 meaning the data is read/write, as the user/supervisor flag is set then the memory is also read/write accessible to user code. The NX flag is not set if the feature `executable_heap` is enabled, otherwise the NX flag is set to 1 meaning that the memory is not executable in the guest. The `executable_heap` feature is disabled by default. It is required to allow data in the heap to be executable to when guests dynamically load or generate code, e.g. `hyperlight-wasm` supports loading of AOT compiled WebAssembly modules, these are loaded dynamically by the Wasm runtime and end up in the heap, therefore for this scenario the `executable_heap` feature must be enabled. In a future update we will implement a mechanism to allow the guest to request memory to be executable at runtime via the Hyperlight Guest API.
79-
80-
For `Guard Pages` the NX flag is set to 1 meaning that the memory is not executable in the guest. The RW flag is set to 1 meaning the data is read/write, as the user/supervisor flag is set then the memory is also read/write accessible to user code. **Note that neither of these flags should really be set as the purpose of the guard pages is to cause a fault if accessed, however, as we deal with this fault in the host not in the guest we need to make the memory accessible to the guest, in a future update we will implement exception and interrupt handling in the guest and then change these flags.**
173+
Hyperlight unconditionally uses 48-bit virtual addresses (4-level
174+
paging) and enables PAE. The guest is always entered in long mode.

0 commit comments

Comments
 (0)