Required reading: Disco
What is a virtual machine? IBM definition: a fully protected and isolated copy of the underlying machine's hardware.
What's the basic idea behind virtualization? Having the CPU do most of the work natively and making the programs running in the virtual machine think that they are running on a real machine.
Virtual machines can be useful for a number of reasons:
What are the alternatives? Processor emulation (e.g. bochs) or binary emulation (WINE). Emulation runs instructions purely in software; virtualization gets out of the way whenever possible. Therefore emulation gives portability whereas virtualization focuses on performance. However, this means that you need to model your hardware very carefully in software. Binary emulation focuses on just getting system call for a particular operating system's interface. Binary emulation can be hard because it is targetted towards a particular operating system (and even that can change between revisions).
What needs to be virtualized? What might that entail?
Types of virtualization
read()
).Single vs multiprocessor
Understanding memory virtualization. Let's consider the MIPS example from the paper. Ideally, we'd be able to intercept and rewrite all memory address references. (e.g. by intercepting virtual memory calls). Why can't we do this on the MIPS? (There are addresses that don't go through address translation --- but we don't want the virtual machine to directly access memory!) What does Disco do to get around this problem? (Relink the kernel outside this address space.)
Having gotten around that problem, how do we handle things in general?
// Disco's tlb miss handler. // Called when a memory refernce for virtual adddress // 'VA' is made, but there is not VA->MA (virtual -> machine) // mapping in the cpu's TLB. void tlb_miss_handler (VA) { // see if we have a mapping in our "shadow" tlb (which includes // "main" tlb) tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va); if (t) tlbwrite (va, t->pa, t->otherdata); else // trap to the virtual CPU/OS's handler } // Disco's procedure which emulates the MIPS // instruction which writes to the tlb. // // VA -- virtual addresss // PA -- physical address (NOT MA machine address!) // otherdata -- perms and stuff void emulate_tlbwrite_instruction (VA, PA, otherdata) { tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically MA = allocate_machine_page (); thiscpu->pmap[PA] = MA; // See 4.2.2 thiscpu->pmapbackmap[MA] = PA; thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns) } tlbwrite (va, thiscpu->pmap[PA], otherdata); } // Disco's procedure which emulates the MIPS // instruction which read the tlb. tlb_entry *emulate_tlbread_instruction (VA) { // Must return a TLB entry that has a "Physical" address; // This is recorded in our secondary TLB cache. // (We don't have to read from the hardware TLB since // all writes to the hardware TLB are mediated by Disco. // Thus we can always keep the l2tlb up to date.) return tlb_lookup (thiscpu->l2tlb, va); }
In the x86, must intercept any modifications to the page table and substitute appropriate responses. And update things like the accessed/dirty bits.
Requirements:
The MIPS didn't quite meet the second criteria, as discussed above. But, it does have a supervisor mode that is between user mode and kernel mode where any privileged instruction will trap.
What might a the VMM trap handler look like?
void privilege_trap_handler (addr) { instruction, args = decode_instruction (addr) switch (instruction) { case foo: emulate_foo (thiscpu, args, ...); break; case bar: emulate_bar (thiscpu, args, ...); break; case ...: ... } }
The emulator_foo
bits will have to evaluate the
state of the virtual CPU and compute the appropriate "fake" answer.
What sort of state is needed in order to appropriately emulate all of these things?
- all user registers - CPU specific regs (e.g. on x86, %crN, debugging, FP...) - page tables (or tlb) - interrupt tablesThis is needed for each virtual processor.
What about in the x86? We know that it meets the first two criteria above. If you run the CPU in ring 3, most x86 instructions will be fine.
// addr is a physical address void emulate_lcr3 (thiscpu, addr) { thiscpu->cr3 = addr; Pte *fakepdir = lookup (addr, oldcr3cache); if (!fakepdir) { fakedir = ppage_alloc (); store (oldcr3cache, addr, fakedir); // May wish to scan through supplied page directory to see if // we have to fix up anything in particular. // Exact settings will depend on how we want to handle // problem cases below and our own MM. } asm ("movl fakepdir,%cr3"); // Must make sure our page fault handler is in sync with what we do here. }
However, there are some that are bad. Examples?
pushf/popf
: FL_IF
is handled different,
for example.push
, pop
, mov
)
that reads or writes from %cs
.
Basic idea is to decode the instruction stream that is provided by the
user and look for bad instructions. When we find them, replace them
with an interrupt (INT 3
) that will allow the VMM to
handle it correctly. This might look something like:
void initcode () { scan_for_nonvirtual (0x7c00); } void scan_for_nonvirtualizable (thiscpu, startaddr) { addr = startaddr; instr = disassemble (addr); while (instr is not branch or bad) { addr += len (instr); instr = disassemble (addr); } // remember that we wanted to execute this instruction. replace (addr, "int 3"); record (thiscpu->rewrites, addr, instr); } void breakpoint_handler (tf) { oldinstr = lookup (thiscpu->rewrites, tf->eip); if (oldinstr is branch) { newcs:neweip = evaluate branch scan_for_nonvirtualizable (thiscpu, newcs:neweip) return; } else { // something non virtualizable // dispatch to appropriate emulation } }
All pages must be scanned in this way. Fortunately, most pages probably are okay and don't really need any special handling so after scanning them once, we can just remember that the page is okay and let it run natively.
What about writes? We must detect self-modifying code (e.g. must simulate buffer overflow attacks correctly.) When a write to a physical page that happens to be in code segment happens, must trap the write and then rescan the affected portions of the page.
What about self-examining code? Need to protect it some how --- possibly by playing tricks with instruction/data TLB caches, or introducing a private segment for code (%cs) that is different than the segment used for reads/writes (%ds).
The above can be slow! So sometimes you want the guest operating system to be aware that it is a guest and allow it to avoid the slow path. Special device drivers or changing instructions that would cause traps into memory read/write instructions. XXX how does that latter work?
We intercept all communication to the I/O devices: read/writes to reserved memory addresses cause page faults into special handlers which will emulate or pass through I/O as appropriate.
In a system like Disco, the sequence would look something like:
Interrupts will require some additional work:
This is more complex when in a hosted state since it involves more
transitions between different modules. However, it may be easier to
code: VMM can emulate by calling write
on the appropriate
device and select
'ing for read.
Disco has some I/O specific optimizations.
Disco developers clearly had access to IRIX source code.
Performance?
John Scott Robin, Cynthia E. Irvine. Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor.
Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor. In Proceedings of the 2001 Usenix Technical Conference.
Kevin Lawton, Drew Northup. Plex86 Virtual Machine.