Required reading: Mach external pager
Virtual memory is at minimum the translation of virtual addresses to physical addresses. This translation involves page tables, their entries (PTEs), the OS setting up the tables, the VM hardware using them, and the page-fault handler when the hardware cannot find a PTE for a virtual address.
More generally, we can view virtual memory has a layer of indirection between virtual addresses and physical addresses that we can use for different purposes:
These functions are hardware independent, but require an interface to translation module. The OS cannot rely on the translation module to do much. The translation modules might be just a TLB. Hardware pages tables unlikely to be able to express all the OS wants.
To cope with the complexity of a virtual-memory system, it is useful to split the design in multiple mostly-independent modules:
Don't view VM as a thin layer just above the memory system. It's actually an important program/OS interface. Allows OS to control what memory references refer to. Most of the implementation is in the OS, not hardware. OS uses flexible control to improve performance. Applications can do the same.
Mach is a microkernel operating system developed at CMU in the 80s. At the time, lots of action around Mach. One of the lasting impacts is its virtual memory design, which has been adopted by 4.4BSD and OS/X.
The kernel implementation picture is as follows (in 4.4BSD terminology):
Pagers (or data managers) provide the data for objects. Example data managers: file system server and network shared memory.
Example: fork
Points:
At a page fault:
/* * O/S page fault handling code, from 4.4BSD, from Mach. * * map is the process' list of vm_map_entries. * addr is the virtual address that caused the fault. * fault_type is read or write. */ vm_fault(map, addr, fault_type) { /* find addr in the vm_map_entry list */ for(m = map; m != NULL; m = m->next){ if(addr >= m->start && addr < m->end){ object = m->object; offset = m->offset; protection = m->protection; break; } } /* signal a fault to the user? */ if(m == NULL){ raise an unmapped page fault signal; return; } if((fault_type == WRITE && (protection & WRITE) == 0) || (fault_type == READ && (protection & READ) == 0)){ raise a protection fault signal; return; } first_object = object; /* walk down the chain of shadow objects */ while(1) { page = find cached physical page at offset in object; if(page found) break; if(object has file/disk storage) { /* might be a file or a shadow object with disk backing store */ page = read page from disk or file at offset; if(page found) break; } if(object->next) object = object->next; else break; } if(page == NULL){ page = allocate a physical page of memory; add page to first_object; zero-fill page; } else if(object != first_object){ if(fault_type == WRITE){ /* copy-on-write */ page = copy(page); add page to first_object; } else { /* prepare for future copy-on-write */ change page protection to read-only; } } /* update VM hardware */ pmap_enter(addr, physical address of page, protection); } /* * Interface to machine-dependent "pmap" VM layer. * Manages h/w PTE arrays and/or TLB entries. * Operates on the current process's map. * Not intended to keep much state beyond h/w PTE tables or TLB. */ pmap_enter(vaddr, paddr, protection) { /* * vaddr is a virtual address (in the current process). * paddr is the physical address of a real page in real memory. * * 1. Enter new mapping into current PTE array, or TLB. * Or change existing mapping. * 2. On machines with virtually-indexed caches, see if paddr * is mapped anywhere else, and delete all other mappings. * 3. Flush stale entries from virtually-indexed caches. * 4. Flush stale entry from TLB. * * This may required allocating physical memory for a new PTE entry. * */ } pmap_remove(vaddr) { /* delete a mapping */ } pmap_protect(vaddr, mode) { /* change a mapping's protection mode */ } pmap_page_protect(paddr, mode) { /* change protection of all mappings that point to paddr */ } pmap_is_dirty(vaddr) { /* has h/w marked the PTE as dirty? */ } pmap_clear_dirty(vaddr) { /* clear dirty bit */ }