Virtual Memory in Linux
Aaron Hall, 11 December 2002
General Virtual Memory
Virtual memory is dependent on certain key hardware mechanisms, including:
- dynamic runtime address translation
- noncontiguous pages and/or segments
The main benefits of virtual memory are the ability to keep more processes loaded at any time, and the ability to execute
a process larger than main memory. It's effective because of the principle of locality: references between and within
programs and data tend to cluster, as in this famous plot:
Virtual memory can use pages, segments, or both. Pages are fixed-size and typically smaller. Their use can lead to internal
fragmentation but avoids external fragmentation. Segments are variable-sized, typically larger, and have no internal
fragmentation but are prone to external fragmentation. There is typically one page or segment table for each process,
which can be indexed by one or more directories for efficiency. If pages and segments are combined each segment will
consist of multiple pages.
To mitigate the inefficiency of using multilayer tables many processors have a translation lookaside buffer that
caches table entries in associative memory. It's success relies on the same principle of locality of reference mentioned
before. Fast but expensive, it is therefore usually fairly small (on the order of 64-256 entries).
Key Implementation Issues
Linux uses a three-level page table structure with a page directory, page middle directory, and page table.
Because the x86 architecture only supports two levels, on those platforms the middle directory is given
a size of one. Virtual address on that architecture are 32-bit: a 10-bit directory index, 10-bit table index,
and 12-bit offset. On the PowerPC architecture an inverted page table is used, in which the page number is used
in a hash table to find a pointer to the corresponding entry in the page table. In linux the top 1GB of each 4GB
virtual address space maps kernel data, the rest is user space. There has been some exploration of a variable page size
to make better use of the scarce TLB (Irix uses this), but for now page size is a compile-time constant.
Two primary allocation algorithms are used in Linux: buddy and slab.
Buddy Allocation Algorithm
Under the buddy algorithm, each memory zone has a list of blocks of each power-of-two size, the blocks being arranged
in a binary-tree fasion. free_area When the
allocation routine is called the free block list for
the closest corresponding size is searched. If a block is found it is allocated, if not, the next-higher list is searched.
If a free block is found there it is split in half, with one half allocated and the other moved to the next-lower list.
When the deallocation routine is called the block of pages
is deallocated. In addition, if that block's buddy is also free, they are merged. In this way the buddy algorithm's
tendency toward fragmentation is balanced somewhat.
Slab Allocation Algorithm
The kernel uses real instead of virtual memory address. On some architectures (like the x86) this means putting the
processor into a special "real" mode, on others it means using a region of the address space that is interpreted
by the processor as a real address. In order to allocate memory for itself it uses the slab algorithm, which maintains
caches of several small, volatile data structures
such as dentries or inodes that the kernel routinely uses. Each cache has several
slabs in three lists: fully used, partial, and fully free.
A third type of allocation the kernel uses is based on a set of
size caches. These caches are used for rapidly allocating small
buffers with the kmalloc function, which has an associated
Placement is the problem of deciding where in real memory to place a piece of virtual memory when it is loaded or
swapped in. This is significant in a purely segmented system but as Linux uses paging as well, memory can be accessed
equally well no matter where it's put so this is essentially a nonissue. However, the recent development of systems
with NUMA architecture has made placement an important consideration again. This is an area of active development in
Linux, complicated because there are at least nine different NUMA architectures in use, with little standardization of
behavior. Under the current design each process, virtual memory area, and the kernel itself have a cpumemmap, which
identifies which resources may and may not be used by that entity. Currently the kernel itself doesn't know the actual
topology of the system, expecting that the cpumemmap will be created externally with the topology in mind. A kernel-level
understanding of topology is one of many improvements planned.
Replacement is the problem of deciding what to swap out of real memory and when. Some common algorithms include:
Linux uses a variation of the clock algorithm: the kernel thread
kswapd is periodically run (or is invoked if the
"memory pressure" is found to be too high). If the number of free pages is less than free_pages_high, it will attempt
to free 3 pages. If it's less than free_pages_low it'll attempt to free 6 pages and will sleep for only half the time
before running again. When it runs it scans a portion of mem_map, a list of all
pages in memory. Each page has an age byte, initialized
to 3. Every time it's accessed the age is incremented (to a maximum of 20), every time kswapd scans it the age is
decremented. If the age of a page is 0 it's considered swappable.
- FIFO: the oldest page swapped in is the one swapped out. This is usually not the best page to swap.
- LRU: the oldest page accessed is the one swapped out. This tends to be expensive
- Clock: a circular list of pages is maintained, and a "hand" sweeps around the list, clearing the use bits of
the page's data structures. Every time a page is accessed it's use bit, if not set, is set by the processor. If the
"hand" finds a page that doesn't have its use bit set, it's swapped out. This can give near-LRU behavior with
a cost more like FIFO.
When kswapd is called on to free pages, it first attempts to clear
slab caches, then it tries to free pages from the page cache. If that fails it shrinks the filesystem caches, and finally
it kills a process as a last resort.
As an added complication, RISC processors generally don't support a reference bit, so this behavior must be simulated
Fetch Policy is the problem of deciding when pages should be loaded or swapped into physical memory. The simplest
policy is demand paging, where pages are brought in when they are referenced. By the principle of locality, the
initially high rate of page faults should settle down over time as the most-used pages stay in RAM. A more sophisticated
policy is prepaging, where multiple pages are loaded at once on the idea that if one page faults, it's neighbor may be
next. Prepaging can be done when a process is created, or on every fault. The usefulness of prepaging is unclear.
Linux's fetch policy is essentially demand paging, but the
page cache reads one page ahead. Each process has an
mm_struct, which stores information about the
process's address space as a whole. It contains pointers to several
vm_area_structs, stored in an AVL tree. Each vm_area_struct
stores information about each area, including individual access restrictions and
operations for opening,
closing, handling no-page exceptions and such. On a page fault, the AVL tree is searched. If no corresponding area is
found a SIGSEGV is raised.
Otherwise, the area's permissions are checked. If the memory access is consistent with the permissions for that area,
control is passed to different functions based on the nature of the fault. Control typically passes through
handle_mm_fault and from there to
do_no_page (for loading a page from storage),
do_swap_page (for swapping in a page), or
do_wp_page (for writes to a shared page or page marked for
Working Set Management
The working set is the set of pages used most often by a process, referring to the principle of locality. A simple policy
for managing the number of frames made available to a process is the fixed-allocation policy, in which the number of
frames is constant. A variable-allocation policy might change the number of frames based on the frequency of page faults,
or on a measure of the working set over some variable interval. Linux uses no working set management policy other than
Windows NT (2000, XP) shares many similarities with Linux in its implementation of virtual memory. It also uses a 32-bit
linear address space, but divided into 2GB system and 2GB application areas. It uses a 2-level page table. Paging is said
to be demand, however, on each page fault a cluster of nearby pages is loaded. Copy-on-write optimization is used as in
Linux. Instead of a clock algorithm, Windows uses an LRU replacement algorithm on uniprocessor x86 architectures, and a
FIFO algorithm on multiprocessor x86 and Alpha architectures. It uses a page-fault frequency algorithm to manage the
working set of its processes.
The problem is to modify the memory manager so that it collects statistics on the rate of page faults of one process or
the system as a whole, add a system call that initializes the statistics and reports the fault frequency, and write a
user program that exercises that call.