Virtual Memory in Linux

Aaron Hall, 11 December 2002


General Virtual Memory


Virtual memory is dependent on certain key hardware mechanisms, including:
The main benefits of virtual memory are the ability to keep more processes loaded at any time, and the ability to execute a process larger than main memory. It's effective because of the principle of locality: references between and within programs and data tend to cluster, as in this famous plot:

Virtual memory can use pages, segments, or both. Pages are fixed-size and typically smaller. Their use can lead to internal fragmentation but avoids external fragmentation. Segments are variable-sized, typically larger, and have no internal fragmentation but are prone to external fragmentation. There is typically one page or segment table for each process, which can be indexed by one or more directories for efficiency. If pages and segments are combined each segment will consist of multiple pages.

To mitigate the inefficiency of using multilayer tables many processors have a translation lookaside buffer that caches table entries in associative memory. It's success relies on the same principle of locality of reference mentioned before. Fast but expensive, it is therefore usually fairly small (on the order of 64-256 entries).

Key Implementation Issues

Page Tables



Linux uses a three-level page table structure with a page directory, page middle directory, and page table. Because the x86 architecture only supports two levels, on those platforms the middle directory is given a size of one. Virtual address on that architecture are 32-bit: a 10-bit directory index, 10-bit table index, and 12-bit offset. On the PowerPC architecture an inverted page table is used, in which the page number is used in a hash table to find a pointer to the corresponding entry in the page table. In linux the top 1GB of each 4GB virtual address space maps kernel data, the rest is user space. There has been some exploration of a variable page size to make better use of the scarce TLB (Irix uses this), but for now page size is a compile-time constant.

Allocation


Two primary allocation algorithms are used in Linux: buddy and slab.

Buddy Allocation Algorithm


Under the buddy algorithm, each memory zone has a list of blocks of each power-of-two size, the blocks being arranged in a binary-tree fasion. free_area When the allocation routine is called the free block list for the closest corresponding size is searched. If a block is found it is allocated, if not, the next-higher list is searched. If a free block is found there it is split in half, with one half allocated and the other moved to the next-lower list. When the deallocation routine is called the block of pages is deallocated. In addition, if that block's buddy is also free, they are merged. In this way the buddy algorithm's tendency toward fragmentation is balanced somewhat.

Slab Allocation Algorithm


The kernel uses real instead of virtual memory address. On some architectures (like the x86) this means putting the processor into a special "real" mode, on others it means using a region of the address space that is interpreted by the processor as a real address. In order to allocate memory for itself it uses the slab algorithm, which maintains caches of several small, volatile data structures such as dentries or inodes that the kernel routinely uses. Each cache has several slabs in three lists: fully used, partial, and fully free.

Size Caches

A third type of allocation the kernel uses is based on a set of size caches. These caches are used for rapidly allocating small buffers with the kmalloc function, which has an associated kfree.

Placement


Placement is the problem of deciding where in real memory to place a piece of virtual memory when it is loaded or swapped in. This is significant in a purely segmented system but as Linux uses paging as well, memory can be accessed equally well no matter where it's put so this is essentially a nonissue. However, the recent development of systems with NUMA architecture has made placement an important consideration again. This is an area of active development in Linux, complicated because there are at least nine different NUMA architectures in use, with little standardization of behavior. Under the current design each process, virtual memory area, and the kernel itself have a cpumemmap, which identifies which resources may and may not be used by that entity. Currently the kernel itself doesn't know the actual topology of the system, expecting that the cpumemmap will be created externally with the topology in mind. A kernel-level understanding of topology is one of many improvements planned.

Replacement


Replacement is the problem of deciding what to swap out of real memory and when. Some common algorithms include: Linux uses a variation of the clock algorithm: the kernel thread kswapd is periodically run (or is invoked if the "memory pressure" is found to be too high). If the number of free pages is less than free_pages_high, it will attempt to free 3 pages. If it's less than free_pages_low it'll attempt to free 6 pages and will sleep for only half the time before running again. When it runs it scans a portion of mem_map, a list of all pages in memory. Each page has an age byte, initialized to 3. Every time it's accessed the age is incremented (to a maximum of 20), every time kswapd scans it the age is decremented. If the age of a page is 0 it's considered swappable.

When kswapd is called on to free pages, it first attempts to clear slab caches, then it tries to free pages from the page cache. If that fails it shrinks the filesystem caches, and finally it kills a process as a last resort.
As an added complication, RISC processors generally don't support a reference bit, so this behavior must be simulated in software.

Fetch Policy


Fetch Policy is the problem of deciding when pages should be loaded or swapped into physical memory. The simplest policy is demand paging, where pages are brought in when they are referenced. By the principle of locality, the initially high rate of page faults should settle down over time as the most-used pages stay in RAM. A more sophisticated policy is prepaging, where multiple pages are loaded at once on the idea that if one page faults, it's neighbor may be next. Prepaging can be done when a process is created, or on every fault. The usefulness of prepaging is unclear.
Linux's fetch policy is essentially demand paging, but the page cache reads one page ahead. Each process has an mm_struct, which stores information about the process's address space as a whole. It contains pointers to several vm_area_structs, stored in an AVL tree. Each vm_area_struct stores information about each area, including individual access restrictions and operations for opening, closing, handling no-page exceptions and such. On a page fault, the AVL tree is searched. If no corresponding area is found a SIGSEGV is raised. Otherwise, the area's permissions are checked. If the memory access is consistent with the permissions for that area, control is passed to different functions based on the nature of the fault. Control typically passes through handle_mm_fault and from there to do_no_page (for loading a page from storage), do_swap_page (for swapping in a page), or do_wp_page (for writes to a shared page or page marked for copy-on-write).

Working Set Management


The working set is the set of pages used most often by a process, referring to the principle of locality. A simple policy for managing the number of frames made available to a process is the fixed-allocation policy, in which the number of frames is constant. A variable-allocation policy might change the number of frames based on the frequency of page faults, or on a measure of the working set over some variable interval. Linux uses no working set management policy other than first-come, first-served.

Windows NT


Windows NT (2000, XP) shares many similarities with Linux in its implementation of virtual memory. It also uses a 32-bit linear address space, but divided into 2GB system and 2GB application areas. It uses a 2-level page table. Paging is said to be demand, however, on each page fault a cluster of nearby pages is loaded. Copy-on-write optimization is used as in Linux. Instead of a clock algorithm, Windows uses an LRU replacement algorithm on uniprocessor x86 architectures, and a FIFO algorithm on multiprocessor x86 and Alpha architectures. It uses a page-fault frequency algorithm to manage the working set of its processes.

Problem


The problem is to modify the memory manager so that it collects statistics on the rate of page faults of one process or the system as a whole, add a system call that initializes the statistics and reports the fault frequency, and write a user program that exercises that call.

Sources