Virtualization of CPU and memory

Why virtualize CPU and memory#

Virtualizing memory makes application programming simpler. The operating system provides a illusion to the program that it has a large independent address space, without needing to consider how the actual small and crowded physical memory is allocated. Additionally, it can protect the memory of other programs from being modified by malicious programs.

Virtualizing the CPU allows the operating system to use scheduling algorithms to ensure that the CPU is not monopolized by a single program during multi-program execution.

The three goals of virtual memory are transparency, efficiency, and protection.

Zombie State#

A process can be in a final state where it has exited but has not yet been cleaned up (known as a zombie state in UNIX-based systems). This final state is useful because it allows other processes (usually the parent process that created the process) to check the return code of the process and see if the completed process executed successfully (typically, in UNIX-based systems, a process returns zero when it successfully completes a task, otherwise it returns a non-zero value).

Shell Invocation Process, fork, exec#

The shell is also a user program. It first displays a prompt and waits for user input. You can input a command (the name of an executable program and any required parameters). In most cases, the shell can find this executable program in the file system, create a new process using fork(), and execute the executable program using one of the variants of exec(). It then waits for the command to complete using wait(). After the child process finishes execution, the shell returns from wait() and outputs another prompt, waiting for the next command from the user.

The separation of fork() and exec() allows the shell to easily implement many useful features. For example:

wc p3.c > newfile.txt

In the example above, the output of wc is redirected to the file newfile.txt (indicated by the greater than sign before newfile.txt). The shell implements output redirection by closing the standard output before calling exec() and opening the file newfile.txt. This way, the output of the running program wc is sent to the file instead of being printed on the screen.

Key underlying mechanisms of CPU virtualization, restricted direct execution#

System Call Process, trap instruction, trap table#

To execute a system call, a program must execute a special trap instruction. This instruction jumps into the kernel and elevates the privilege level to kernel mode. Once in the kernel, the system can perform any necessary privileged operations (if allowed) to perform the work required by the calling process. After completion, the operating system calls a special return-from-trap instruction, which returns to the user program that initiated the call and lowers the privilege level back to user mode.

There is an important detail that has not been discussed: how does the trap know which code to run inside the OS?

Obviously, the calling process cannot specify the address to jump to (like you do when making a procedure call), as this would allow the program to jump to any location in the kernel, which is clearly a bad idea. In fact, this ability would likely allow a clever programmer to make the kernel run any sequence of code. Therefore, the kernel must carefully control the code executed on a trap.

The kernel achieves this by setting up a trap table at startup. When the machine starts up, it runs in privileged (kernel) mode and can configure the machine hardware as needed. The first thing the operating system does is tell the hardware where to run code when certain exceptional events occur.

For example, when a disk interrupt occurs, a keyboard interrupt occurs, or a program makes a system call, where should the code run? The operating system typically notifies the hardware of the location of these trap handlers using some special instruction. Once the hardware is notified, it remembers the location of these handlers until the next machine reboot, and the hardware knows what to do (i.e., where to jump) when a system call or other exceptional event occurs.

One last note: being able to execute an instruction to tell the hardware where the trap table is located is a very powerful capability. Therefore, as you might guess, it is also a privileged operation. If you try to execute this instruction in user mode, the hardware will not allow it.

Cooperative Scheduling System#

In a cooperative scheduling system, the operating system regains control of the CPU by waiting for a system call or some kind of illegal operation to occur.

Non-Cooperative Scheduling System#

In a non-cooperative scheduling system, the operating system regains control of the CPU by waiting for a clock interrupt, system call, or some kind of illegal operation to occur. The scheduling algorithm of the operating system determines which program to run next.

Virtual Memory#

Hardware-Based Dynamic Relocation#

Each CPU requires two registers: a base register and a limit register. When writing and compiling a program, the assumption is that the address space starts from 0. When the program is actually executed, the operating system determines the actual loading address in physical memory and records the starting address in the base register. When the process is running, all memory references generated by the process are processed by the hardware as physical addresses = virtual addresses + base. This method leads to memory fragmentation (allocated space within the program that is not used).

Segmentation, Generalized Base/Limit#

A typical address space has three different logical segments: code, stack, and heap. These three parts are divided into three segments and placed in physical memory, with each segment having a set of base/limit registers. The hardware uses segment registers to convert virtual addresses.

For example, a segment register's first 2 bits identify the segment to which the virtual address belongs, and the remaining 12 bits represent the offset within the segment. In this case, the physical address = offset + base of the corresponding segment, and the limit register value is used to check for segment overflow. In addition to the base/limit registers, the hardware also needs to know the direction of segment growth (e.g., the stack may grow in the opposite direction).

Segmentation leads to the problem of quickly filling up physical memory with small holes of unused space, making it difficult to allocate new segments or expand existing ones. This problem is known as external fragmentation.

Paging#

Paging involves dividing a process's address space into fixed-size units called pages. Similarly, physical memory is divided into page frames, with each page frame containing a virtual memory page.

To record the location of each virtual page in physical memory, the operating system typically maintains a data structure called a page table for each process.

The main purpose of the page table is to save address translations for each virtual page of the address space, so that we know the location of each page in physical memory. The page table is a per-process data structure.

To convert the resulting virtual address, we must first divide it into two components: the virtual page number (VPN) and the offset within the page. Assuming a virtual address space of 64 bytes, we need a total of 6 bits for the virtual address (2^6 = 64). If the page size is set to 16 bytes, the offset within the page needs 4 bits (2^4 = 16), and the remaining 2 bits are allocated to the VPN (2^2 can represent 4 pages, 4 * 16 = 64).

When performing address translation, the operating system queries the page table to convert the VPN to the PFN (physical page number), while keeping the offset unchanged.

The page table is a data structure used to map virtual addresses to physical addresses, so any data structure can be used. The simplest form is called a linear page table, which is an array. The operating system uses the VPN to index into this array and looks up the page table entry (PTE) at that index to find the desired physical frame number (PFN).

The PTE contains fields such as valid bit, protection bit, and present bit, as well as the most important field, the physical frame number (PFN).

Paging Problems#

Linear page tables occupy a large amount of memory because they store entries for all virtual pages, regardless of whether they are mapped to physical addresses or not.

In addition, paging requires an extra memory reference for every memory reference (whether it is an instruction fetch or an explicit load or store) to first fetch the address translation from the page table. The extra memory reference overhead is significant and can slow down a process by a factor of two or more.

So, there are two practical problems that must be addressed. If not carefully designed in hardware and software, page tables can slow down the system and consume too much memory.

Fast Address Translation - TLB#

TLB (Translation Look-aside Buffer) is a hardware-based cache that can solve the problem of slow address translation. When hardware translates a virtual address, it first checks if it is in the TLB. If it is not, it then accesses the page table in memory.

The virtual-to-physical address mappings stored in the TLB are only valid for the current process and have no meaning for other processes. Therefore, when a process context switch occurs, the hardware or operating system (or both) must ensure that the upcoming process does not mistakenly read the address mappings of the previous process.

Simply clear the TLB.
Add an address space identifier (to identify which process it belongs to) in the TLB.

Multi-Level Page Tables#

Page tables occupy a large amount of memory, and one simple way to reduce the size of page tables is to use larger pages (reducing the number of VPNs and increasing the offset), but larger pages also mean more memory fragmentation. Another approach is to use a hybrid of segmentation and paging.

We mentioned that page tables are a data structure, and we can use a hash table or tree to store valid page table entries (PTEs). However, using a binary tree has a high cost, as it requires multiple memory accesses during the lookup process.

Modern operating systems often use multi-level page tables, which can be seen as a simple hash table (in fact, a linear array is also a kind of hash table). They divide the ordinary linear page table into page-sized units and put these units into a new linear table called the page directory. This is a time-space trade-off. If a two-level page table is used, when the TLB misses, it needs to first query the page directory from memory and then query the required PTE from the page directory. This results in two memory accesses to obtain the PTE, and another memory access is required to access the data after obtaining the physical address, which is more costly compared to a simple linear table. Fortunately, we still have the TLB.