Context Switching on x86

Context switching is the method an operating system employs to implement "multitasking". It's the practice of having multiple contexts of execution in your system, often called "processes", and switching between them really, really quickly to create the illusion that multiple things are happening at the same time.

It's also a good method for efficiently dealing with processes that need to wait for an IO request, such as a hard disk read, to return. While one process is waiting, another can execute.

Prerequisite Knowledge: There will be some C and assembler in this post, so some familiarity with those languages would be good. Also, knowledge of what processor registers are will help.

You may also want to read this post at least twice. Just saying.

# Before we begin

First, I want to talk about the ideas behind context switching, then I want to look at the xv6 implementation of context switching, then I'd like to talk a little bit about hardware vs software context switching.

If you're unfamiliar with xv6, it's well worth checking out. It's an educational reimplementation of the Unix V6 kernel with a very well written and well documented source base. I'll be referring to it throughout this post, so I would checkout the source code. Instructions to do that are on their website.

# The Basics

Context switching mechanisms are platform specific. Depending on what processor architecture you're using, your context switching will look very different. I'm going to cover x86 (32 bit), as that's the only one I know about in any detail.

The general idea is that your processor has a state, which is the current contents of all of its registers. Each process will contain a copy of the state that it's in when it gets swapped off the processor, and when it later gets swapped back on that state will get restored. This allows processes to be paused and resumed at a later time.

Your processor doesn't have any knowledge of processes, though. Not really. The processor only really executes one set of instructions at a time in a (mostly) linear fashion, so how do we convince it to jump between multiple processes? We need to have a list of processes, each containing processor state. We also need to have a means of jumping out of a running process and doing all of the logic necessary to swap it off the processor and swap its successor on.

# What is a process?

From a high level, we all know that a process is a running program in the system. It contains some binary executable code, some data, it can be sent signals, it can spawn children and it can die. But what data makes this possible? Let's take a look at what xv6 considers enough information to represent a process:

// Per-process state
struct proc {
  uint sz;                     // Size of process memory (bytes)
  pde_t* pgdir;                // Page table
  char *kstack;                // Bottom of kernel stack for this process
  enum procstate state;        // Process state
  volatile int pid;            // Process ID
  struct proc *parent;         // Parent process
  struct trapframe *tf;        // Trap frame for current syscall
  struct context *context;     // swtch() here to run process
  void *chan;                  // If non-zero, sleeping on chan
  int killed;                  // If non-zero, have been killed
  struct file *ofile[NOFILE];  // Open files
  struct inode *cwd;           // Current directory
  char name[16];               // Process name (debugging)
};

Lots of interesting things in there. A lot of them are pretty obvious, such as the process ID, the size of the process memory, the parent process and so on. Other things are a little less obvious. Why does it need a kernel stack, for example? And what's a trap frame?

# Kernel Stacks

Every process in xv6 needs a kernel stack for handling system calls. If, for example, a process gets stopped mid way through a system call and a new process starts, it would be a really bad idea if the new process started in the middle of a shared kernel stack. It would be equivalent to all user processes sharing one big stack, and because sharing stacks is a bad idea, each process has its own user and kernel stack.

Another reason it's really useful to have a kernel stack for each user-mode process is that even if the user mangles its user-mode stack, the kernel stack will be intact and the operating system will be able to handle the failure gracefully.

# Interrupts and Trap Frames

An integral part of the x86 architecture is its interrupt system. The processor can, at any time and for a variety of reasons, be interrupted and asked to do something else. Examples of types of interrupts include divide by zero, access to memory that doesn't belong to you or is protected, IO requests returning, keyboard events and a special "timer" interrupt that happens periodically. This is how "pre-emptive process scheduling" is implemented. All of these things require multiple context switches. (Every single keypress you do requires at least 2 context switches!).

There is a piece of hardware inside your computer called a "programmable interval timer"*, which can be told to generate an interrupt after a given amount of time. In xv6, it is told to fire every 100 milliseconds, and xv6 has set the PIT interrupt number to point at its process scheduling code, which handles context switches.

The trap frame stored in the proc struct, however, is concerned only with system calls. xv6 implements its system calls in a similar way to Linux, by having the software interrupt number 0x80 (decimal 64) be handled as a system call. The user stores the system call number into the eax register, fires an int 0x80, the processor looks up the handler for that interrupt, and then the OS transitions into kernel mode.

Part of the kernel mode transition involves creating a "trap frame", which looks like this (x86.h file in xv6):

// Layout of the trap frame built on the stack by the
// hardware and by trapasm.S, and passed to trap().
struct trapframe {
  // registers as pushed by pusha
  uint edi;
  uint esi;
  uint ebp;
  uint oesp;      // useless & ignored
  uint ebx;
  uint edx;
  uint ecx;
  uint eax;

  // rest of trap frame
  ushort gs;
  ushort padding1;
  ushort fs;
  ushort padding2;
  ushort es;
  ushort padding3;
  ushort ds;
  ushort padding4;
  uint trapno;

  // below here defined by x86 hardware
  uint err;
  uint eip;
  ushort cs;
  ushort padding5;
  uint eflags;

  // below here only when crossing rings, such as from user to kernel
  uint esp;
  ushort ss;
  ushort padding6;
};

This is built on the stack for all interrupts that happen and serves as a way for the operating system to tell the processor where to jump back to after the interrupt has been handled.

When xv6 handles a system call, the current process's proc->tf is set to the interrupt trap frame, which contains the register set that will get restored to the processor when the system call retuns. This way, system calls can interact with the register set and give return values back to user mode.

System call code in xv6, found in syscall.c:

void
syscall(void)
{
  int num;

  // Get the syscall number from eax
  num = proc->tf->eax;

  // Check that the syscall number is greater than 0, inside the range of total
  // syscalls and that a function pointer exists for that syscall number.
  if(num > 0 && num < NELEM(syscalls) && syscalls[num]) {
    // Set eax to the return value of the syscall.
    proc->tf->eax = syscalls[num]();
  } else {
    // Else print an error and return -1 to the user program.
    cprintf("%d %s: unknown sys call %d\n",
            proc->pid, proc->name, num);
    proc->tf->eax = -1;
  }
}

# Disabling and re-enabling interrupts

Sometimes we don't want the processor to be interrupted because we may leave really important data structures in an inconsistent state. x86 offers two commands to deal with this called cli and sti, clear interrupts and set interrupts respectively. These two commands manipulate a bit inside the eflags register to tell the processor whether or not we are currently interested in handling interrupts.

When we've cleared interrupts, we stop handling them but they do get queued up internally by the processor. When we restore interrupts, the queued up interrupts start getting processed again. This makes sense. You wouldn't want to completely ignore them otherwise you would have processes waiting on IO results that have been thrown away.

In spinlock.c there are definitions for pushcli() and popcli(), which are a "matched" interface to the x86 cli and sti instructions. So it takes two calls to popcli() to restore interrupts after two calls to pushcli(). This makes it easier to handle the clearing of interrupts when multiple parts of the code base are calling cli and sti. Without them, you could end up with functions turning interrupts back on before other parts of the code base are ready for them.

If that's still unclear, imagine the following code:

void func1(void) {
  cli();
  // important stuff
  func2();
  // more important stuff
  sti();
  return;
}

void func2(void) {
  cli();
  // important stuff
  sti();
  return;
}

func1();

There is a gigantic problem in the above code. func2() is calling sti(), but when it returns, func1() still expects interrupts to be off and does some important things. To get around that problem, we would rewrite the above code like so:

void func1(void) {
  pushcli();
  // important stuff
  func2();
  // more important stuff
  popcli();
  return;
}

void func2(void) {
  pushcli();
  // important stuff
  popcli();
  return;
}

func1();

This way, interrupts would only be turned back on when the popcli() inside of func1() has been called.

# Detailed steps for context switching

Let's envisage a scenario. Process 1 is running, process 2 is waiting to run. The PIT throws an interrupt. This causes the processor to stop what it's doing, abandon process 1, create a trap frame and transition into kernel mode. When an interrupt is caught, there is a generic catch-all that sets up the trap frame and any trap information the processor has given us and then calls the C function trap(). Here's the assembler for that:

#include "mmu.h"

  # vectors.S sends all traps here.
.globl alltraps
alltraps:
  # Build trap frame.
  pushl %ds
  pushl %es
  pushl %fs
  pushl %gs
  pushal

  # Set up data and per-cpu segments.
  movw $(SEG_KDATA<<3), %ax
  movw %ax, %ds
  movw %ax, %es
  movw $(SEG_KCPU<<3), %ax
  movw %ax, %fs
  movw %ax, %gs

  # Call trap(tf), where tf=%esp
  pushl %esp
  call trap # <- ENTRY INTO C TRAP FUNCTION
  addl $4, %esp

  # Return falls through to trapret...
.globl trapret
trapret:
  popal
  popl %gs
  popl %fs
  popl %es
  popl %ds
  addl $0x8, %esp  # trapno and errcode
  iret

# vector.S

The comment at the top of the above assembler snippet mentions vector.S, but you probably won't find that file inside the xv6 repo. Why? The reason is that x86 doesn't make interrupt management easy, and to handle all interrupts in a way that allows you to identify what interrupt happened you need to set up 256 interrupt handlers in assembly. Not fun. xv6 handles this with a Perl script:

# Generate vectors.S, the trap/interrupt entry points.
# There has to be one entry point per interrupt number
# since otherwise there's no way for trap() to discover
# the interrupt number.

print "# generated by vectors.pl - do not edit\n";
print "# handlers\n";
print ".globl alltraps\n";
for(my $i = 0; $i < 256; $i++){
    print ".globl vector$i\n";
    print "vector$i:\n";
    if(!($i == 8 || ($i >= 10 && $i <= 14) || $i == 17)){
        print "  pushl \$0\n";
    }
    print "  pushl \$$i\n";
    print "  jmp alltraps\n";
}

print "\n# vector table\n";
print ".data\n";
print ".globl vectors\n";
print "vectors:\n";
for(my $i = 0; $i < 256; $i++){
    print "  .long vector$i\n";
}

The important bit to note is that everything jumps into the alltraps assembler label seen above, which passes control into the trap() function in C.

# Continuing the example...

We're now in process 1's kernel stack in the trap handler and we need to find a new process to run. Because this is a timer interrupt, the following bit of code inside of the trap() function in trap.c will run:

// Force process to give up CPU on clock tick.
// If interrupts were on while locks held, would need to check nlock.
if(proc && proc->state == RUNNING && tf->trapno == T_IRQ0+IRQ_TIMER)
  yield();

Let's dig into what yield() is doing (proc.c):

// Give up the CPU for one scheduling round.
void
yield(void)
{
  acquire(&ptable.lock);
  proc->state = RUNNABLE;
  sched();
  release(&ptable.lock);
}

So first we're acquiring a lock on the process table. This is to avoid race conditions when multiple processors are running scheduling code. Then we set the current process's state to RUNNABLE, and we call a function called sched() (in proc.c):

// Enter scheduler.  Must hold only ptable.lock
// and have changed proc->state.
void
sched(void)
{
  int intena;

  // If we aren't holding the ptable lock, panic
  if(!holding(&ptable.lock))
    panic("sched ptable.lock");

  // If we aren't 1 pushcli level deep, panic
  if(cpu->ncli != 1)
    panic("sched locks");

  // If the current process is in the running state, panic
  if(proc->state == RUNNING)
    panic("sched running");

  // If the processor can be interrupted, panic
  if(readeflags()&FL_IF)
    panic("sched interruptible");

  intena = cpu->intena;
  swtch(&proc->context, cpu->scheduler);
  cpu->intena = intena;
}

The majority of sched() is concerned with making sure it's safe to schedule a new process. First it checks that we're holding the process table lock, then it checks to make sure we're only one pushcli() level deep, then it makes sure the process being swapped off is not still running, then it checks if interrupts have been cleared. The last check feels a lot like a "just in case", because in theory the cpu->ncli check should cover it.

If any of the above is true, we cannot safely schedule a new process and the kernel panics.

The cpu->intena is a variable that stores whether or not interrupts were enabled before a pushcli call. The only place in the xv6 code that cpu->intena seems to be used is in spinlock.c but I can't quite understand what it's doing. I don't think it's particularly important for this explanation, though, and we can skip over it without much trouble.

The next important call inside of sched() is the call to swtch(). This is the meat and potatoes of the operation, where the actual context switching happens. It passes in a pointer to the current process's context, so that the current registers can be saved, and it passes in the scheduler's context to be switched to. It makes sense, in that case, that this part is implemented in assembler (swtch.S):

# Context switch
#
#   void swtch(struct context **old, struct context *new);
#
# Save current register context in old
# and then load register context from new.

.globl swtch
swtch:
  movl 4(%esp), %eax
  movl 8(%esp), %edx

  # Save old callee-save registers
  pushl %ebp
  pushl %ebx
  pushl %esi
  pushl %edi

  # Switch stacks
  movl %esp, (%eax)
  movl %edx, %esp

  # Load new callee-save registers
  popl %edi
  popl %esi
  popl %ebx
  popl %ebp
  ret

Tada! This is what we've been looking for. swtch() takes two context structs, that look like this (proc.h):

// Saved registers for kernel context switches.
// Don't need to save all the segment registers (%cs, etc),
// because they are constant across kernel contexts.
// Don't need to save %eax, %ecx, %edx, because the
// x86 convention is that the caller has saved them.
// Contexts are stored at the bottom of the stack they
// describe; the stack pointer is the address of the context.
// The layout of the context matches the layout of the stack in swtch.S
// at the "Switch stacks" comment. Switch doesn't save eip explicitly,
// but it is on the stack and allocproc() manipulates it.
struct context {
  uint edi;
  uint esi;
  uint ebx;
  uint ebp;
  uint eip;
};

It contains only the registers necessary for performing a context switch, and you can see in swtch() that all we need to do is push the current context values, switch the stack pointer, then pop the context values from the new stack into the appropriate registers. Magic.

The rest of the process is less concerned with context switching and more concerned with picking a new process to run, which is a completely different kettle of fish.

# Hardware context switching

x86 offers a method for doing context switching in hardware, which sounds as if it would be faster than software methods but in practice this doesn't turn out to be the case. (According to various reading around the web. I've not run any benchmarks on this personally).

The problem with hardware context switching is that it handles more registers than it needs to for modern OS implementations. For example, Linux uses what's called a "flat memory model", which means all of the "segment selectors" are zeroed and never change.

# Segment Selectors

You're probably thinking "what the fuck are segment selectors?". x86 uses a segmented memory model, which means that all of memory is split up in 65kb segments and everything is accessed with a segment selector (%cs for code segment, %ds for data segment and so on) and an offset. At least, that's what people used to do. These days it is far more preferable to set all of the segment selectors to 0 and access memory as if it were "flat".

Whenever a segment selector is changed, the processor performs security checks on the new values to make sure they're valid and a whole host of other stuff that I don't fully understand. This is an overhead that is unnecessary and causes hardware context switching to be slower than software context switching in most cases.

Also, the x86-64 architecture has dropped support for hardware context switching according to the OSDev wiki.

# Wrapping up

So that's context switching in about as much detail as I can really offer. xv6 is a really awesome code base and I've learnt a lot about relatively simple yet realistic kernel implementations without having to go through the pain of reading the Linux source code.

Once I'm at the stage where I fully understand xv6 and can tinker around with it confidently, I fully intend to move up to the Linux code base and start messing with that.

I think a natural follow-on topic from this post would be to talk about scheduling algorithms. I might do that, but it does feel a bit too much like being back in University. I dunno, I'll see how I feel :)