Stack Overflows and Tail Call Optimisation

In this post I want to walk through what a stack is in terms of the processor, what function calls are, why your stack can overflow or get too deep and how all of this relates to tail-call optimisation.

Prerequisite Knowledge: It would be useful if you've run into a stack overflow error previously. If you haven't I want to offer a congratulations but it'll happen at some point, I promise, and when it does it'll be nice to understand what causes them.

Other than that, I'm hoping this post is accessible to anybody with reasonable knowledge of pointers and one programming language (C is ideal) under their belt, but useful for even the seasoned programmer.

# Program Memory Layout

To understand the stack, and the rest of this post, we must first have a good idea of how a program is laid out in memory. When the binary image of a program is initially laid out into memory, it is done in a very specific way that looks something like this:

high memory +----------------+
            | Command line   |
            | args and env   |
            | variables      |
            +----------------+
            | Stack          |
            | (Grows down)   |
            +----------------+
            |       |        |
            |       V        |
            |                |
            |                |
            |       ^        |
            |       |        |
            +----------------+
            | Heap           |
            | (Grows up)     |
            +----------------+
            | Uninitialised  |
            | data (BSS)     |
            +----------------+
            | Initialised    |
            | data           |
            +----------------+
            | Text           |
            |                |
low memory  +----------------+

This diagram is not to scale, but that's how things get laid out. For the unfamiliar, this explanation is tied very heavily into the ELF binary format. Here's a small refresher on what each of the segments mean:

Uninitialised data (BSS segment) is where global variables that are declared but not initialised get put. When we say "not initialised" here, we mean that they are not statically initialised at compile time. For example, if you define a 500 element int array in the top level namespace, you won't get a 2000 byte empty array sitting in your binary file. That's wasteful. You'll get a declaration in your binary that tells the loader how much space to reserve as soon as the program is loaded into memory.
Initialised data is for all of the stuff that is statically initialised at compile time. Any global variables that are defined before the program is running will go into here.
Text. This is a slightly misleading name for the code segment of your binary. This is where the meat and potatoes, the actual machine code that runs, can be found.
Heap. This is a dynamically resizeable area of memory for doing dynamic, unmanaged memory allocation. Every call to malloc() will give you back a pointer that points to somewhere inside this region.
Stack. This is what we're interested in. This is where local, temporary variables go and it also dynamically resizes by way of a "stack pointer", the %esp register on x86 machines.

# Tangent: Clearing up some niggling questions

So it's all well and good saying that this is how our program is laid out in memory, and saying that part of it is at high memory and part of it is in low memory, but what does that even mean? Where is high memory? Where is low memory? Why doesn't it start at the bottom? What happens when other processes get involved?

# Virtual Memory

The answer to all of the above questions is made very difficult because it's fallacious to think that memory is accessed in a direct manner. It isn't. Memory, for all user programs, is accessed through a layer of abstraction called "virtual memory". If you're unfamiliar with virtual memory, I won't go into great detail here, but Googling the topic can lead to a very enlightening afternoon well spent.

The gist of it is that your program never sees physical memory. It sees what the operating system shows it, which is this idealistic view of memory. Idealistic because the process is made to think that it is the only thing running in the system and it is potentially shown that it has access to an amount of memory that is larger than the physical memory available.

Why? Because having a uniform view of memory that does not depend on the physical memory available and believing your program has free reign over all of memory makes writing software about 8 million times easier. It also means that memory can be allocated in a safe way and you can control access to regions of memory.

A process sees only itself in memory, and the operating system deals with mapping virtual addresses into physical ones. This makes the layout of a program easy to manage when you are running multiple processes. The application programmer doesn't have to worry because the operating system will take care of the mapping and your program will always think it's in the same place.

# Back to reality

Phew. Sorry about that. Where were we?

The stack! Of course. Now that we know roughly how your program gets laid out in memory, we can see that the stack starts at the top and grows down. We can also see that the heap starts low and grows up. So, in theory, won't they eventually meet in the middle? Therein lies the problem.

Realistically, they probably won't meet. That's a bit difficult to keep track of. The stack has a limited size, determined at the start of the program / thread (yup, threads get their own stack) and dependent on a vast number of factors, that it cannot go over. If it does try to go over that, say if you try and put a huge array on the stack, your program will throw a stack overflow error.

# Basic stack operations

Let's take a look at what some stack operations look like and how they affect the stack. I also want to stop referring to "the stack" as some abstract thing and show you exactly how it looks at each stage of these operations.

At the start of your program, the stack will contain the following things:

bottom of stack    +----------------+
                   | null value     |
                   +----------------+
                   | *envp[n]       |
                   +----------------+
                   | *envp[...]     |
                   +----------------+
                   | *envp[0]       |
                   +----------------+
                   | null value     |
                   +----------------+
                   | *argv[argc-1]  |
                   +----------------+
                   | *argv[...]     |
                   +----------------+
                   | *argv[0]       | <--- Pointer to name of program
                   +----------------+
                   | argc           | <--- %esp
top of stack       +----------------+

When the execve system call is made, the program loader and operating system program initialisation routines will make sure that we have the above values on the stack, in that order.

argc is the number of elements in the **argv array
**argv is an array containing pointers to null terminated strings. The 0th element is always the name of your program, the rest are parameters passed in. It's where your command line parameters end up.
**envp is an array containing pointers to null terminated strings. All of the environment variables inherited by your program from its parent live in here.

# Pushing onto the stack

If we want to add something to the stack, x86 offers the push instruction:

push dword 0xff

This will push the value 0xff onto the stack. We declare this value as a dword, which means it will be 4 bytes in size. The stack now looks like this:

bottom of stack    +----------------+
                   | **envp         | <-- May contain many items
                   +----------------+
                   | **argv         | <-- May contain many items
                   +----------------+
                   | argc           |
                   +----------------+
                   | 0x000000ff     | <--- %esp
top of stack       +----------------+

Notice that the %esp register has moved to now point at the new top of the stack. This is part of what x86's push is doing for us behind the scenes (yes, there is still a lot of abstraction at the assembler level!).

I say "the %esp register has moved", which is completely wrong. What I mean is that the value inside the %esp register has changed to be 4 less than what it was previously. Subtracting from the stack pointer advances it forward (remember, starts at the top and grows down).

# Popping off the stack

In a similar fashion, x86 offers the pop instruction to remove things from the stack:

pop eax

This will pop the value at the top of the stack into the register %eax. The new stack looks like this:

bottom of stack    +----------------+
                   | **envp         | <-- May contain many items
                   +----------------+
                   | **argv         | <-- May contain many items
                   +----------------+
                   | argc           | <--- %esp
top of stack       +----------------+

And %eax now contains 0x000000ff. Magic.

# Function calls and calling conventions

When you call a function in C, there is a convention followed to ensure binary compatibility between libraries. The convention is defined by the System V Application Binary Interface and it can differ between processor architectures. I'm going to focus on x86, 32bit, SVR4 (the spec for which can be found here, relevant calling convention information starts on page 35).

If you don't want to read the spec, and I totally understand why you wouldn't, here's a brief run down of the very basics.

If you want to make sure your assembly function can be called from C code, you must push all arguments to that function onto the stack in reverse order (right to left), then call the function. Simple. Here's a contrived example:

push 5
push 10
call add

Equivalent C:

add(10, 5);

That's the general gist of it. Push things onto the stack in reverse order and call the function. Here's a more realistic example, though, because it's not necessarily that simple in all cases:

; printf.S - Hello world in ASM!

section .data
    string db "Hello, %s", 10, 0
    world  db "world!"

section .text
    global _main     ; Make our main function visible outside of this file.
    extern _printf   ; Declare printf from libc

    _main:
        sub esp, 4   ; Stack alignment
        push world   ; Second arg
        push string  ; First arg
        call _printf ; Call printf
        add esp, 12  ; Deallocate our stack
        xor eax, eax ; 0 eax, serves as return value from main
        ret          ; return

If you want to run this and you're not on a Mac, you may have to fiddle around with the following instructions so that it works on your platform. You'll also likely need to remove the _ from the front of _printf, that's apparently a Mac thing. Assembling and linking required a few flags:

$ nasm -f macho printf.S
$ ld -macosx_version_min 10.7 /usr/lib/libc.dylib /usr/lib/crt1.o printf.o

The nasm command required the filetype to be specified, which is a Mach-O i386 binary. The ld command required the OSX min version specified, and it also needed us to link in libc and crt1 (if you're unfamiliar with what crt1 is, I talked about it in my previous blog post).

Once that's done, we will have a file called a.out and when we run it:

$ ./a.out
Hello, world!

Voila!

# Stack alignment? What?

You'll notice in the printf asm snippet that one of the lines mentions "stack alignment", and is seemingly doing an arbitrary subtraction from the stack pointer. Why is this?

When the operating system drops us into our _main function, the stack pointer will be evenly divisible by 16. Why is this? I don't know exact details, it's an area of processors I'm yet to dive into, but this stack overflow question seems to suggest that it is something to do with "SIMD" instructions. They're a type of instruction that do the same operation to multiple bits of data in parallel, and apparently require the stack to be aligned on 16 byte boundaries.

# Stack frames

If we're following the above calling convention properly, every time we call a function we'll be creating a "stack frame". This stack frame will contain the following:

+----------------+
| Arg N          | <-- ebp + 4n + 8
+----------------+
| ...            |
+----------------+
| Arg 0          | <-- ebp + 8
+----------------+
| Return addr    | <-- ebp + 4
+----------------+
| Previous ebp   | <-- ebp
+----------------+
| Func locals    | <-- ebp - 4
+----------------+
| ...            | <-- variable length section for your local vars
+----------------+
| ...            | <-- esp
+----------------+

The args in the above diagram are technically inside the previous stack frame, because the %ebp register is referred to as the "frame pointer". When you write your own assembler functions you will have a section at the start of every function called the "preamble", and then clean up at the end. It'll look a bit like this:

my_function:
    push ebp
    mov  ebp, esp

    ; Do your own thing here

    pop ebp
    ret

We're pushing the current frame pointer onto the stack, then using the stack pointer as the new frame pointer. Then, if we wished, we could use the stack however we wanted, as long as we clean it up when we're done. Then at the end, we pop the previous frame pointer back out and return.

The call instruction pushes the current address at the time of the call onto the stack, and this is what ret uses to know where to return to. It's very important that you correctly clean up your stack, otherwise you will be returning into who-knows-where and you'll likely segfault.

# Stack local variables

If you're familiar with C or more or less any other modern programming language, you'll be aware of "variable scope". It's the idea that variables defined inside a function can only be referenced inside that function and are deallocated when you leave. This is why it's always a bad idea to return a pointer to something that was defined on the stack, and why you should use malloc instead (because malloc returns a pointer that doesn't point into the stack).

Based on the above description, it should start to be clear why this is. When we leave a function, we're deallocating all of its part of the stack behind it. So the stack size decreases. When we go into another function, we grow the stack back over the section we've just deallocated! Nothing really cleans that area up, so our previous frame was not zeroed. This is why variables declared but not initialised are not guaranteed to be 0.

# Recursion

Right! We know about the stack, we know about stack frames, so what about recursion? Recursion is the act of calling a function inside itself. The quintessential example is a naive Fibonacci sequence generator:

int fib(int n) {
    if (n < 2) return n;
    return fib(n - 1) + fib(n - 2);
}

So inside of fib we're calling fib, giving us recursion. Consider what this means in terms of stack frames. What's going to happen when this gets called? For the sake of example, let's say you call fib(100000). You're just going to spend all of your time allocating stack frames until you run out of stack space and get a stack overflow error. Interestingly, because the order of growth of the above algorithm is O( 2^N ), you'll be trying to allocate 2^100000 stack frames! That's... a lot of stack frames. Good luck.

# Tail Call Optimisation

Now, the fib example above isn't a possible candidate for tail call optimisation. To be eligible for TCO, a function must call itself as the very last thing it does. The above function is a bit deceptive because it looks like it's calling itself as the last thing, but it isn't. Think about what the processor would have to do to service that request.

fib called.
Check n < 2.
    Test fails.
Return fib(n - 1) + fib(n - 2)
    Calculate n - 1.
    Call fib(n - 1).
    Calculate n - 2.
    Call fib(n - 2).
    Add the two results together.
    Return.

So the last thing that's done is the addition, thus TCO cannot apply.

# Why?

Let's consider a simpler, contrived example. Something that is eligible for TCO.

void countdown(int n) {
    printf("%d\n", n);
    if (n == 0) return;
    countdown(n - 1);
}

So this function counts down from any number until it gets to 0. In this example, the self referential call is the last thing to run, so we can do something very clever.

Consider for a second what's about to happen. We push a new return address onto the stack, we push our new argument onto the stack and then we jump to the same code we just executed. Why bother? We aren't going to use the previous stack ever again apart from the return address back to the original caller. If we recurse, we're just going to end up with a massive daisy chain of jumps to return addresses that will be very, very inefficient to execute. Why don't we just overwrite the existing stack and leave the return address in tact?

And that, ladies and gents, is TCO in a nutshell. You just overwrite the existing stack and leave the return address as is, then jump into the start of the function again. Dead simple.

# Conclusion

As always, there are likely lots of omissions and simplifications that hide some of the complexities of what I'm explaining. Specifically, I think there are a lot of complications I'm not aware of when it comes to TCO. I have a 10,000 foot view of what TCO is and how it is implemented at the assembler level, but when it comes to how the compiler spots it I really don't have a clue yet :)

Hope you enjoyed the post! I'm thinking the next post might be about context switching.

# Edits

Corrected complexity of Fibonacci function. Thanks @AaronKalair!