The Birth and Death of a Running Program
I've been on a quest over the last year or so to understand fully how a program ends up going from your brain into code, from code into an executable and from an executable into an executing program on your processor. I like the point I've got to in this pursuit, so I'm going to brain dump here :)
Prerequisite Knowledge: Some knowledge of assembler will help. Some knowledge of processors will also help. I wouldn't call either of these necessary, though, I'll try my best to explain what needs explaining. What you will need, though, is a toolchain. If you're on Ubuntu, hopefully this article will help. If you're on another system, Google for "[your os] build essentials", e.g. "arch linux build essentials".
# The Birth of a Program
You have an idea for a program. It's the best program idea you've ever had so you quickly prototype something in C:
#include <stdio.h>
int main(int argc, char* argv[]) {
printf("Hello, world!\n");
return 0;
}
A work of genius. You quickly compile and run it to make sure all is good:
$ gcc hello.c -o hello
$ ./hello
Hello, world!
Boom!
But wait... What has happened? How has it gone from being quite an understandable high level program into being something that your processor can understand and run. Let's go through what's happening step by step.
GCC is doing a tonne of things behind the scenes in the gcc hello.c -o hello
command. It is compiling your C code into assembly, optimising lots in the
process, then it is creating "object files" out of your assembly (usually in a
format called ELF on Linux platforms), then it is linking those object files
together into an executable file (again, executable ELF format). At this point
we have the hello
executable and it is in a well-known format with lots of
cross-machine considerations baked in.
After we run the executable, the "loader" comes into play. The loader figures out where in memory to put your code, it figures out whether it needs to mess about with any of the pointers in the file, it figures out of the file needs any dynamic libraries linked to it at runtime and all sorts of mental shit like that. Don't worry if none of this makes sense, we're going to go into it in good time.
# Compiling from C to assembly
This is a difficult bit of the process and it's why compilers used to cost you an arm and a leg before Stallman came along with the Gnu Compiler Collection (GCC). Commercial compilers do still exist but the free world has standardised on GCC or LLVM, it seems. I won't go into a discussion as to which is better because I honestly don't know enough to comment :)
If you want to see the assembly output of the hello.c
program, you can run the
following command:
$ gcc -S hello.c
This command will create a file called hello.s
, which contains assembly
code. If you've never worked with assembly code before, this step is going to be
a bit of an eye opener. The file generated will be long, difficult to read and
probably different to mine depending on your platform.
Now is not the time or place to teach assembly. If you want to learn, this book is a brilliant place to start. I will, however, point out a little bit of weirdness in the file. Do you see stuff like this?
EH_frame0:
Lsection_eh_frame:
Leh_frame_common:
Lset0 = Leh_frame_common_end-Leh_frame_common_begin
.long Lset0
Leh_frame_common_begin:
.long 0
.byte 1
.asciz "zR"
.byte 1
.byte 120
.byte 16
.byte 1
.byte 16
.byte 12
.byte 7
.byte 8
.byte 144
.byte 1
.align 3
I was initially curious as to what this was as well, so I checked out stack overflow and came across a really great explanation of what this bit means, which you can read here.
Also, notice the following:
callq _puts
The assembly program is calling puts
instead of printf
. This is an example
of the kind of optimisation GCC will do for you, even on the default level of
"no optimisation" (-O0
flag on the command line). printf
is a
really heavy
function, due to having to deal with a large range of format codes. puts
is
far less heavy. I could only find the NetBSD version of it. puts
itself is
very small and it delegates to __sfvwrite
, the code of which is
here.
If you want more information on how GCC will optimise printf
,
this is a great
article.
Also, if assembler is a bit new to you, a few things to note is that this post is using GAS (Gnu Assembler) syntax. There are different assemblers out there, a lot of people like the Netwide Assembler (NASM) which has a more human friendly syntax.
GAS suffixes its commands with a letter that describes what "word size" we're
dealing with. Above, you'll see we used callq
. The q
stands for "quad",
which is a 64bit value. Here are other suffixes you may run in to:
- b = byte (8 bit)
- s = short (16 bit integer) or single (32-bit floating point)
- w = word (16 bit)
- l = long (32 bit integer or 64-bit floating point)
- q = quad (64 bit)
- t = ten bytes (80-bit floating point)
# Assembling into machine code
By comparison, turning assembly instructions into machine code is pretty simple. Compiling is a much more difficult step than assembling is. Assembly instructions are often a 1 to 1 mapping into machine code.
At the end of the assembling stage, you would expect to have a file that just contained binary instructions right? Sadly that's not quite the case. The processor needs to know a lot more about your code than just the instructions. To facilitate passing this required meta-information there are a variety of binary file formats. A very common one in *nix systems is ELF: executable linkable format.
Your program will be broken up into lots of sections. For example, a section
called .text
contains your program code. A section called .bss
stores
statically initialised variables (globals, essentially), that are not given a
starting value, thus get zeroed. A section called .strtab
contains a list of
all of the strings you plan on using in your program. If you statically
initialise a string anywhere, it'll go into the .strtab
section. In our
hello.c
example, the string "Hello, world!\n"
will go into the .strtab
.
This article, from issue 13 of Linux Journal in 1995, gives a really good overview of the ELF format from one of the people who created it. It's quite in depth and I didn't understand everything he said (still not sure on relocations), but it's very interesting to see the motivations behind the format.
# Linking into an executable
Coming back from the previous tangent, let's think about linking. When you
compile multiple files, the .c
files get compiled into .o
files. When I
first started doing C code, one thing that continuously baffled me was how a
.c
file referenced a function in another .c
file. You only reference .h
files in a .c
file, so how did it know what code to run?
The way it works is by creating a symbol table. There are a multitude of types
of symbols in an executable file, but the general gist is that a symbol is a
named reference to something. The nm
utility allows you to inspect an
executable file's symbol table. Here's some example output:
$ nm hello
0000000100001048 B _NXArgc
0000000100001050 B _NXArgv
0000000100001060 B ___progname
0000000100000000 A __mh_execute_header
0000000100001058 B _environ
U _exit
0000000100000ef0 T _main
U _puts
0000000100001000 d _pvars
U dyld_stub_binder
0000000100000eb0 T start
Look at the symbols labelled with the letter U
. We have _exit
, _puts
and
dyld_stub_binder
. The _exit
symbol is operating system specific and will be
the routine that knows how to return control back to the OS once your program
has finished, the _puts
symbol is very important for our program and exists in
whatever libc we have, and dyld_stub_binder
is an entry point for resolving
dynamic loads. All of these symbols are "unresolved", which means if you try and
run the program and no suitable match is found for them, your program will fail.
So when you create an object file, the reason you include the header is because everything in that header file will become an unresolved symbol. The process of linking multiple object files together will do the job of finding the appropriate function that matches your symbol and link them together for the final executable created.
To demonstrate this, consider the following C file:
#include <stdio.h>
extern void test(void);
int main(int argc, char* argv[]) {
printf("Hello, world!\n");
return 0;
}
Compiling this file into an object file and then inspecting the contents will show you the following:
$ gcc -c hello.c
$ nm hello.o
0000000000000050 r EH_frame0
000000000000003b r L_.str
0000000000000000 T _main
0000000000000068 R _main.eh
U _puts
U _test
We now have an unresolved symbol called _test
! The linker will expect to find
that somewhere else and, if it does not, will throw a bit of a hissy fit. Trying
to link this file on its own complains about 2 unresolved symbols, _test
and
_puts
. Linking it against libc complains about one unresolved symbol, _test
.
Unfortunately, because we don't actually have a definition for test()
we can't
use it. This may sound confusing, seeing as we defer the linking of puts()
until runtime. Why can't we just do the same with test()
? Build an executable
file and let the loader/linker try and figure it out at runtime?
In the linking process you need to specify where the linker will be able to
find things on the target system. Let's step through the original hello.c
example, doing each of the compilation steps ourself:
$ gcc -c hello.c
This creates hello.o
with an unresolved _puts
symbol.
$ ld hello.o
This craps out. We need to give it more information. At this point I'm going to
mention that I'm on a Mac system and am about to reference libraries that have
different names on a Linux system. As a general rule here, you can replace the
.dylib
extension with .so
:
$ ld hello.o /usr/lib/libc.dylib
This still craps out. Check out this error message:
ld: entry point (start) undefined. Usually in crt1.o for inferred
architecture x86_64
What the hell? This is a really good error to come across and learn about, though. It leads us nicely into the next section.
# Running the program
Wait, didn't we finish the last section with an object file that wouldn't link for some arcane reason? Yes, we did. But getting to a point where we can successfully link it requires us to know a little bit more about how our program starts running when it's loaded into memory.
Before every program starts, the operating system needs to set things up for it.
Things such as a stack, a heap, a set of page tables for accessing virtual
memory and so on. We need to "bootstrap" our process and set up a good
environment for it to run in. This setup is usually done in a file called
crt0.o
.
When you started learning programming and you used a language that got compiled,
one of the first things you learned was that your program's entry point is
main()
right? The true story is that your program doesn't start in main, it
starts in start
. This detail is abstracted away from you by the OS and the
toolchain, though, in the form of the crt0.o
file.
The osdev wiki shows a
great example of a simple crt0.o
file that I'll copy here:
.section .text
.global _start
_start:
# Set up end of the stack frame linked list.
movq $0, %rbp
pushq %rbp
movq %rsp, %rbp
# We need those in a moment when we call main.
pushq %rsi
pushq %rdi
# Prepare signals, memory allocation, stdio and such.
call initialize_standard_library
# Run the global constructors.
call _init
# Restore argc and argv.
popq %rdi
popq %rsi
# Run main
call main
# Terminate the process with the exit code.
movl %eax, %edi
call exit
07/08/2013 UPDATE: In a previous version of this post I got this bit totally wrong, confusing the 32bit x86 calling convention with the x86-64 calling convention. Thanks to Craig in the comments for pointing it out :) The below should now be correct.
The line that's probably most interesting there is where main
is called. This
is the entry point into your code. Before it happens, there is a lot of setup.
Also notice that argc
and argv
handling is done in this file, but it assumes
that the loader has pushed the values into registers beforehand.
Why, you might ask, do argc
and argv
live in %rsi
and %rdi
before being
passed to your main function? Why are those registers so special?
The reason is something called a "calling convention". This convention details how arguments should be passed to a function call before it happens. The calling convention in x86-64 C is a little bit tricky but the explanation (taken from here) is as follows:
Once arguments are classified, the registers get assigned (in left-to-right order) for passing as follows:
- If the class is MEMORY, pass the argument on the stack.
- If the class is INTEGER, the next available register of the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used
For example, take this C code:
void add(int a, int b) {
return a + b;
}
int main(int argc, char* argv[]) {
add(1, 12);
return 0;
}
The assembler that would call that function goes something like this:
movq $1, %rdi
movq $12, %rsi
call add
The $12
and $1
there are the literal, decimal values being passed to the
function. Easy peasy :) The convention isn't something that needs to be
followed in your own assembly code. You're free to put arguments wherever you
want, but if you want to interact with existing library functions then you need
to do as the Romans do.
With all of this said and done, how do we correctly link and run our hello.o
file? Like so:
$ ld hello.o /usr/lib/libc.dylib /usr/lib/crt1.o -o hello
$ ./hello
Hello, world!
Hey! I thought you said it was crt0.o
? It can be... crt1.o
is a file with
exactly the same purpose but it has more in it. crt0.o
didn't exist on my
system, only crt1.o
did. I guess it's an OS decision.
Here's
a short mailing list post that talks about it.
Interestingly, inspecting the symbol table of the executable we just linked together shows this:
$ nm hello
0000000000002058 B _NXArgc
0000000000002060 B _NXArgv
U ___keymgr_dwarf2_register_sections
0000000000002070 B ___progname
U __cthread_init_routine
0000000000001eb0 T __dyld_func_lookup
0000000000001000 A __mh_execute_header
0000000000001d9a T __start
U _atexit
0000000000002068 B _environ
U _errno
U _exit
U _mach_init_routine
0000000000001d40 T _main
U _puts
U dyld_stub_binder
0000000000001e9c T dyld_stub_binding_helper
0000000000001d78 T start
The reason is that .dylib
and .so
files (they have the same job, but on Mac
they have the .dylib
extension and probably a different internal format) are
dynamic or "shared" libraries. They will tell the linker that they are to be
linked dynamically, at runtime, rather than statically, at compile time. The
crt*.o
files are normal objects, and link statically which is why the start
symbol has an address in the above symbol table.
# The Death of a Running Program
You return a number from main()
and then your program is done, right? Not
quite. There is still a lot of work to be done. For starters, your exit code
needs to be propagated up to any parent processes that may be anticipating your
death. The exit code tells them something about how your program finished.
Exactly what it tells them is entirely up to you, but the standard is that 0
means everything was okay, anything non-zero (up to a max of 255) signifies that
an error occurred.
There is also a lot of OS cleanup that happens when your program dies. Things
like tidying up file descriptors and deallocating any heap memory you may have
forgotten to free()
before you returned. You should totally get into the habit
of cleaning up yourself, though!
# Wrapping up
So that's about the extent of my knowledge on how your code gets turned into a running program. I know I missed some bits out, oversimplified some things and I was probably wrong in places. If you can correct me on any point, or have anything illuminating about how non-x86 or non-ELF systems do the above tasks, I would love to have a discussion about it in the comments :)