Language Interoperability From the Ground Up
samwho keyboard logo

Language Interoperability From the Ground Up

How does a function in Ruby call a function in C? How does a function in Kotlin call a function in Java?

It's no secret that you can call functions in one programming language from another. Not only is it possible, it can be done in multiple ways. But how does it actually work?

This post is going to be a bottom-up, in-depth look at interoperability. Each section will build upon the previous until we have an understanding of the mechanisms that make language interoperability possible.

# Who should read this post?

If you have been programming for a little while and have heard people talking about whether a library has "Java bindings" or "Ruby bindings."

If you have been using a language like Kotlin for a while and called Java functions from Kotlin and wondered how on earth that works under the hood.

If you have a keen interest in the gory details of complicated systems, but want an explanation that doesn't assume much and keeps the examples as simple and hands-on as possible.

Additionally, you will need to have the following programs installed on your computer and accessible from the command line if you would like to follow along with the examples: nm, gcc, nasm, gdb, javac javap, kotlinc, kotlin, ruby. All of the examples in this post were written and run on an Arch Linux installation, they're not likely to work as shown here when run on a Mac or Windows machine.

# Why is language interoperability important?

The majority of language interoperability involves higher level languages calling libraries written in lower level languages. A Java library that offers you the ability to call in to the native OpenSSL libraries, for example. Or a Rust library that provides a more idiomatic Rust API in to the cURL library written in C.

Duplication is bad. When you've written a complicated library in one language, reimplementing it in another is just one more thing to forget to maintain.

Additionally, it's a good idea for young languages to be able to make the most of existing work. If you create a new language and include a mechanism for using existing native libraries, you can immediately draw upon decades of hard work.

# The benefits of purity

Sometimes you might see a library advertised as a "pure Ruby" implementation, or a "pure Java" implementation of some other library. These libraries will be full reimplementations of their target technology.

For example, this is a pure Java implementation of LevelDB. This would have been a lot of work for the author, but the advantage is that people will be able to use LevelDB from Java without having to install the native LevelDB library on their system and package that up with their Java code for deployment.

While a pure reimplementation can be a lot of up-front effort and maintenance, they can be easier to use.

# Why start at the bottom and work up?

The key to interoperability is finding common ground. With computers, the common ground is a set of standards and conventions in the lower levels of how programs work that allow languages to speak to each other. Understanding these conventions is key to understanding when languages can interoperate, and when they can't.

# Rock bottom: assembly language

mov eax, 10

That there is a line of "assembly code." It is a single "instruction," and it consists of a "mnemonic", mov, and two "operands," eax and 10.

This is the "move" instruction, and it instructs our CPU to move the value 10 in to the register eax. It is part of an instruction set called "x86," which is used in 32-bit Intel and AMD CPUs1.

1

Don't worry too much about these words if they don't mean anything to you. They're just terms you may see in the wild and, when you do, you'll know that the things in this post are what they are referring to.

# What is a register?

Registers are small but fast bits of storage connected directly to your CPU. If you want to do something to some data, be it addition, subtraction, or some other operation, the data needs to first be loaded in to a register.

If you're not used to writing code at this level, the concept of registers might seem silly. Why can't we just store 10 in memory and operate on it there?

Because that isn't physically possible. The CPU isn't connected to memory directly. It is connected through a series of caches, and all memory access must go through the Memory Management Unit, or MMU. The MMU can't process any operations, this happens in the Arithmetic Logic Unit, or ALU2, and the ALU is also directly connected to the CPU. It can only get at data if it is in a register.

2

With exceptions including floating point calculations, which happen on the Floating Point Unit, or FPU.

The MMU and friends

This is not to scale. In reality, there are around a kilobyte of registers, a few hundred kilobytes of L1 cache, a megabyte or two of L2 cache, low tens of megabytes of L3 cache, and then memory is often tens of gigabytes.

# Program flow

Doing mathematical operations is great and all, but for us to write code that does anything we need to be able to compare things and make decisions based on the outcome.

mov    eax, 10
sub    eax, 5
mov    ebx, 5
test    eax, ebx
je    equal

notequal:
jmp    exit

equal:
; do something here

exit:
; end of program

We start this snippet with some moves and a subtract (sub). We introduce a new register, ebx. Then we do a test. The test instruction compares two values, they can either both be registers or one of them could be an "immediate" value, e.g. 10. The result of the test is stored in a special "flags" register that we don't touch directly. Instead, we use a family of "jump" instructions that read the flags register and decide which instruction to run next.

The je instruction will jump to a point in our code, denoted by the "label" equal, if the result of the test was that both values are equal. Otherwise, the code falls through to the next assembly instruction, which will be whatever we decide to put below our notequal label. For now, we just do a jmp, which is an unconditional jump, to the end of our program at exit.

"Labels" are a new concept in this snippet. They aren't instructions and the CPU doesn't attempt to run them. Instead, they're used to mark points in the code that can be jumped to, which is what we've used them for here. You can spot labels by looking for the trailing colon.

# Accessing memory

So far so good, but we're only touching registers at the moment. Eventually we will run out of space in registers and have to fall back to using memory. How does that work in assembly?

mov eax, 0x7f00000
mov dword [eax], 123

The first bit should be familiar, we're storing the hexidecimal value 0x7f00000 in to eax, but then we do something funky with the next instruction.

The square brackets around [eax] mean that we want to store a value in the memory address inside of eax. The dword keyword signifies that we want to store a 4-byte value ("double word"). We need the dword keyword in there because without it there's no way to infer how large the value 123 should be in memory.

If you're familiar with pointers in C, this is essentially how pointers work. You store the memory address you want to point to in a register and then access that memory by wrapping the register name in square brackets.

We can take this a little further:

mov eax, 0x7f00000
mov dword [eax + 1], 123

This will store the value 123 at address 0x7f00001, one byte higher than before. The value of eax isn't modified by doing this, it's just letting us access the value at a register plus an offset. This is commonly seen in real-world assembly code, and we'll see why later on.

# The stack

Fundamentals out of the way, this is our first recognisable concept from higher level languages. The stack is where languages typically store short-lived variables, but how does it work?

When your program starts, the value of a special register called esp is set to a memory location that represents the top of the stack space you are allowed to access, which is itself near the top of your program's "address space."

You can examine this phenomenon by writing a simple C program and running it in a debugger:

int main() {
    return 42;
}

Compile with gcc and run using gdb, the "GNU Debugger":

$ gcc main.c -o main -m32
$ gdb main
gdb> start
gdb> info register esp
esp    0xffffd468    0xffffd468

When we say start in the GDB prompt, we're asking it to load our program in to memory, start running it, but pause it as soon as it starts. Then we inspect the contents of esp with info register esp.

# A brief detour into memory organisation

Why is the stack near the top of the address space? What even is an address space?

When your program gets loaded in to memory on a 32-bit system, it is given its own address space that is 4GB in size: 0x00000000 to 0xffffffff. This happens no matter how much physical RAM your machine has installed in it. Mapping this "virtual" address space is another one of the jobs that the MMU performs, and the details of it are beyond the scope of this post.

Memory layout visualised

This simplified view of memory shows that our program is loaded somewhere fairly low down in the address space, our stack is up near the top and grows down, then we have this mysterious place called the "heap" that lives just above our program. It's not important for language interoperability, but the heap stores long-lived data, for example things you've allocated using malloc() in C or new in C++.

This allows for a fairly efficient use of the available space. In reality, the stack is limited in size to a handful of megabytes and exceeding that will cause your program to crash. The heap can grow all the way up to the base of the stack but no further. If it ever attempts to overlap with the stack, your malloc() calls will start to fail, which causes many programs to crash.

# Back to the stack

The x86 instruction set gives us some instructions for adding and removing values from the stack.

push    1
push    2
push    3
pop    eax
pop    ebx
pop    ecx

At the end of this sequence of instructions, the value of eax will be 3, ebx will be 2 and ecx will be 1. Don't trust me? We can verify this for ourselves with a couple of small modifications.

global main
main:
    push    1
    push    2
    push    3
    pop    eax
    pop    ebx
    pop    ecx
    ret

Save this in to a file called stack.s and run the following commands:

$ nasm -f elf stack.s
$ gcc -m32 stack.o -o stack

If you don't have nasm you'll need to install it. It's a type of program called an "assembler" and it can take assembly instructions and compile them down to machine code.

We now have an executable file in our working directory called "stack". It's a bonafide program you can run like any other program. The only modifications we had to make to it were giving it a main label, and making sure it correctly returns control back to the operating system with the ret instruction. We'll explore ret in more detail later.

Running this program doesn't really do anything. It will run, but exit silently3. To see what's going on inside of it, we will once again need a debugger.

3

It's not 100% silent. Check the exit status of the program when it finishes running. Why do you think it exits with the status it does?

$ gdb stack gdb> break *&main+9 gdb> run gdb> info registers eax ebx ecx

The output of this sequence of commands should verify what I said earlier about the contents of those registers. gdb allows us to load up a program, run it until a certain point (break *&main+9 is us telling gdb to stop just before the ret instruction) and then examine the program state.

# So what do push and pop actually do?

The push and pop instructions are shorthand and can be expanded to the following:

; push
sub    esp, 4
mov    [esp], operand

; pop
mov    operand, [esp]
add    esp, 4

All of which should be familiar to you from previous sections, and neatly demonstrates how the stack grown downwards and shrinks upwards as things are added and removed.

# Functions in assembly

Language interoperability is, at its most fundamental, the ability to call a function written in one language from a different language. So how do we define and call functions in assembly?

With the knowledge we have so far, you might be tempted to think that function calls look like this:

main:
    jmp myfunction

myfunction:
    ; do things here

A label to denote where your function is in memory, and a jump instruction to call it. This approach has two critical problems: it doesn't handle passing arguments to or returning values from the function and it doesn't handle returning control to the caller when your function ends.

We could solve the first problem by putting arguments and return values on the stack:

main:
    push    1
    push    2
    jmp    add
    pop    eax

add:
    pop    eax
    pop    ebx
    add    eax, ebx
    push    eax

This would work really well if only our add function were able to jump back to the caller when it was finished. At the moment, when add ends, the program ends. In an ideal world it would return back to just after the jmp in main.

What if we saved where we were when we called the function and jumped back to that location when the function was finished?

The eip register holds the location of the currently executing instruction. Using this knowledge, we could do this:

main:
    push    1
    push    2
    push    eip
    jmp    add
    pop    eax

add:
    pop    edx    ; store the return address for later
    pop    eax    ; 2
    pop    ebx    ; 1
    add    eax, ebx
    push    eax
    mov    eip, edx

We're getting there. This approach has a couple of problems, though: we modify esp a lot more than we really have to and x86 doesn't let you mov things in to eip.

Here's what a compiler would actually generate for our example above:

main:
    push    ebp
    mov    ebp, esp
    push    2
    push    1
    call    add
    add    esp, 8
    pop    ebp
    ret

add:
    push    ebp
    mov    ebp, esp
    mov    edx, dword [ebp + 8]
    mov    eax, dword [ebp + 12]
    add    eax, edx
    pop    ebp
    ret

This is a lot to take in, so let's go through it line by line.

We're introducing a new special register: ebp. This is the "base pointer" register, and its purpose is to act as a pointer to the top of the stack at the moment a function is called. Every function starts with saving the old value of the base pointer on the stack, and then moving the new top of the stack in to ebp.

Next we do the familiar pushing of arguments on to the stack. At least we got that right. Then we use an instruction we haven't seen before called call. call can be expanded to the following:

; call
push    eip + 2
jmp    operand

With eip + 2 meaning the instruction after the jmp below it. It doesn't matter what the value is, just think of it as pushing the address of the instruction after the call instruction on to the stack so we can refer to it later.

Then control is passed to add, which follows a similar pattern. The base pointer is saved, the stack pointer becomes the new base pointer, and then we get to see the base pointer in action.

mov    edx, dword [ebp + 8]
mov    eax, dword [ebp + 12]

This code is pulling the two arguments to add off of the stack and in to registers so that we can operate on them. But why are they 8 bytes and 12 bytes away?

Remember we pushed the arguments on to the stack, and then call pushed the address to return to. This means that the first 4 bytes of stack are a return address, then then 8 bytes after that are our arguments. To get to the first argument, you need to move 8 bytes up from the stack pointer (because it grows downwards), and to get to the second argument you need to move 12 bytes up.

Stack frame

This has the added benefit of not requiring us to modify esp with every single push and pop instruction. It's a small saving, but when you consider that this has to happen for every argument to every function called, it adds up.

When we've done what we need to do in our add function, we perform the steps we did at the start but in reverse order.

pop    ebp
ret

The ret instruction is special because it allows us to set the value of eip. It pops the return value that call pushed for us and jumps to it, returning control of the program to the calling function.

The same thing happens in main, except there's a subtle difference. The add esp, 8 is necessary to "free" the arguments to add we pushed on to the stack. If we don't do this, the pop ebp will not correctly restore the base pointer and we'll likely try to refer to memory we never intended to, crashing our program.

Lastly, you'll notice that add doesn't push its result back on to the stack when it's done. It leaves the result in eax. This is because it's conventional for a function's return value to be stored in eax.

# Conventions

We've just done the deepest possible dive on how function calls work in x86, now let's put names to each of the things we have learnt.

Saving the base pointer and moving the stack pointer prior to calling a function is called the function prologue.

Restoring the stack pointing and base pointer after a function call is called the function epilogue.

Those two concepts, along with returning values in eax and storing your function arguments on the stack make up what's called a calling convention, and calling conventions are part of a larger concept known as an application binary interface, or ABI. Specifically, all of what I have described so far it part of the System V ABI, which is used by almost all Unix and Unix-like operating systems.

Before we start calling functions written in one language from functions written in another, there's one last thing we need to be aware of.

# Object files

When you compile a C program, quite a lot of things happen under the hood. The most important concept to understand for language interoperability is "object files."

If your program consists of 2 .c files, the first thing a compiler does is compile each of those .c files in to a .o file. There is typically a 1-to-1 mapping between .c files and .o files. The same is true of assembly, or .s files.

Object files, on Linux and lots of other Unix-like operating systems, are in a binary format called the "executable and linkable format," or ELF for short. If you're interested in the full range of data found inside of an ELF file, you can use the readelf -a <elffile> command to find out. The output is likely to be quite dizzying, though, and we're only really interested in one of its features here.

# Symbol tables

To explain symbol tables, let's split out our assembly from earlier in to two .s files:

add.s:

add:
    push    ebp
    mov    ebp, esp
    mov    edx, dword [ebp + 8]
    mov    eax, dword [ebp + 12]
    add    eax, edx
    pop    ebp
    ret

main.s:

main:
    push    ebp
    mov    ebp, esp
    push    2
    push    1
    call    add
    add    esp, 8
    pop    ebp
    ret

Assemble both of the files:

$ nasm -f elf add.s
$ nasm -f elf main.s
main.s:6: error symbol `add' undefined

Whoops. Our assembler isn't happy about us calling an undefined function. This is sticky, because we want to define that function elsewhere and call it in main.s, but it seems here like the assembler doesn't allow that.

The problem is that there is no add symbol defined in this file. If we want to tell the assembler that we intend to find this symbol in another file, we need to say so. Add this line to the top of main.s:

extern add

And now it should assemble without complaint. Before we go further, have a look in your working directory. You should have two .o files: main.o and add.o. We can look at the contents of their symbol tables with a tool called nm:

$ nm add.o
00000000 t add

$ nm main.o
         U add
00000000 t main

The first column is the address of the symbol, the second column is the type of the symbol, and the third column is the name of the symbol. Notice that the symbol names match up with our label names. Also note that main.o has an add symbol of type U. U means "undefined", which means we need to find a definition for it when we link these object files together later.

Both of our defined functions have a symbol type of t. This means that the symbol points to some code.

To create an executable out of these object files, we run:

$ gcc -m32 main.o add.o -o main

This will shout at you, claiming that it cannot find either main or add. What gives?

Unless we explicitly say so, the symbols in an object file cannot be used by other object files when the compiler links them together. To fix this, we need to add:

global add

To the top of add.s and:

global main

To the top of main.s. This allows the symbols to be linked and the result is that the compiler now takes our object files and creates an executable out of them without complaint.

$ nm main

This will produce a lot of output now, because the compiler has to link in a lot of administrative stuff, like the libc constructor and destructor handlers, __libc_csu_init and __libc_csu_fini respectively. Don't worry about them, the important thing is that both main and add are defined and the program runs without complaint.

Assemble and link diagram

# Calling a C function from C++

Let's go up a level and look at some C and C++.

main.cpp:

extern int inc(int i);

int main() {
    return inc(2);
}

The first line here is us telling the compiler to expect to find a function called inc in another file.

inc.c:

int inc(int i) {
    return i + 1;
}

Here's the set of steps we need to follow to compile them both separately and then link them together:

$ gcc -c main.cpp -o main.o
$ gcc -c inc.c -o inc.o
$ gcc inc.o main.o

Unfortunately, this fails with a seemingly unfathomable error:

main.o: In function `main'
main.cpp:(.text+0xa): undefined reference to `inc(int)'
collect2: error: ld return 1 exit status

But we supplied an object file with a definition of inc(int) in it, we made sure to tell the compiler to expect to find a function called inc(int), why can't it find it?

# Name mangling

Sadly C++ wasn't able to provide all of the features it wanted to without diverging from the System V ABI a little bit. When you compile a function in C, that function gets given a symbol with the name you gave it so that others can call it by that name.

C++ does not do this by default. As well as the name you give it, C++ also tacks on information about the return type and argument types of that function. If the function is in a class, information about the class is also included. It does this to allow you to overload the name of a function, so you can define multiple variations of a function that takes different arguments. This is called name mangling.

As a result, when we compiled our main.cpp file, it was told to look for a function called _Z3inci instead of plain old inc. Our inc.c file provides a function called inc, and as such the two languages cannot interoperate without a little bit of help.

Fortunately, the problem is easily solved by adding 4 characters to our main.cpp:

extern "C" int inc(int i);

This addition of "C" tells C++ to compile this file in search of a function with plain old C-style calling conventions, and this includes using the plain old C name of inc. Attempting to compile this code should now work as expected.

# Calling a C++ function from C

This relationship works similarly in the other direction. If we want to write a function in C++ but expose it in a way that a C program could call it, we would need to use extern "C" on that function:

extern "C" int inc(int i) {
    return i + 1;
}

# What about Java?

What we've discussed up until now is the fundamental basis for how language interoperability works between native languages, but what about a language that runs on a virtual machine, such as Java?

The process is a little more involved, but doable. The problem arises because Java runs on a thing called the Java Virtual Machine, or JVM, which acts as a layer of indirection between Java code and the machine on which it runs. Because of this, Java cannot link directly to native libraries in the same way that C and C++ can. We have to introduce a layer that translates between the native world and the JVM world.

Fortunately, the people behind Java gave this a lot of thought and they came up with the "Java Native Interface," or JNI. It's the accepted way to get Java code to talk to native code, and here's how it works:

public class Interop {
  static {
    System.loadLibrary("inc");
  }

  public static native int inc(int i);

  public static void main(String... args) {
    System.out.println(inc(2));
  }
}

Notice the use of the native keyword. This tells Java that the implementation for this function is defined in some native code somewhere. The System.loadLibrary("inc") line will search Java's library path for a library called libinc.so and, when it finds it, we will be able to use the Java function inc to call our native code!

But how do we do that?

Step 1: Generate the JNI header file from our code.

$ javac -h . Interop.java

The -h . tells javac to generate a file called Interop.h in the current directory. This will define the function we have to to implement. The resulting file looks like this:

/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class Interop */

#ifndef _Included_Interop
#define _Included_Interop
#ifdef __cplusplus
extern "C" {
#endif
/*
 * Class:     Interop
 * Method:    inc
 * Signature: (I)I
 */
JNIEXPORT jint JNICALL Java_Interop_inc
  (JNIEnv *, jclass, jint);

#ifdef __cplusplus
}
#endif
#endif

This looks a lot scarier than it is. The lines containing _Included_Interop are just "header guards" that make sure we can't accidentally include this file twice and the __cplusplus bit checks if we're compiling as C++ code and, if we are, wraps everything in an extern "C" block, which you'll remember from earlier in this post.

The rest is the definition of the JNI function we have to implement:

JNIEXPORT jint JNICALL Java_Interop_inc
  (JNIEnv *, jclass, jint);

It might not look like one, but this is indeed a function declaration. We implement it like so:

inc-jni.c:

#include <jni.h>
#include "Interop.h"

JNIEXPORT jint JNICALL Java_Interop_inc
  (JNIEnv* env, jclass class, jint i) {
    return (jint)(i + 1);
}

The jint cast is to make sure our integer is the size that Java is expecting it to be. The jni.h include is required for all of the Java-specific things we're seeing, such as jint and jclass.

To compile this we need to execute a pretty gnarly gcc call:

$ gcc -I"$JAVA_HOME\include" -I"$JAVA_HOME\include\linux" -fPIC -shared -o libinc.so inc-jni.c

The -I flags tell gcc where to find header files, which we need for the #include <jni.h> lines to work. The -fPIC -shared flags create a special type of object file called a "shared object." What's special about this type of object file is that it's possible to, instead of compiling directly against it, load it in to your process at runtime. Shared object files are, by convention, named lib<something>.so.

Now we can run the Java code like so:

$ java -Djava.library.path=. Interop
3

Voila! We successfully called our native inc implementation from Java! How cool is that?

# What about Ruby?

Ruby has a similar approach to Java. It exposes an API called "Ruby native extensions" that you can hook in to to call native functions in Ruby. Given that we have explored Java's way of doing this, and Ruby's is not too dissimilar, I want to use Ruby to focus on a different and more convenient way of calling native code.

# ffi

ffi stands for "foreign function interface," and it's a Ruby gem that allows us to call functions in existing shared object files with very little setup. First, we install the ffi gem:

$ gem install ffi

Then we need to compile our original inc.c file in to a shared object:

$ gcc -fPIC -shared -o libinc.so inc.c

And then we can write the following short snippet of Ruby code and call our inc function:

require 'ffi'

module Native
  extend FFI::Library
  ffi_lib './libinc.so'
  attach_function :inc, [:int], :int
end

puts Native.inc(2)

This method makes us jump through far fewer hoops than Java's JNI does, and has the benefit of allowing us to call existing shared objects without modifying them. For example, you can call directly in to libc.so:

require 'ffi'

module Libc
  extend FFI::Library
  ffi_lib 'c'
  attach_function :puts, [:string], :int
end

Libc.puts "Hello, libc!"

Notice that we only had to specify 'c' as the library name. This is because Unix-like systems have standardised paths to find libraries in, often including /lib and /usr/lib. By default ffi will look for libraries with the naming scheme lib<name>.so. If you do ls /usr/lib/libc.so you should find a file exists at that path.

# Why doesn't Java have FFI?

It does! There's a library called JNA that does the same job that Ruby's FFI library does.

import com.sun.jna.Library;
import com.sun.jna.Native;

class JNA {
  public interface Libc extends Library {
    Libc INSTANCE = (Libc)Native.loadLibrary("c", Libc.class);
    public int puts(String s);
  }

  public static void main(String... args) {
    Libc.INSTANCE.puts("Hello, libc!");
  }
}

We have to download the JNA library, which you can do so from here. Then we compile and run the Java code including the JNA library:

$ javac -classpath jna-4.1.0.jar JNA.java
$ java -classpath jna-4.1.0.jar:. JNA
Hello, libc!

# Why would anyone use a language's native interface if FFI is so much more convenient?

Lots of languages have some form of FFI library, and they're very convenient for calling in to existing native libraries. The problem, though, is that it's one-way communication. The library you're calling in to can't modify, for example, a Java object directly. It can't call Java code. The only way to do that is to use Java's JNI or Ruby's native extensions, because they expose to you an API for doing that.

If you don't need to have two-way communication between languages, though, and all you want to do is call existing native code, FFI is the way to go.

# Hold up, so how does Kotlin call functions written in Java?

Kotlin is a relatively new language that runs on the JVM.

Wait, doesn't Java run on the JVM?

That's true! But the JVM isn't just for Java, it's a generic virtual machine that compilers can generate "machine code" for just like native machines. The JVM machine code is more commonly referred to as "bytecode," because every instruction is one byte in length. Yes, this limits the instruction set to 256 instructions. Far fewer than the 2,034 instructions you'll find in x86.

In this respect, you could consider the JVM to be a "native" machine. That is the abstraction it aims to present to compilers and users. The only difference is that the native machine you're running code on is being emulated by a piece of software called the JVM.

Let's look at some example Java and Kotlin code.

Hello.java:

public class Hello {
  public static void world() {
    System.out.println("Hello, world!");
  }
}

Main.kt:

fun main(args: Array<String>) {
  Hello.world()
}

Let's compile and run this code.

$ javac Hello.java
$ kotlinc -classpath . Main.kt
$ kotlin MainKt
Hello, world!

And now we disassemble the code to see what's going on under the hood:

$ javap -c MainKt.class
Compiled from "Main.kt"
public final class MainKt {
  public static final void main(java.lang.String[]);
    Code:
      0: aload_0
      1: ldc           #9   // String args
      3: invokestatic  #15  // Method kotlin/jvm/internal/Intrinsics.checkParameterIsNotNull:(Ljava/lang/Object;Ljava/lang/String;)V
      6: invokestatic  #21  // Method Hello.world:()V
      9: return
}

This looks a lot different to disassembled native code. The bit underneath the "Code:" heading is the actual bytecode that gets run. The numbers on the left hand side are the byte offset in to the code, the next column is the "opcode" and anything after the opcode are the operands.

"But you said JVM bytecodes were one byte in length, why does the byte index on the left increment more than 1 byte per instruction?"

The opcodes are one byte in length. The operands make up the rest of the space. For example, the invokestatic opcode takes two additional bytes in operands: one byte to reference the class being called and another to reference the method on that class.

How does the number 21 reference our Hello.world() method?

A fundamental concept in JVM class files is the "constant pool." It's not output by default by javap, but we can ask for it:

$ javap -verbose MainKt.class
Classfile /tmp/MainKt.class
  Last modified May 7, 2018; size 735 bytes
  MD5 checksum f3ce23c2362512e852a5c91a1053c198
  Compiled from "Main.kt"
public final class MainKt
  minor version: 0
  major version: 50
  flags: ACC_PUBLIC, ACC_FINAL, ACC_SUPER
Constant pool:
  ...
  #16 = Utf8               Hello
  #17 = Class              #16
  #18 = Utf8               world
  #19 = Utf8               ()V
  #20 = NameAndType        #18:#19
  #21 = Methodref          #17.#20
  ...

I've trimmed a lot of the output, but left the relevant entried in the constant pool. 16, 18 and 19 are the raw UTF-8 strings that give the names "Hello", "world" and "()V", which is a way of saying it's a function that returns void. The invokestatic opcode specifically takes operands that are indexes in to the constant pool, and entries in the constant pool can reference other entries in the constant pool.

If we tried to compile Main.kt on its own, without adding Hello.class in to the classpath, we get an error very similar to the one we got when we tried to link an object files without all of the symbol definitions it required:

$ kotlinc Main.kt
Main.kt:2:2: error: unresolved reference: Hello

So because these languages, Java and Kotlin, both run using the same ABI and on the same instruction set, we are able to use that ABI's calling conventions to make function calls across language boundaries.

# Conclusion

We've come a long way. From raw, native assembly code up to calling functions between languages running on the same virtual machine.

With the knowledge you have built up of calling conventions, instruction sets, object files and FFI libraries, you should now be well equipped to explore how languages not mentioned in this post would call functions written in other languages.

powered by buttondown