Language Interoperability From the Ground Up
How does a function in Ruby call a function in C? How does a function in Kotlin call a function in Java?
It's no secret that you can call functions in one programming language from another. Not only is it possible, it can be done in multiple ways. But how does it actually work?
This post is going to be a bottom-up, in-depth look at interoperability. Each section will build upon the previous until we have an understanding of the mechanisms that make language interoperability possible.
# Who should read this post?
If you have been programming for a little while and have heard people talking about whether a library has "Java bindings" or "Ruby bindings."
If you have been using a language like Kotlin for a while and called Java functions from Kotlin and wondered how on earth that works under the hood.
If you have a keen interest in the gory details of complicated systems, but want an explanation that doesn't assume much and keeps the examples as simple and hands-on as possible.
Additionally, you will need to have the following programs installed on your
computer and accessible from the command line if you would like to follow along
with the examples: nm
, gcc
, nasm
, gdb
, javac
javap
, kotlinc
,
kotlin
, ruby
. All of the examples in this post were written and run on
an Arch Linux installation, they're not likely to work as shown here when run
on a Mac or Windows machine.
# Why is language interoperability important?
The majority of language interoperability involves higher level languages calling libraries written in lower level languages. A Java library that offers you the ability to call in to the native OpenSSL libraries, for example. Or a Rust library that provides a more idiomatic Rust API in to the cURL library written in C.
Duplication is bad. When you've written a complicated library in one language, reimplementing it in another is just one more thing to forget to maintain.
Additionally, it's a good idea for young languages to be able to make the most of existing work. If you create a new language and include a mechanism for using existing native libraries, you can immediately draw upon decades of hard work.
# The benefits of purity
Sometimes you might see a library advertised as a "pure Ruby" implementation, or a "pure Java" implementation of some other library. These libraries will be full reimplementations of their target technology.
For example, this is a pure Java implementation of LevelDB. This would have been a lot of work for the author, but the advantage is that people will be able to use LevelDB from Java without having to install the native LevelDB library on their system and package that up with their Java code for deployment.
While a pure reimplementation can be a lot of up-front effort and maintenance, they can be easier to use.
# Why start at the bottom and work up?
The key to interoperability is finding common ground. With computers, the common ground is a set of standards and conventions in the lower levels of how programs work that allow languages to speak to each other. Understanding these conventions is key to understanding when languages can interoperate, and when they can't.
# Rock bottom: assembly language
mov eax, 10
That there is a line of "assembly code." It is a single "instruction," and it
consists of a "mnemonic", mov
, and two "operands," eax
and 10
.
This is the "move" instruction, and it instructs our CPU to move the value 10
in to the register eax
. It is part of an instruction set called "x86," which
is used in 32-bit Intel and AMD CPUs1.
Don't worry too much about these words if they don't mean anything to you. They're just terms you may see in the wild and, when you do, you'll know that the things in this post are what they are referring to.
# What is a register?
Registers are small but fast bits of storage connected directly to your CPU. If you want to do something to some data, be it addition, subtraction, or some other operation, the data needs to first be loaded in to a register.
If you're not used to writing code at this level, the concept of registers might seem silly. Why can't we just store 10 in memory and operate on it there?
Because that isn't physically possible. The CPU isn't connected to memory directly. It is connected through a series of caches, and all memory access must go through the Memory Management Unit, or MMU. The MMU can't process any operations, this happens in the Arithmetic Logic Unit, or ALU2, and the ALU is also directly connected to the CPU. It can only get at data if it is in a register.
With exceptions including floating point calculations, which happen on the Floating Point Unit, or FPU.
This is not to scale. In reality, there are around a kilobyte of registers, a few hundred kilobytes of L1 cache, a megabyte or two of L2 cache, low tens of megabytes of L3 cache, and then memory is often tens of gigabytes.
# Program flow
Doing mathematical operations is great and all, but for us to write code that does anything we need to be able to compare things and make decisions based on the outcome.
mov eax, 10
sub eax, 5
mov ebx, 5
test eax, ebx
je equal
notequal:
jmp exit
equal:
; do something here
exit:
; end of program
We start this snippet with some moves and a subtract (sub
). We introduce a
new register, ebx
. Then we do a test
. The test
instruction compares two
values, they can either both be registers or one of them could be an
"immediate" value, e.g. 10. The result of the test
is stored in a special
"flags" register that we don't touch directly. Instead, we use a family of
"jump" instructions that read the flags register and decide which instruction
to run next.
The je
instruction will jump to a point in our code, denoted by the "label"
equal
, if the result of the test was that both values are equal. Otherwise,
the code falls through to the next assembly instruction, which will be whatever
we decide to put below our notequal
label. For now, we just do a jmp
, which
is an unconditional jump, to the end of our program at exit
.
"Labels" are a new concept in this snippet. They aren't instructions and the CPU doesn't attempt to run them. Instead, they're used to mark points in the code that can be jumped to, which is what we've used them for here. You can spot labels by looking for the trailing colon.
# Accessing memory
So far so good, but we're only touching registers at the moment. Eventually we will run out of space in registers and have to fall back to using memory. How does that work in assembly?
mov eax, 0x7f00000
mov dword [eax], 123
The first bit should be familiar, we're storing the hexidecimal value 0x7f00000
in to eax
, but then we do something funky with the next instruction.
The square brackets around [eax]
mean that we want to store a value in the
memory address inside of eax
. The dword
keyword signifies that we want to
store a 4-byte value ("double word"). We need the dword
keyword in there
because without it there's no way to infer how large the value 123
should
be in memory.
If you're familiar with pointers in C, this is essentially how pointers work. You store the memory address you want to point to in a register and then access that memory by wrapping the register name in square brackets.
We can take this a little further:
mov eax, 0x7f00000
mov dword [eax + 1], 123
This will store the value 123
at address 0x7f00001
, one byte higher than
before. The value of eax
isn't modified by doing this, it's just letting us
access the value at a register plus an offset. This is commonly seen in
real-world assembly code, and we'll see why later on.
# The stack
Fundamentals out of the way, this is our first recognisable concept from higher level languages. The stack is where languages typically store short-lived variables, but how does it work?
When your program starts, the value of a special register called esp
is set to
a memory location that represents the top of the stack space you are allowed
to access, which is itself near the top of your program's "address space."
You can examine this phenomenon by writing a simple C program and running it in a debugger:
int main() {
return 42;
}
Compile with gcc
and run using gdb
, the "GNU Debugger":
$ gcc main.c -o main -m32
$ gdb main
gdb> start
gdb> info register esp
esp 0xffffd468 0xffffd468
When we say start
in the GDB prompt, we're asking it to load our program in
to memory, start running it, but pause it as soon as it starts. Then we inspect
the contents of esp
with info register esp
.
# A brief detour into memory organisation
Why is the stack near the top of the address space? What even is an address space?
When your program gets loaded in to memory on a 32-bit system, it is given its own address space that is 4GB in size: 0x00000000 to 0xffffffff. This happens no matter how much physical RAM your machine has installed in it. Mapping this "virtual" address space is another one of the jobs that the MMU performs, and the details of it are beyond the scope of this post.
This simplified view of memory shows that our program is loaded somewhere
fairly low down in the address space, our stack is up near the top and grows
down, then we have this mysterious place called the "heap" that lives just
above our program. It's not important for language interoperability, but the
heap stores long-lived data, for example things you've allocated using
malloc()
in C or new
in C++.
This allows for a fairly efficient use of the available space. In reality, the
stack is limited in size to a handful of megabytes and exceeding that will
cause your program to crash. The heap can grow all the way up to the base of
the stack but no further. If it ever attempts to overlap with the stack, your
malloc()
calls will start to fail, which causes many programs to crash.
# Back to the stack
The x86 instruction set gives us some instructions for adding and removing values from the stack.
push 1
push 2
push 3
pop eax
pop ebx
pop ecx
At the end of this sequence of instructions, the value of eax
will be 3,
ebx
will be 2 and ecx
will be 1. Don't trust me? We can verify this for
ourselves with a couple of small modifications.
global main
main:
push 1
push 2
push 3
pop eax
pop ebx
pop ecx
ret
Save this in to a file called stack.s
and run the following commands:
$ nasm -f elf stack.s
$ gcc -m32 stack.o -o stack
If you don't have nasm
you'll need to install it. It's a type of program
called an "assembler" and it can take assembly instructions and compile them
down to machine code.
We now have an executable file in our working directory called "stack". It's a
bonafide program you can run like any other program. The only modifications we
had to make to it were giving it a main
label, and making sure it correctly
returns control back to the operating system with the ret
instruction. We'll
explore ret
in more detail later.
Running this program doesn't really do anything. It will run, but exit silently3. To see what's going on inside of it, we will once again need a debugger.
It's not 100% silent. Check the exit status of the program when it finishes running. Why do you think it exits with the status it does?
$ gdb stack gdb> break *&main+9 gdb> run gdb> info registers eax ebx ecx
The output of this sequence of commands should verify what I said earlier about
the contents of those registers. gdb
allows us to load up a program, run it
until a certain point (break *&main+9
is us telling gdb
to stop just before
the ret
instruction) and then examine the program state.
#
So what do push
and pop
actually do?
The push
and pop
instructions are shorthand and can be expanded to the
following:
; push
sub esp, 4
mov [esp], operand
; pop
mov operand, [esp]
add esp, 4
All of which should be familiar to you from previous sections, and neatly demonstrates how the stack grown downwards and shrinks upwards as things are added and removed.
# Functions in assembly
Language interoperability is, at its most fundamental, the ability to call a function written in one language from a different language. So how do we define and call functions in assembly?
With the knowledge we have so far, you might be tempted to think that function calls look like this:
main:
jmp myfunction
myfunction:
; do things here
A label to denote where your function is in memory, and a jump instruction to call it. This approach has two critical problems: it doesn't handle passing arguments to or returning values from the function and it doesn't handle returning control to the caller when your function ends.
We could solve the first problem by putting arguments and return values on the stack:
main:
push 1
push 2
jmp add
pop eax
add:
pop eax
pop ebx
add eax, ebx
push eax
This would work really well if only our add
function were able to jump back to
the caller when it was finished. At the moment, when add
ends, the program ends.
In an ideal world it would return back to just after the jmp
in main
.
What if we saved where we were when we called the function and jumped back to that location when the function was finished?
The eip
register holds the location of the currently executing instruction. Using
this knowledge, we could do this:
main:
push 1
push 2
push eip
jmp add
pop eax
add:
pop edx ; store the return address for later
pop eax ; 2
pop ebx ; 1
add eax, ebx
push eax
mov eip, edx
We're getting there. This approach has a couple of problems, though: we modify
esp
a lot more than we really have to and x86 doesn't let you mov
things in
to eip
.
Here's what a compiler would actually generate for our example above:
main:
push ebp
mov ebp, esp
push 2
push 1
call add
add esp, 8
pop ebp
ret
add:
push ebp
mov ebp, esp
mov edx, dword [ebp + 8]
mov eax, dword [ebp + 12]
add eax, edx
pop ebp
ret
This is a lot to take in, so let's go through it line by line.
We're introducing a new special register: ebp
. This is the "base pointer"
register, and its purpose is to act as a pointer to the top of the stack at the
moment a function is called. Every function starts with saving the old value of
the base pointer on the stack, and then moving the new top of the stack in to
ebp
.
Next we do the familiar pushing of arguments on to the stack. At least we got that
right. Then we use an instruction we haven't seen before called call
. call
can
be expanded to the following:
; call
push eip + 2
jmp operand
With eip + 2
meaning the instruction after the jmp
below it. It doesn't matter
what the value is, just think of it as pushing the address of the instruction after
the call
instruction on to the stack so we can refer to it later.
Then control is passed to add
, which follows a similar pattern. The base pointer
is saved, the stack pointer becomes the new base pointer, and then we get to see the
base pointer in action.
mov edx, dword [ebp + 8]
mov eax, dword [ebp + 12]
This code is pulling the two arguments to add
off of the stack and in to registers
so that we can operate on them. But why are they 8 bytes and 12 bytes away?
Remember we pushed the arguments on to the stack, and then call
pushed the address
to return to. This means that the first 4 bytes of stack are a return address, then
then 8 bytes after that are our arguments. To get to the first argument, you need to
move 8 bytes up from the stack pointer (because it grows downwards), and to get to
the second argument you need to move 12 bytes up.
This has the added benefit of not requiring us to modify esp
with every single
push
and pop
instruction. It's a small saving, but when you consider that this
has to happen for every argument to every function called, it adds up.
When we've done what we need to do in our add
function, we perform the steps
we did at the start but in reverse order.
pop ebp
ret
The ret
instruction is special because it allows us to set the value of eip
.
It pops the return value that call
pushed for us and jumps to it, returning
control of the program to the calling function.
The same thing happens in main
, except there's a subtle difference. The add esp, 8
is necessary to "free" the arguments to add
we pushed on to the
stack. If we don't do this, the pop ebp
will not correctly restore the base
pointer and we'll likely try to refer to memory we never intended to, crashing
our program.
Lastly, you'll notice that add
doesn't push its result back on to the stack
when it's done. It leaves the result in eax
. This is because it's conventional
for a function's return value to be stored in eax
.
# Conventions
We've just done the deepest possible dive on how function calls work in x86, now let's put names to each of the things we have learnt.
Saving the base pointer and moving the stack pointer prior to calling a function is called the function prologue.
Restoring the stack pointing and base pointer after a function call is called the function epilogue.
Those two concepts, along with returning values in eax
and storing your
function arguments on the stack make up what's called a calling convention,
and calling conventions are part of a larger concept known as an application
binary interface, or ABI. Specifically, all of what I have described so
far it part of the System V ABI, which is used by almost all Unix and
Unix-like operating systems.
Before we start calling functions written in one language from functions written in another, there's one last thing we need to be aware of.
# Object files
When you compile a C program, quite a lot of things happen under the hood. The most important concept to understand for language interoperability is "object files."
If your program consists of 2 .c
files, the first thing a compiler does is compile
each of those .c
files in to a .o
file. There is typically a 1-to-1 mapping between
.c
files and .o
files. The same is true of assembly, or .s
files.
Object files, on Linux and lots of other Unix-like operating systems, are in a
binary format called the "executable and linkable format," or ELF for short. If
you're interested in the full range of data found inside of an ELF file, you
can use the readelf -a <elffile>
command to find out. The output is likely to
be quite dizzying, though, and we're only really interested in one of its
features here.
# Symbol tables
To explain symbol tables, let's split out our assembly from earlier in to two
.s
files:
add.s
:
add:
push ebp
mov ebp, esp
mov edx, dword [ebp + 8]
mov eax, dword [ebp + 12]
add eax, edx
pop ebp
ret
main.s
:
main:
push ebp
mov ebp, esp
push 2
push 1
call add
add esp, 8
pop ebp
ret
Assemble both of the files:
$ nasm -f elf add.s
$ nasm -f elf main.s
main.s:6: error symbol `add' undefined
Whoops. Our assembler isn't happy about us calling an undefined function. This is
sticky, because we want to define that function elsewhere and call it in main.s
,
but it seems here like the assembler doesn't allow that.
The problem is that there is no add
symbol defined in this file. If we want to tell
the assembler that we intend to find this symbol in another file, we need to say so.
Add this line to the top of main.s
:
extern add
And now it should assemble without complaint. Before we go further, have a look in
your working directory. You should have two .o
files: main.o
and add.o
. We can
look at the contents of their symbol tables with a tool called nm
:
$ nm add.o
00000000 t add
$ nm main.o
U add
00000000 t main
The first column is the address of the symbol, the second column is the type of
the symbol, and the third column is the name of the symbol. Notice that the
symbol names match up with our label names. Also note that main.o
has an
add
symbol of type U
. U
means "undefined", which means we need to find a
definition for it when we link these object files together later.
Both of our defined functions have a symbol type of t
. This means that the symbol
points to some code.
To create an executable out of these object files, we run:
$ gcc -m32 main.o add.o -o main
This will shout at you, claiming that it cannot find either main
or add
.
What gives?
Unless we explicitly say so, the symbols in an object file cannot be used by other object files when the compiler links them together. To fix this, we need to add:
global add
To the top of add.s
and:
global main
To the top of main.s
. This allows the symbols to be linked and the result is
that the compiler now takes our object files and creates an executable out of
them without complaint.
$ nm main
This will produce a lot of output now, because the compiler has to link in a
lot of administrative stuff, like the libc constructor and destructor handlers,
__libc_csu_init
and __libc_csu_fini
respectively. Don't worry about them,
the important thing is that both main
and add
are defined and the program
runs without complaint.
# Calling a C function from C++
Let's go up a level and look at some C and C++.
main.cpp
:
extern int inc(int i);
int main() {
return inc(2);
}
The first line here is us telling the compiler to expect to find a function called
inc
in another file.
inc.c
:
int inc(int i) {
return i + 1;
}
Here's the set of steps we need to follow to compile them both separately and then link them together:
$ gcc -c main.cpp -o main.o
$ gcc -c inc.c -o inc.o
$ gcc inc.o main.o
Unfortunately, this fails with a seemingly unfathomable error:
main.o: In function `main'
main.cpp:(.text+0xa): undefined reference to `inc(int)'
collect2: error: ld return 1 exit status
But we supplied an object file with a definition of inc(int)
in it, we made
sure to tell the compiler to expect to find a function called inc(int)
, why
can't it find it?
# Name mangling
Sadly C++ wasn't able to provide all of the features it wanted to without diverging from the System V ABI a little bit. When you compile a function in C, that function gets given a symbol with the name you gave it so that others can call it by that name.
C++ does not do this by default. As well as the name you give it, C++ also tacks on information about the return type and argument types of that function. If the function is in a class, information about the class is also included. It does this to allow you to overload the name of a function, so you can define multiple variations of a function that takes different arguments. This is called name mangling.
As a result, when we compiled our main.cpp
file, it was told to look for a function
called _Z3inci
instead of plain old inc
. Our inc.c
file provides a function
called inc
, and as such the two languages cannot interoperate without a little bit
of help.
Fortunately, the problem is easily solved by adding 4 characters to our main.cpp
:
extern "C" int inc(int i);
This addition of "C"
tells C++ to compile this file in search of a function with
plain old C-style calling conventions, and this includes using the plain old C name
of inc
. Attempting to compile this code should now work as expected.
# Calling a C++ function from C
This relationship works similarly in the other direction. If we want to write a
function in C++ but expose it in a way that a C program could call it, we would
need to use extern "C"
on that function:
extern "C" int inc(int i) {
return i + 1;
}
# What about Java?
What we've discussed up until now is the fundamental basis for how language interoperability works between native languages, but what about a language that runs on a virtual machine, such as Java?
The process is a little more involved, but doable. The problem arises because Java runs on a thing called the Java Virtual Machine, or JVM, which acts as a layer of indirection between Java code and the machine on which it runs. Because of this, Java cannot link directly to native libraries in the same way that C and C++ can. We have to introduce a layer that translates between the native world and the JVM world.
Fortunately, the people behind Java gave this a lot of thought and they came up with the "Java Native Interface," or JNI. It's the accepted way to get Java code to talk to native code, and here's how it works:
public class Interop {
static {
System.loadLibrary("inc");
}
public static native int inc(int i);
public static void main(String... args) {
System.out.println(inc(2));
}
}
Notice the use of the native
keyword. This tells Java that the implementation
for this function is defined in some native code somewhere. The
System.loadLibrary("inc")
line will search Java's library path for a library
called libinc.so
and, when it finds it, we will be able to use the Java
function inc
to call our native code!
But how do we do that?
Step 1: Generate the JNI header file from our code.
$ javac -h . Interop.java
The -h .
tells javac
to generate a file called Interop.h
in the current
directory. This will define the function we have to to implement. The resulting
file looks like this:
/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class Interop */
#ifndef _Included_Interop
#define _Included_Interop
#ifdef __cplusplus
extern "C" {
#endif
/*
* Class: Interop
* Method: inc
* Signature: (I)I
*/
JNIEXPORT jint JNICALL Java_Interop_inc
(JNIEnv *, jclass, jint);
#ifdef __cplusplus
}
#endif
#endif
This looks a lot scarier than it is. The lines containing _Included_Interop
are just "header guards" that make sure we can't accidentally include this file
twice and the __cplusplus
bit checks if we're compiling as C++ code and, if
we are, wraps everything in an extern "C"
block, which you'll remember from
earlier in this post.
The rest is the definition of the JNI function we have to implement:
JNIEXPORT jint JNICALL Java_Interop_inc
(JNIEnv *, jclass, jint);
It might not look like one, but this is indeed a function declaration. We implement it like so:
inc-jni.c
:
#include <jni.h>
#include "Interop.h"
JNIEXPORT jint JNICALL Java_Interop_inc
(JNIEnv* env, jclass class, jint i) {
return (jint)(i + 1);
}
The jint
cast is to make sure our integer is the size that Java is expecting it to
be. The jni.h
include is required for all of the Java-specific things we're seeing,
such as jint
and jclass
.
To compile this we need to execute a pretty gnarly gcc
call:
$ gcc -I"$JAVA_HOME\include" -I"$JAVA_HOME\include\linux" -fPIC -shared -o libinc.so inc-jni.c
The -I
flags tell gcc
where to find header files, which we need for the
#include <jni.h>
lines to work. The -fPIC -shared
flags create a special
type of object file called a "shared object." What's special about this type of
object file is that it's possible to, instead of compiling directly against
it, load it in to your process at runtime. Shared object files are, by
convention, named lib<something>.so
.
Now we can run the Java code like so:
$ java -Djava.library.path=. Interop
3
Voila! We successfully called our native inc
implementation from Java! How
cool is that?
# What about Ruby?
Ruby has a similar approach to Java. It exposes an API called "Ruby native extensions" that you can hook in to to call native functions in Ruby. Given that we have explored Java's way of doing this, and Ruby's is not too dissimilar, I want to use Ruby to focus on a different and more convenient way of calling native code.
# ffi
ffi
stands for "foreign function interface," and it's a Ruby gem that allows
us to call functions in existing shared object files with very little setup.
First, we install the ffi gem:
$ gem install ffi
Then we need to compile our original inc.c
file in to a shared object:
$ gcc -fPIC -shared -o libinc.so inc.c
And then we can write the following short snippet of Ruby code and call our
inc
function:
require 'ffi'
module Native
extend FFI::Library
ffi_lib './libinc.so'
attach_function :inc, [:int], :int
end
puts Native.inc(2)
This method makes us jump through far fewer hoops than Java's JNI does, and has
the benefit of allowing us to call existing shared objects without modifying them.
For example, you can call directly in to libc.so
:
require 'ffi'
module Libc
extend FFI::Library
ffi_lib 'c'
attach_function :puts, [:string], :int
end
Libc.puts "Hello, libc!"
Notice that we only had to specify 'c'
as the library name. This is because
Unix-like systems have standardised paths to find libraries in, often including
/lib
and /usr/lib
. By default ffi
will look for libraries with the naming
scheme lib<name>.so
. If you do ls /usr/lib/libc.so
you should find a file
exists at that path.
# Why doesn't Java have FFI?
It does! There's a library called JNA that does the same job that Ruby's FFI library does.
import com.sun.jna.Library;
import com.sun.jna.Native;
class JNA {
public interface Libc extends Library {
Libc INSTANCE = (Libc)Native.loadLibrary("c", Libc.class);
public int puts(String s);
}
public static void main(String... args) {
Libc.INSTANCE.puts("Hello, libc!");
}
}
We have to download the JNA library, which you can do so from here. Then we compile and run the Java code including the JNA library:
$ javac -classpath jna-4.1.0.jar JNA.java
$ java -classpath jna-4.1.0.jar:. JNA
Hello, libc!
# Why would anyone use a language's native interface if FFI is so much more convenient?
Lots of languages have some form of FFI library, and they're very convenient for calling in to existing native libraries. The problem, though, is that it's one-way communication. The library you're calling in to can't modify, for example, a Java object directly. It can't call Java code. The only way to do that is to use Java's JNI or Ruby's native extensions, because they expose to you an API for doing that.
If you don't need to have two-way communication between languages, though, and all you want to do is call existing native code, FFI is the way to go.
# Hold up, so how does Kotlin call functions written in Java?
Kotlin is a relatively new language that runs on the JVM.
Wait, doesn't Java run on the JVM?
That's true! But the JVM isn't just for Java, it's a generic virtual machine that compilers can generate "machine code" for just like native machines. The JVM machine code is more commonly referred to as "bytecode," because every instruction is one byte in length. Yes, this limits the instruction set to 256 instructions. Far fewer than the 2,034 instructions you'll find in x86.
In this respect, you could consider the JVM to be a "native" machine. That is the abstraction it aims to present to compilers and users. The only difference is that the native machine you're running code on is being emulated by a piece of software called the JVM.
Let's look at some example Java and Kotlin code.
Hello.java
:
public class Hello {
public static void world() {
System.out.println("Hello, world!");
}
}
Main.kt
:
fun main(args: Array<String>) {
Hello.world()
}
Let's compile and run this code.
$ javac Hello.java
$ kotlinc -classpath . Main.kt
$ kotlin MainKt
Hello, world!
And now we disassemble the code to see what's going on under the hood:
$ javap -c MainKt.class
Compiled from "Main.kt"
public final class MainKt {
public static final void main(java.lang.String[]);
Code:
0: aload_0
1: ldc #9 // String args
3: invokestatic #15 // Method kotlin/jvm/internal/Intrinsics.checkParameterIsNotNull:(Ljava/lang/Object;Ljava/lang/String;)V
6: invokestatic #21 // Method Hello.world:()V
9: return
}
This looks a lot different to disassembled native code. The bit underneath the "Code:" heading is the actual bytecode that gets run. The numbers on the left hand side are the byte offset in to the code, the next column is the "opcode" and anything after the opcode are the operands.
"But you said JVM bytecodes were one byte in length, why does the byte index on the left increment more than 1 byte per instruction?"
The opcodes are one byte in length. The operands make up the rest of the space.
For example, the invokestatic
opcode takes two additional bytes in operands: one
byte to reference the class being called and another to reference the method on that
class.
How does the number 21 reference our Hello.world() method?
A fundamental concept in JVM class files is the "constant pool." It's not output
by default by javap
, but we can ask for it:
$ javap -verbose MainKt.class
Classfile /tmp/MainKt.class
Last modified May 7, 2018; size 735 bytes
MD5 checksum f3ce23c2362512e852a5c91a1053c198
Compiled from "Main.kt"
public final class MainKt
minor version: 0
major version: 50
flags: ACC_PUBLIC, ACC_FINAL, ACC_SUPER
Constant pool:
...
#16 = Utf8 Hello
#17 = Class #16
#18 = Utf8 world
#19 = Utf8 ()V
#20 = NameAndType #18:#19
#21 = Methodref #17.#20
...
I've trimmed a lot of the output, but left the relevant entried in the constant
pool. 16, 18 and 19 are the raw UTF-8 strings that give the names "Hello",
"world" and "()V", which is a way of saying it's a function that returns void.
The invokestatic
opcode specifically takes operands that are indexes in to
the constant pool, and entries in the constant pool can reference other entries
in the constant pool.
If we tried to compile Main.kt
on its own, without adding Hello.class in to
the classpath, we get an error very similar to the one we got when we tried to
link an object files without all of the symbol definitions it required:
$ kotlinc Main.kt
Main.kt:2:2: error: unresolved reference: Hello
So because these languages, Java and Kotlin, both run using the same ABI and on the same instruction set, we are able to use that ABI's calling conventions to make function calls across language boundaries.
# Conclusion
We've come a long way. From raw, native assembly code up to calling functions between languages running on the same virtual machine.
With the knowledge you have built up of calling conventions, instruction sets, object files and FFI libraries, you should now be well equipped to explore how languages not mentioned in this post would call functions written in other languages.