How does file copying work

I'm a big fan of Quora. One of the questions I came across recently concerned how files get copied from one place to another. I wrote an answer that tried to show empirically how files get copied. I thought it would make for a nice blog post, so I've cleaned up the original answer and extended the parts that I waffled over to be much more concrete and correct.

Prerequisite knowledge: If you're not sure what a system call is, this post might confuse you.

First we'll create a file of 1 megabyte in size to work with.

$ dd bs=1024 count=1024 if=/dev/zero of=/tmp/zeroes
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.00184454 s, 568 MB/s

This command is saying copy 1024 blocks of size 1024 bytes from /dev/zero to /tmp/zeroes. So we're going to get a 1 megabyte (1024 * 1024) file filled with zero bytes (\0).

Now what we're going to do is strace the cp command. strace shows you what system calls are being made by a program. Because to copy a file you fundamentally have to execute some system calls, this should tell us exactly how the file is copied.

$ strace -- cp /tmp/zeroes /tmp/cpzeroes

...
open("/tmp/zeroes", O_RDONLY)           = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1048576, ...}) = 0
open("/tmp/cpzeroes", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35abd32000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "", 131072)                     = 0
close(4)                                = 0
close(3)                                = 0
...

I've trimmed the output down to the parts that concern us because strace outputs a lot of startup and shutdown system calls for a process that we're not interested in here (though it's good fun to read and try to figure out what's going on).

Let's start from the top:

open("/tmp/zeroes", O_RDONLY)           = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1048576, ...}) = 0
open("/tmp/cpzeroes", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

We open the file /tmp/zeroes and get back the file descriptor 3. After that, we fstat the file. fstat is a syscall that gets file status information. On my Arch Linux virtual machine, I have version 8.23 of GNU's coreutils cp implementation, so I downloaded the source from here and had a poke around in the internals. The first fstat is done to error check the file being copied, e.g. make sure it isn't a directory. Interestingly, fstat is never called. It's either stat or lstat, which must call fstat at some point or we wouldn't be seeing that in our strace output.

Then we open /tmp/cpzeroes, sending some interesting flags. O_WRONLY is write only, O_CREAT means create the file if it does not exist and O_EXCL means that if the file does already exist, the call to open will fail (we don't want cp to trash any of our files by mistake!). We get the file descriptor 4 back from the open call.

And then another fstat call, but I'm not 100% sure why this one is done. To figure it out, I compiled a version of cp with debugging symbols and run it under gdb:

$ wget "http://ftp.gnu.org/gnu/coreutils/coreutils-8.23.tar.xz"
$ tar xvf coreutils-8.23.tar.xz
$ cd coreutils-8.23
$ ./configure
$ CFLAGS=-g make
$ gdb --args coreutils-8.23/src/cp /tmp/zeroes /tmp/cpzeroes
(gdb) break copy_internal
(gdb) run

This will stop us at a call to copy_internal (the place we found the first fstat) and it looks promising:

Breakpoint 1, copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
    dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=new_dst@entry=false, parent=parent@entry=0x0,
    ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
    first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
    at src/copy.c:1746

Let's make sure we don't miss any calls to stat or lstat:

(gdb) break stat
(gdb) break lstat
(gdb) continue

This drops us first at the first fstat call:

Breakpoint 3, copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
    dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=new_dst@entry=false, parent=parent@entry=0x0,
    ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
    first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
    at src/copy.c:1767
1767      if (XSTAT (x, src_name, &src_sb) != 0)

Fabulous. Now, where's the one on dst_name?

(gdb) continue
Breakpoint 2, copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
    dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=new_dst@entry=false, parent=parent@entry=0x0,
    ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
    first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
    at src/copy.c:1817
1817               ? stat (dst_name, &dst_sb)

Tada! That corresponds to this line. There's a check a few lines later that seems to be ensuring that there is no existing file at dst_name. So it's doing an fstat to ensure it doesn't clobber anything. Neat.

Onwards!

fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0

The fadvise64 call tells the operating system that we intend to read from file descriptor 3 sequentially (so lower offsets read before higher ones). This doesn't need to be done, but doing it helps the operating system optimise our read calls. How will it optimise the calls? Well, it might fetch more of the file into memory than our first read asks for. Why? Because reading larger chunks from disk is better than smaller ones. Why? There's a lot of overhead involved in asking the hard drive to do something for us. With traditional spinning disks, we have to wait for the read head to get to the right place on the platter, something that could take milliseconds (a long time for a computer!). So reading as much as we can while the read head is in the right place is a Good Thing. The operating system may also fetch more of the file we're reading into memory if the disk is idle and we're not doing any IO as a bit of a bet that we will use the data later.

mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35abd32000

The mmap call is probably how the implementation of malloc handles large allocations on the system I've run this on. The MAP_ANONYMOUS flag tells mmap not to back this mapping by a file (so it essentially just becomes a large allocation in memory). Notice also that the allocation is 139264 bytes in size, this will be an important thinking point later.

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072

This is the bit that actually answers the original Quora question. A series of 8 reads and writes, each 131072 (128 * 1024, or 128kb) in size. So the file is indeed being copied in chunks, not all at the same time.

Why are the reads done in chunks of 131072 when the buffer we allocated with mmap is 139264 bytes in size? My original guess on the Quora answer was to do with slab allocation (which might not be the right term for it, but what I mean is memory allocators that have "size classes" for allocations, similar to tcmalloc), but this is only partly right. To figure it out for reals, I dove back into gdb.

$ gdb --args coreutils-8.23/src/cp /tmp/zeroes /tmp/cpzeroes
(gdb) break copy_internal
(gdb) run
(gdb) break malloc
(gdb) continue
(gdb) bt
#0  0x00007ffff76a60d0 in malloc () from /usr/lib/libc.so.6
#1  0x0000000000410359 in xmalloc (n=n@entry=135175) at lib/xmalloc.c:41
#2  0x00000000004081b9 in copy_reg (src_sb=0x7fffffffe2c0, new_dst=<synthetic pointer>,
    omitted_permissions=<optimized out>, dst_mode=<optimized out>, x=0x7fffffffe6a0,
    dst_name=0x7fffffffeb72 "/tmp/cpzeroes", src_name=0x7fffffffeb66 "/tmp/zeroes") at src/copy.c:1156
#3  copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
    dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=<optimized out>, new_dst@entry=false,
    parent=parent@entry=0x0, ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
    first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
    at src/copy.c:2523
#4  0x0000000000408fc9 in copy (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
    dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", nonexistent_dst=nonexistent_dst@entry=false,
    options=options@entry=0x7fffffffe6a0, copy_into_self=copy_into_self@entry=0x7fffffffe5a8,
    rename_succeeded=rename_succeeded@entry=0x0) at src/copy.c:2830
#5  0x000000000040475b in do_copy (n_files=<optimized out>, file=<optimized out>, target_directory=<optimized out>,
    target_directory@entry=0x0, no_target_directory=no_target_directory@entry=false, x=x@entry=0x7fffffffe6a0)
    at src/cp.c:767
#6  0x000000000040342b in main (argc=3, argv=0x7fffffffe898) at src/cp.c:1214

So we've arrived at a malloc call, and it seems like the one we're after. Look at stack frame 2. This is just before we dive into anything that looks like allocation, the filenames look relevant to our interests. Sadly libc wasn't compiled with debugging symbols, so we'll have to figure this one out by having a look through the code.

The code contains lots of interesting maths surrounding the buffer size calculation. If I have a look at the local variables in frame 2, we see some interesting numbers (output truncated for readability):

(gdb) frame 2
(gdb) info locals
buf_alignment_slop = 4103
make_holes = <optimized out>
wrote_hole_at_eof = 112
buf_size = 131072
sparse_src = false
n_read = -4294967246
buf_alloc = 0x0
source_desc = 3
name_alloc = 0x0
dest_desc = 4
dest_errno = <optimized out>
src_mode = 33188
return_val = true

When the call to xmalloc (which is just a wrapper around malloc that terminates the program if malloc fails) is done, the calculation is buf_size + buf_alignment_slop, which comes to 135175. I don't know how malloc is implemented on my system, but what I do know is that allocations this large are almost always done on page boundaries, so multiples of 4096. What's the next multiple of 4096? 139264. I'm 99% sure that's what's at play here.

Why are the reads done in chunks of 131072 anyway? My guess for this one was the "block size" of the hard drive. Again, this was only sort of right. There's a lovely comment on the io_blksize function that explains why 128kb was chosen.

Why are there loads of strings of zero bytes in the read and write calls? In the documentation for mmap, it is mentioned that by default the buffer you get back is zeroed if you supply the argument MAP_ANONYMOUS (which we did). That explains the first one. The rest are explained by the file we're copying being filled entirely by zeroes, and the same buffer being reused for all of the calls to read and write.

read(3, "", 131072)                     = 0
close(4)                                = 0
close(3)                                = 0

The last read (because the return value is less than the number of bytes requested, indicting that we reached the end of the file) and then standard file cleanup.

Phew. Every system call in a file copy using cp explained. I hope that was as enlightening for you as it was for me :)