How does file copying work
I'm a big fan of Quora. One of the questions I came across recently concerned how files get copied from one place to another. I wrote an answer that tried to show empirically how files get copied. I thought it would make for a nice blog post, so I've cleaned up the original answer and extended the parts that I waffled over to be much more concrete and correct.
Prerequisite knowledge: If you're not sure what a system call is, this post might confuse you.
First we'll create a file of 1 megabyte in size to work with.
$ dd bs=1024 count=1024 if=/dev/zero of=/tmp/zeroes
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.00184454 s, 568 MB/s
This command is saying copy 1024 blocks of size 1024 bytes from /dev/zero
to
/tmp/zeroes
. So we're going to get a 1 megabyte (1024 * 1024) file filled with
zero bytes (\0).
Now what we're going to do is strace
the cp
command.
strace
shows you what system calls are being made by a program.
Because to copy a file you fundamentally have to execute some system calls, this
should tell us exactly how the file is copied.
$ strace -- cp /tmp/zeroes /tmp/cpzeroes
...
open("/tmp/zeroes", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1048576, ...}) = 0
open("/tmp/cpzeroes", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35abd32000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "", 131072) = 0
close(4) = 0
close(3) = 0
...
I've trimmed the output down to the parts that concern us because strace
outputs a lot of startup and shutdown system calls for a process that we're not
interested in here (though it's good fun to read and try to figure out what's
going on).
Let's start from the top:
open("/tmp/zeroes", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1048576, ...}) = 0
open("/tmp/cpzeroes", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
We open the file /tmp/zeroes
and get back the file descriptor 3. After that,
we fstat
the file. fstat
is a syscall that gets file status information. On
my Arch Linux virtual machine, I have version 8.23 of GNU's coreutils cp
implementation, so I downloaded the source from here and had a poke around
in the internals. The first fstat is done to error check the file being
copied, e.g. make sure it isn't a directory. Interestingly, fstat
is never
called. It's either stat
or lstat
, which must call fstat
at some point or
we wouldn't be seeing that in our strace
output.
Then we open /tmp/cpzeroes
, sending some interesting flags. O_WRONLY
is
write only, O_CREAT
means create the file if it does not exist and O_EXCL
means that if the file does already exist, the call to open
will fail (we
don't want cp
to trash any of our files by mistake!). We get the file
descriptor 4 back from the open
call.
And then another fstat
call, but I'm not 100% sure why this one is done. To
figure it out, I compiled a version of cp
with debugging symbols and run it
under gdb
:
$ wget "http://ftp.gnu.org/gnu/coreutils/coreutils-8.23.tar.xz"
$ tar xvf coreutils-8.23.tar.xz
$ cd coreutils-8.23
$ ./configure
$ CFLAGS=-g make
$ gdb --args coreutils-8.23/src/cp /tmp/zeroes /tmp/cpzeroes
(gdb) break copy_internal
(gdb) run
This will stop us at a call to copy_internal
(the place we found the first
fstat
) and it looks promising:
Breakpoint 1, copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=new_dst@entry=false, parent=parent@entry=0x0,
ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
at src/copy.c:1746
Let's make sure we don't miss any calls to stat
or lstat
:
(gdb) break stat
(gdb) break lstat
(gdb) continue
This drops us first at the first fstat
call:
Breakpoint 3, copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=new_dst@entry=false, parent=parent@entry=0x0,
ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
at src/copy.c:1767
1767 if (XSTAT (x, src_name, &src_sb) != 0)
Fabulous. Now, where's the one on dst_name
?
(gdb) continue
Breakpoint 2, copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=new_dst@entry=false, parent=parent@entry=0x0,
ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
at src/copy.c:1817
1817 ? stat (dst_name, &dst_sb)
Tada! That corresponds to this line. There's a check a few lines later that
seems to be ensuring that there is no existing file at dst_name
. So it's doing
an fstat
to ensure it doesn't clobber anything. Neat.
Onwards!
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
The fadvise64
call tells the operating system that we intend to read from
file descriptor 3 sequentially (so lower offsets read before higher ones). This
doesn't need to be done, but doing it helps the operating system optimise our
read
calls. How will it optimise the calls? Well, it might fetch more of the
file into memory than our first read
asks for. Why? Because reading larger
chunks from disk is better than smaller ones. Why? There's a lot of overhead
involved in asking the hard drive to do something for us. With traditional
spinning disks, we have to wait for the read head to get to the right place on
the platter, something that could take milliseconds (a long time for a
computer!). So reading as much as we can while the read head is in the right
place is a Good Thing. The operating system may also fetch more of the file
we're reading into memory if the disk is idle and we're not doing any IO as a
bit of a bet that we will use the data later.
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f35abd32000
The mmap
call is probably how the implementation of malloc
handles large
allocations on the system I've run this on. The MAP_ANONYMOUS
flag tells
mmap
not to back this mapping by a file (so it essentially just becomes a
large allocation in memory). Notice also that the allocation is 139264 bytes in
size, this will be an important thinking point later.
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
This is the bit that actually answers the original Quora question. A series of 8 reads and writes, each 131072 (128 * 1024, or 128kb) in size. So the file is indeed being copied in chunks, not all at the same time.
Why are the reads done in chunks of 131072 when the buffer we allocated with
mmap
is 139264 bytes in size? My original guess on the Quora answer was to do
with slab allocation (which might not be the right term for it, but what I mean
is memory allocators that have "size classes" for allocations, similar to
tcmalloc), but this is only partly right. To figure it out for reals,
I dove back into gdb.
$ gdb --args coreutils-8.23/src/cp /tmp/zeroes /tmp/cpzeroes
(gdb) break copy_internal
(gdb) run
(gdb) break malloc
(gdb) continue
(gdb) bt
#0 0x00007ffff76a60d0 in malloc () from /usr/lib/libc.so.6
#1 0x0000000000410359 in xmalloc (n=n@entry=135175) at lib/xmalloc.c:41
#2 0x00000000004081b9 in copy_reg (src_sb=0x7fffffffe2c0, new_dst=<synthetic pointer>,
omitted_permissions=<optimized out>, dst_mode=<optimized out>, x=0x7fffffffe6a0,
dst_name=0x7fffffffeb72 "/tmp/cpzeroes", src_name=0x7fffffffeb66 "/tmp/zeroes") at src/copy.c:1156
#3 copy_internal (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", new_dst=<optimized out>, new_dst@entry=false,
parent=parent@entry=0x0, ancestors=ancestors@entry=0x0, x=x@entry=0x7fffffffe6a0, command_line_arg=true,
first_dir_created_per_command_line_arg=0x7fffffffe56f, copy_into_self=0x7fffffffe5a8, rename_succeeded=0x0)
at src/copy.c:2523
#4 0x0000000000408fc9 in copy (src_name=src_name@entry=0x7fffffffeb66 "/tmp/zeroes",
dst_name=dst_name@entry=0x7fffffffeb72 "/tmp/cpzeroes", nonexistent_dst=nonexistent_dst@entry=false,
options=options@entry=0x7fffffffe6a0, copy_into_self=copy_into_self@entry=0x7fffffffe5a8,
rename_succeeded=rename_succeeded@entry=0x0) at src/copy.c:2830
#5 0x000000000040475b in do_copy (n_files=<optimized out>, file=<optimized out>, target_directory=<optimized out>,
target_directory@entry=0x0, no_target_directory=no_target_directory@entry=false, x=x@entry=0x7fffffffe6a0)
at src/cp.c:767
#6 0x000000000040342b in main (argc=3, argv=0x7fffffffe898) at src/cp.c:1214
So we've arrived at a malloc call, and it seems like the one we're after. Look at stack frame 2. This is just before we dive into anything that looks like allocation, the filenames look relevant to our interests. Sadly libc wasn't compiled with debugging symbols, so we'll have to figure this one out by having a look through the code.
The code contains lots of interesting maths surrounding the buffer size calculation. If I have a look at the local variables in frame 2, we see some interesting numbers (output truncated for readability):
(gdb) frame 2
(gdb) info locals
buf_alignment_slop = 4103
make_holes = <optimized out>
wrote_hole_at_eof = 112
buf_size = 131072
sparse_src = false
n_read = -4294967246
buf_alloc = 0x0
source_desc = 3
name_alloc = 0x0
dest_desc = 4
dest_errno = <optimized out>
src_mode = 33188
return_val = true
When the call to xmalloc
(which is just a wrapper around malloc
that
terminates the program if malloc
fails) is done, the calculation is
buf_size + buf_alignment_slop
, which comes to 135175. I don't know how
malloc
is implemented on my system, but what I do know is that allocations
this large are almost always done on page boundaries, so multiples of 4096.
What's the next multiple of 4096? 139264. I'm 99% sure that's what's at play
here.
Why are the reads done in chunks of 131072 anyway? My guess for this one was the
"block size" of the hard drive. Again, this was only sort of right. There's a
lovely comment on the io_blksize
function that explains why 128kb was
chosen.
Why are there loads of strings of zero bytes in the read
and write
calls? In
the documentation for mmap
, it is mentioned that by default the buffer you get
back is zeroed if you supply the argument MAP_ANONYMOUS
(which we did). That
explains the first one. The rest are explained by the file we're copying being
filled entirely by zeroes, and the same buffer being reused for all of the calls
to read
and write
.
read(3, "", 131072) = 0
close(4) = 0
close(3) = 0
The last read
(because the return value is less than the number of bytes
requested, indicting that we reached the end of the file) and then standard file
cleanup.
Phew. Every system call in a file copy using cp
explained. I hope that was as
enlightening for you as it was for me :)