I/O performance in Python

The Problem

I am writing a small python script to keep track of various events and messages. It uses a flat file as an index, each record being of the same size and containing details about each message.

This file can get large, in the order of several hundreds of megabytes. The python code is trivial, given that each record is exactly the same size, but what is the fastest way to access and use that index file?

With python (or any programming language, for what is worth), I have plenty of ways to read a file:

I can just rely on read and io.read in python having perfectly good buffering, and just read (or io.read) a record at a time.
I can read it in one go, and then operate in memory (eg, a single read in a string, followed by using offsets within the string).
I can do my own buffering, read a large chunk at a time, and then operate on each chunk as a set of records (eg, multiple read of some multiple of the size of the record, followed by using offsets within each chunk).
I can use fancier libraries that allow me to mmap or use some other crazy approach.

and even more ways to parse it. For now, I just wanted to focus on the reading part, as in my case parsing was a simple split or struct.unpack.

The Quest

Disclaimer: most of the examples below are oversimplified and not particularly beautiful. I should have used the with construct, checked for exceptions, or preferred list operations over string operations at times. I was trying to avoid any distractions, by keeping the examples as similar as possible. Please bear with me, and keep your pythonic feelings at bay, if you can :), but do please provide suggestions if you have any!

The first question I wanted to answer was "is solving this problem even worth spending time on it? Why can't I just write the simplest code, using read, and be done with it?".

So that's where I started from:

f = open(sys.argv[1], "rb")
for i in itertools.count():
  record = f.read(16)
  if not record:
    break

print i

The counter, printing i, is just to verify that all the records have been read correctly.

The file I've used for benchmarks is about 1.4 Gb in size:

$ ls -al ./test-index.bin
-rw-r--r-- 1 rabexc users 1493893120 Dec 16  2010 ./test-index.bin

My system has about 4 Gb of ram, only < 1 of which is used in steady state:

$ free
             total       used       free     shared    buffers     cached
Mem:       4039164    2363280    1675884          0      38128    1697472
-/+ buffers/cache:     627680    3411484
Swap:      6291452        140    6291312

Given that I'm not using the system for anything else but this test, this means that after I read the file for a few times, it should live in cache, and always be accessed from memory.

So, how do I time the code? well, a few run of the scripts with time should do it:

$ time python /tmp/snippets/snippet00.py ./test-index.bin
93368320
real    0m54.960s
user    0m53.715s
sys     0m0.620s

(this is like the 3rd run, which is important to make sure that all the important bits, including the test file, are read from cache rather than actual disk, so I can benchmark the code, and not my hard drive).

This means 55 seconds were spent running the python snippet above, and 0.6 seconds were spent by the operating system to provide the content of the file.

This is quite disappointing: I was hoping for my code to take < 10 seconds. How fast can this go at best? Let's try comparing this number with how long cat takes to read the file:

$ time cat ./test-index.bin &> /dev/null
real    0m0.531s
user    0m0.016s
sys     0m0.512s

well, a plain cat is taking about half a second, and I'm sure there are better ways to read files than what cat does! It ought to be possible to speed up my python code!

Given that I don't know anything about python internals, let's start by looking at what cat is doing internally with a simple strace:

fadvise64_64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
...
read(3, "QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\1@\0\0\0"..., 32768) = 32768
write(1, "QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\1@\0\0\0"..., 32768) = 32768
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
read(3, "\200\0\0\0\0\6\0\0\200\0\0\0\0\214\0\0\200\0\0\0\1\21\0\0\200\0\0\0\1\226\0\0"..., 32768) = 32768
write(1, "\200\0\0\0\0\6\0\0\200\0\0\0\0\214\0\0\200\0\0\0\1\21\0\0\200\0\0\0\1\226\0\0"..., 32768) = 32768
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
read(3, "\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
write(1, "\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
...

So:

It tells the operating systems it intends to read the file sequentially (fadvise).
It reads the file in chunk of 32K.
Which it immediately writes out to the output file.

The Results

Unless you are interested in spending the next 30 mins of your life learning about experiments that turned out to be of very little help, let me summarize the results first:

I now believe that python is actually quite fast at reading the file. The issue seems to be in accessing the buffer 16 bytes at a time. A simple substring using list slice syntax over a large buffer costs about 300 ns, multiply by a 100,000,000 runs, this gets to ~30 seconds.
I could not find any good way to speed this part up. Anything I did outside the example snippets ended up paying roughly the same cost. Note also that most of the APIs take strings as input, so it is hard to use anything else.
Eliminating this cost in example snippets of code was tricky, but brought the performance to be in the same order of magnitude as that of cat.
Not sure even eliminating this cost would buy much. Sure, it will make the benchmark faster, but any sort of processing in the middle of the loop will likely end up costing something.
Different ways to actually read the file did bring some improvements. Eg, I went from ~30-40 seconds with the best approaches, to about 2 minutes with the worst ones. Given that the naive code is about 50 seconds, those 10 seconds are probably not worth any sort of extra complexity in your code.

The Experiments

Naive open with read

You have already seen this snippet of code:

f = open(sys.argv[1], "rb")
for i in itertools.count():
  record = f.read(16)
  if not record:
    break

print i

Running this script takes about 54 seconds. Just to put this in perspective, it is about 100 times slower than cat, although it is really not doing anything with each record.

$ time python /tmp/snippets/snippet00.py ./test-index.bin
93368320
real    0m54.960s
user    0m53.715s
sys     0m0.620s

If I run strace on this script, I see that internally python is reading 4K at a time, and buffering. Good!

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096

this means it is using buffers of about 4K size, or 8 times smaller than cat. If I increase the buffer size to 32K, with:

f = open(sys.argv[1], "rb", 32768)

it is still taking ~48 seconds, which means that most of the time is still, well, spent by python itself, with sys time still ~0.5 seconds.

If I further increase the buffer size to 1 Mb and 10 Mb, performance does not improve.

But if I leave a large buffer, and also increase the f.read() size from 16 bytes to 32 K or 1 Mb, run time goes down to about 3 seconds.

Given that the buffer size controls how many times the read syscall is invoked, while the size of the f.read() controls how many times the loop, and the python code to perform the read is called, it seems like the cost is dominated by the python code to perform the read.

This is where knowning more of the python internals would come in handy, but by peeking at the python process with a profiler, I don't see that much, except for the fact that python itself is spending lots of CPU, which doesn't sound very surprising.

 35.17%  [.] 0x000bd586
 30.86%  [.] PyEval_EvalFrameEx
  9.15%  [.] PyDict_GetItem
  6.22%  [.] _PyObject_GenericGetAttrWithDict
  2.98%  [.] PyDict_SetItem
  2.13%  [.] Py_UniversalNewlineFread
  2.06%  [.] PyCFunction_Call
  1.90%  [.] PyObject_Malloc
  1.83%  [.] PyObject_GetAttr
  1.78%  [.] PyString_FromStringAndSize
  1.76%  [.] PyObject_Free
  1.37%  [.] PyEval_RestoreThread
  1.30%  [.] PyTuple_New
  0.68%  [.] PyEval_SaveThread
  0.58%  [.] _PyArg_ParseTuple_SizeT
  0.24%  [.] getresgid@plt
  0.00%  [.] _PyModule_Clear
  0.00%  [.] PyType_ClearCache

So the question arises: does it get faster if, for example, I use io.read instead of the plain read? And what if I use os.read on file descriptors instead?

io.open and read

This is what the code looks like:

f = io.open(sys.argv[1], "rb")
for i in itertools.count():
  record = f.read(16)
  if not record:
    break
print i

and this is how fast it is:

$ time python /tmp/snippets/snippet01.py ./test-index.bin
93368320

real    0m42.741s
user    0m42.039s
sys     0m0.636s

I was not expecting io module to be any faster, really, especially given that according to strace, they are using the same buffer size. This seems to indicate that the python code path to run io.read() is indeed slightly faster? oh, well, the 10 seconds gain doesn't make it particularly exciting.

open and os.read

Let's try with a simple os.read. Given that it is the lowest level function I have available, I expect it to have the least amount of python code in it. Although I know nothing about python internals, I would expect plain read and io.read to be implemented on top of os.read.

However, calling os.read directly to get 16 bytes at a time will incur in heavy syscall costs, but let's try:

f = open(sys.argv[1], "rb")
fd = f.fileno()
for i in itertools.count():
  record = os.read(fd, 16)
  if not record:
    break

print i

And here's how long it takes:

$ time python /tmp/snippets/snippet02.py ./test-index.bin
93368320

real    1m53.372s
user    0m58.292s
sys     0m54.927s

Which is roughly twice as slower than any other solution so far. Unsurprisingly, this means that buffering is important. Note, however, how the sys time has gone up significantly from previous tests, from ~half a second to about 54 seconds, while the user time is pretty much unchanged.

This probably means that we are paying roughly the same cost within python itself, plus the cost to perform many more syscalls and actually asking the kernel 16 bytes at a time of the file.

But what if I increase buffering here? And read 1 Mb at a time?

f = open(sys.argv[1], "rb")
fd = f.fileno()
for i in itertools.count():
  record = os.read(fd, 1048576)
  if not record:
    break

print i

Time goes down significantly:

$ time python ./snippet02.py /opt/vms/pool/test-index.bin 
1425

real  0m1.446s
user  0m0.012s
sys 0m1.424s

So, where is the cost? Is it in the read function codepath, or is it in working 16 bytes at a time in python?

Given that by using a plain read with 1 Mb at a time the run time went down to about 3 seconds, I expect the difference between os.read and read in terms of cost to be that 1.5 second.

So, what about the 40+ seconds caused by reading 16 bytes at a time?

Let's see what happens if I os.read() 1 Mb at a time, and then work on 16 bytes at a time. Here's what the code looks like:

f = open(sys.argv[1], "rb")
fd = f.fileno()
offsets = range(0, 1048765, 16)
i = 0
while True:
  buff = os.read(fd, 1048576)
  if not buff:
    break

  for offset in offsets:
    record = buff[offset:offset + 16]
    i += 1

print i

And here are the timings:

$ time python ./snippet05.py /opt/vms/pool/test-index.bin 
93405900

real    0m38.921s
user    0m37.328s
sys     0m1.520s

What makes these timings interesting is that if I comment out the inner loop, it takes 1.5 seconds once again. So the 37 extra seconds are spent by the inner loop!

If I turn the body of the loop into a simple pass, the run time goes down to about 6 seconds, while if I leave only the record = buff[offset:offset + 16] as instruction, this becomes ~30 seconds.

This means that computing the substring is really what I need to make faster, and costs about 300 ns.

Reading the whole file in one go

Here is what the code looks like:

f = open(sys.argv[1], "rb")
data = f.read()

for i in itertools.count():
  start = i * 16
  end = start + 16
  if start >= len(data):
    break
  record = data[start:end]

print i

and here is the output:

$ time python /tmp/snippets/snippet03.py test-index.bin
93368320

real    1m12.975s
user    1m11.120s
sys     0m1.720s

which is not that surprising. This is actually much slower than any other read operation. If the cost is in string processing, it may be that we are paying this cost multiple times now: once to fill data, reading one buffer at a time, and once every time I create a record object from the buffer.

In facts, if I comment out the record = data[start:end] line the time goes back to almost normal:

$ time python /tmp/snippets/snippet03.py test-index.bin
93368320

real    0m48.553s
user    0m44.427s
sys     0m1.764s

io.open and readinto

By peeking at pydoc io, there seems to be a method readinto that reads data into a bytearray instead of a string, so let's try it:

b = bytearray(16) 
f = io.open(sys.argv[1], "rb")
for i in itertools.count():
  numread = f.readinto(b)
  if not numread:
    break

print i

The first experiment doesn't really go well, this is worse than anything I've seen before:

$ time python /tmp/snippets/snippet04.py test-index.bin
93368320

real    1m40.463s
user    1m38.794s
sys     0m0.648s

By looking at strace, it seems to be doing lots of read(4096), which probably means read into a buffer, and then copy into the bytearray of 16 bytes:

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096

Which seems to be slower than reading into a plain string? interesting.

Let's try with a larger bytearray, like 1048576, and see what happens instead:

b = bytearray(1048576)
# b = bytearray(16)
f = io.open(sys.argv[1], "rb")
for i in itertools.count():
  numread = f.readinto(b)
  if not numread:
    break

print i

run time is now:

$ time python /tmp/snippets/snippet04.py test-index.bin
1425

real    0m3.926s
user    0m0.220s
sys     0m1.568s

4 seconds, which is comparable to what I had before with large reads.

If I look with strace, mostly out of curiosity, here's what happens:

mmap2(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7296000
read(3, "\277\267\330\5\301\224\230\374\212\335Y\354\326a\237K\217][\354\334a\257\376c\357,v\352\260\213\364"..., 1048576) = 1048576
munmap(0xb7296000, 1052672)             = 0

which looks like it's reading directly into the bytearray.

What is surprising is the mmap. As pointed out on the reddit thread, the mmap and munmap are used to allocate memory (flags indicate anonymous memory, and file descriptor is -1). This might simply be related to the malloc() call in linux glibc, that according to the man page will use mmap instead of sbrk for any allocation larger than MMAP_THRESHOLD, by default set to 128 k.

What I find suspicious here is that if you look at strace, you see a continuous pattern of mmap and munmap. This probably means a malloc followed by a free shortly after, but I would have expected a good memory allocator to "cache" allocations within the process for some time before returning memory to the operating system, although it's been a while since I've looked at the implementation of malloc.

So, without actually looking at the python code, here's my theory of the internals:

If "bytearray > buffer", read directly into bytearray.
Otherwise read into a buffer of configured size and copy the requested data into bytearray.

Let's try with a bytearray of 8192 bytes, which is twice as much as the buffer size:

$ time python /tmp/snippets/snippet04.py test-index.bin
182360

real    0m1.256s
user    0m0.704s
sys     0m0.548s

From strace, this is reading 8k at a times, presumably into the bytearray directly, given the run time.

Let's try one more time, with buffering disabled (note the 0 in io.open):

b = bytearray(16)
f = io.open(sys.argv[1], "rb", 0)
for i in itertools.count():
  numread = f.readinto(b)
  if not numread:
    break

print i

With strace, I can now clearly see that python is reading 16 bytes at a time directly:

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16

However, this is far slower than I was expecting:

$ time python /tmp/snippets/snippet04.py test-index.bin
93368320

real    2m4.992s
user    1m10.476s
sys     0m54.287s

Note the sys time at 54 seconds, probably because of the repeated read calls. The 1 minute and 10 seconds of user time is probably the high number of iterations? I would not have expected readinto to be slower than other primitives.

With python3

Those were mostly run out of curiosity, with the default buffer size and the most naive code.

Naive open and read

$ time python3 /tmp/snippets/snippet00.py test-index.bin

real    0m32.056s
user    0m30.662s
sys     0m1.300s

io.open and read

$ time python3 /tmp/snippets/snippet01.py test-index.bin
93368320

real    0m38.533s
user    0m37.862s
sys     0m0.588s

open and os.read

$ time python3 /tmp/snippets/snippet02.py test-index.bin 
93368320

real    2m6.902s
user    1m11.432s
sys     0m55.223s

reading the file in one go

$ time python3 /tmp/snippets/snippet03.py test-index.bin
93368320

real    1m36.938s
user    1m33.706s
sys     0m1.712s

io.open and readinto

$ time python3 /tmp/snippets/snippet04.py test-index.bin
45590

real    0m0.595s
user    0m0.084s
sys     0m0.512s

Making GRUB quiet Hanging clothes in windy weather