I am writing a small python script to keep track of various events and messages. It uses a flat file as an index, each record being of the same size and containing details about each message.
This file can get large, in the order of several hundreds of megabytes. The python code is trivial, given that each record is exactly the same size, but what is the fastest way to access and use that index file?
With python (or any programming language, for what is worth), I have plenty of ways to read a file:
read
and io.read
in python having perfectly good buffering,
and just read
(or io.read
) a record at a time.read
in a string,
followed by using offsets within the string).read
of some multiple of the
size of the record, followed by using offsets within each chunk).mmap
or use some other
crazy approach.and even more ways to parse it. For now, I just wanted to focus on the reading part,
as in my case parsing was a simple split
or struct.unpack
.
Disclaimer: most of the examples below are oversimplified and not particularly beautiful. I should have used the with
construct, checked for exceptions
, or preferred list
operations over string
operations at times. I was trying to avoid any distractions, by keeping the examples as similar as possible. Please bear with me, and keep your pythonic feelings at bay, if you can :), but do please provide suggestions if you have any!
The first question I wanted to answer was "is solving this problem even worth spending
time on it? Why can't I just write the simplest code, using read
, and be done with it?".
So that's where I started from:
f = open(sys.argv[1], "rb") for i in itertools.count(): record = f.read(16) if not record: break print i
The counter, printing i, is just to verify that all the records have been read correctly.
The file I've used for benchmarks is about 1.4 Gb in size:
$ ls -al ./test-index.bin -rw-r--r-- 1 rabexc users 1493893120 Dec 16 2010 ./test-index.bin
My system has about 4 Gb of ram, only < 1 of which is used in steady state:
$ free total used free shared buffers cached Mem: 4039164 2363280 1675884 0 38128 1697472 -/+ buffers/cache: 627680 3411484 Swap: 6291452 140 6291312
Given that I'm not using the system for anything else but this test, this means that after I read the file for a few times, it should live in cache, and always be accessed from memory.
So, how do I time the code? well, a few run of the scripts with time
should do it:
$ time python /tmp/snippets/snippet00.py ./test-index.bin 93368320 real 0m54.960s user 0m53.715s sys 0m0.620s
(this is like the 3rd run, which is important to make sure that all the important bits, including the test file, are read from cache rather than actual disk, so I can benchmark the code, and not my hard drive).
This means 55 seconds were spent running the python snippet above, and 0.6 seconds were spent by the operating system to provide the content of the file.
This is quite disappointing: I was hoping for my code to take < 10 seconds. How fast can this go
at best? Let's try comparing this number with how long cat
takes to read the file:
$ time cat ./test-index.bin &> /dev/null real 0m0.531s user 0m0.016s sys 0m0.512s
well, a plain cat
is taking about half a second, and I'm sure there are better ways to read
files than what cat
does! It ought to be possible to speed up my python code!
Given that I don't know anything about python internals, let's start by looking at what
cat
is doing internally with a simple strace:
fadvise64_64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0 ... read(3, "QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\1@\0\0\0"..., 32768) = 32768 write(1, "QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\1@\0\0\0"..., 32768) = 32768 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 read(3, "\200\0\0\0\0\6\0\0\200\0\0\0\0\214\0\0\200\0\0\0\1\21\0\0\200\0\0\0\1\226\0\0"..., 32768) = 32768 write(1, "\200\0\0\0\0\6\0\0\200\0\0\0\0\214\0\0\200\0\0\0\1\21\0\0\200\0\0\0\1\226\0\0"..., 32768) = 32768 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 read(3, "\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 write(1, "\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768 ...
So:
Unless you are interested in spending the next 30 mins of your life learning about experiments that turned out to be of very little help, let me summarize the results first:
I now believe that python is actually quite fast at reading the file. The issue seems to be in accessing the buffer 16 bytes at a time. A simple substring using list slice syntax over a large buffer costs about 300 ns, multiply by a 100,000,000 runs, this gets to ~30 seconds.
I could not find any good way to speed this part up. Anything I did outside the example snippets ended up paying roughly the same cost. Note also that most of the APIs take strings as input, so it is hard to use anything else.
Eliminating this cost in example snippets of code was tricky, but brought
the performance to be in the same order of magnitude as that of cat
.
Not sure even eliminating this cost would buy much. Sure, it will make the benchmark faster, but any sort of processing in the middle of the loop will likely end up costing something.
Different ways to actually read
the file did bring some improvements. Eg,
I went from ~30-40 seconds with the best approaches, to about 2 minutes with
the worst ones. Given that the naive code is about 50 seconds, those 10
seconds are probably not worth any sort of extra complexity in your code.
You have already seen this snippet of code:
f = open(sys.argv[1], "rb") for i in itertools.count(): record = f.read(16) if not record: break print i
Running this script takes about 54 seconds. Just to put this in perspective, it is about 100
times slower than cat
, although it is really not doing anything with each record.
$ time python /tmp/snippets/snippet00.py ./test-index.bin 93368320 real 0m54.960s user 0m53.715s sys 0m0.620s
If I run strace
on this script, I see that internally python is reading 4K at a time, and buffering. Good!
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
this means it is using buffers of about 4K size, or 8 times smaller than cat
. If I increase the buffer
size to 32K, with:
f = open(sys.argv[1], "rb", 32768)
it is still taking ~48 seconds, which means that most of the time is still, well, spent by python itself, with sys time still ~0.5 seconds.
If I further increase the buffer size to 1 Mb and 10 Mb, performance does not improve.
But if I leave a large buffer, and also increase the f.read()
size from 16 bytes to 32 K
or 1 Mb, run time goes down to about 3 seconds.
Given that the buffer size controls how many times the read
syscall is invoked, while
the size of the f.read()
controls how many times the loop, and the python code to
perform the read
is called, it seems like the cost is dominated by the python code
to perform the read.
This is where knowning more of the python internals would come in handy, but by peeking at the python process with a profiler, I don't see that much, except for the fact that python itself is spending lots of CPU, which doesn't sound very surprising.
35.17% [.] 0x000bd586 30.86% [.] PyEval_EvalFrameEx 9.15% [.] PyDict_GetItem 6.22% [.] _PyObject_GenericGetAttrWithDict 2.98% [.] PyDict_SetItem 2.13% [.] Py_UniversalNewlineFread 2.06% [.] PyCFunction_Call 1.90% [.] PyObject_Malloc 1.83% [.] PyObject_GetAttr 1.78% [.] PyString_FromStringAndSize 1.76% [.] PyObject_Free 1.37% [.] PyEval_RestoreThread 1.30% [.] PyTuple_New 0.68% [.] PyEval_SaveThread 0.58% [.] _PyArg_ParseTuple_SizeT 0.24% [.] getresgid@plt 0.00% [.] _PyModule_Clear 0.00% [.] PyType_ClearCache
So the question arises: does it get faster if, for example, I use io.read
instead of the plain read
?
And what if I use os.read
on file descriptors instead?
This is what the code looks like:
f = io.open(sys.argv[1], "rb") for i in itertools.count(): record = f.read(16) if not record: break print i
and this is how fast it is:
$ time python /tmp/snippets/snippet01.py ./test-index.bin 93368320 real 0m42.741s user 0m42.039s sys 0m0.636s
I was not expecting io module to be any faster, really, especially given that according to strace
,
they are using the same buffer size. This seems to indicate that the python code path to run
io.read() is indeed slightly faster? oh, well, the 10 seconds gain doesn't make it particularly
exciting.
Let's try with a simple os.read
. Given that it is the lowest level function I have available,
I expect it to have the least amount of python code in it. Although I know nothing about python
internals, I would expect plain read
and io.read
to be implemented on top of os.read
.
However, calling os.read
directly to get 16 bytes at a time will incur in heavy syscall costs,
but let's try:
f = open(sys.argv[1], "rb") fd = f.fileno() for i in itertools.count(): record = os.read(fd, 16) if not record: break print i
And here's how long it takes:
$ time python /tmp/snippets/snippet02.py ./test-index.bin 93368320 real 1m53.372s user 0m58.292s sys 0m54.927s
Which is roughly twice as slower than any other solution so far. Unsurprisingly, this means that buffering
is important. Note, however, how the sys
time has gone up significantly from previous tests, from ~half
a second to about 54 seconds, while the user time is pretty much unchanged.
This probably means that we are paying roughly the same cost within python itself, plus the cost to perform many more syscalls and actually asking the kernel 16 bytes at a time of the file.
But what if I increase buffering here? And read 1 Mb at a time?
f = open(sys.argv[1], "rb") fd = f.fileno() for i in itertools.count(): record = os.read(fd, 1048576) if not record: break print i
Time goes down significantly:
$ time python ./snippet02.py /opt/vms/pool/test-index.bin 1425 real 0m1.446s user 0m0.012s sys 0m1.424s
So, where is the cost? Is it in the read function codepath, or is it in working 16 bytes at a time in python?
Given that by using a plain read
with 1 Mb at a time the run time went down to about 3 seconds,
I expect the difference between os.read
and read
in terms of cost to be that 1.5 second.
So, what about the 40+ seconds caused by reading 16 bytes at a time?
Let's see what happens if I os.read()
1 Mb at a time, and then work on 16 bytes at a time.
Here's what the code looks like:
f = open(sys.argv[1], "rb") fd = f.fileno() offsets = range(0, 1048765, 16) i = 0 while True: buff = os.read(fd, 1048576) if not buff: break for offset in offsets: record = buff[offset:offset + 16] i += 1 print i
And here are the timings:
$ time python ./snippet05.py /opt/vms/pool/test-index.bin 93405900 real 0m38.921s user 0m37.328s sys 0m1.520s
What makes these timings interesting is that if I comment out the inner loop, it takes 1.5 seconds once again. So the 37 extra seconds are spent by the inner loop!
If I turn the body of the loop into a simple pass
, the run time goes down to about 6 seconds,
while if I leave only the record = buff[offset:offset + 16]
as instruction, this becomes ~30 seconds.
This means that computing the substring is really what I need to make faster, and costs about 300 ns.
Here is what the code looks like:
f = open(sys.argv[1], "rb") data = f.read() for i in itertools.count(): start = i * 16 end = start + 16 if start >= len(data): break record = data[start:end] print i
and here is the output:
$ time python /tmp/snippets/snippet03.py test-index.bin 93368320 real 1m12.975s user 1m11.120s sys 0m1.720s
which is not that surprising. This is actually much slower than any other
read operation. If the cost is in string processing, it may be that we are
paying this cost multiple times now: once to fill data, reading one buffer
at a time, and once every time I create a record
object from the buffer.
In facts, if I comment out the record = data[start:end]
line the time goes
back to almost normal:
$ time python /tmp/snippets/snippet03.py test-index.bin 93368320 real 0m48.553s user 0m44.427s sys 0m1.764s
By peeking at pydoc io
, there seems to be a method readinto
that reads data
into a bytearray
instead of a string
, so let's try it:
b = bytearray(16) f = io.open(sys.argv[1], "rb") for i in itertools.count(): numread = f.readinto(b) if not numread: break print i
The first experiment doesn't really go well, this is worse than anything I've seen before:
$ time python /tmp/snippets/snippet04.py test-index.bin 93368320 real 1m40.463s user 1m38.794s sys 0m0.648s
By looking at strace, it seems to be doing lots of read(4096)
, which
probably means read into a buffer, and then copy into the bytearray of
16 bytes:
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
Which seems to be slower than reading into a plain string? interesting.
Let's try with a larger bytearray, like 1048576, and see what happens instead:
b = bytearray(1048576) # b = bytearray(16) f = io.open(sys.argv[1], "rb") for i in itertools.count(): numread = f.readinto(b) if not numread: break print i
run time is now:
$ time python /tmp/snippets/snippet04.py test-index.bin 1425 real 0m3.926s user 0m0.220s sys 0m1.568s
4 seconds, which is comparable to what I had before with large reads.
If I look with strace, mostly out of curiosity, here's what happens:
mmap2(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7296000 read(3, "\277\267\330\5\301\224\230\374\212\335Y\354\326a\237K\217][\354\334a\257\376c\357,v\352\260\213\364"..., 1048576) = 1048576 munmap(0xb7296000, 1052672) = 0
which looks like it's reading directly into the bytearray.
What is surprising is the mmap
. As pointed out on the reddit thread, the mmap
and munmap
are used to allocate memory (flags indicate anonymous memory, and file descriptor is -1). This might simply be related to the malloc()
call in linux glibc, that according to the man page will use mmap
instead of sbrk
for any allocation larger than MMAP_THRESHOLD
, by default set to 128 k.
What I find suspicious here is that if you look at strace
, you see a continuous pattern of mmap
and munmap
. This probably means a malloc
followed by a free
shortly after, but I would have expected a good memory allocator to "cache" allocations within the process for some time before returning memory to the operating system, although it's been a while since I've looked at the implementation of malloc.
So, without actually looking at the python code, here's my theory of the internals:
Let's try with a bytearray of 8192 bytes, which is twice as much as the buffer size:
$ time python /tmp/snippets/snippet04.py test-index.bin 182360 real 0m1.256s user 0m0.704s sys 0m0.548s
From strace, this is reading 8k at a times, presumably into the bytearray directly, given the run time.
Let's try one more time, with buffering disabled (note the 0 in io.open):
b = bytearray(16) f = io.open(sys.argv[1], "rb", 0) for i in itertools.count(): numread = f.readinto(b) if not numread: break print i
With strace, I can now clearly see that python is reading 16 bytes at a time directly:
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 16
However, this is far slower than I was expecting:
$ time python /tmp/snippets/snippet04.py test-index.bin 93368320 real 2m4.992s user 1m10.476s sys 0m54.287s
Note the sys time at 54 seconds, probably because of the repeated
read calls. The 1 minute and 10 seconds of user time is probably the high number of iterations?
I would not have expected readinto
to be slower than other primitives.
Those were mostly run out of curiosity, with the default buffer size and the most naive code.
$ time python3 /tmp/snippets/snippet00.py test-index.bin
real 0m32.056s
user 0m30.662s
sys 0m1.300s
$ time python3 /tmp/snippets/snippet01.py test-index.bin 93368320 real 0m38.533s user 0m37.862s sys 0m0.588s
$ time python3 /tmp/snippets/snippet02.py test-index.bin 93368320 real 2m6.902s user 1m11.432s sys 0m55.223s
$ time python3 /tmp/snippets/snippet03.py test-index.bin 93368320 real 1m36.938s user 1m33.706s sys 0m1.712s
$ time python3 /tmp/snippets/snippet04.py test-index.bin 45590 real 0m0.595s user 0m0.084s sys 0m0.512s