2007-10-22

GNU pth instead of pthread: hardcore python tuning

Problem
I've been working on speed of an application written in python. While doing strace I found out that the most of syscalls were futex(). This futexes came from python's internal synchronization code. But the application never used any kind of threading!

Most of the users never actually realize that python is doing synchronization between threads even when threads aren't used. It's not so bad, those futex() calls are very quick about 7usecs one. But they are executed thousands times!
Sample "strace -c":

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
67.62 0.000570 0 15192 futex
10.91 0.000092 0 896 sendto
7.00 0.000059 7 8 stat
5.34 0.000045 0 1936 poll
3.32 0.000028 0 1076 recvfrom
2.73 0.000023 1 28 open
2.02 0.000017 3 6 write
Workaround
To get rid of futex() calls one can compile python without support for threading (./configure --without-threads). The fall of this method is that some modules depend on threading, like postgresql wrapper psycopg.

There is also other method. It seems that python supports GNU pth library. This library emulates threads in userspace, using pthread compatible interface. The only non-trivial thing is to get rid of pthread and use pth while python compilation (even though pth should be supported by python out of the box).

Enabling pth is not easy, but doable. Here's the compilation procedure thanks to my friend Primitive:
$ sudo apt-get install libpth-dev
$ wget http://python.org/ftp/python/2.5.1/Python-2.5.1.tar.bz2
$ wget http://ai.pjwstk.edu.pl/~majek/dump/pyconfig.h.in-pth-patch.diff
$ tar xjf Python-2.5.1.tar.bz2
$ cd Python-2.5.1
Python-2.5.1$ cat ../pyconfig.h.in-pth-patch.diff|patch -p0
patching file pyconfig.h.in
Python-2.5.1$ ./configure --with-pth --prefix=/usr
Python-2.5.1$ find . -name Makefile -exec sed -i "s/-lpthread/-lpth/" {} \;
Python-2.5.1$ sed -i "s/-pthread//" Makefile
Python-2.5.1$ sed -i 's/$(srcdir)\/Python\/thread_pthread.h//' Makefile
Python-2.5.1$ make
Benchmarks
In corner case the time gained is about 25%-30%. The test program is not complicated at all. We know that importing modules needs synchronization. So let's do only imports:
#!/usr/bin/python
for i in xrange(1000000):
import os
Average results for standard python 2.5.1, using pthread:
real    0m2.061s
user 0m1.728s
sys 0m0.332s
Average results for patched python 2.5.1 using pth:
real    0m1.572s
user 0m1.560s
sys 0m0.008s
What's with those futexes?
Well, as I suggested futexes are gone when using pth. Here are strace results for python with pthreads:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 1.014676 1 1000020 futex
0.01 0.000067 0 142 read
0.01 0.000057 0 4504 _llseek
And for fixed python with pth:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000050 0 4504 _llseek
0.00 0.000000 0 142 read


5 comments:

Maciek said...

you might try something like this:
LD_PRELOAD=/path/to/pth.so ./single_threaded_python_script.py
Since the api is supposed to be the same, it should be possible to substitute libraries at startup-time

Anonymous said...

LD_PRELOAD didn't worked ;)

majek said...

Didn't worked for me either.

Phillip said...

So I know this post is old, but I am very intrigued by this idea. Seems like a big win for Python and a very logical choice.

I was browsing the Python source to see if system call mapping is enabled, because if it isn't then any I/O will block the entire process.

It doesn't appear to be enabled. Since you have this up and running, wanna give it a shot? I'm going to try it myself soon, with the hopes of speeding up my FastCGI/WSGI processes :)

majek said...

> I was browsing the Python source to
> see if system call mapping is
> enabled, because if it isn't then
> any I/O will block the entire
> process.

I only tried to avoid the futex() overhead. I never really thought about wrapping read(). But it's very interesting if this would really work.

I'm not going to try it myself in near future, but I'd love to hear your results.