Website loading time debugging (what happens on the wire when you browse?)

One of my responsibilities is optimizing page loading time. There are some very useful tools on the web that help understand where are bottlenecks.

Firefox addons like Firebug with Yslow and Tamperdata are the most important and easy to use tools for every web developer.

On the other hand I see a lack of a tool that is more related to networking. It's fine that yslow can suggest me that caching headers are broken, but how can I test if a content is really cached.

I would like to see a tool that looks more or less like Firebug networking tab, but shows what is really going on on the wire. I would like to know all "If-not-modified" requests or if "Etags" are used. There is also tcp/ip layer that is sometimes forgotten. Are "Keep-alives" used, is establishing tcp/ip connection fast?

That's why I'm thinking of a new tool, nothing revolutionary, just useful tool for me. I'm currently writing the proof of concept. Sample screenshot from my work:


Set process name in python

How to set process name from inside it? I know that nmap can fake name, but the question is how to do this from python.
The code that is doing this magic in nmap:

# from nmap.cc
if (quashargv) {
size_t fakeargvlen = strlen(FAKE_ARGV), argvlen = strlen(argv[0]);
if (argvlen < fakeargvlen)
fatal("If you want me to fake your argv, you need to call the program with a longer name. Try the full pathname, or rename it fyodorssuperdedouperportscanner");
strncpy(argv[0], FAKE_ARGV, fakeargvlen);
memset(&argv[0][fakeargvlen], '\0', strlen(&argv[0][fakeargvlen]));
for(i=1; i < argc; i++)
memset(argv[i], '\0', strlen(argv[i]));
There are some possible ways of doing this from python. Some guys suggest using prctl ioctl. But it didn't worked for me. Also the module proctitle haven't done what I expected.
From time I found rtgraph3d project, I'm a fan of PyInline module. Here's my python code that's working (for me):
import PyInline, time

m = PyInline.build(code=r"""
#include <Python.h>
#include <stdio.h>
#include <string.h>

void set_argv(char *str){
int argc;
char **argv;
Py_GetArgcArgv(&argc, &argv);
strncpy(argv[0], str , strlen(str));
memset(&argv[0][strlen(str)], '\0', strlen(&argv[0][strlen(str)]));
""", language="C")

m.set_argv('very custom python process name')
It looks funny from ps:
majek@tigger:~$ ps aux|grep very
majek 11688 2.6 0.4 7192 4360 pts/21 S+ 00:20 0:00 very custom python process name


The mystery of extremly long "grep -i"

It's well known fact, that grep is the fastest way of searching data. The mystery was why "grep -i" took so much longer than standard, case-sensitive grep.

Finally depesz found the reason.

GNU pth instead of pthread: hardcore python tuning

I've been working on speed of an application written in python. While doing strace I found out that the most of syscalls were futex(). This futexes came from python's internal synchronization code. But the application never used any kind of threading!

Most of the users never actually realize that python is doing synchronization between threads even when threads aren't used. It's not so bad, those futex() calls are very quick about 7usecs one. But they are executed thousands times!
Sample "strace -c":

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
67.62 0.000570 0 15192 futex
10.91 0.000092 0 896 sendto
7.00 0.000059 7 8 stat
5.34 0.000045 0 1936 poll
3.32 0.000028 0 1076 recvfrom
2.73 0.000023 1 28 open
2.02 0.000017 3 6 write
To get rid of futex() calls one can compile python without support for threading (./configure --without-threads). The fall of this method is that some modules depend on threading, like postgresql wrapper psycopg.

There is also other method. It seems that python supports GNU pth library. This library emulates threads in userspace, using pthread compatible interface. The only non-trivial thing is to get rid of pthread and use pth while python compilation (even though pth should be supported by python out of the box).

Enabling pth is not easy, but doable. Here's the compilation procedure thanks to my friend Primitive:
$ sudo apt-get install libpth-dev
$ wget http://python.org/ftp/python/2.5.1/Python-2.5.1.tar.bz2
$ wget http://ai.pjwstk.edu.pl/~majek/dump/pyconfig.h.in-pth-patch.diff
$ tar xjf Python-2.5.1.tar.bz2
$ cd Python-2.5.1
Python-2.5.1$ cat ../pyconfig.h.in-pth-patch.diff|patch -p0
patching file pyconfig.h.in
Python-2.5.1$ ./configure --with-pth --prefix=/usr
Python-2.5.1$ find . -name Makefile -exec sed -i "s/-lpthread/-lpth/" {} \;
Python-2.5.1$ sed -i "s/-pthread//" Makefile
Python-2.5.1$ sed -i 's/$(srcdir)\/Python\/thread_pthread.h//' Makefile
Python-2.5.1$ make
In corner case the time gained is about 25%-30%. The test program is not complicated at all. We know that importing modules needs synchronization. So let's do only imports:
for i in xrange(1000000):
import os
Average results for standard python 2.5.1, using pthread:
real    0m2.061s
user 0m1.728s
sys 0m0.332s
Average results for patched python 2.5.1 using pth:
real    0m1.572s
user 0m1.560s
sys 0m0.008s
What's with those futexes?
Well, as I suggested futexes are gone when using pth. Here are strace results for python with pthreads:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 1.014676 1 1000020 futex
0.01 0.000067 0 142 read
0.01 0.000057 0 4504 _llseek
And for fixed python with pth:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000050 0 4504 _llseek
0.00 0.000000 0 142 read


Interesting gettimeofday() on x86_64

During doing strace() on application I found that there are no gettimeofday() syscalls. That's nothing new, for x86_64 experts, but it's interesting for me.

Here is an example of the code:

#include <stdio.h>
int main(){
Sample strace from i386:

write(1, "start\n", 6) = 6
gettimeofday(NULL, NULL) = 0
write(1, "stop\n", 5) = 5
And the same code on x86_64:

write(1, "start\n", 6) = 6
write(1, "stop\n", 5) = 5


Finding paths between users in social network graph.

For some time I'm working at Grono.net the biggest Polish social networking website (over 1.3mln registered users). Like in other networking sites, our users have the list of friends which they know. We have interesting feature of finding the shortest path between two users, measured in "handshakes" (by "handshake" I mean connection between two users).

The problem with that "paths" service was that it was not scaling well. Our servers were overloaded and the path searching time was unacceptable. It became clear that it's good time to rewrite our algorithms.

Zero: what's the problem
Social networks are based on the idea of "six degrees of separation". It means, that the distance (we can measure it in number of handshakes) between any two users is quite small. Even though the average number of friends for users is not big, the problem of finding shortest paths between users is very complex from the view of algorithmic complexity.

When the social network graph is small, one can use standard Dijkstra method for finding shortest paths. But when the graph is becoming bigger and bigger, the standard algorithm can take few seconds to count and is inefficient.

First try: the Dijkstra method
Dijkstra algorithm is the simplest one for finding shortest paths in graphs (see Wikipedia article for details). In our case, users can be treated as graph nodes, and connections between users as edges. At this point we know that the distance between users is always relatively small, at most a dozen of handshakes (edges). Once we know this characteristics, we can optimize Dijkstra method to our specific needs. The algorithmic cost of standard Dijkstra implementations is O(V * log(V) + E). In our case we can easily reduce this to O(V + E). It's possible to achieve that by implementing priority list data structure as hashed table. It's possible to do all operations we need to do on this data structure in linear cost (instead of standard priority list when the cost is logarithmic).

Second try: finish counting earlier
Standard Dijkstra algorithm is counting path from specified node, to all other nodes. In our case we don't need to find paths to all possible users, we can stop counting when the first valid path to the end user is found. One can think that would speed up the program significantly, but it doesn't. The gain in practice was small, because of specific properties of our graph. The number of connections (edges) is growing very rapidly in next levels of connections. It seems that in the distance of four handshakes we can reach the most of all users. On the time when we could leave the computation with the first result we'd already counted shortest paths to most of the users. Here is the sample data, that shows the growing number of nodes and edges on next levels of handshakes (the distance from some random node).

Distance   Nodes     Edges
#00 | 1n 59e
#01 | 59n 4917e
#02 | 3654n 463608e
#03 | 159042n 17669289e
#04 | 557105n 25225499e
#05 | 261623n 564312e
#06 | 25225n 28748e
#07 | 1879n 2104e
#08 | 190n 221e
#09 | 29n 33e
#10 | 4n 6e
#11 | 2n 2e
Third try: count path from both sides
The next specific property of our graph is the fact, that when I'm friend of someone, he must be my friend too. In computer terms we could say that the graph is not directed. That means that the shortest path between user A and B is also the shortest path from user B to A. As I mentioned before, the greater depth of nodes, the more of the data to consider. For a moment let's assume that the user A is in the distance of four handshakes to user B. We don't have to count the expensive nodes in distance of four from node A. We can count nodes in the depth of two from node A, and the depth of two from node B. The shortest path between A to B must go through the nodes that are in both sets. Using this method (bidirectional search) we can greatly reduce the complexity of algorithm, because we didn't have to grab the expensive data in distance 3 or more from specific node.

Fourth try
The specified method is much faster than previous ones. But for some nodes the time of counting was incredibly long. The algorithm is counting one step at a time, from both node A and B, and then it's comparing the results in search of middle points. Instead of doing one step from both sides at a time, we should modify the program, so that it should count the results only at one side of the path, the side where the number of nodes is smaller. Thanks to this optimization, we have guarantee that we do the least possible counting of nodes. Unfortunetaly we have to pay for this by doing twice the number of compares of the results than in previous method.

As the result, we evolved the standard Dijkstra method, from the complexity of
to the complexity of
where K is the average number of edges to the node and N is the average distance between nodes.

Even though it's exponential (which sounds worse than logarithmic), in this specified case of social network graph, the method reduced the search time from seconds to few milliseconds.

If you found this article interesting, you may also find interesting my graphical visualization of social network connections.