Asynchronous Django responses (comet), yes, for WSGI

Comet on python?

I was looking for comet server implementation in python. There are many projects that claim to provide this, but none of them is working for me. Most of the projects are just proofs of concepts that never been deployed for real usage.

Here is the list of projects that my friend found:

The Orbited project looks like the thing I was looking for. The problem is that even the examples from official documentation aren't working. Sorry, I don't want to work with undocumented projects.

Next candidate is Twisted, but I'm not the fan of it for religious reasons.

Fapws project also is looking good, but it supports only basics, no possibility of deferring (preeptimg) request.

The problem

I want to do an ajax request from browser that will be finished only when some event occured at the server side. The connection should stall and wait for the server. (this idea, of pushing events from server to browser is called comet)

Next thing is to allow django applications to use this technology. As far as I know no one has ever done this before, probably because django is based on WSGI stack which is, unfortunetally, fully synchronous.

Even though I think it's possible to allow WSGI stack to support asynchronous responses without modifying the WSGI stack at all.

The solution

Author of FAPWS suggested that it's possible to use libevent directly from python using ctypes. Libevent proved to be very reliable and fast enough in most cases, so it should be good idea to use it.

Using libevent glued with ctypes I created simple WSGI server (William examples helped me much).

As I mentioned the idea was to allow Django to do async responses. For the proof-of-concept I created website that produces output in five chunks, separated by delays of one second. Sounds easy, but it's not so simple to do without freezing server process for the whole time.

Going straight to the example, here's the view.py file from sample django app:
# Url handler of the basic url.
# Headers only from this response are used.
# The same with status code, this function
# response status code is served to the client.
def comet_main(request):
response = HttpResponse('First content<br /> ') # return something
response.stage = 0
return defer_in_time(response,
1, '/stage/') # after one second run the '/stage/' url handler

def comet_stage(request, response_prev=None):
response = HttpResponse('stage%i<br /> ' % (response_prev.stage + 1) )
response.stage = response_prev.stage + 1
if response.stage < 3: # loop here three times
url = '/stage/'
url = '/stage_last/' # than end the connection
return defer_in_time(response,
1, url) # after one second run the url

def comet_stage_last(request, response_prev=None):
# no magic here. just regular response
return HttpResponse('last_stage%i<br /> ' % (response_prev.stage +1))
Here is how it looks from the browser:

And what really happens on the wire (notice that Transfer-Encoding: chunked is used):

The sources are here.

What's next
The next thing to do is to create the example of full comet application.


PLUG: Warsaw python meetings

My friends are going to have python presentations in Warsaw. It's going to be a part of PLUG meetings.

Thursday, 29th November 2007, 19:00 at Politechnika Warszawska, sala AL.

More information:


Motoman weekly news #1

If you're interested in my work with Motoman industrial robots this post is for you.

The first thing is that this photo isn't connected to this post. It was taken by Marianek when the left Motoman got, out of my controll for a while. But getting back to point...

We have suffered from serious errors on right robot, mostly with the R axis. I suggested that the fault could be because of a broken wiring or demolished encoder for the motor.

We needed to know what really is causing the errors. To know this we (me and Creek) swapped the R motors in left and right Motomans.

Now it seems clearly that the problem lays in the motor encoder. We need to buy new one :)

I hope to repair the robots soon.


Trinity choice: nsock over libevent

Everybody knows that Trinity is using nmap.
But what she has to nsock or libevent, well nothing.

Recently I wrote some software using libevent and I'm disgusted. Libevent is poorly documented (except the self-explanatory function names which are suggestive), there are not many examples on the web. Even though I took an effort and wrote few programs.

About a year ago I put my hands on nmap project, I wrote something for nmap's asynchronous networking library nsock.

There are some really interesting differences between these two libraries:

  • In libevent you must allocate memory and set event structures by hand, from manual: "It is the responsibility of the caller to provide these functions with pre-allocated event structures.". In nsock event objects are managed by library.
  • I thought that EV_PERSIST events should be persistent. I don't get exactly the idea of EV_PERSIST event type in libevent. It's not working for me when I think it should (for example for EV_WRITE). On the other hand nsock doesn't provide any kind of persistent events.
  • Libevent provides the api for queuing signals. I haven't got an idea what application could theoretically take a use of such feature. I can't think of any use case for queuing signals in real world applications.
  • Nsock executes callback function with the information about a state of an event like if it has succeeded, failed or timeouted. Libevent doesn't. User have to check the socket status by hand.
  • In Nsock user can get current time (the time after last select() in main event loop) without syscall using nsock_gettimeofday(), which can speedup program. In libevent you have to use gettimefday().
  • Libevent supports multiple polling methods, most notably epoll(), while nsock is using only slow select().
  • Libevent could be used for both client or server side. Nsock is build especially for client side, it's possible to abuse it and use it for server side, but it's more like hacking.
The thing I like most in nsock is the easiness of using it. I don't have to worry about event structures. I have the guarantee that sooner or later my callback function will be called. To show the difference I have an example. It's the simplest program I could think of. It registers timer callback with null timeout, million times one after another sequentially.

First libevent example:
#include <stdlib.h>
#include <event.h>

gcc -Wall event_test1.c -levent && time ./a.out

int counter = 0;
struct timeval tv = {0,0};
struct event ev;

void timer_handler(int fd, short event, void *ud){
event_add(&ev, &tv);
if(counter > 1000000)

int main(){

event_set(&ev, -1, EV_TIMEOUT, timer_handler, NULL);
event_add(&ev, &tv);
And very similar nsock code:
#include <nsock.h>
#include <stdlib.h>

/* Sorry, some magic with compilation. Nsock isn't standalone library yet!
gcc -Wall nsock_test.c -I/home/majek/nmap/nsock/include /home/majek/nmap/nsock/src/libnsock.a -ldnet /home/majek/nmap/nbase/libnbase.a /home/majek/nmap/libpcap/libpcap.a && time ./a.out

nsock_pool nsp; /* global */
int counter = 0;

void timer_handler (nsock_pool nsp, nsock_event nse, void *ud){
nsock_timer_create(nsp, timer_handler, 0 /*ms*/, NULL);
if(counter > 1000000)

int main(){
nsp = nsp_new(NULL);
nsock_timer_create(nsp, timer_handler, 0 /*ms*/, NULL);
nsock_loop(nsp, 1000000);
Times of execution this programs are very interesting for me. Unbelievably it seems that nsock is more than two times faster!
Libevent averate timings:
real    0m2.271s
user 0m0.640s
sys 0m1.624s
Nsock average time:
real    0m0.941s
user 0m0.392s
sys 0m0.548s
The "user time" shows that libevent implementation of internal structures is much heavier than nsock. Even that nsock should have more work because it's allocating event structures alone, without users help. The "sys time" should be similar for both libraries, but it's not. Strace is going to reveal the issue:
# The loop is reduced to 100 iterations.

# Data for NSOCK
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
nan 0.000000 0 203 gettimeofday

# Data for LIBEVENT
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
nan 0.000000 0 406 gettimeofday
nan 0.000000 0 101 epoll_wait
nan 0.000000 0 1 epoll_create
nan 0.000000 0 1 epoll_ctl
Well, it seems clear that two things are broken in libevent. First, the gettimeofday() is called four times for every event loop iteration, it's unacceptable waste of resources. Second, the epoll() is executed for every event loop, while for nsock even a single select() isn't called. In this particular case, when timer events are registered with Null timeout there isn't a need of executing epoll().

From programmers point of view, I like nsock programming style. Unfortunately nsock doesn't have many important features, like epoll or server side sockets handling.

It seems that libevent ain't no perfect either.

Maybe there is a need of async library with easy api like nsock and full-featured like libevent?

Libevent under python

So I want to write asynchronous tcp/ip server in python.

I really hate overblown twisted. The thing I like most in Python is simplicity and easiness to read. Well, twisted in my opinion doesn't have this attributes. It's of course my personal feeling, mostly because I don't know twisted. But such statements are making me sick, at first sight I really don't understand what are they for (from core twisted documentation):

internet.TCPServer(7779, IFingerFactory(f)).setServiceParent(

I remember such cascades from Java rather than Python and I dislike Java especially for this way of programming.

Getting back to point. If not twisted than what? Lets try libevent. Stable C library, which is a production standard now. Memcached is using it, Tor is using it, IO also. It must be really professional asynchronous library.

There are also some other async linux libraries, like liboop. Wait a moment, libevent 500k google hits, liboop 25k. Sorry, the only asynchronous library for linux is libevent for now.
(btw. Liboop's api is in my opinion totally broken)

Are there any python wrappers for libevent? Google says about two: libevent-python and pyevent. Both seem abandoned.
I don't like exceptions inside libraries, so it didn't took long to eliminate pyevent:
    Traceback (most recent call last):
File "./tights.py", line 37, in
File "./tights.py", line 34, in main
File "event.pyx", line 262, in event.dispatch
TypeError: exceptions must be strings, classes, or instances, not type

Update: Michael Carter writes in comments that this bug is the fault of pyrex rather than pyevent and that it shouldn't happen in python2.4(I use 2.5). I'll have to take a closer look at pyevent again.

Libevent-python seems to be working. Unfortunately this wrapper removed the concept of 'userdata' passed to callback in favor of class-based approach. The difference is that callback is not called with 'userdata' but it's called on specific class instance, so callback has 'self' object. I personally prefer the 'userdata' approach, but it's quite easy to switch.

Here's an example of simple libevent-python server, which is just printing every data it receives (something like standard echo_server.py example from libevent-python, but simpler)
# -*- coding: utf-8 -*-
import socket, signal, libevent

def callback_interrupt(signum, events, event_obj):

def callback_onconnect(fd, events, event):
# hack to avoid passing sock_srv from main()
sock_srv = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
sock, (host, port) = sock_srv.accept()
conn = Connection(sock, (host, port))

class Connection:
def __init__(self, sock, addr):
self.sock = sock
self.addr = addr
libevent.createEvent(sock, libevent.EV_READ, self.callback_onread).addToLoop()

def callback_onread(self, fd, events, event_obj):
buf = self.sock.recv(4096)
if not buf: # just disconnect
# Yeah, print!
print "%r %r" % (self.addr, buf)
# reuse current event

if __name__ == '__main__':
# bind.
sock_srv = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock_srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock_srv.bind(('', 1111))

libevent.createSignalHandler(signal.SIGINT, callback_interrupt).addToLoop()
libevent.createEvent(sock_srv, libevent.EV_READ|libevent.EV_PERSIST, callback_onconnect).addToLoop()

Here seems to be another way of connecting libevent with python, using raw ctypes: http://william-os4y.livejournal.com/3718.html.


Website loading time debugging (what happens on the wire when you browse?)

One of my responsibilities is optimizing page loading time. There are some very useful tools on the web that help understand where are bottlenecks.

Firefox addons like Firebug with Yslow and Tamperdata are the most important and easy to use tools for every web developer.

On the other hand I see a lack of a tool that is more related to networking. It's fine that yslow can suggest me that caching headers are broken, but how can I test if a content is really cached.

I would like to see a tool that looks more or less like Firebug networking tab, but shows what is really going on on the wire. I would like to know all "If-not-modified" requests or if "Etags" are used. There is also tcp/ip layer that is sometimes forgotten. Are "Keep-alives" used, is establishing tcp/ip connection fast?

That's why I'm thinking of a new tool, nothing revolutionary, just useful tool for me. I'm currently writing the proof of concept. Sample screenshot from my work:


Set process name in python

How to set process name from inside it? I know that nmap can fake name, but the question is how to do this from python.
The code that is doing this magic in nmap:

# from nmap.cc
if (quashargv) {
size_t fakeargvlen = strlen(FAKE_ARGV), argvlen = strlen(argv[0]);
if (argvlen < fakeargvlen)
fatal("If you want me to fake your argv, you need to call the program with a longer name. Try the full pathname, or rename it fyodorssuperdedouperportscanner");
strncpy(argv[0], FAKE_ARGV, fakeargvlen);
memset(&argv[0][fakeargvlen], '\0', strlen(&argv[0][fakeargvlen]));
for(i=1; i < argc; i++)
memset(argv[i], '\0', strlen(argv[i]));
There are some possible ways of doing this from python. Some guys suggest using prctl ioctl. But it didn't worked for me. Also the module proctitle haven't done what I expected.
From time I found rtgraph3d project, I'm a fan of PyInline module. Here's my python code that's working (for me):
import PyInline, time

m = PyInline.build(code=r"""
#include <Python.h>
#include <stdio.h>
#include <string.h>

void set_argv(char *str){
int argc;
char **argv;
Py_GetArgcArgv(&argc, &argv);
strncpy(argv[0], str , strlen(str));
memset(&argv[0][strlen(str)], '\0', strlen(&argv[0][strlen(str)]));
""", language="C")

m.set_argv('very custom python process name')
It looks funny from ps:
majek@tigger:~$ ps aux|grep very
majek 11688 2.6 0.4 7192 4360 pts/21 S+ 00:20 0:00 very custom python process name


The mystery of extremly long "grep -i"

It's well known fact, that grep is the fastest way of searching data. The mystery was why "grep -i" took so much longer than standard, case-sensitive grep.

Finally depesz found the reason.

GNU pth instead of pthread: hardcore python tuning

I've been working on speed of an application written in python. While doing strace I found out that the most of syscalls were futex(). This futexes came from python's internal synchronization code. But the application never used any kind of threading!

Most of the users never actually realize that python is doing synchronization between threads even when threads aren't used. It's not so bad, those futex() calls are very quick about 7usecs one. But they are executed thousands times!
Sample "strace -c":

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
67.62 0.000570 0 15192 futex
10.91 0.000092 0 896 sendto
7.00 0.000059 7 8 stat
5.34 0.000045 0 1936 poll
3.32 0.000028 0 1076 recvfrom
2.73 0.000023 1 28 open
2.02 0.000017 3 6 write
To get rid of futex() calls one can compile python without support for threading (./configure --without-threads). The fall of this method is that some modules depend on threading, like postgresql wrapper psycopg.

There is also other method. It seems that python supports GNU pth library. This library emulates threads in userspace, using pthread compatible interface. The only non-trivial thing is to get rid of pthread and use pth while python compilation (even though pth should be supported by python out of the box).

Enabling pth is not easy, but doable. Here's the compilation procedure thanks to my friend Primitive:
$ sudo apt-get install libpth-dev
$ wget http://python.org/ftp/python/2.5.1/Python-2.5.1.tar.bz2
$ wget http://ai.pjwstk.edu.pl/~majek/dump/pyconfig.h.in-pth-patch.diff
$ tar xjf Python-2.5.1.tar.bz2
$ cd Python-2.5.1
Python-2.5.1$ cat ../pyconfig.h.in-pth-patch.diff|patch -p0
patching file pyconfig.h.in
Python-2.5.1$ ./configure --with-pth --prefix=/usr
Python-2.5.1$ find . -name Makefile -exec sed -i "s/-lpthread/-lpth/" {} \;
Python-2.5.1$ sed -i "s/-pthread//" Makefile
Python-2.5.1$ sed -i 's/$(srcdir)\/Python\/thread_pthread.h//' Makefile
Python-2.5.1$ make
In corner case the time gained is about 25%-30%. The test program is not complicated at all. We know that importing modules needs synchronization. So let's do only imports:
for i in xrange(1000000):
import os
Average results for standard python 2.5.1, using pthread:
real    0m2.061s
user 0m1.728s
sys 0m0.332s
Average results for patched python 2.5.1 using pth:
real    0m1.572s
user 0m1.560s
sys 0m0.008s
What's with those futexes?
Well, as I suggested futexes are gone when using pth. Here are strace results for python with pthreads:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 1.014676 1 1000020 futex
0.01 0.000067 0 142 read
0.01 0.000057 0 4504 _llseek
And for fixed python with pth:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000050 0 4504 _llseek
0.00 0.000000 0 142 read


Interesting gettimeofday() on x86_64

During doing strace() on application I found that there are no gettimeofday() syscalls. That's nothing new, for x86_64 experts, but it's interesting for me.

Here is an example of the code:

#include <stdio.h>
int main(){
Sample strace from i386:

write(1, "start\n", 6) = 6
gettimeofday(NULL, NULL) = 0
write(1, "stop\n", 5) = 5
And the same code on x86_64:

write(1, "start\n", 6) = 6
write(1, "stop\n", 5) = 5


Finding paths between users in social network graph.

For some time I'm working at Grono.net the biggest Polish social networking website (over 1.3mln registered users). Like in other networking sites, our users have the list of friends which they know. We have interesting feature of finding the shortest path between two users, measured in "handshakes" (by "handshake" I mean connection between two users).

The problem with that "paths" service was that it was not scaling well. Our servers were overloaded and the path searching time was unacceptable. It became clear that it's good time to rewrite our algorithms.

Zero: what's the problem
Social networks are based on the idea of "six degrees of separation". It means, that the distance (we can measure it in number of handshakes) between any two users is quite small. Even though the average number of friends for users is not big, the problem of finding shortest paths between users is very complex from the view of algorithmic complexity.

When the social network graph is small, one can use standard Dijkstra method for finding shortest paths. But when the graph is becoming bigger and bigger, the standard algorithm can take few seconds to count and is inefficient.

First try: the Dijkstra method
Dijkstra algorithm is the simplest one for finding shortest paths in graphs (see Wikipedia article for details). In our case, users can be treated as graph nodes, and connections between users as edges. At this point we know that the distance between users is always relatively small, at most a dozen of handshakes (edges). Once we know this characteristics, we can optimize Dijkstra method to our specific needs. The algorithmic cost of standard Dijkstra implementations is O(V * log(V) + E). In our case we can easily reduce this to O(V + E). It's possible to achieve that by implementing priority list data structure as hashed table. It's possible to do all operations we need to do on this data structure in linear cost (instead of standard priority list when the cost is logarithmic).

Second try: finish counting earlier
Standard Dijkstra algorithm is counting path from specified node, to all other nodes. In our case we don't need to find paths to all possible users, we can stop counting when the first valid path to the end user is found. One can think that would speed up the program significantly, but it doesn't. The gain in practice was small, because of specific properties of our graph. The number of connections (edges) is growing very rapidly in next levels of connections. It seems that in the distance of four handshakes we can reach the most of all users. On the time when we could leave the computation with the first result we'd already counted shortest paths to most of the users. Here is the sample data, that shows the growing number of nodes and edges on next levels of handshakes (the distance from some random node).

Distance   Nodes     Edges
#00 | 1n 59e
#01 | 59n 4917e
#02 | 3654n 463608e
#03 | 159042n 17669289e
#04 | 557105n 25225499e
#05 | 261623n 564312e
#06 | 25225n 28748e
#07 | 1879n 2104e
#08 | 190n 221e
#09 | 29n 33e
#10 | 4n 6e
#11 | 2n 2e
Third try: count path from both sides
The next specific property of our graph is the fact, that when I'm friend of someone, he must be my friend too. In computer terms we could say that the graph is not directed. That means that the shortest path between user A and B is also the shortest path from user B to A. As I mentioned before, the greater depth of nodes, the more of the data to consider. For a moment let's assume that the user A is in the distance of four handshakes to user B. We don't have to count the expensive nodes in distance of four from node A. We can count nodes in the depth of two from node A, and the depth of two from node B. The shortest path between A to B must go through the nodes that are in both sets. Using this method (bidirectional search) we can greatly reduce the complexity of algorithm, because we didn't have to grab the expensive data in distance 3 or more from specific node.

Fourth try
The specified method is much faster than previous ones. But for some nodes the time of counting was incredibly long. The algorithm is counting one step at a time, from both node A and B, and then it's comparing the results in search of middle points. Instead of doing one step from both sides at a time, we should modify the program, so that it should count the results only at one side of the path, the side where the number of nodes is smaller. Thanks to this optimization, we have guarantee that we do the least possible counting of nodes. Unfortunetaly we have to pay for this by doing twice the number of compares of the results than in previous method.

As the result, we evolved the standard Dijkstra method, from the complexity of
to the complexity of
where K is the average number of edges to the node and N is the average distance between nodes.

Even though it's exponential (which sounds worse than logarithmic), in this specified case of social network graph, the method reduced the search time from seconds to few milliseconds.

If you found this article interesting, you may also find interesting my graphical visualization of social network connections.


Die Google! - future search engines

While talking with a friend, I realized that Google isn't the last word in the web search engines.

From technical point of view google search engine isn't magic, the topic of indexing data is nothing revolutionary.

The only interesting thing about Google success is their method of promoting good results and punishing bad in search results. In Google nomenclature it's called the (mistified) Pagerank. They keep the exact algorithm in secret, but in the beginning it was some kind of counting links to specified website.

The development of Sergey and Larry's algorithm was tied by their input data. The only thing they had was the source of web page and links between sites. Nothing more. (maybe the domain name, link url and hosting center, but
that doesn't say anything about the content)

Currently, in the time of user generated content, we have a bit more information.

We have direct link between the content and the author. In the Web 2.0 epoch we exactly know who/where/when wrote specified text.

If you search for some information and you found someone's information useful, maybe you would like to seek for more posts of this person. Maybe you would like to see what threads does this person read. Or if you read some forum, it could be interesting for you that most of the people interested in this forum like some different forum. (This is already implemented in online shops. You can see that people who bought this item also bought something else.) Or if your search for information maybe you would like to see content created by your friends first?

I want to emphasize the fact, that exact information about the author of the content is something that Google will never be able to get.

Maybe the idea is to change the meaning of search. Maybe nowadays the thing of 'search' is to link the user not with the content, but the user with the author?

I'm already using this.
When I find something interesting, I want to see other things from the author. I want to know what he is interested in, I search for his profile in LinkedIn, I search for his interesting links in del.icio.us.

How about you?


Graphical representation of connections between users

Recently, I've been looking for a way to graphically show connections between users in social network. The basic idea is to show it in 3d, in the simplest possible manner.

I tried to code this, but unfortunately I didn't knew any good-enough algorithm for counting the scene.

Other social networks tried to visualize connections. For example this graph was created at 2003 Club Nexus at Stanford University.

Few days ago I found very interesting application, RTgraph3d is a tool from Philippe Biondi, the author of very successful applications like Scapy.

What I found the most surprising, is that the results sometimes show something interesting. For example this image is showing someone's friends, with connections between them. From this image you can easily recognize three groups of friends.

This is the example of the most common view, just a mess of connections between users.

This is yet another try. It shows two levels of links, but without connections between users at the same level.

RTGraph3D seems to be a good tool for such graphs. I have a tool, I have data, but I don't have an idea what other interesting things could be shown.


Magic popcount (popcnt) command

From Frank de Groot blog:
Every serious hacker sooner or later needs the popcount instruction.
This "population count" instruction counts the set bits in a register, and is so useful that the NSA demands that all computers they purchase implement it in hardware.

But this command is not present at x86 architecture. AMD64 documentation (p. 188) supports popcount extension, but I never heard of any processor that implements this feature.

You may wonder why would one need popcount instruction. I needed it for a very simple task. I wanted to change data structure from bitfield to list of bit positions. Basically I needed something like this:

0x1001 -> [0, 12]
0xF000 -> [12,13,14,15]
The simplest solution is to iterate over every bit and check if it's set. More complicated algorithm could count number of set bits and index possible results using this number. I wonder if that approach is faster. It's not so simple to say, because low level performance issues could have great role in this problem.

I wonder what is the fastest possible method for this problem. I bet that popcount instruction is the key to efficient implementation.

I found some interesting popcount implementations in software.
GCC approach, sum bits for every byte:
const UQItype __popcount_tab[] =

int __popcountSI2 (UWtype x)
UWtype i, ret = 0;
for (i = 0; i < W_TYPE_SIZE; i += 8)
ret += __popcount_tab[(x >> i) & 0xff];
return ret;
Method based on subtraction, from GnuLib:

int popcount(unsigned int x){
int pop;
for (pop = 0; x != 0; pop++)
x &= x - 1;
return pop;
AMD64 Optimization Guide (p. 137): (they have also mmx version)
unsigned int popcount(unsigned int v) 
unsigned int retVal;
__asm {
MOV EAX, [v] ;v
SHR EAX, 1 ;v >> 1
AND EAX, 055555555h ;(v >> 1) & 0x55555555
SUB EDX, EAX ;w = v - ((v >> 1) & 0x55555555)
SHR EDX, 2 ;w >> 2
AND EAX, 033333333h ;w & 0x33333333
AND EDX, 033333333h ;(w >> 2) & 0x33333333
ADD EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
SHR EAX, 4 ;x >> 4
ADD EAX, EDX ;x + (x >> 4)
AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F)
IMUL EAX, 001010101h ;y * 0x01010101
SHR EAX, 24 ;population count = (y * 0x01010101) >> 24
MOV retVal, EAX ;store result
return (retVal);
There is also interesting topic aboutpopcount at programming.reddit.com. Yet another source, John-Mark Gurney tips about popcount implementaion.


Python memory hack

Django and most of python frameworks eat a lot of RAM. To serve multiple pages in the same time there must be many running python processes (this is not like light www daemons which serve all requests from only one process, like lighttpd or nginx).

So every python process eats RAM. During it's live it imports many libraries and files. This imports are never freed by garbage collector. This imports are almost identical for every process, but the problem is that they are not shared between processes. If you import some library in django, it's going to be imported separately at every django process.

To reduce memory usage it's possible to import all used libraries actually before forking the python processes. Linux fork() uses copy-on-write technique, so we can think that preloaded libraries are going to be shared between python processes.

Unfortunately in python it doesn't work. That's because immutable and mutable data are stored in the same memory segment! For example header structure for immutable string is being often changed, because it contains reference counter.

I created quick memory hack for cpython. Python imports are mostly code objects, which contain code in string objects. I hacked string object, so that it allocates memory for immutable data in separate memory segment than mutable part. My hack allocates big memory segment in which immutable string data is stored. Except for storing new strings, that memory is not going to be changed during execution. I hope, that after fork(2), this memory segment will be successfully shared between threads. The bad side is that this memory is never freed.

Example usage:

'a'.setmajekhack() # enable hack
# do some imports here
import sys,os,time
'a'.setmajekhack() # disable hack

Simple output, this time only 17KB of strings data were put to immutable segment:
MAJEKHACK enabled 0x2b6bc0c57000
MAJEKHACK disabled, allocated 17131 bytes

Unfortunately this hack doesn't give as good results as I thought. Although I think that python memory allocator could be highly optimized for similar cases and save our precious memory.


Crisis with asynchronous programming

I'm currently designing an application. It's going to be quite simple linux network daemon which will serve some data from disk.

Looks simple. I can easily predict that the bottleneck can be either disk IO or network bandwidth. Nowadays it's obvious that high performance network daemon must be done 100% asynchronous.

The problem is that there isn't yet any technology that allows full asynchronous programming. Okay, there is libevent or liboop for C, twisted for python. But I mean full asynchronous, with fully async stat(2) and open(2) syscalls.

There are projects that try to handle this issues. For example lighttpd have specialized thread pool just for managing stat(2). I don't want to do it this way. I want simple to use, fully async, portable library that will allow me to code in the 100% right way.

And please don't tell me that there isn't a need for aio_open or aio_stat (example). Disks aren't faster than five years ago, so access to metadata on disk is becoming relatively slower. But Linus doesn't think the problem exists.

I can't believe that in 21st century there isn't any simple solution to such a trivial problem.


Nmap nse descriptors overflow patch

I just released quick patch for Brandon bug. It's interesting that this flaw wasn't found earlier.

My patch is a bit hackish, but it should work. I wonder if anyone finds better way of fixing this issue. The descriptor overflow patch lies here.

Update 1: Hurray! Patch is included in svn version of nmap (r5739).


Well, we need Google architecture, but how should we implement such a huge project?

Every fast-growing company will have to answer this issues sooner or later:

  • storage disks are full
  • it's hard to add more disks to existing systems
  • files are served too slow because disks seeks are becoming limitation
  • simple database is growing to millions rows, cost and complexity of partitioning is reaching the level of cost-effectiveness

It seems that Google architecture is solving this issues. By their architecture I mean their applications, like Google File System, Chubby lock service and the most important Big Table distributed database (not really database, just distributed multidimensional map).

  • scaling linearly (add new machine to cluster, it's just working as it should, no specific configuration or manual partitioning of existing data is needed)
  • auto partitioning (when one machine is overloaded data are split and moved to other one)
  • low cost of administration (you care only about hardware failures)
  • designed to be high-available (it's not exactly HA. Google designed everything not to reduce single point of failures, they rather focused on fast restoring failed applications. In case of failure, application will be down for few seconds most.)

Well, Google architecture have also some limitations, for example sometimes you can't afford this few seconds of system downtime. But for most users Google architecture is resolving many important problems.

Are you're going to implement Big Table?

The main problem with this architecture is that it's build from projects that are tightly linked. You can't build reliable clone of Big Table without Google file system and Chubby lock manager.

The bigger the project is, the greater chance of failure. Nobody wants to do highly risk investment. We can hear "well, it's easier to hack current architecture than to build new from scratch". It's obviously true to some point. After that you need the final solution for problems right now. And it's getting harder to move a dozen or so terabytes data from old storage system to new.

I think there is a possibility of building something like Google architecture in small steps. But it's going to have some limitations in first stages.

First step: "Administrators nightmare"
Assume that we need only auto partitioning. Every other features at this point we ignore. We're going to create basic Big Table. By that I mean standalone application, that's going to answer simple mapping from key to value. The internals must be created as simplified version of Tablet server. The only feature we truly need is autopartitioning. When Tablet size is reaching some threshold, it splits to two instances, with reduced row ranges for each instance.

What we're going to get:
  • administrators must move instances manually from machine to another
  • manually configured (hardcoded) row-ranges at clients
  • no distributed features
  • reduced disk seeks compared to filesystem based key-value mapping

Second step: Multi-dimensional mapping
The idea of column families, timestamps and not limited columns is just great.

Third step: Chubby lock manager
We need to create some automated way of informing the clients about Tablet changes. Clients shouldn't have any hardcoded configuration. At this step we should create some kind of METATABLE for handling Tablets row ranges and Chubby as the main source of basic metadata. We can assume that we have only one instance of Chubby server (no replication). But we definitely need Chubby to serve files, locking and asynchronous events.

Fourth step: The File System
We need file system with shared name space for every node. If Tablet is partitioned, it must be possible to open new Tablet data on any other server.

Fifth step. Competing with Giant :)
At some point Chubby should be replicated, the Chubby master should be chosen using famous Paxos algorithm to be fully HA. Some other features to Big Tablets are also nice, like Bloom filters or compression. Secondary indices to Big Table are very nice (yuppi, Google doesn't have that feature).

Pencils Up! I wish you happy coding.

BTW. There is a Hadoop/HBase project, you can be interested.


Some thoughts on hashing

I was looking for fast hash algorithms and I found Paul Hsieh SuperFastHash implementation. I also found that hash functions in Bloom filter algorithm are used to test if an element is member of a set.

I found interesting fact, that it's possible to count any number of independent hashes using only two hash functions using this formula (f1 and f2 are hash functions, i is any number that is relatively prime to m):

Even more surprising is fact that if f1 is something like this:
than f2 can be:
Which means, that to count any number of hash functions for specified element, it's needed to count only one "real" hash(x), plus some additional sums and modulos.

Knuth suggests to use such m value, that both m and m-2 are primes, like 1021 and 1019.


How a bad password generator can ruin security

This is a true story about really broken password generator that was used at my college. The good idea and a bad implementation.

On 30th of December 2005 all students at PJIIT received email which sounded like this:

From today (30.12.2005) new password policy is going to be used:

      Password must contain eight or more characters
      Password must not contain username or any part of it
      Password should contain characters from three of four specified categories:
      1. Small letters [a-z]
      2. Capital letters [A-Z]
      3. Digits [0-9]
      4. Special characters: [!#$%^&*()_+{}:";'<>,.?]

This policy seems to be reasonable. Every of at least eight letters in password could be one of 82 possible characters ([a-zA-Z0-9!#$%^&*()_+{}:";'<>,.?] = 25+25+10+22 = 82).
Number of possible passwords is huge 828 = 2044*1012.

But creating such complicated password that fits to that policy was too hard for some students. Some of them just added some characters to the end of their previous password (password bart78 became bart78##). Others used password generator supplied by our college network adminstrators.

At first glance generator seems to create reasonable passwords. Let's concentrate at passwords generated at default "complexity 3".
character family[a-zA-Z][0-9][!"#$%&'()*+,-./]
number of characters
in generated pass
4-6 1-2 1-2

Number of characters for each character family is selected randomly from specified range. There are only four possible character patterns 4-1-1, 5-2-1, 5-1-2 and 6-1-1. I haven't written this password generator myself but I can bet that every pattern have the same probability to be selected. Let's count number of possible passwords in each pattern.
lettersdigitsspecial probability
to be selected
number of uniqe
possible passwords
4 2 2 25% 504 * 102 * 152 = 0.140 * 1012
5 2 1 25% 505 * 102 * 151 = 0.468 * 1012
5 1 2 25% 505 * 101 * 152 = 0.703 * 1012
6 1 1 25%506 * 101 * 151 = 2.243 * 1012
sum: 3.656 * 1012

That's not good. From 2044 * 1012 possible passwords, generator can generate only 3.6*1012 passwords. It's less than two pro miles of possible passwords!

The generator is even worse. That's because of Microsoft's programming framework and their approach to random numbers.
From Microsoft's documentation of Randomize function:
"[...] This example uses the Randomize statement to initialize the random-number generator. Because the number argument has been omitted, Randomize uses the return value from the Timer function as the new seed value."

And the magic "Timer function" documentation:
"Timer Function returns the number of seconds that have elapsed since 12:00 AM (midnight)"

Seconds from midningt, it means 24*60*60 = 86400 possible initial seed values for random number generator. That means only 86400 unique possible generated passwords.

Take a look at previous table. Most of unique passwords are in the fourth category. But probability of selecting this category is only 25%.
If we omit this category, we will still have 75% of probability of guessing the generated password. In our situation there are only 32000 unique passwords in first three categories.

In conclusion, from 2044*1012 possible passwords, bogus password generator created situation that 75% of passwords will be in group of only 32*103 unique passwords.
In this case brute force attack will have great chances of success.

UPDATE #1: The Visual Basic random number generator is even worse. Mark Hutchinson writes that in fact there are only 64000 possible random seeds values! (Not 86000 I mentioned before).


What happened to tcp flag URGENT, MSG_OOB and SIGURG?

Nobody today uses tcp urgent mode, so it's good topic to make some research on.

Usually when socket receives tcp packet with URG flag it treats it as normal tcp data. recv() is going to read urgent data as it was normal tcp stream. The only difference is that the last byte of data is discarded. The last byte in urgent data was always a problem due to incoherent rfc.

Pseudocode for this case:

server: send("ab", MSG_OOB)
client: recv() -> "a"
Tcp urgent packets whould trigger SIGURG for process that's listening on socket. It doesn't work until we set process pid of socket owner (thanks MichaƂ for this tip).
if (fcntl(sd, F_SETOWN, getpid()) < 0) {
The interesting thing is that after we enabled SIGURG it's possible to send a signal to remote process without sending any valid data. Such situation occures when process receives tcp URG packet with one byte payload. The only byte is discarded, no data is waiting on a socket, but SIGURG is sent to a process.
client: signal(SIGURG, handler);
client: fcntl(sd, F_SETOWN, getpid());
client: poll([sd], 1, 1000);
server: send("a", MSG_OOB);
client: signal SIGURG is received by a handler
client: poll ends, because of received signal
client: but there aren't any data waiting on a socket
That's all. Feel free to use draft source code of client and server I used for testing.


Peek at new nmap-nse scripts

Well. I'll give you a peek at my new scripts for nmap. This scripts aren't public yet, I hope this post will give mi motivation to finish them.

This time we're going to focus on traceroute. New traceroute function in nmap looks like this:

# ./nmap -n -sS -p80 scanme.insecure.org --traceroute

TRACEROUTE (using port 80/tcp)
1 0.33 (censor)
2 7.93 (censor)
3 ...
4 7.84
5 23.85
6 4.90
7 5.06
8 209.72
9 191.55
10 196.77
11 198.69
12 200.40
13 200.15

My first script gives similar results. It works almost like standard traceroute, but instead of sending Syn packets with small ttl, it injects packets to established connection. The idea is stolen from Lcamtuf's 0trace tool.
# ./nmap -n -sS -p80 --script=0trace.nse scanme.insecure.org -P0
Interesting ports on
80/tcp open http
|_ 0trace:

Next script is quite different. It sends Syn packet with Record Route ip option (see ping -R). Note that this time ip addresses are different than in previous scans. That's because routers have several ip addresses and Record Route records ip from outgoing interface (I think). The disadvantage of this method is that it's possible to record only nine hops.
# ./nmap -n -sS -p80 --script=recordroute.nse scanme.insecure.org -P0
Interesting ports on
80/tcp open http
|_ record route:


Is it possible to abuse icmp?

I wonder what's going to happen when malicious user will inject crafted icmp packet with some error information (like Port Unreachable) to tcp connection. Will the connection be closed? You may think that to inject something like this it's needed to guess sequential numbers. That's not exactly correct. Icmp payload can be relatively small, specification says it has to carry at least 8 bytes from OSI 3th layer, so only tcp source and destination port must be included.

I can assume that you're going to use google. To block your connection I have to only send 65K packets with payload: from google, to you, from 80port to, and sequentially select your local port. I should guess after while. That's even less than 65K possibilities, on standard linux there are less than 32K possible local ports (see proc parameter ip_local_port_range).

The question is how common tcp stacks are interpreting such crafted Icmp message.

There are also other possibly vulnerable applications. For example most dns servers, to make recursive queries use source port 53 (see bind 'query-source address' option). How dns servers react for icmp errors with relatively small payload? Is it possible to perform Dos attack using this method?

Well, of course port number in tcp header uses only 16 bits. So to 8 bytes in icmp payload should fit: source, destination port, seq number.

I change my question. How common tcp stacks interpret icmp packets with too short payload?

Linux seems to be immune for this kind of attack. The function that parses icmp errors for tcp transmission is in linux/net/tcp_ipv4.c:tcp_v4_err(). The code says, that payload must be at least 8 bytes long. Source and destination port in the payload must be correct and sequence number must be in correct range. In other case icmp packet is simply dropped.


Some random Nmap ideas

There were some interesting projects that were using nmap. For example Doug Hoyte Qscan idea. I hope his work won't be forgotten.

But there are also some interesting projects on my desk.

I think there is a possibility to implement Lcamtuf 0trace as nse script. This could be an add on to my p0f script. In 0trace we're of course interested in last hops only because standard traceroute was already implemented in nmap by Eddie.

Adding bind(2) to nsock library can also be very interesting. The effect won't be tremendous, but implementing this seems really challenging.

The biggest project I would like to end writing is General Scanning Engine. Unfortunately the project needs to be completely rewritten. I would like to see GSE plugins written in Nmap Scripting Engine. I must think if there's simpler design than the design I proposed during Google's Summer of Code'06.


Pcap support for Nmap Script Engine

Some time ago I wrote nse-pcap patch for Nmap that adds some libpcap features to Diman's NSE. Today I ported changes to current Nmap. It's definitely time to check if the code is worth time I spent on it!

I think that pcap support one of the most promising features in NSE. It gives "new experience" to advanced programmers. It also introduces easy to use and powerful way of distributing received packets to registered Lua NSE threads. I think it is very nice way of programming pcap applications. Basically it's based on callbacks from pcap, rather than iterating over sequence of pcap results.

Good example of nse-pcap power can give my implemetation of Lcamtuf's p0f SYN+ACK scan.

Try out installation of nse-pcap.

$ svn co --username=guest --password= svn://svn.insecure.org/nmap-exp/soc07/nmap nmap
$ cd nmap
$ wget http://ai.pjwstk.edu.pl/~majek/private/nmap/nse-pcap/nmap-soc07-5184-ncap-try2B2-with-whitespace.diff
$ cat nmap-soc07-5184-ncap-try2B2-with-whitespace.diff|patch -p1
$ ./configure --without-nmapfe && make
And play with the power of p0f.nse script:
$ sudo ./nmap -n -sT -PS80 -p21,22,53,80,443 --script=p0f.nse www.cisco.com

Starting Nmap 4.22SOC1 ( http://insecure.org ) at 2007-07-12 00:04 CEST
Interesting ports on
21/tcp open ftp
|_ p0f signature: UNKNOWN [65500:61:1:64:M1436,N,N,S,N,W0,N,N,T:AT:?:?] (link: IPSec/GRE, up: 1446 hrs, ipid:54702)
22/tcp filtered ssh
53/tcp filtered domain
80/tcp open http
|_ p0f signature: UNKNOWN [8192:238:0:44:M1460:A:?:?] (link: ethernet/modem, up: disabled, fill:4008, ipid:1)
443/tcp open https
|_ p0f signature: UNKNOWN [8192:238:0:44:M1460:A:?:?] (link: ethernet/modem, up: disabled, fill:f8ef, ipid:10)
From this information I can see at least source NAT. And next try:
$ sudo ./nmap -n -sT -p21,22,53,80,443 --script=p0f.nse www.orkut.com

Starting Nmap 4.22SOC1 ( http://insecure.org ) at 2007-07-12 00:04 CEST
Interesting ports on
21/tcp open ftp
|_ p0f signature: UNKNOWN [65500:61:1:64:M1436,N,N,S,N,W0,N,N,T:AT:?:?] (link: IPSec/GRE, up: 7282 hrs, ipid:30587)
22/tcp filtered ssh
53/tcp filtered domain
80/tcp open http
|_ p0f signature: UNKNOWN [8190:235:0:44:M1400:A:?:?] (link: sometimes DSL (2), up: disabled, fill:0f2d, ipid:28096)
443/tcp closed https

I'm waiting for feedback or bug reports!

Okay. I'm so enthusiastic about this project, because I wrote it. But there also some other promising features in nmap, like Swen's Web application detection.

Some usefull links:


Filesystems in 21st century

At studies I was taught that there are three levels in theoretical filesystems:

  1. File name is tightly bound to the volume where file is stored (eg: on Microsoft Windows the path contains drive name, like c:\windows).
  2. File name that doesn't say anything about place. But specified directory can be mounted as disk (in unixes; you don't know on which disk file lies, but you know that every file in some directory is on one disk).
  3. Where you can't differ physical location by the directory. For example you could have every file in specified directory on different disk. It would be quite nice to store small files on fast SCSI storage, and big ones on cheap ATA drive, wouldn't it? (as far as I know no vendor implements this)
But I wasn't said anything about distributed filesystems. Well, there are some levels of developement.

Let's imagine:
You have everything on one disk. You need more space. You add disks and join them in some kind of raid or LVM. But what if you can't add any more disks? You have to buy new computer and split files somehow on both computers.

The first level is:
  1. You can say on which machine file lies just by looking at it's name
    • The simplest method is to count hash(filename) % number_of_computers. The problem is that if you want to add new computer, you have to recount hashes for _all_ of files. In case some files must be moved to new node.
    • Second method is to count hash for filename, but this time count the reminder of some arbitrary chosen number like 100 (ie: hash(filename) % 100). Each host has ranges of hashes that it serves. For example first host serves all files with hash from 0 to 50. Second from 51 to 99. If you need to add next computer to pool you need to split hash ranges. To do this you need only to recount hashes for all files but at only one node.
  2. At some point recounting hashes is just too expensive. Nowadays copying disks can last tenths hours. Simpler way to maintain namespace is to use database which contains all the metadata.

    The database knows on which node specified file lies. This adds next level of complexity, because to access file you have to first ask the database. But on the other hand once written file is not going to be moved anywhere.
    Adding new disks is costless.

  3. The next problem occurs when single database of metadata is too big. Consider having 60MLN of files. Every file metadata occupies row in the database. Roughly counting 60MLN of rows * let's say 64 bytes per row = 4GB of metadata. It would be nice to save seeks on the disk on database and store data in RAM. And what happens if such huge database goes down?
    Well, I'm not the one that believes in database replication.

    So what's the solution? Spread the metadata across the nodes.

    Google solution is to add next abstraction layer. They use single namespace database in their filesystem, but with hard requirement that the amount of metadata is relatively small. With such requirement it's possible to provide really fast fall back in case that master nameserver fails.

    Huge number of files can be served, but from next abstraction layer. They split the data and metadata into relatively small continous sets and distributed across nodes. The master knows which range of keys is served by which server.

    Let's assume we want to access file 'abc.txt'. First we must ask the master server, on which server this file lies. Master stores key ranges for every 'tablet' server.

    Next, we connect to 'tablet' server and ask for specified file. To resolve file position 'tablet' server uses in memory B-tree.

    Using this design we have to: ask master server, than tablet server. In background there can be some requests to underlaying distributed filesystem, but we have guarantee that all this requests are answered directly from memory. The only disk seek is from the tablet server that serves requested file.

I must admit I like Google solution. The more I think about it, the more I read about it, the more I think that's the way it should be.