2007-10-16

Finding paths between users in social network graph.


For some time I'm working at Grono.net the biggest Polish social networking website (over 1.3mln registered users). Like in other networking sites, our users have the list of friends which they know. We have interesting feature of finding the shortest path between two users, measured in "handshakes" (by "handshake" I mean connection between two users).

The problem with that "paths" service was that it was not scaling well. Our servers were overloaded and the path searching time was unacceptable. It became clear that it's good time to rewrite our algorithms.

Zero: what's the problem
Social networks are based on the idea of "six degrees of separation". It means, that the distance (we can measure it in number of handshakes) between any two users is quite small. Even though the average number of friends for users is not big, the problem of finding shortest paths between users is very complex from the view of algorithmic complexity.

When the social network graph is small, one can use standard Dijkstra method for finding shortest paths. But when the graph is becoming bigger and bigger, the standard algorithm can take few seconds to count and is inefficient.

First try: the Dijkstra method
Dijkstra algorithm is the simplest one for finding shortest paths in graphs (see Wikipedia article for details). In our case, users can be treated as graph nodes, and connections between users as edges. At this point we know that the distance between users is always relatively small, at most a dozen of handshakes (edges). Once we know this characteristics, we can optimize Dijkstra method to our specific needs. The algorithmic cost of standard Dijkstra implementations is O(V * log(V) + E). In our case we can easily reduce this to O(V + E). It's possible to achieve that by implementing priority list data structure as hashed table. It's possible to do all operations we need to do on this data structure in linear cost (instead of standard priority list when the cost is logarithmic).

Second try: finish counting earlier
Standard Dijkstra algorithm is counting path from specified node, to all other nodes. In our case we don't need to find paths to all possible users, we can stop counting when the first valid path to the end user is found. One can think that would speed up the program significantly, but it doesn't. The gain in practice was small, because of specific properties of our graph. The number of connections (edges) is growing very rapidly in next levels of connections. It seems that in the distance of four handshakes we can reach the most of all users. On the time when we could leave the computation with the first result we'd already counted shortest paths to most of the users. Here is the sample data, that shows the growing number of nodes and edges on next levels of handshakes (the distance from some random node).

Distance   Nodes     Edges
#00 | 1n 59e
#01 | 59n 4917e
#02 | 3654n 463608e
#03 | 159042n 17669289e
#04 | 557105n 25225499e
#05 | 261623n 564312e
#06 | 25225n 28748e
#07 | 1879n 2104e
#08 | 190n 221e
#09 | 29n 33e
#10 | 4n 6e
#11 | 2n 2e
Third try: count path from both sides
The next specific property of our graph is the fact, that when I'm friend of someone, he must be my friend too. In computer terms we could say that the graph is not directed. That means that the shortest path between user A and B is also the shortest path from user B to A. As I mentioned before, the greater depth of nodes, the more of the data to consider. For a moment let's assume that the user A is in the distance of four handshakes to user B. We don't have to count the expensive nodes in distance of four from node A. We can count nodes in the depth of two from node A, and the depth of two from node B. The shortest path between A to B must go through the nodes that are in both sets. Using this method (bidirectional search) we can greatly reduce the complexity of algorithm, because we didn't have to grab the expensive data in distance 3 or more from specific node.

Fourth try
The specified method is much faster than previous ones. But for some nodes the time of counting was incredibly long. The algorithm is counting one step at a time, from both node A and B, and then it's comparing the results in search of middle points. Instead of doing one step from both sides at a time, we should modify the program, so that it should count the results only at one side of the path, the side where the number of nodes is smaller. Thanks to this optimization, we have guarantee that we do the least possible counting of nodes. Unfortunetaly we have to pay for this by doing twice the number of compares of the results than in previous method.

Summary
As the result, we evolved the standard Dijkstra method, from the complexity of
    O(V*log(V)+E)
to the complexity of
    O(k^(n/2))
where K is the average number of edges to the node and N is the average distance between nodes.

Even though it's exponential (which sounds worse than logarithmic), in this specified case of social network graph, the method reduced the search time from seconds to few milliseconds.

PS:
If you found this article interesting, you may also find interesting my graphical visualization of social network connections.


3 comments:

Anonymous said...

Do you have some code to share on this topic?

majek said...

You can take a look at an example.

sanity said...

I strongly recommend you watch this video, which describes the algorithm developed by Freenet to permit routing in a "darknet".

To cut a long story short, they've developed an algorithm that can find short paths (often the shortest, but this isn't guaranteed) between nodes in a network in logarithmic time.