2008-06-06

LinkedIn graph - I don’t get it



At Grono.net, the network graph had more than 80M edges and about 1.2M nodes. You know how much RAM it used on the machines? 320MB. It’s a bit more than 4 bytes for an edge, because of alignment, plus few megabytes of index.

LinkedIn says they need 12GB RAM for keeping their graph with 120M edges. Wow. It’s 107 bytes for an edge! I think that the graph shouldn't consume more than 0.5GB RAM.

Their software has a bit more features - their graph is updated in real time. The biggest problem with this feature is that you need a good garbage collector and there’s a possibility that you’d have fragmentation issues. But what about updating the data every few minutes? Using my solution it would require two instances of 0.5GB graph (one is serving, other is loading) - 1GB RAM.

Okay. I use 32bits as indexes, they can use 64 bits. But it’s still only 2GB RAM!


I don't get it. What the other 10GB RAM is used for?


PS. The graph is easily compressible. With a bit of CPU you can highly compress it (I bet more than 4 times). Cpu is cheaper than ram, isn't it?

PPS.
The LiknedIn Architecture pdf.



16 comments:

Shaneal Manek said...

It sounds like they keep their nodes (all 22 million of them) in RAM too.

That means that they are only using ~500 bytes/node, which is very reasonable (name, user information, etc).

It makes sense to do that, otherwise you'd have to hit a DB n times to display a user's friends (where n is the number of friends that user has) to get the friend's name and so on.

majek said...

> That means that they are only using
> ~500 bytes/node, which is very
> reasonable

I'd keep this information in some other place (Lucene/Sphinxsearch/Memcached/Whatever), but this explanation seems reasonable.

Anonymous said...

It's a matter of efficiency. If you merely want to keep a table of the edges in memory, well, all you need is two ids, which can be 32-bit numbers -- 8 bytes per edge.

That's great and all, but that's to store an unindexed graph with no secondary information. What are you going to do with it? How are you going to query/modify it?

Linked in queries include identifying all neighbors within 3 and excludes self-references. It also always associates a name with a node if the distance is 1. Rather than resort to going to the DB for all that, it would seem to make sense to represent that in the in-core model -- if you've got the space for it, why not.

An optimal storage and data model is a matter of determining how it's to be used. Optimizing for space clearly has a lower priority than optimizing for performance or functionality -- as well it should. 12GB is really nothing special in contemporary terms. At my jobs we have a number of applications that use 2x that, and that's for analytical work for individual biologists, not a popular Internet service.

drjnet said...

I am trying to write some code to show possible routes from one member to another on a social networking site ive developed. Does anyone know any resources where I can learn about building a node graph to enable this to happen or the best way to do it?

I like the way xing works, anyone know how they do it?

cheers

PouleJapon said...

Maybe they also keep some kind of info about which people you can get in touch with through your friends...

When you to the page of sbdy you don't know they can tell you how many branchs separates you from this person up to the third degree. That's a great computational cost if you do that from a regular graph, isn't it?

majek said...

> When you to the page of sbdy you don't
> know they can tell you how many
> branchs separates you from this
> person up to the third degree.
> That's a great computational cost if
> you do that from a regular graph,
> isn't it?

Actually if you're using bidirectional search, it's possible to count this rather fast:

http://popcnt.org/2007/10/finding-paths-between-users-in-social.html

PouleJapon said...

Thank you for the fast answer.

I suppose they don't have those 12GB on the same server.
Maybe they wanted a smart way to only query the two servers once, and then give back the result.

(That's a wide guess though)

John said...

Duh. You forgot about the data.

Anonymous said...

If you check out the slides, they talk about a pre-computed, transient "network cache" for each member. That may be eating up some RAM as well...

majek said...

> If you check out the slides, they
> talk about a pre-computed, transient
> "network cache" for each member. That
> may be eating up some RAM as well...

>>> 10*(2**30)/(2*(2**20))
5120
>>> _ * 40
204800

5k users online seems okay.
5k*40(servers) = 204k users online seems not like a reasonable number.

But I don't get why counting "the user view of the network" is "computationally expensive". Counting this should be milliseconds, transmission this data over the network could take more time.

And I don't get why there's a need to implement special cache to cache this kind of data. As far as I understand, memcached/whatever should be perfectly fine.

I'd understand: "we use 400GB of memory for cache". But I don't get: "we use 400GB of memory to cache the graph".

PouleJapon said...

That's not exactly my field, but how do you implement bidirectional search on a network?

My understanding is that at least you will need to store all branchs on the servers of both extremities, is that correct? That's something that you didn't have to do at Grono.

Then, from what I understand with a traditional data structure you may need to query all you servers, possibly more than one time, and "in rounds".

So if you need to dispatch your data between N servers, you actually didn't really scale in the sense that the more you add servers the more you will suffer from the communications between servers.

One solution would be to dispatch your nodes on your servers in a smarter way. (Something which should be equivalent to transforming the graph matrix to a more block-like matrix)

My bet was that their solution consists on having some redundancy.

For instance having embedded with each branch a bloom filter of the neighbors of your friend.
That's sufficient to only query two servers once to detect a relation up to the 4th degree... You wouldn't be able to tell the name of the person in the middle.
I think you can scale nicely with that.

Maybe they found a smart data structure, and that's what they called "cache".

I hope a guy from linkedIn will come and enlight us.


Great blog !

majek said...

> My understanding is that at least
> you will need to store all
> branchs on the servers of both
> extremities, is that correct?

Well, you have to store all the graph data on every server (in my opinion it's something like 0.5GB, without compression).

From this data you could easily generate my friend list at level 1 or 2 or 3 (the main problem is that in pessimistic case it's 22M users :P).

Using bidirectional search on the graph you may quickly find a shortest path between two users.

> For instance having embedded with
> each branch a bloom filter of the
> neighbors of your friend.

I think there's no reason to partition the graph.


> You wouldn't be able to tell the
> name of the person in the middle.
> I think you can scale nicely with
> that.

Yes, you would get only the id of that person. But I think it should be enough.

PouleJapon said...

> Well, you have to store all the graph data on every server (in my opinion it's something like 0.5GB, without compression).

Oh ok! my mistake, I didn't got that part!

Anonymous said...

>"12GB is really nothing special in contemporary terms. At my jobs we have a number of applications that use 2x that"

I think you are right. Nowadays 12GB isn't that much memory at all. I've seen 500 euro entry level PCs for n00bs being equipped with 4GB of RAM these days.

I do notice though that among some programmers it is actually a kind of trend to not take advantage of progresses in modern hardware. Programmers who have always learned that their app should run on a single CPU machine, inside at most 1MB and where even a single function call is relatively expensive, are not going to program for multiple cores and are not going to take advantage of larger memories. Since in their minds the function call is still expensive, they don't really like to use 'modern' software techniques like OOP (which by now really isn't modern anymore, but to them it still is), etc.

Of course there is some virtue in being conservative with your resources. On the other hand it's ridiculous to jump through a lot of hoops to save a few bytes when putting many gigabytes of memory in your machine only costs a few bucks.

psyho said...

12 GB is not only for edges, but also vertices (data). It seems like a pretty reasonable amount to me.

Inv said...

> Well, you have to store all the
> graph data on every server
> (in my opinion it's something like
> 0.5GB, without compression).

As I understand it, you would keep the exact same copy of the graph replicated on n servers to distribute the load for shortest path queries, right?

How do you make sure that these replicas are always consistent? Would you reload all the graphs from the database once in a while?