2007-08-28

Well, we need Google architecture, but how should we implement such a huge project?

Every fast-growing company will have to answer this issues sooner or later:

  • storage disks are full
  • it's hard to add more disks to existing systems
  • files are served too slow because disks seeks are becoming limitation
  • simple database is growing to millions rows, cost and complexity of partitioning is reaching the level of cost-effectiveness

It seems that Google architecture is solving this issues. By their architecture I mean their applications, like Google File System, Chubby lock service and the most important Big Table distributed database (not really database, just distributed multidimensional map).

It's:
  • scaling linearly (add new machine to cluster, it's just working as it should, no specific configuration or manual partitioning of existing data is needed)
  • auto partitioning (when one machine is overloaded data are split and moved to other one)
  • low cost of administration (you care only about hardware failures)
  • designed to be high-available (it's not exactly HA. Google designed everything not to reduce single point of failures, they rather focused on fast restoring failed applications. In case of failure, application will be down for few seconds most.)

Well, Google architecture have also some limitations, for example sometimes you can't afford this few seconds of system downtime. But for most users Google architecture is resolving many important problems.

Are you're going to implement Big Table?

The main problem with this architecture is that it's build from projects that are tightly linked. You can't build reliable clone of Big Table without Google file system and Chubby lock manager.

The bigger the project is, the greater chance of failure. Nobody wants to do highly risk investment. We can hear "well, it's easier to hack current architecture than to build new from scratch". It's obviously true to some point. After that you need the final solution for problems right now. And it's getting harder to move a dozen or so terabytes data from old storage system to new.

I think there is a possibility of building something like Google architecture in small steps. But it's going to have some limitations in first stages.

First step: "Administrators nightmare"
Assume that we need only auto partitioning. Every other features at this point we ignore. We're going to create basic Big Table. By that I mean standalone application, that's going to answer simple mapping from key to value. The internals must be created as simplified version of Tablet server. The only feature we truly need is autopartitioning. When Tablet size is reaching some threshold, it splits to two instances, with reduced row ranges for each instance.

What we're going to get:
  • administrators must move instances manually from machine to another
  • manually configured (hardcoded) row-ranges at clients
  • no distributed features
  • reduced disk seeks compared to filesystem based key-value mapping

Second step: Multi-dimensional mapping
The idea of column families, timestamps and not limited columns is just great.

Third step: Chubby lock manager
We need to create some automated way of informing the clients about Tablet changes. Clients shouldn't have any hardcoded configuration. At this step we should create some kind of METATABLE for handling Tablets row ranges and Chubby as the main source of basic metadata. We can assume that we have only one instance of Chubby server (no replication). But we definitely need Chubby to serve files, locking and asynchronous events.

Fourth step: The File System
We need file system with shared name space for every node. If Tablet is partitioned, it must be possible to open new Tablet data on any other server.

Fifth step. Competing with Giant :)
At some point Chubby should be replicated, the Chubby master should be chosen using famous Paxos algorithm to be fully HA. Some other features to Big Tablets are also nice, like Bloom filters or compression. Secondary indices to Big Table are very nice (yuppi, Google doesn't have that feature).


Pencils Up! I wish you happy coding.

BTW. There is a Hadoop/HBase project, you can be interested.


1 comment:

niklasro said...

Don't even use one :-)
6400 @ 2.13GHz GNU/Linux
Filesystem Size Used Avail Use% Mounted on
none 1.7G 17M 1.6G 2% /
[guest@localhost ~]$