2007-07-10

Filesystems in 21st century

At studies I was taught that there are three levels in theoretical filesystems:

  1. File name is tightly bound to the volume where file is stored (eg: on Microsoft Windows the path contains drive name, like c:\windows).
  2. File name that doesn't say anything about place. But specified directory can be mounted as disk (in unixes; you don't know on which disk file lies, but you know that every file in some directory is on one disk).
  3. Where you can't differ physical location by the directory. For example you could have every file in specified directory on different disk. It would be quite nice to store small files on fast SCSI storage, and big ones on cheap ATA drive, wouldn't it? (as far as I know no vendor implements this)
But I wasn't said anything about distributed filesystems. Well, there are some levels of developement.

Let's imagine:
You have everything on one disk. You need more space. You add disks and join them in some kind of raid or LVM. But what if you can't add any more disks? You have to buy new computer and split files somehow on both computers.

The first level is:
  1. You can say on which machine file lies just by looking at it's name
    • The simplest method is to count hash(filename) % number_of_computers. The problem is that if you want to add new computer, you have to recount hashes for _all_ of files. In case some files must be moved to new node.
    • Second method is to count hash for filename, but this time count the reminder of some arbitrary chosen number like 100 (ie: hash(filename) % 100). Each host has ranges of hashes that it serves. For example first host serves all files with hash from 0 to 50. Second from 51 to 99. If you need to add next computer to pool you need to split hash ranges. To do this you need only to recount hashes for all files but at only one node.
  2. At some point recounting hashes is just too expensive. Nowadays copying disks can last tenths hours. Simpler way to maintain namespace is to use database which contains all the metadata.

    The database knows on which node specified file lies. This adds next level of complexity, because to access file you have to first ask the database. But on the other hand once written file is not going to be moved anywhere.
    Adding new disks is costless.

  3. The next problem occurs when single database of metadata is too big. Consider having 60MLN of files. Every file metadata occupies row in the database. Roughly counting 60MLN of rows * let's say 64 bytes per row = 4GB of metadata. It would be nice to save seeks on the disk on database and store data in RAM. And what happens if such huge database goes down?
    Well, I'm not the one that believes in database replication.

    So what's the solution? Spread the metadata across the nodes.

    Google solution is to add next abstraction layer. They use single namespace database in their filesystem, but with hard requirement that the amount of metadata is relatively small. With such requirement it's possible to provide really fast fall back in case that master nameserver fails.

    Huge number of files can be served, but from next abstraction layer. They split the data and metadata into relatively small continous sets and distributed across nodes. The master knows which range of keys is served by which server.

    Let's assume we want to access file 'abc.txt'. First we must ask the master server, on which server this file lies. Master stores key ranges for every 'tablet' server.

    Next, we connect to 'tablet' server and ask for specified file. To resolve file position 'tablet' server uses in memory B-tree.

    Using this design we have to: ask master server, than tablet server. In background there can be some requests to underlaying distributed filesystem, but we have guarantee that all this requests are answered directly from memory. The only disk seek is from the tablet server that serves requested file.

I must admit I like Google solution. The more I think about it, the more I read about it, the more I think that's the way it should be.


1 comment:

Anonymous said...
This comment has been removed by a blog administrator.