Faster CouchDB Document Commits

Right now CouchDB maintains a database "header", a 4k region at the front of the file that contains the root of the valid database structures. When the header changes, it is the only thing that finally commits the changes permanently to disk, until that happens any changes to the database will simply be invisible on restart. The header is written twice in 2 identical 2k regions, and each is signed. Upon restart, if the first header is bad, because of power failure during write for example, the second header will still be intact and valid. If the second is also bad, then the database simply cannot be opened.

This is a simple and very reliable design. The problem with this arrangement is that to commit a document, the disk head has to seek at least once or more for any document commit (depending of the underlying filesystem, some journaling filesystems could do it with 0 seeks). Disk head seeks are slow, but once the head is in position a read or write is fast. CouchDB with it's write only storage generally clusters the updates very close to the end of the file, so less seeking to commit the document and associated indexes. But then it always has to seek all the way to beginning of the file write out the header for the final commit.

So unless you batch the document saves, each document save is going to involved multiple seeks, limiting absolute database throughput significantly. (That just made me realize that the storage engine could also batch document commits, committing on a small fixed interval, speeding up concurrent throughout at the cost of an additional fixed latency)

Anyway, if instead of writing the header to the front of the file, how about writing it to the end of the file? That solves the seek issue necessary to commit the header. The only problem now is how do I find the header on the next time we open the database?

When the shutdown is "clean", its easy, the header is always the very last thing in the file. But what if we wrote a bunch of stuff past the last valid header, and then crashed before we could finish?

That's the thing I'd been kicking around in my head for months, I think I finally have a design I like.

I'm thinking that every, lets say 4k, in the file, there is space reserved for an offset pointer that points to the last known good header in the database file. And the database header it points to will have a md5 signature to validate integrity.

Committing to disk is still a 2 flush process. Flush 1 is to flush the data, meta-data, associated indexes, etc to disk. Then flush 2 is to append the database header, the structure that points to the all the correct/updated structures. But since you are writing data adjacent to the last flushed data, the head seeks are greatly reduced.

But what happens upon opening the database? First thing is to look at the end of the database file for the database header. If there is no valid md5 signed header in the last 4k, then we start doing a divide and conquer search, checking the header offset pointers (which are every 4k) to find the most recent valid pointer.

We check the middle most offset pointer and if it points to a valid header, then we pick a offset pointer halfway between it and end of the file and repeat. If instead it points to a invalid header, then we pick an offset pointer between the current location and the beginning of the file. Recurse the algorithm, tracking upper and lower bounds until you've found the most recent database header.

This algorithm will have O log(N) cost, where n is the number of 4k segments in the file.

Then once you've found the most recent valid header, truncate the file past header and then the database is back to normal, ready to be read and written. Cool.

Anyway, I'm not sure when I'll implement it. For one thing it's possible solid state memory with constant random access times become common, so it won't matter so much then. The current design will be fine and it currently has a constant worst case startup cost. This new design has a worst case startup cost of O log(n).

Also, while I think this new design is reliable, I'm not as sure about it as I am the current design. If anyone sees holes in it, failure cases I'm missing, let me know.

Posted February 1, 2008 8:05 PM

Comments

Dan Sickles, February 2, 2008 12:58 AM

The problem with the current design is, if the hd has a problem usually both headers are killed (hds have a quite big internal cluster size these days).
Thats why _no_ fs/db/... writes the header and the backup into nearly the same place.

gebi, February 2, 2008 4:28 AM

gebi, can you provide some citation or supporting evidence for your statements? I'm not saying you're wrong, but I've talked to a lot of different people about it, multiple database people and a hard drive performance and failure specialist, they all said it should work for all the most common filesystems. It could be they misunderstood my design or were just plain wrong themselves, so I'd really like to know if it's an issue.

Damien Katz, February 2, 2008 1:38 PM

Why have the header in the same file? If the header was distinct from the data then it could reside on a different physical device, or different file system. Perhaps a file system optimised for sequential writing or one that is not journaled, or perhaps a small solid state device for the headers would work well.

Alan Bell, February 3, 2008 3:42 PM

Alan, good idea. Putting things in different files would go along way to optimizing document updates when you can put the files on different drives/spindles. It can also slow down the case where you only have one drive, which I believe is the general case. To support both it would make things a lot more complicated internally.

But to optimize CouchDB for the fastest possible commits when you have multiple spindles, you'd definitely put the document storage on it's own (treating it as a journal), and the core indexes on to another and possibly a header on to a third. This will likely happen at some point.

And if you don't mind occasionally going through a fix-up or re-indexing phase on restart, then you only have to worry about committing the core document storage and can commit the indexes asynchronously, again giving nice performance boost.

But all these things significantly complicate the code. And there are many many more ways to optimize the hell out of all this, but most of them significantly complicate the code and failure scenarios. So I am on the lookout for simple optimizations that give good bang for the buck in the general case.

Right now read performance is more important than update performance and simple, reliable designs are more important than optimized designs. So accordingly CouchDB's storage design is optimized most for reading and view indexing performance. Doc commit performance is a secondary concern, except when simple optimizations can be made. The multiple file thing is probably the next major storage optimization that will happen, but probably post 1.0.

Damien Katz, February 3, 2008 5:08 PM

Look at how ZODB implements the journalling. Sounds similar to what you are proposing. They never modify the datastore itself. Only append to it.

Tracy Reed, February 6, 2008 2:03 AM

Post a comment




Remember Me?

(you may use HTML tags for style)