Faster CouchDB Document Commits

Right now CouchDB maintains a database "header", a 4k region at the front of the file that contains the root of the valid database structures. When the header changes, it is the only thing that finally commits the changes permanently to disk, until that happens any changes to the database will simply be invisible on restart. The header is written twice in 2 identical 2k regions, and each is signed. Upon restart, if the first header is bad, because of power failure during write for example, the second header will still be intact and valid. If the second is also bad, then the database simply cannot be opened.

This is a simple and very reliable design. The problem with this arrangement is that to commit a document, the disk head has to seek at least once or more for any document commit (depending of the underlying filesystem, some journaling filesystems could do it with 0 seeks). Disk head seeks are slow, but once the head is in position a read or write is fast. CouchDB with it's write only storage generally clusters the updates very close to the end of the file, so less seeking to commit the document and associated indexes. But then it always has to seek all the way to beginning of the file write out the header for the final commit.

So unless you batch the document saves, each document save is going to involved multiple seeks, limiting absolute database throughput significantly. (That just made me realize that the storage engine could also batch document commits, committing on a small fixed interval, speeding up concurrent throughout at the cost of an additional fixed latency)

Anyway, if instead of writing the header to the front of the file, how about writing it to the end of the file? That solves the seek issue necessary to commit the header. The only problem now is how do I find the header on the next time we open the database?

When the shutdown is "clean", its easy, the header is always the very last thing in the file. But what if we wrote a bunch of stuff past the last valid header, and then crashed before we could finish?

That's the thing I'd been kicking around in my head for months, I think I finally have a design I like.

I'm thinking that every, lets say 4k, in the file, there is space reserved for an offset pointer that points to the last known good header in the database file. And the database header it points to will have a md5 signature to validate integrity.

Committing to disk is still a 2 flush process. Flush 1 is to flush the data, meta-data, associated indexes, etc to disk. Then flush 2 is to append the database header, the structure that points to the all the correct/updated structures. But since you are writing data adjacent to the last flushed data, the head seeks are greatly reduced.

But what happens upon opening the database? First thing is to look at the end of the database file for the database header. If there is no valid md5 signed header in the last 4k, then we start doing a divide and conquer search, checking the header offset pointers (which are every 4k) to find the most recent valid pointer.

We check the middle most offset pointer and if it points to a valid header, then we pick a offset pointer halfway between it and end of the file and repeat. If instead it points to a invalid header, then we pick an offset pointer between the current location and the beginning of the file. Recurse the algorithm, tracking upper and lower bounds until you've found the most recent database header.

This algorithm will have O log(N) cost, where n is the number of 4k segments in the file.

Then once you've found the most recent valid header, truncate the file past header and then the database is back to normal, ready to be read and written. Cool.

Anyway, I'm not sure when I'll implement it. For one thing it's possible solid state memory with constant random access times become common, so it won't matter so much then. The current design will be fine and it currently has a constant worst case startup cost. This new design has a worst case startup cost of O log(n).

Also, while I think this new design is reliable, I'm not as sure about it as I am the current design. If anyone sees holes in it, failure cases I'm missing, let me know.

Posted February 1, 2008 8:05 PM