CouchDb Architecture

I'm going to describe some of the core of CouchDb architecture.

In the fewest word possible, the purpose of CouchDb is to be an indexable schema-less database. To explain briefly the properties of a Couch database:

1. CouchDb will store primarily "objects", but these aren't objects with behavior, only structure and data. The object is split in to two parts, summary and payload data. The summary data will be the information that is used in generating tables, and will consist of simple name/value pairs and meta data about the object. The payload data will be the bulky data, like file attachments and long html bodies.
2. CouchDbs will have tables and indexes. These tables are built using the summary data from the object. The tables will be built using custom logic specified by the user/developer, and each row in the table will correspond to a single object.
3. CouchDbs will replicatable.

If you know the Lotus Notes NSF data model, all of this will sound very familiar. It's very similar to the key underpinnings of Notes and Domino (and still its biggest differentiator from other development products).

From the outside, CouchDb will look and act very similarly to NSF, but the internal architecture will be somewhat different. CouchDb uses combination of append-only storage, and atomic updates. Notes uses atomic-updates (it also has transaction logging capability as well).

Append-only databases (also known as zero-overwrite databases) are databases where the persistent storage is never overwritten. All database actions (insert, update, delete), even index building, are the results of appends to the end of the file, existing data in the database isn't overwritten, instead it just becomes "outdated", with the newer data taking precedence. Databases designed this way have some interesting properties:
1. Updates to the database are very fast. Since all writes are at the end, there is much less seeking for the disk heads, so writes are very fast.
2. Truncated databases aren't corrupt, instead they are simply an earlier version of the database. This makes incremental back-ups a breeze, and even simple tools can backup an open database.
3. Incomplete writes aren't a problem, if a database update fails (for instance, due to power failure), it can't corrupt the existing state of the database.

This design has one obvious shortcoming, the database file is always growing with every change (even deletes), and old, useless data continues to waste disk space. All append-only databases have some way to compact or "vacuum" the file. Often it is a simple as creating a new file and writing all the current data into it, then replacing the old file with the new, but more sophisticated schemes are also used to compact the database in place.

Unfortunately, this design has a serious limitation that I think make it a poor choice for Couch. In my experience, the single biggest performance bottleneck in most Notes applications and servers is the view indexing. Typically when a Notes database is slow to open, it is because one or more needed view indexes is out of date and is being incrementally updated. Now if the data to build these views is scattered all over the file, then it will take many disk reads to load up the data before it can compute the table and index values. Since every update happens at the end of the file, the updates, over time, will increasingly be scattered over the file, meaning many seeks and data reads. Disk IO is the performance killer, and anything that can reduce disk IO helps.

To speed thing ups, Notes has an optimization where all the summary data for documents are placed into "summary buckets", such that many document's summary data can be packed into in a single bucket. Then when building the index, the summary data for the many documents can be loaded in one shot.

So if Couch utilized an append-only design, then when building it's tables and indexes, it would often have to jump around, reading small parts of the file, and greatly increasing the disk IO, and greatly slowing down the rate at which it can build the tables. So instead Couch adopts a combination of streaming zero-overwrite storage and atomic updates.

Couch Streaming Storage

Inside a CouchDb, the beginning of the file contains a traditional header that contains the meta-data about the database and information about where key structures are located inside in the database files. Following the header is the raw storage.

All raw storage is just a stream of data, and the stream is made up of file segments, which are essentially linked lists. Each segment has a small header and footer. The header specifies the length of the segment and has a pointer to the previous segment (for navigating a stream backwards), and the footer has a pointer to the next segment. All the space in between is the data for the stream.

Here is a diagram of a stream segment. It is the first segment in the stream (no previous segment), the data in the segment is 14 (0x0E) bytes long, and the location of the next segment is at file position 17772 (0x456C):

|00000000|0000000E|44616D69656E20697320636F6F6C|0000456C|
| Prev | Data | ... Data ... | Next |
| Segment| Length | | Segment|


And here is a diagram of a CouchDb file with three streams, X, Y and Z, each with multiple segments:

HEADER XXXXX YYYY ZZZ XXXXX YYYYYYY YYYYY ZZZZZ XXXX
| | +-|---|-|-----|-|-----+ |
| +-----|---|-+ +-+ |
+----------+ +---------------------+

When a new segment is allocated (usually at the end of the file), its header is written in and its footer is zero'd and data can be streamed into it. Once the segment is filled with data, a new segment is allocated and the footer of the existing filled segment is written with the location of the new segment. This means that any storage stream is unbounded in length, which makes it easy to store data where the length isn't known ahead of time.

This streaming architecture is the core of the CouchDb storage model. Since streaming data is such a big part of how Couch will work, I thought about renaming it to Streamy. Then I realized I could be really trendy and call it xStream storage. Then I punched myself in the nuts for coming up with such stupid name.

The main storage streams in Couch are:
Summary stream - This contains all the meta-information about each object. As each object is written to the database, all it's summary information is appended to this stream.

Update Log stream - For each object written or deleted, a small record of the update is appended to this stream. Each record contains the object's ID, its summary stream position and payload position, and the action taken (update or delete).

The Update Log stream is intended to be a stream that can be quickly "played back" to determine where the actual summary and payload data for an object resides.

To get around the inherent deficiencies in always playing back the full streams in order to find out the latest state of the stream (including its end), pointers to the ends or other key places of the streams can be stored in a header (it can also be stored in more streaming storage). Also, if streams get too wasteful (the percentage of invalidated entries is high), the streams can be copied out to new streams with the invalidated data removed.

Doubly Written Headers

The headers in CouchDb are doubly written headers. That is, the header is duplicated, one right after the other. This way, when "committing" writes to the database by updating a header, if the power suddenly goes out in the middle of the write, the header should still be in a recoverable state. If one header is partially written, then the other header can be used. When this happens, the data written to the stream is lost, but the stream and database are not corrupt, and no "fixup" or consistency check will be necessary after a crash. And since the headers are typically small, the double writing shouldn't add much disk overhead.

So whenever an index is opened, it will have a record of the last entry in the update log stream that it processed. It will then start reading the log from that point and build a list of all the objects that have been updated or deleted since. Then it will stream out all object summaries from the summary update stream, and add the new records to the index. Since the object stream is mostly contiguous on disk, loading the summaries from disk should be pretty efficient.

The indexes it creates will be stored likewise in an append-only storage stream. As it loads and processes the summaries, it will add the modifications to the index in RAM memory. Once it's used a certain amount of memory, the index changes will flushed to the index stream.

Again, because of the streaming nature of the database, both the reads and the writes should be pretty efficient, with relatively little disk seeking.

Ok, I've written enough for now. During the process of developing CouchDb, I am making many unproven assumptions, so if anyone sees any flaws or mistaken assumptions above, please tell me. Also, I welcome any questions, as they'll help me clarify and explain this stuff better in the future.

Posted April 29, 2005 3:43 PM