CouchDb Update

CouchDb is coming along quite nicely. It is now a replicate-able store, it features atomic updates and is crash only in design. But it's still very much in its infancy and no where near production ready.

CouchDb has a 64 bit file size limit, but currently opening a db means loading a small amount of metadata for all notes into memory, which means it really doesn't scale very impressively for the total number of notes. But that problem will be handled later, and I think it will be possible to make it scale to millions of notes with tweaks to the current architecture. Currently it handles thousands quite easily.

The next big work item is the indexing facility. I'm probably going to use a B+ tree design for the on-disk stuff but I'm still doing research. This is difficult stuff and integral for overall speed and scalability. There are trade-offs for every design, and I really need to think about how CouchDb will be used most.

Also, I'm not sure how to let the user specify the selection criteria and how the values get computed in the indexes. In Lotus Notes, users (developers really) write Formula Language expressions for each column and selection formula, which is an effective and proven way to go about it. But the thing is, I don't have a formula engine to drop into CouchDb, and I really don't feel like writing one. I've already done it for Notes, and it's a lot of work.

So one thing I'm considering a popular scripting language like Python or Ruby, but really general purpose languages might be a bit much. If the language is full featured and imperative, how do restrict what happens while computing column values? One thing that is vital when computing the values is that they be based only on the data and metadata of the note, or in geek terms the code must be referentially transparent with the note as the sole input. This way you get consistent results and the tables can be built incrementally, just like Lotus Notes.

XQuery is also an option (also used in Berkley DB XML and other XML DBs) but that brings other issues into play, such as do I really want this to be an XML database? What about all the complexity that might bring into it for the developer who's using CouchDb? I prefer the relative simplicity of the name-value pair that's used in Lotus Notes and I think that's an ideal way to go about it. However, XQuery is a standard and there's more likely to be a drop-in engine I can use. And really, XQuery is designed for just such things, but I don't like it because I want things as simple as possible.

Anyway, the query language/engine portion is something that actually can be pluggable, so I don't have to make that decision soon anyway. So I'm going to move ahead with the core indexing stuff and write the table selection and column code in C++ for testing purposes.

One of these days I'm going to create a proper document that really explains CouchDb in a more comprehensive and understandable fashion. But that won't happen until I actually get the whole architecture nailed down. Hopefully that will be soon.

As always, feedback is much appreciated. (oops, ironically comments were off, fixed now)

Posted June 13, 2005 11:35 PM