CouchDB Performance

Seth Falcon gives CouchDB performance a nice going over in his article A quick look at CouchDB Performance. I actually tried to post a reply on the blog a couple of days ago, but I think something went wrong when I posted it. Anyway, my response is back, in blog form.

Performance-wise, things looks slightly better than I thought they would. A few things to note:

For Test 1, I created new documents in an empty database one after another in a loop.

=== Add single doc in a loop ===
|      N |  sec | Docs/sec |
|   1000 |    9 |      111 |
|  10000 |  102 |       98 |
| 100000 | 1075 |       93 |
The times scale linearly, but I was surprised to see a 1.3GB file size for the database in the 100K case.

I think he means the times scale constantly, not linearly, as a function of the number of documents in the database. Constant scaling is ideal, but in practice never achievable, and linear scaling is typical for unindexed, flat files. The actual update cost as a function of the number of documents is logarithmic, which is hard to see here. I think the fixed costs of the doc write, parsing and network overhead in these samples is outsizing the logarithmic degradation of the core index updates.

The large size of the file is due to the overhead of write-once internal indexes that must be partially copied with every document update. The compaction process will recover this space (the compaction code isn't done yet). In general, the amount of wasted space for a freshly compacted db should be less that 2x the raw document size (though the overhead per document is fixed + logarithmic cost, so large documents have less waste per document).

Here I was surprised both that 100K documents was enough to exhaust the system’s memory and that the result was a complete crash of the server. I wonder how hard it would be to take advantage of Erlang’s touted fault tolerant features to make the server more robust to such situations.

In this case, the high memory usage is due the lack of proper HTTP and JSON stream processing in the CouchDB HTTP server module. Everything is buffered before processing, causing the memory spikes you see. Switching to the mochiweb erlang HTTP server library will help with this.

And the fact that low memory causes the whole VM to crash is something that surprised and saddened me, and we are addressing it with a parent process to monitor and restart the VM process. I'm not happy at all that the Erlang VM is so fragile with memory allocations, I'd assumed it would terminate the virtual Erlang process requesting the memory, not the whole VM process. But I digress.

One thing I'd like to see are some concurrent client load tests, and see how many clients it supports and how performance looks with different read write loads. I think CouchDB will do very well here.

Anyway, great stuff Seth, thanks for the testing and article!

Posted December 18, 2007 8:06 PM