Thoughts on Optimizing CouchDB

As I've stated before, no profiling work has been done on CouchDB whatsoever. It has just been quick enough that it hasn't mattered at these still early stages, so nothing has been optimized. It also means that CouchDB has lots of room for performance improvement. If it's fast enough now, when some serious work has been put into optimizations it should scream. I'd guess with a couple weeks of profiling and optimization, just finding and eliminating the hot spots will result in a 10x performance increase. I'm just guesstimating because easy-to-fix hotspots in unprofiled code usually account for ~90% of low hanging performance fruit.

CouchDB has a custom storage and view indexing engine with a simple design that is optimized first for reliability and availability, while having good update, retrieval and indexing performance. It's also designed for high concurrency, using optimistic updates that never block readers. It trades disk space to get much of this (fortunately disk space is cheap and getting cheaper all the time).

Within the design of the core server and storage engine there is opportunity for many optimizations. Things off the top of my head: caching (Erlang has a really good native Judy Array implementation ideal for an object cache EDIT: I'm apparently wrong, the Judy array's never made into core Erlang? Anyway, Erlang does has ETS and an unordered set option which is ideal for an object cache.), unordered storage, data structure compaction and string atomization, binary data compression, more efficient string collation plus all the things I don't know about being performance problems because I haven't profiled yet.

Within the front end http server layer there are more big areas for optimization. Replacing the inets httpd server in favor of something more light weight, like the mochi web server, which will allow for http and json stream processing. Also, adding more bulk query http calls, for bulk reads and multi-key view lookups (the core already supports it, it's just not exposed to HTTP). Replication will speed up tremendously when it's http calls are performed in bulk.

View indexing has tons of optimizations that can be done and can easily become a project itself. Compression, IO optimizations, static analysis of map functions, etc, etc.

So if you ask how fast will CouchDB be, I say measure what it does now and expect it to do it at least 10x faster by 1.0. Being a big project with a lot a layers, I'd say a 10x improvement from the un-optimized baseline is an easy target to hit. In few years after more really smart engineers have worked on it, 100x may be possible.

Posted December 13, 2007 1:35 PM