Compaction

File compaction is now checked into the Apache CouchDB SVN repos.

CouchDB databases grow with every document update, even if the update is a document deletion. So file compaction needs to be run occasionally to recover wasted disk space.

The compaction process will purge all old revisions and pack together the documents on disk to make sequential document access faster, like during view rebuilds and replication. Compaction is copy style and happens live or hot, while the database server is actively running. Normal database operations, reads, view index refreshes, document updates, replication, etc can all happen while the database is actively being compacted.

Also it is incremental and restartable, so if the server is shut down or there is a power failure in the middle of compaction, the next time you restart the compaction it will start back near the last spot where it left off.

So that's now checked in with a unit test working, though like most of the code it needs more testing, etc.

And here are still some other enhancements I'd like to see to storage compaction. One is compaction queuing, to make sure only one database at a time is compacting since it's a very disk IO heavy operation. That's fairly easy to implement.

Another enhancement is dealing with long transactions better that overlap the compaction file transition. Currently when a compaction completes, any read or write that started before the compaction completed will have at least 5 seconds to finish before it will be forcibly terminated with an error. That can be fixed to allow any unlimited amounts of time for transactions to complete, and to do so actually ties into some larger changes I'd like to see in the code. But until then clients can just retry the operation and things will be fine, so for now these are low priority things and I'm ok with it if they don't get done before 1.0.

Right now view indexes still do not compact, but that will be fixed later. For now as a stopgap, just delete the index files and the views will rebuild from scratch and hence "compacted". But views definitely will have proper compaction before 1.0.

Right now we are working on getting the Mochiweb branch finished and integrated back into trunk (Mochiweb is a replacement library for the current Inets HTTP library). Christopher Lenz has been doing most of the work, and I'm now going to help out finishing it up and hopefully get it checked in this week.

Once that's done plus a few more small tweaks, we might consider CouchDB 0.8 done. Then the release after that I think we'll target the incremental reduce and security, and then CouchDB will be in beta.

Posted April 7, 2008 7:19 PM