Nice writeup

Harry Fuecks writes about CouchDb:

Working out the last modification time (caching), replication / mirroring, administration and a whole host of other stuff gets much easier to manage, vs. a relational database where what constitutes a complete "document" may be spread across multiple tables. Of course the downside is stuff like searching, sorting and relations gets harder—enter CouchDb where (if I've understood right) you can "compile" tables from the contents of your raw documents using it's fabric formula language. Assuming the processing done to create the tables is reproducible, replicating databases across systems would then "only" be a matter of copying the raw documents.

Bullseye. Harry nails it.

SitePoint Blogs - CouchDb: document oriented persistence

Posted September 6, 2006 9:25 PM

Comments

Hey Damien, I've got a question.. I'm not very familar with CouchDb (yet), but I was reading over the wiki and the quick intro page and noticed the thing about documents having unique ids. Makes sense and everything - but I was curious if you considered having the docid be a hash of the document? (like md5, sha1, etc) The primary advantage that I could think of for that would be a guarentee (well, not 100% obviously due to collision possibility), that not only is your data uncorrupted (since it could be re-validated very easily), but also that you'd never have to store more than one copy of a given document. There's probably downsides to this that I haven't thought of yet in the context of your system (as I said, just working through the intro docs now), but I figured it wouldn't hurt to toss the question out there. Maybe you have addressed this very idea somewhere else. Or maybe it's a stupid idea. :-) Anyway, love the concept of CouchDb so far from what I've read!

Sean, September 6, 2006 10:18 PM

The downside is that you couldn't ever have two identical documents.

OTOH, if having a guarantee that there is just one instance of any given document content is a goal, which is useful in archival systems where data comes in from multiple sources, then using a hash as the document id is a great fit.

Richard Schwartz, September 7, 2006 12:13 AM

The big issue with documentID as a hash is not that it would require uniqueness of documents, but that if you modified a document it's hash would change. This means that if you modified a document it would not be able to replicate that change to other replicas. If you have some application where editing of documents could and should never happen then this might be a rule in your application logic, but it wouldn't work as a requirement of the core database.

Alan Bell, September 7, 2006 5:28 AM

You'd never modify existing documents when if using a hash as its identifier.

You store modified documents under with a new hash. And then relate it to the previous version.

I'm experimenting writing a Wiki that uses similar process.

Jared "Ren" Williams, September 7, 2006 6:02 AM

Jared, I see nothing wrong with your approach, it's a backend of more than one large scale storage system. However, to use it, you must also keep an external index of the md5 object identifiers, and some sort of meta info about the object (like a name).

And there is the problem with identical documents which isn't a problem for the system until you want to delete only one copy of the document. In which case now there must be a ref-counting scheme. (this problem may not apply to specific applications, like a Wiki).

CouchDb could fuse the the two concepts into one, but the only advantage I see of going to a MD5 hash scheme is diskspace. given how cheap disk space has become, I don't see that as being much of an advantage.

But, it's is easy enough though for CouchDb to incorporate a full document hash as part of the document to validate integrity. I'll have to think about maybe making the revision id be a document hash. That could have interesting consequences, hmmmm....

Damien, September 7, 2006 1:50 PM


The basic structure of my Wiki developments so far are:

1) Store documents. Send data to it, and it returns a key (contentKey) to use to retrieve the data. Simplest implementation is to sha1() the data, use that as a path/filename and write it direct to file.

2) Store document histories. Given a document name, a contentKey, and optional parent revision id(s).

The revision id is created by the concatinating the parent revision id and the contentKey and then hashing with sha1.

Design is similar to a few distributed version control systems, mainly monotone and mercurial. See http://www.selenic.com/mercurial/wiki/index.cgi/Design

Jared "Ren" Williams, September 7, 2006 6:11 PM

Post a comment




Remember Me?

(you may use HTML tags for style)