Incremental Map/Reduce

Ok, here is a quick rundown of how the full Map/Reduce functionality in CouchDB will work.

So right now, to create views in CouchDB, you create a function that takes a document and outputs key value pairs.

This is a function that defines a view of documents keyed by their tags:

function (doc) {

for(var i=0; i<doc.tags.length; i++)

map(doc.tags[i], null)
}

To be used, this function must be stored in a design document:

{
"_id":"_design/tags",
"views": {"all": "function (doc) {...snipped...}"}
}

Then we can get the results:

GET /db/_view/tags/all

Result:

{"total_rows":6,
"rows":[
{ "id":"2DA92AAA628CB4156134F36927CF4876",
"key":"bar",
"value":null},
{ "id":"4156134F36927CF48762DA92AAA628CB",
"key":"bar",
"value":null},
{ "id":"4F36927CF48762DA92AAA628CB415613",
"key":"bar”,
"value":null}

{ "id":"2DA92AAA628CB4156134F36927CF4876",
"key":"foo",
"value":null},
{ "id":"4156134F36927CF48762DA92AAA628CB",
"key":"foo",
"value":null}
]}

Then get just the docs tags foo:

GET /db/_view/tags/all?key="foo"

Result:


{"total_rows":6,
"rows":[
{ "id":"2DA92AAA628CB4156134F36927CF4876",
"key":"foo",
"value":null},
{ "id":"4156134F36927CF48762DA92AAA628CB",
"key":"foo",
"value":null}
]}

The restriction on map functions is that they must be referentially transparent. That is, given the same input document, they will always emit the same key/value pairs. This allows CouchDB views to be updated incrementally, only reindexing the documents that have changed since the last index update.

Okay cool, simple and efficient and it's already working. (The only thing I want to change about it is rename the internal javascript map(Key,Value) call to emit(Key,Value). It's less confusing that way.)

Now what about Reduce?

This is a similar example to the previous view map function, but this time we want a view of all the tag counts in the database. (The user programable reduce isn't done yet, but this is how it will work.)

We need a Map function to output doc tags:


function (doc) {

for(var i=0; i<doc.tags.length; i++)

emit(doc.tags[i], 1);

}

And a Reduce function to count the tags:


function (tag, counts) {
int sum=0;

for(var i=0; i<counts.length; i++)

sum+=counts[i];
return sum;

}

And then stored in a design document:


{
"_id":"_design/tags",
"views": {
"all": "function (doc) {...snip...}",
"count": {
"map": "function (doc){...snip...}"
"reduce": "function (tag, counts){...snip...}"
}
}
}

Now to access the results:

GET /db/_view/tags/count

Result:

{"total_rows":2,
"rows":[
{ "key":"bar",
"value":3},
{ "key":"foo",
"value":2}
]}

Then get the count of just the docs tagged "foo":

GET /db/_view/tags/count?key="foo"

Result:

{"total_rows":2,
"rows":[
{ "key":"foo",
"value":2}
]}

To make incremental Map/Reduce possible, the Reduce function has the requirement that not only must it be referentially transparent, but it must also be commutative and associative for the array value input, to be able reduce on it's own output and get the same answer, like this:

f(Key, Values) == f(Key, [ f(Key, Values) ] )

This requirement of reduce functions allows CouchDB to store off intermediated reductions directly into inner nodes of btree indexes, and the view index updates and retrievals will have logarithmic cost. It also allows the indexes to be spread across machines and reduced at query time with logarithmic cost.

The incremental design makes it possible to use map/reduce to query huge partitioned clusters in realtime, instead of having to wait for a whole map/reduce job to complete or having stale, occasionally updated indexes,. The downside is it may be harder to write the Reduce function in an associative and commutative manner.

Update: Here is a followup posting about Combiner functions.

Posted February 7, 2008 8:02 PM