Incremental Map/Reduce
Ok, here is a quick rundown of how the full Map/Reduce functionality in CouchDB will work.
So right now, to create views in CouchDB, you create a function that takes a document and outputs key value pairs.
This is a function that defines a view of documents keyed by their tags:
function (doc) {
for(var i=0; i<doc.tags.length; i++)
map(doc.tags[i], null)
}
To be used, this function must be stored in a design document:
{
"_id":"_design/tags",
"views": {"all": "function (doc) {...snipped...}"}
}
Then we can get the results:
GET /db/_view/tags/all
Result:
{"total_rows":6,
"rows":[
{ "id":"2DA92AAA628CB4156134F36927CF4876",
"key":"bar",
"value":null},
{ "id":"4156134F36927CF48762DA92AAA628CB",
"key":"bar",
"value":null},
{ "id":"4F36927CF48762DA92AAA628CB415613",
"key":"bar”,
"value":null}
{ "id":"2DA92AAA628CB4156134F36927CF4876",
"key":"foo",
"value":null},
{ "id":"4156134F36927CF48762DA92AAA628CB",
"key":"foo",
"value":null}
]}
Then get just the docs tags foo:
GET /db/_view/tags/all?key="foo"
Result:
{"total_rows":6,
"rows":[
{ "id":"2DA92AAA628CB4156134F36927CF4876",
"key":"foo",
"value":null},
{ "id":"4156134F36927CF48762DA92AAA628CB",
"key":"foo",
"value":null}
]}
The restriction on map functions is that they must be referentially transparent. That is, given the same input document, they will always emit the same key/value pairs. This allows CouchDB views to be updated incrementally, only reindexing the documents that have changed since the last index update.
Okay cool, simple and efficient and it's already working. (The only thing I want to change about it is rename the internal javascript map(Key,Value) call to emit(Key,Value). It's less confusing that way.)
Now what about Reduce?
This is a similar example to the previous view map function, but this time we want a view of all the tag counts in the database. (The user programable reduce isn't done yet, but this is how it will work.)
We need a Map function to output doc tags:
function (doc) {
for(var i=0; i<doc.tags.length; i++)
emit(doc.tags[i], 1);
}
And a Reduce function to count the tags:
function (tag, counts) {
int sum=0;
for(var i=0; i<counts.length; i++)
sum+=counts[i];
return sum;
}
And then stored in a design document:
{
"_id":"_design/tags",
"views": {
"all": "function (doc) {...snip...}",
"count": {
"map": "function (doc){...snip...}"
"reduce": "function (tag, counts){...snip...}"
}
}
}
Now to access the results:
GET /db/_view/tags/count
Result:
{"total_rows":2,
"rows":[
{ "key":"bar",
"value":3},
{ "key":"foo",
"value":2}
]}
Then get the count of just the docs tagged "foo":
GET /db/_view/tags/count?key="foo"
Result:
{"total_rows":2,
"rows":[
{ "key":"foo",
"value":2}
]}
To make incremental Map/Reduce possible, the Reduce function has the requirement that not only must it be referentially transparent, but it must also be commutative and associative for the array value input, to be able reduce on it's own output and get the same answer, like this:
f(Key, Values) == f(Key, [ f(Key, Values) ] )
This requirement of reduce functions allows CouchDB to store off intermediated reductions directly into inner nodes of btree indexes, and the view index updates and retrievals will have logarithmic cost. It also allows the indexes to be spread across machines and reduced at query time with logarithmic cost.
The incremental design makes it possible to use map/reduce to query huge partitioned clusters in realtime, instead of having to wait for a whole map/reduce job to complete or having stale, occasionally updated indexes,. The downside is it may be harder to write the Reduce function in an associative and commutative manner.
Update: Here is a followup posting about Combiner functions.
Posted February 7, 2008 8:02 PM
Comments
s/map/emit/ is a huge improvement, it was the thing that hung me up most on understanding couch when I first played with it.
Bill Mill, February 8, 2008 12:01 AM
Sounds like a good fit - I can't wait to try it out.
One quick correction - s/values/counts/ in the reduce function:
function (tag, counts) {
int sum=0;
for(var i=0; i sum+=values[i];
return sum;
}
Brian P O'Rourke, February 8, 2008 1:56 AM
Foiled again by comment formatting. More concisely:
sum+=values[i];
should be
sum+=counts[i];
in your reduce function.
Brian P O'Rourke, February 8, 2008 1:59 AM
Thanks Brian, fixed.
Damien Katz, February 8, 2008 3:07 AM
That Commutative / Associative stuff seems convincing as a requirement for scalability. I'm really excited for my first CouchDB project.
Chris Anderson, February 8, 2008 5:27 AM
First, it's really cool to see this moving along. I'm going to be really interested to see the guts of this when you've got something to share.
However, Doesn't the Commutative / Associative requirement make this more analogous to the combiner function from the canonical Google paper. While this is great for certain classes of problem, it doesn't provide for a true MapReduce , as defined by that paper.
Of course the limit of distributability of a true reduce function is per intermediate key, and you have to run the whole reduce for that key for any update.
I'll post some more questions and thought on the Google Group.
Still great work.
Kerr, February 8, 2008 6:01 AM
In the example
the result has "total Rows" as 6 - shouldn't it be 2?This is all quite interesting and exciting stuff - fun times ahead.
Jos, February 8, 2008 8:37 AM
BRILLIANT!
Raj, February 8, 2008 9:53 AM
Thanks everyone. I really should have written this up a while ago.
Joe, the total_rows property in the results is used to indicate the total rows in the view, not the result set returned.
Damien Katz, February 8, 2008 2:25 PM
Damien this looks fantastic.
Have you had any thoughts on incrementally maintaining views which have
contributing documents join and leave passively, instead of through an explicit
update / delete / insert action?
For example - to maintain a count of comments by entry this week, the map might
look something like:
function (doc) { if(doc.type == "comment") { var now = new Date(); var period = 1000*60*60*24*7; // 1 week in millseconds var lastweek = now.getTime() - period; var created = new Date(doc.year, doc.month, doc.day).getTime(); if (created > lastweek) emit(doc.entry, 1); } }and the reduce would be pretty much the same as the count the tags reduce in
the article.
It'd be great if, as time passes, comments which age to be older than 1 week
are incrementally dropped from the map, and the corresponding comments this
week count incrementally lowered.
Andy Gayton, February 9, 2008 5:13 PM
Post a comment