Incremental Map/Reduce/Combiner?

Because people keep asking me about a Combiner function (which is written about in the Google Map/Reduce paper and also used in Hadoop), here's a follow up to yesterdays post. The reason I'm not talking about about a combiner function is because I'm pretty sure the design I propose is isomorphic to a Map/Reduce/Combiner combo design. That is to say it's the same damn thing, just re-jiggered a little.

To illustrate, here is an example that does it's own combiner operations inside the reduce.

Let's say we have a database full of receipts and we want to calculate the total spent in a month and the average amount of each purchase.

Here is an example receipt:

{
"_id":"2DA92AAA628CB4156134F36927CF4876",
"type":"receipt"
"amount":19.95,
"note":"Enzyte, i sure hope this works",
"year":2008,
"month":1,
"day":12,
....
"}
}

Calculating the total spent in a reduction is easy, it's simple addition which is commutative and associative. But calculating an average is not intrinsically associative, so we'll have to make our function a little more complex to make the Javascript function behave properly.

This map function outputs each receipt's year and month as the key, and the amount as the value:


function (doc) {

if (doc.type == "receipt") {

var key = [doc.year, doc.month];
emit(key, doc.amount);
}
}

And a Reduce function to compute the totals of the receipts:


function (date, values) {
var result = {totalSpent:0, numPurchases:0};
for(var i=0; i<values.length; i++)
 {
switch (typeof values[i]) {
case "number":
result.totalSpent += values[i];
result.numPurchases++;
case "object":
// this in an output object from a previous reduction
result.totalSpent += values[i].totalSpent;
result.numPurchases += values[i].numPurchases;
}
}
return result;
}

Once saved into a design document then you can get the values from January of '08:

GET /db/_view/receipts/amount_spent_month?key=[2008,1]

Result:

{"total_rows":12,
"rows":[
{ "key":[2008,1],
"value": { "totalSpent": 1252.1
"numPurchases": 13}}
]}

Now calculating the final average is left to the client.

However, we can avoid extra combiner logic by simply changing the map function to output the values to look like their final form anyway:


function (doc) {

if (doc.type == "receipt") {

var key = [doc.year, doc.month];
emit(key, {totalSpent: doc.amount, numPurchases:1});
}
}

And the Reduce function:


function (date, values) {
var result = {totalSpent:0, numPurchases:0};
for(var i=0; i<values.length; i++)
 {
result.totalSpent += values[i].totalSpent;
result.numPurchases += values[i].numPurchases;
}
}
return result;
}

This code is a lot simpler than the first example, but doing it this way has the disadvantage that the map output values, which can be very numerous, are now containing extra useless information for the sake of algorithmic simplicity, possibly slowing things down or using extra resources.

Anyway, I think the addition of a separate Combiner function makes things more complicated for the general case, and an separate combiner design can be done in this design anyway, using wrapper functions. So I don't see the point of an explicit combiner function, though I might be convinced otherwise.

Posted February 8, 2008 1:47 PM