Recommendations on Map-Reduce implementation


I’ve a map-reduce job that does some calculation for each element and store the result in a a type, no aggregation needed.
I wonder which of the following implementations is more efficient?

  1. have only map() stage that does calculation then stores result.
  2. have a map() stage that does calculation, output result, then a reduce() stage that does storage.

So basically, is to good to combine everything in the map stage or adding a reduce phase will be better?
Also, is it possible to have the output of map stage been directly stored into a table, (as MapReduce jobs are tied to an underlying table)



I would only write a map function if no aggregation is needed.
Between the map and reduce phase key, value pairs are written in Cassandra.

If you decide to write the storage part in the reduce phase you are basically incurring useless I/O cost that could be avoided if you put it in the map phase.

Moreover I think that the reduce tasks can only start when all the map tasks are done so you would have to wait for all the map tasks to be done before starting storing elements into your type.



Oh wow didn’t know all these details!
The documentation [1] on MapReduce does not mention any of this, it should be further detailed.

[1] https://[domain]/api/1/[tag]/[tenant]/documentation/type/MapReduce



@bachr can you provide how you implemented your map function to write directly back to the type?



nothing fancy, you just need to store your result with create or createBatch (or merge it depends on your use case):

map: ~ js server
function map(batch, objs, job, subBatch) {
  var srcTypeRef = job.targetType;
  var srcType    = c3Type(srcTypeRef.typeName);
  // process objs
  var results = doProcess(objs);