MapReduce performances: how are batches created?



When I start a MapReduce, does C3:

  • load all the objects in memory then split the resulting collection and send them to the map() workers,
  • or does each map() worker load a subset of the objects?

I have quite a few include in my MapReduce spec so I wouldn’t want the master to load all the objects in a single batch.



If an include is specified it is applied when initially reading the objs to produce each batch. That fetch is done as a stream (so the whole set is never materialized in memory anywhere) and each batch of objs is persisted in Cassandra and those objs are ultimately what is passed into the map function.