MapReduce - multi-stage reduce capability?


For a map reduce job that I’ve written, I’m curious what limits may exist on the reduce phase.

After the map phase, the intermediate values (the array of values with the same outkey) are passed to individual invocations of reduce, per outkey.

Are there any practical limitations to the size of the array of intermediate values? Or, does the platform have some capability to do some multi-stage reduce so that one reducer doesn’t get all of the data at once?




The reduce function takes an array of objects so it’s really only limited by memory, though we don’t have any protection in place. I believe the way we would address map results that are larger than could be handled that way would be to implement a reduce variant that takes a stream of objects instead.

1 Like