Random order for MapReduce



Documentation says of the MapReduce order field:

if not specified, order will be random (done by fetching with md5hash and mod)

However this is not the behavior I witness with 7.6 as the objects seem to be sorted by id in the batches (for instance batch 1 will contain object from id 00… to 09…, batch 2 from 10… to 19…).
This makes the batches pathological as similar objects are grouped.

Is it possible to enforce a true random ordering with a MapReduce?




Actually the documentation is incorrect here. The md5/% “randomization” approach is quite expensive and we don’t want that to be the default behavior (which it would be if simply not specifying an order for the job triggered it). So by default we do want to order by “id”, which we currently do.

The things that trigger the randomization are specifying either numBatches or samplePct. Can you try that?

I’ve created the ticket: PLAT-11774 - Fix documentation surrounding behavior of “random”

to address the documentation issue.



samplePct: 100 seems to achieve the expected result, but is it guaranteed that 100% of the objects will be processed in that case?



No it isn’t guaranteed. If the data doesn’t distribute evenly across the md5/mod then some will be missed. Keep in mind that the whole random feature, due to it’s inherent performance penalty over non-random, was really only intended to allow the creation of a MapReduce job that would give the desired results when processing a representative sample of the overall data, and specifically “not” the entire data being processed randomly or otherwise. I would say that a well formed MapReduce job, if it is intending to process all of the data, should behave fine independent of the order of the entries received, even if they are ordered by id.



The (well-formed :wink: ) MapReduce job behaves fine, but randomness allows to “even” the batches.

If the type is say, customers, then obviously with time, there will be more subscriptions canceled among the first customers. So after a while, we end up with the first batches that have 5% active users while the latest batches have 95% active users. It “works” but the batches are completely unbalanced, some are very quick to process and some last for a long time.

Among (bearable) side effects of high variance in the batches output, this increases data loads stalling because of C3 design (see Integration starved by MapReduce job despite priority). One recommendation on this thread was to reduce the size of the batches, but efforts in this direction are hampered by the highly unbalanced batches.

However if samplePct: 100 does not achieve the expected result I guess we’ll have to live with it.



I see your point. For the case you mention, what are you doing with the customers that have their subscriptions cancelled in these jobs? If you are simply skipping them, then that should be accomplished using a filter on the job itself which would obviously be preferable to queuing up batches containing entries that don’t require processing. From what we have seen thus far, jobs need to either process everything or there is an obvious filter (e.g. > some timestamp or based on some status value(s)).

Anyway, if there is a legitimate need, we can obviously change this. We just want to make sure that anybody who goes down that path understands that the randomizing process is extremely slow on large datasets and may possibly tax the system more than not doing it. Why don’t you add a ticket and add your use case, indicating why randomization of the entire set is the only solution (e.g. you can’t achieve decent throughput/performance without it) and we’ll be happy to consider it.

I think we can at least safely ensure that if you say 100% that it gets them all. That would at least only impact those cases that need that behavior.



@lerela In master I changed it so that if you specify samplePct = 100, then you are guaranteed to get all entries. Hope this helps.



Awesome. Sorry I didn’t reply but filters don’t work in all circumstances since I simplified the example. For instance we might not have the “cancelled subscription” info so we’ll drop customers for which some evalMetrics return 0. We’ve seen that some batches were almost full of those customers (therefore almost empty after they’re dropped) while others have none.
Your change will definitely help with this.