Simple stratified sampling during the filter phase of a MapReduce job


Let say I have 100M objects of Type Asset. Asset type has an attribute districtLocation which is an enum(‘district1’, ‘district2’, ‘district3’).

The repartition of the 100M Asset into the different districts is the following:

  • 85M Asset s are in ‘district1’ (85%)
  • 10M Asset s are in ‘district2’ (10%)
  • 5M Asset s are in ‘district3’ (5%)

I would like to define a MapReduce job which prepares a training dataset evaluating 300 features (=a lot) on 1M of Asset s, but I really want to ensure that the proportions on each district is respected, that is to say I want to have in this 1M sample:

  • ~850k Asset s in ‘district1’ (85%)
  • ~100k Asset s in ‘district2’ (10%)
  • ~50k Asset s in ‘district3’ (5%)

In that context, I really do not want to evaluate the 300 features on the 100M Asset s and then do the sampling in the map phase of the job. I would like to do the sampling directly in the filter phase (the evalMetrics on 100M will fail anyway).

For a random, uniform (non stratified) sampling, I could do as adviced in How to fetch a random subset? .

Do you have an idea of how I can perform a stratified sampling on attribute districtLocation in these conditions ?


Does a filter like this one works for you?

var f1 = "(districtLocation=='district1' && md5HashKey(id) % 850 == 0)";
var f2 = "(districtLocation=='district2' && md5HashKey(id) % 100 == 0)";
var f3 = "(districtLocation=='district3' && md5HashKey(id) % 50 == 0)";
var filterStr = [f1, f2, f3].join(" || ");
Asset.fetch({filter: filterStr})

Which you can generalize with something like:

var arr = [
  {loc: 'district1', per: 0.84},
  {loc: 'district2', per: 0.1},
  {loc: 'district3', per: 0.05}
var filterStr ={
  return "(districtLocation=='"+elm.loc+"' && md5HashKey(id) % " + Math.floor(1000 * elm.per) + " == 0)"
}).join(" || ");
Asset.fetch({filter: filterStr})


Thank you very much @bachr :slight_smile: . I think your idea totally works yes ! The number does not seem to work though (the query you suggested will generate a sample of ~300k elements with ~100k elements from each district), but this does not matter, I just have to adapt the numbers to get the right proportions !

I would do, following your example:

var arr = [
  {loc: 'district1', sample_one_over: 85000000 / 850000},
  {loc: 'district2', sample_one_over: 10000000 / 100000},
  {loc: 'district3', sample_one_over: 5000000 / 50000}
var filterStr ={
  return "(districtLocation=='"+elm.loc+"' && md5HashKey(id) % " + Math.floor(elm.sample_one_over) + " == 0)"
}).join(" || ");
Asset.fetch({filter: filterStr})

Thank you very much again ! :slight_smile: