Let say I have 100M objects of Type Asset. Asset type has an attribute districtLocation which is an enum(‘district1’, ‘district2’, ‘district3’).
The repartition of the 100M Asset into the different districts is the following:
- 85M Asset s are in ‘district1’ (85%)
- 10M Asset s are in ‘district2’ (10%)
- 5M Asset s are in ‘district3’ (5%)
I would like to define a MapReduce job which prepares a training dataset evaluating 300 features (=a lot) on 1M of Asset s, but I really want to ensure that the proportions on each district is respected, that is to say I want to have in this 1M sample:
- ~850k Asset s in ‘district1’ (85%)
- ~100k Asset s in ‘district2’ (10%)
- ~50k Asset s in ‘district3’ (5%)
In that context, I really do not want to evaluate the 300 features on the 100M Asset s and then do the sampling in the map phase of the job. I would like to do the sampling directly in the filter phase (the evalMetrics on 100M will fail anyway).
For a random, uniform (non stratified) sampling, I could do as adviced in How to fetch a random subset? .
Do you have an idea of how I can perform a stratified sampling on attribute districtLocation in these conditions ?