How to fetch a random subset?

#1

Let’s say I have 1M assets and I want to get a random sample of 1000 assets. How can I do this in fetch?

Simple stratified sampling during the filter phase of a MapReduce job
#2

If you know the ids of those 1000 assets, you can store them in an array and use intersects to fetch them. Something like this:

assetIds = [“123”,“234”…]

c3Grid(Asset.fetch({filter:Filter.intersects(‘id’, assetIds)}))

#3

You can use a md5 hash function in your filter to get a random subset of the size you want.

Asset.fetch({filter:"md5HashKey(id) % 1000 == 0"})

MD5 is not perfect but sufficiently random for most applications.

Notes

  1. The 1000 in the filter corresponds to the ratio of 1:1000 you want.
  2. You can change the remainder for a different subset.
  3. This method also has the advantage of being repeatable (the remainder acts as a seed)
  4. You don’t have guarantees to get exactly 1000, but very close.
2 Likes
#4

Additionally, assuming you already have a source and you want to know which bucket it falls into; you want to compute the same MD5 hash result in javascript you can do the following:
parseInt(MD5.sumString(source.id).slice(0,7), 16)