BatchJob Efficiency Best Practice

#1

I have a BatchJob where I fetch with limit -1 all the ids of my records in the doStart, scheduleBatches, then each job fetches the entirety of the record and performs some operation. However, I am noticing performance issues; I’m assuming this is because it is performing a fetch for every individual record.

I’m curious what the best practice is here…

  • Do I load them into batches of ~500 ids, scheduleBatch each of those, then do one fetch on those 500 in each job?
  • Do I load as many records into memory as I can, scheduleBatch each of those, then the job doesn’t need to fetch?
  • Can fetchObjSteam help somehow?
0 Likes

#2

Are you needing the id values specifically for splitting up your batches? Can you just do a fetchCount and then schedule the batches based on offset and limits to be queried in the processBatch itself?

0 Likes

#3

Can you use fetchObjStream instead to avoid loading all of the initial objects in memory?

Basically fetchObjStream in doStart, push to a batch array, then call scheduleBatch once the length of batch is your limit (say, 500) or the fetchObjStream exits.

0 Likes

#4

Basically, at the end of the day, I just need to perform an individual operation for each of my 100,000+ records.

0 Likes

#5

Could MapReduce be a good alternative? It takes care of all of the fetch + scheduleBatch calls for you, then you only need to worry about implementing map

0 Likes

#6

If it recognizes how to optimize fetch calls and size of batches then probably!

0 Likes

#7

I would run it like so:

function doStart(job, options) {
    var count = MyType.fetchCount();
    var numBatches = Math.ceil(count / options.batchSize);

    for(var i = 0; i < numBatches; i++) {
        MyBatchJob.scheduleBatch(job, {
            offset: i * options.batchSize,
            limit: options.batchSize
        });
    }
}

function processBatch(batch, job, options) {
    MyType.fetch({ offset: batch.offset, limit: batch.limit })
}
1 Like

#8

MapReduce will definitely allow you to write the least code possible. It uses fetchObjStream internally, so it essentially does what I described in my first post to schedule batches in an efficient manner. It also allows you to specify an include spec to be included automatically with each object, if you’d like.

type MyMapReduce extends MapReduce<MyType, string, int, int> type key "MY_MAPR" {
  batchSize: ~ = 500
  include: ~ = 'id, field1, field2' // omit this if you want to fetch in `map`
  map: ~ js server
}
MyMapReduce.make().create().start(); // Using default parameters
MyMapReduce.make({ batchSize: 1000 }).create().start(); // Using custom batchSize
MyMapReduce.make({ filter: "id == 'abc'" }).create().start(); // Using custom filter

If you already specify all the fields in include:

function map(batch, objs, job) {
  objs.each(doTheThing);
}

If you want to fetch in the map step:

function map(batch, objs, job) {
  MyType.fetchObjStream({ ids: objs.at('id') }).each(doTheThing);
}

Also consider what should happen in the case of failed batches. When you re-try a batch, do you want to re-process the same objects? Or do you want to re-process whatever object is now at offset x? New objects may be inserted into the database at any point in time, so scheduling batch jobs with just an offset and a limit might not do what you’re intending to do.

1 Like