Batch Job failing on scheduleBatches

Hi,

I have created batch job by extending BatchJob (some code below): The idea is simple: In the doStart load some data from S3 and then, for each batch, train a MLPipeline on the same data but on a different target (one target for each batch).

Unfortunately the job keeps failing in the doStart function, when I execute scheduleBatches - I didn’t get any error message and from the console the job stays in the “initial” status so I thought to post a messag ehere

Below how I invoke the job

var datasetId = "c165f858-ff83-4ac8-a8a1-45d0c9c4865e" // TOY DATASET
var pipelineId = "faultLocalizationPipeline_NotBalanced"

var mybatchjob = ComponentGroupTrainingBatchJob.make({id: 'bjtoy'}).upsert()

var batchjobspec = ComponentGroupTrainingBatchJobSpec.make({
    pipelineId : pipelineId,
    datasetId: datasetId,
    targets: [2,6],
    batchJobName: "noScadaUnbalanced",
    maxConcurrencyPerNode: 1})

ComponentGroupTrainingBatchJob.doStart( mybatchjob, batchjobspec )

datasetId is the id of an instance of the Type StoredDataset that I defined, pipelineId is a MLPipeline already upserted in the platform. The Javascript code above execeutes correctly but the status of the job remains “initial” forever.

image

When on Splunk, I see that the job instead failed executing the scheduleBatches line in the doStart function. I dont’ get any detail on the error, I can only see then logging message I put right before that line and I can see that the job never executed the processBatch().

I tried executing every line in the doStart from the console and everything executes correctly, except the scheduleBatches but the error may not be related to what happens when I actually launch the whole job.

Do you have any idea on what’s going on?

Thanks!
A

entity type ComponentGroupTrainingBatchJob extends BatchJob<ComponentGroupTrainingBatchJob, ComponentGroupTrainingBatchJobSpec, ComponentGroupTrainingBatchJobBatch> type key 'MTBJ' {
  doStart      : ~ js server
  processBatch : ~ js server
  allComplete  : ~ js server
}

// ComponentGroupTrainingBatchJob.js (Partial - only doStart)
function doStart(job, options) {

  var sd = StoredDataset.get( options.datasetId );
  var ds = sd.toDataset();
  logger.info(log_tenant_tag + 'Dataset loaded');
  var batches = ComponentGroupTrainingBatchJobBatch.array()
   _.each( options.targets, function(x){ 
      batches.push( ComponentGroupTrainingBatchJobBatch.make( { target : x, dataset : ds } ));
   });

   logger.info(log_tenant_tag + 'Batches Created');
   job.scheduleBatches( batches ); // FAILS HERE!!!!
};

type ComponentGroupTrainingBatchJobBatch {
    target : int
    dataset : Dataset
}


type ComponentGroupTrainingBatchJobSpec extends BatchJobOptions {

    /**
    * ID of the ML Pipeline to be trained.
    */
    pipelineId: !string 

    /**
    * ID of the StoredDataset
    */
    datasetId: !string 

    /**
    * List of target classes. Possible values are defined in the FaultLocalizationLabelEncoder Enum Type
    */
    targets: ![integer] schema suffix 'TARGETLIST'

    /**
    * NAME OF THE JOB
     */
    batchJobName: !string 
}

… adding some info from Splunk
I get the Batches Created message, then the function fails on the scheduleBatches action.

@alessandro.perina I see this error in logs

2019-06-11T14:16:53.188Z level=WARN thread="Hannibal-16" logger=c.s.i.Task a_id="8530.10520744" a_rid="8530.10520744" 
action Action [8530.10520744, Failed, Target [ItInPreMan/a464783/ComponentGroupTrainingBatchJob?action=doStart]] failed
c3.love.exceptions.C3RuntimeException: Job run ComponentGroupTrainingBatchJob:BatchJobId: does not exist

Looks like the id passed refer to a batchJob that does not exist in memory or persisted

Thanks a lot Marco,

If you look at my JS code, I am actually upserting the job after I created. Maybe my syntax is wrong?.. Then the doStart function executes. How can I not find the job only when the doProcess is called?

Mmm… I will have a deeper look tomorrow

give a try doing
ComponentGroupTrainingBatchJob.get('bjtoy').start(batchjobspec)
instead of the doStart…

Hi,

I did what you suggested (ComponentGroupTrainingBatchJob.get(‘bjtoy’).start(batchjobspec)) , the job is now running

however, I cannot find my process running in the actionDump and it has been pending in the invalidationQueue for a couple of hours now…

No error or logging messages on Splunk

Should I kill the job? Is it stuck?

@alessandro.perina we are checking workers status do not kill process for now keep you posted

Marco, any news here? Can I kill the job?

I had to kill the job :slight_smile: - Will try a different approach

@marcosordi
I am trying again with my code. I realized that the problem was that I was running the batch job using doStart while using ComponentGroupTrainingBatchJob.get(‘bjtoy’).start(batchjobspec) worked… Do you have any idea why?

A

Hi @alessandro.perina here the reason - if something is not clear let us know

Thanks. I did not understand the explaination provided :expressionless:
In my case, I tried to do exactly as shown in the slides we received during the course…
Anyhow for now problem solved :), so thanks a lot for your help!