Best way to retrieve huge data

Hi,
I’ve to train a different machine learning models directly in c3 platform.

Context:

  • There is a main asset;
  • different useCase linked to the asset;
  • each useCase have a proper machine learning model;
  • the train procedure has been written in Python;
  • Now is performed in jupyter notebook;
  • I use mapReduce jobs to retrieve data;
  • data retrieved are stored in S3 buckets;
  • features are about 20 for each useCase;
  • Data have a SECOND time granularity for 1 month.

Problem:

  • When the mapReduce jobs (one for each day of requested data) are computed by jupyter, the datasets returned have a wrong size of columns ;
  • this require a lot of time to spend to check datasets and re-run the wrong ones.

My idea:

  • in useCase.c3typ declare a function trainModel: member function(timeRange : TimeRange, inputDataset : Dataset) : any (the returned type is not fixed yet)
  • in useCase.py implement function trainModel(this, timeRange, inputDataset)
  • do what I want
  • upsert trained pipelines (please note that I CANNOT use a customPipeline for this project)

My questions:

  • Can I pass Dataset type in trainModel function and use c3.Dataset.toPandas(inputDataset) method to transform it in dataframe?
  • how I can use the map reduce in my .py function in order to retrieve data for one month for each useCase ?
  • Which is the most performing way to retrieve data?

Thank you

G.