Adding functions to extract features in a ML Pipeline

Hi there,

Let me explain my situation.

I have a Type T which represents my object of interest. Each instance of the Type T has few attributes, { a1, a2, … aN, ev }: one these, say a1, is the label that I am trying to predict and another one, say ev, is actually a list of events.

Type T {
a1: double
a2: double
a3: string

ev: [E]
}

Type E {
e1: datetime
e2: string
}

I have a python function F that takes an instance of Type T and generates a set of N features simply by using the values of the attributes {a2, … aN, ev}. In my case I have approximately 1000 features and 100K-1M instances of type T.

F: T --> R^N

I had a look at some tutorials online or received during C3 courses and, If I understood correctly, I could generate a training set matrix X_train (size 100x100K) applying F in Jupyter notebook (so in the container) and then passing X_train and y_train to the pipeline (sklearn pipeline - RandomForest classifier). I haven’t tried this yet, but I believe this would be painfully slow because all the feature engineering code would be executed in the container.

This is what I want to do:

Can I pass instead a handle to the function F to the ML Pipe, so that the feature engineering code (that only acts on instances of Type T) is executed on the cluster, and hopefully parallelized using all the worker nodes? If so, how can I do it? Where should I embed my python code?

Basically I am trying to build a pipeline like this:

  • (Step 0: Define the pipeline in Jupyter)
  • Step 1: Fetch data (want to be able to specify filters)
  • Step 2: Apply F to generate features (e.g., Create a C3 Dataset)
  • Step 3: Train the model
  • Step 4: Pass back the trained model to Jupyter.

Steps 1,2,3 should take place in the platform.

Note: My features have nothing to do with time series, so using metrics does not make much sense

Thanks a lot!