Upsert Other non-sklearn Model Objects


Can you upsert other non-sklearn Model Objects such as statsmodel, glmnet-python, h2o in the Data Science Notebook? If not, is there interest in developing that functionality?

In some situations where there is a lot of variance with the data (just by nature), a lower variance generalized linear model could outperform these highly flexible models in production. Sci-kit learn doesn’t have the full glmnet implemented in the package yet, although there is an open issue discussion to get it in there.



@AlexXuConEd We recently needed to use the statsmodels module (specifically statsmodels.tsa), and were able to add that by defining the right runtime in package.json as follows:

  "name": "statsmodels",
  "description": "Statistical Model ARIMA",
  "author": "Adrien Bos",
  "dependencies": ["standardDependencies"],
  "runtimes": {
    "py-statsmodels": {
      "language": "Python",
      "runtime": "CPython",
      "modules": {
        "conda.numpy": "=1.12",
        "conda.statsmodels": "=0.9.0",
        "conda.dill": "=0.2.8",
        "conda.pandas": "=0.20.3"
      "repositories": [

and the necessary types and methods. For instance you can define the pipe type as follows, with the corresponding methods and new fields (if applicable):

entity type StatsModelsTsaPipe extends MLLeafPipe<Dataset, Dataset> mixes PythonMLHelper type key 'SMT' {

  train: ~

  process: ~

   * Overrides the field in {@link MLLeafPipe}, giving a more specific type.
  technique: !StatsModelsTsaTechnique

Then you need to define the implementation of train, process, and other helper functions if needed, in a file.

I am linking @adrienbos who worked on this for further questions you might have. Hope this helps.



Thank you for your informative response.

Can we generalize the method described above to other non-sklearn packages to fit into the pipe type? We would just have to define the implementation of train and process



That is correct. The “MLPipe” is a generic solution to the using any libraries as a step in a ML Pipeline.