Convert a categorical variable from string to int

During data preprocessing I need to convert some categorical variables from string to int.
This will let me store the resulting table in a C3 dataset and feed a model pipeline.

The problem can be easily solved in Python building mapping, both with a custom function or by using a sklearn encoder.
The problem is that I need to save this string to int mapping (for each variable) in a C3 Type, to have it available not only in training but also in prediction.
The ideal structure would be a dictionary of dictionaries, something like:
map<string, map<string, int>>
An example of this mapping structure is: {‘var1’: {‘lev1’: 1, ‘lev2’: 2}, ‘var2’: {…}}

So I created the following Type to host this nested dictionary:

entity type IntLabelEncoder schema name "SCH1" {

   * dictionary structure to store the labels to integer mapping
  labelMapping: map<string, map<string, int>> schema suffix "LBL"

When provisioning this Type I get the following error:
Unable to upsert collection for type IntLabelEncoder: value should be primitive or reference in field labelMapping

I tried to replace the inner map<string, int> with any or a reference to another field but with no luck:
Can't convert column VALUE in table C3_2_SCH1_LBL from NUMERIC(20) to TEXT

Maybe I could create an adhoc Type (similar to a dictionary) to replace the inner dictionary.
Any suggestion is welcome.

EDIT: fixed the Type name in the error message reported

I recommend that you NOT convert from string to int using type conversion stuff, but instead by writing a metric.


  id: 'GlPrClcValue',
  name: "GlPrClcValue",
  path: ...
  srcType: ...
  otherFields: ...
  expression: "data == 'val1': 1 ? data == 'val2' : 2: ... etc"

Then use this metric for as your feature/label for training & prediction.

Hi Riley, thank you for your answer.
The string -> int mapping must be dynamically updated from input data during each execution of the training phase of the preprocess+model. The same mapping must be saved somewhere and applied during prediction.

The data I’m working with contains several variables, each with hundreds of distinct levels.
This is not feasible with a static definition of mapping logic inside a metric.
Is the metric approach you suggest compatible with a dynamic update of the mapping and a large number of levels?

OK then your way is probably better. I notice just now that the error in your initial post is on type GlPrClcMoGenerateDatasetSpec, not on type IntLabelEncoder.

Can we see that type as well?
Does GlPrClcMoGenerateDatasetSpec provision correctly if you remove the IntLabelEncoder type?

Sorry, GlPrClcMoGenerateDatasetSpec was the real name of the Type, in my post I edited names to increase readability but forgot to change the type name in the error message. I edit the OP.
I’ll try using a json format for field labelMapping and store in it a json version of a Pandas dataframe with the string -> int mapping.