Data Integration: How can canonical types handle whitespaces in .csv column names?

#1

How could a canonical type be written to handle whitespaces in column names? For example, a .csv header could read, “Meter Name,Meter Id,Parent Asset,Date,Measurement Value,Status”. It’s unclear what the canonical type’s field names should be or if there is a handy annotation field to use that can handle such cases.

1 Like

#2

A csv column name is not supposed to have spaces as per https://tools.ietf.org/html/rfc4180.

0 Likes

#3

@Annaji According to that document (emphasis mine):

Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored.

See https://tools.ietf.org/html/rfc4180#page-3, point #4

0 Likes

#4

Yes … The column values are expected to have spaces. But not the column names.

0 Likes

#5

Not according to that document, which defines the CSV grammar as follows:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name \*(COMMA name)
record = field \*(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234
DQUOTE = %x22 ;as per section 6.1 of RFC 2234
LF = %x0A ;as per section 6.1 of RFC 2234
CRLF = CR LF ;as per section 6.1 of RFC 2234
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

On line 4 above, column name (name in the grammar) is defined as equivalent to a column value (field in the grammar). So if column values can have spaces, then column names can as well. And to confirm that, the last line of the grammar states that TEXTDATA may contain a space (ASCII hex code %x20).

Ideally, incoming data files do not have spaces in the CSV column names—since then defining the canonical type would be simple—but the MIME type technically allows them. C3 field type names have a stricter grammar than CSV column names do, so I think there needs to be a way to map/convert those column names which don’t conform to the C3 field type name grammar into valid field names.

0 Likes

#6

I believe that using " characters will work, so for example:

"my first field",secondField\n
aValue,"another value"\n
0 Likes

#7

Yeah that works for the CSV format, but then how do you define "my first field" as your C3 field type name?

This:

type MyCanonicalType mixes Canonical<MyCanonicalType> {
  "my first field": string
}

doesn’t work. Provisioning fails with error:

[Message] Invalid metadata in tag prod in tenant reference:
in "path/to/file/MyCanonicalType.c3typ":
    path/to/file/MyCanonicalType.c3typ [2,2] E no viable alternative at input '"my first field"'
    path/to/file/MyCanonicalType.c3typ [3] E mismatched input '}' expecting ':'

I also tried escaping the space characters with backslashes, as well as quoting the field name with backticks—but both failed with similar provisioning errors.

0 Likes

#8

was this ever resolved?

0 Likes

#9

I’m new to this so this might already exist but, I would suggest to C3 IoT to implement and annotation system for this. Something similar to @db you might have @map to enable mapping between column names and fields. I can imagine other scenarios that would need this like having column names that would match a “keyword”, not having column names at all (so index might be relevant), etc.

0 Likes

#10

I believe you can use the annotation @ser (stands for serialize) so for example @ser(name=“my field with spaces”)

3 Likes

#11

Delete …

@rileysiebel It worked! Thank you!!!

0 Likes