Inconsistent Whitespace Handling in Transforms

I am noticing inconsistent handling of trailing whitespace in Canonical Transforms.

When I use an environment in Azure (server version, the whitespace is automatically removed and data is persisted without it (a trim was applied).

When I use Docker (server version 7.8.10+g834a657dd4c.SNAPSHOT-1), the whitespace is kept and persisted in the data. This causes reference issues, as some data sources send trailing whitespace and others do not, and there are references between them that now no longer match.

This inconsistency concerns me. Is there a way to force trimming the trailing whitespace (like a server config value that is not set by default), or an undocumented expression engine function that could be added to the transform?

This behavior still exists with 7.9 in Docker (build: 7.9.6+gf33b7ec.SNAPSHOT-1).

It looks like 7.9 added trim expression engine functions, but I would prefer not needing to go through all my transforms and adding that, especially since those functions are not in 7.8 and we need to provision our code to both version for now.

@mjlovell when you use local docker how do you load the data files in 7.8/7.9 local docker vs Azure.

In both cases, the file is placed in a Canonical-mapped location (in Azure: the blob store, in Docker: a directory mapped to a docker mount point, which is set as the FileSourceSystem’s rootUrlOverride).

For Azure, SourceFile.syncFile('url to the file in blob store').process()
For Docker, SourceFile.syncAll(); SourceFile.processAll();

In both cases the file contents are loaded. In Azure, trailing whitespace is stripped. In Docker, it is not.

Note, for Azure, we need to use the specific method SourceFile.syncFile, as it appears to be the only one (unless a previous bug was fixed) that correctly reads the contentType from Blob, every other sync method overwrites the contentType with application/octet-stream which will then fail to process. In Docker, all the methods appear to correctly guess a correct contentType of text/csv.

@mjlovell we guess the file content type for docker which is local file system, but when you place files in Azure we assume the contentType set on the backing FileSystem which will be application/octet-stream

I can’t see why would we strip the content in one way and not do it in the other. Might have to check whats different in Azure/ Docker in postgres maybe.

Do you know at which step is the trimming happening? while persisting in postgres or while transforming.

Lets verify by running

After doing some more testing, I am a bit more concerned.
I was running my Jasmine tests, which start with a string representing the CSV contents, uses ContentValue.fromString and Canonical.readObjs to convert that string to the canonical, then Canonical.transformCanonical to compare the results to the input.

In Docker (7.9), Dev (Azure 7.9), QA (Azure, the result from transformCanonical includes the trailing whitespace. In Production (Azure, same buildRefSpec as QA), it does NOT include the trailing whitespace.

So I made a simple small test file to curl into a canonical. The same file was sent to the 4 above mentioned environments, with the same results as the Jasmine test. Docker, Dev, QA persisted the whitespace from the file. Production stripped out the whitespace. The file sent was identical, and as mentioned Prod and QA are same version and build, so why would one strip out trailing whitespace and the other keep it?

@mjlovell if qa and prod have the same c3 build then it seems unlikely to do with code. Can you create a ticket what the steps you reproduced for qa and prod (could you specify the environment )and we can take a look and try to find the root cause of the issue.