Configuring S3 for multi-tenancy etc

#1

I would like to know how to use S3 wrt. isolation between tags, between APIs, in a recommended way.

  1. Why do I see c3/c3 as tenant/tag in the result of FileSourceCollection.get("<some canonical>").inboxUrl()? (I have curled to /file API using my tenant/tag)

  2. Why are files curled via /file and /import seen in the same S3 directory/ FileSourceCollection?

  3. It seems that SourceFile is not marked as completed when a file is curled via /import but then it gets reprocessed upon syncAll(fsc.inboxUrl()): am I wrong or how should this be reconciled? (My coworker curled file1 to /import while I was testing SourceFile APIs with file2 of the same canonical. When I called syncAll, file1 was reprocessed. I don’t know at the moment if he used the same tag as myself.)

  4. Is inboxUrl() returning a physical url or what is visible is actually tag/tenant-isolated?

  5. What is the right way to deal with the usual case of integrating data per canonical? There is one FileSourceCollection per canonical, but SourceFile seems not so connected to canonicals. Is there a better/safer way to synchronizing per canoncal than manually (file by file from a FileSourceCollection or via tricks like syncAll(fsc.inboxUrl()))?

@garrynigel suggested configuring with

FileSystemConfig.make().getConfig()

and asking the Ops about details.

Thanks

#2

curling to file endpoint involves sending files in payload from your local machine to the server.
an example from the c3 docs is as below
b. Upload data through the C3 Platform to using the /file API. The command is:

curl -v -H "Content-Type: text/csv;" -X PUT --data-binary @Factory001.csv
https://environmentURL/file/1/tenant/tag/s3://c3--dev-dl-custenv/BillingSys/Factory/Factory001.csv -u username:password

if you do not specify the full url or encoded path it will put it the default location configured for the tenant
tag in the environment

curl -v -H "Content-Type: text/csv;" -X PUT --data-binary @Factory001.csv
https://environmentURL/file/1/tenant/tag/Factory/Factory001.csv -u username:password

which can be checked as below

FileSystem.inst().urlFromEncodedPath()

The configurations for each tenant and tag is generally maintained in FileSystemConfig

Now the /import endpoint is very similar but in 7.8 this is typically how the command is

curl -H "Content-Type: text/csv" -X PUT --data-binary @<<file>.csv http://<<HostName>>/import/1/<<tenant>>/<<tag>>/<<CanonicalTypeName>>/<<filenameidentifier>> -u <<username>>:<<password>> -v

it will find the FileSourceCollection for the CanonicalTypeName and find the inboxUrl for that CanonicalTypeName and write the file in that location before it syncs and processes it.

inboxUrl is configured based on FileSystemConfig or you can override inboxUrl for each FileSourceCollection by setting inboxUrlOverride on the FileSourceCollection.

this is the document on the type file

  /**
   * Overrides the #inboxUrl
   * If data needs to be loaded from a folder different than the default one, provide the value via this field.
   * e.g. s3://my-custom-bucket/mypath/inbox
   */
  inboxUrlOverride: private string

Canonical does not have a 1 to 1 mapping with SourceCollection. SourceCollection is metadata to distinguish what type of source data is it? whether its a queue message, a file, direct http request, a sql table and properties related to it and what is the structure(Source type/ Canonical type) to read the data.

Integrating data per canonical I do not actually understand the use case. I think loading data and having priorities on SourceCollection is a feature we are introducing in 7.9 and maybe we can set those to determine which Sources need to be processed first and so forth.

#3

The use case “per-canonical” is what we do all the time in before-v7.8:

Process.transfer(MyCanonical)

usually via cron, which does transfer from sftp, load and process. Now we want data in S3 instead of sftp; then the transfer and load become sync. And it feels natural to stay in per-canonical mode like before.

#4

Correct so then if your case is only for files you can prioritize for each FileSourceCollection what the Source type and schedule crons for each FileSourceCollection accordingly.
Check SourceCollectionProcessMode which can be set for each SourceCollection. both on_arrival and on_schedule can be set to determine how you data-load.

#5

FWIW, below are two bash functions that we usually use to upload to/via C3. Please let me know if you have suggestions.

  1. targetting /import API
function upload() {
    if [ -f "${1}" ]
    then
        fname="${1}"
        IFS='_' read -ra fnames <<< "${fname}"
        bname=`basename "${fname}"`
        tname=`basename "${fnames[0]}"`
        echo fname: ${fname}, bname: ${bname}, tname: ${tname}
        curl -H "Content-Type: text/csv" -X PUT --data-binary "@${fname}" \
             "${C3_ENV_URL}/import/1/${C3_TENANT}/${C3_TAG}/${tname}/${bname}" \
             -H "Authorization: ${C3_TOKEN}" -vvv
    else
        echo "$0: File '${fname}' not found."
    fi
}
  1. targetting /file API
function uploadS3() {
    if [ -f "${1}" ]
    then
        fname="${1}"
        IFS='_' read -ra fnames <<< "${fname}"
        bname=`basename "${fname}"`
        tname=`basename "${fnames[0]}"`
        inbox="${C3_S3_MOUNT}/${tname}/inbox"
        echo fname: ${fname}, bname: ${bname}, tname: ${tname}, inbox: ${inbox}
        curl -H "Content-Type: text/csv" -X PUT --data-binary "@${fname}" \
             "${C3_ENV_URL}/file/1/${C3_TENANT}/${C3_TAG}/${inbox}/${bname}" \
             -H "Authorization: ${C3_TOKEN}" -vvv
    else
        echo "$0: File '${fname}' not found."
    fi
}

For example,

export C3_ENV_URL=<vanityUrl>
export C3_TENANT=myapp
export C3_TAG=test
export C3_TOKEN=deadbeef...
export C3_S3_MOUNT=s3://... # S3.mountName() + '/inbox'
#6

the above methods look good. if the paths provided to the file api and the import API are the same then files will end up in the same place. for eg

curl -H "Content-Type: text/csv" -X PUT --data-binary @file1.csv            
    "http:localhost:8080/import/1/dataIntegration/test/CanonicalAccount/file1.csv" "

and

curl -H "Content-Type: text/csv" -X PUT --data-binary @file1.csv            
    "http:localhost:8080/file/1/dataIntegration/test/s3://c3--local-env/dl/dataIntegration/test/CanonicalAccount/file1.csv" 

Will put the file in the same location as s3://c3--local-env/dl/dataIntegration/test/CanonicalAccount/file1.csv

only difference is import will also process the file.

#7

Will import also mark in SourceFile that it has been processed?

#8

Yes it will schedule it for processing. it should do the same thing as calling SourceFile.syncFile(file)

1 Like
#9

We are trying to configure and test the inbox url on a stage env. Currently, FileSystem.inst().urlFromEncodedPath() returns a path ending in c3/c3.
We were able to override inbox url for a FileSourceCollection corresponding to a canonical type so that it ends with <tenant>/<tag>/inbox, it seems to work but FileSystem.inst().urlFromEncodedPath() still returns the same path with c3/c3, which seems inconsistent. Additionally, when I take a look at another project’s QA environment, there FileSystem.inst().urlFromEncodedPath() returns a path ending in <tenant>/<tag> (not c3/c3).
How can I reconfigure our stage env so that FileSystem.inst().urlFromEncodedPath() returns a path ending with our <tenant>/<tag>?
Thanks

#10

A related question: with a slightlly modified uploadS3 (which uses basic authentication) and the overriden FileSourceCollection url, we get the following error:

Tue 19Feb19 15:58:25CET > uploadS3up CanonicalSmartCMeasurement_foo.csv
fname: CanonicalSmartCMeasurement_foo.csv, bname: CanonicalSmartCMeasurement_foo.csv, tname: CanonicalSmartCMeasurement, inbox: s3://<clusterPart>/fs/engie-vertuoz/test2/inbox/CanonicalSmartCMeasurement/inbox
...
> PUT /file/1/engie-vertuoz/test2/s3://<clusterPart>/fs/engie-vertuoz/test2/inbox/CanonicalSmartCMeasurement/inbox/CanonicalSmartCMeasurement_foo.csv HTTP/1.1
...
> Content-Type: text/csv
> Content-Length: 92
>
* upload completely sent off: 92 out of 92 bytes
< HTTP/1.1 500 Server Error
< Date: Tue, 19 Feb 2019 15:45:26 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: c3tenant=engie-vertuoz;Path=/;Domain=<VanityUrl>;Secure
<
* Connection #0 to host <VanityUrl> left intact

What is wrong?

Thanks

EDIT: I just tried fsc.listFiles() and got this error:

500 response with invalid data (cannot parse JSON)
  data:

 *** ERROR: Internal error: 

Invalid or inaccessible URL s3://<clusterPart>/fs/engie-vertuoz/test2/inbox for s3 [UnsupportedAction]

So it seems we need to create this directory. I only saw createFile, but am not sure how to use it (it requires a File and I need to create a directory).

EDIT2: This line seems to create a file:

FileSystem.createFile(File.make({url:'<clusterPart>/fs/engie-vertuoz/test2/inbox'}))

but when trying SourceFile.syncFile(<above file>), I get:

FileSourceCollection does not exist for  s3://<clusterPart>/fs/c3/c3/<clusterPart>/fs/engie-vertuoz/test2 doesn't exist!

BTW, the File created by createFile has url with a prefix s3:// – where does it come from? (I have not mentioned it in that line.)
ANSWER: It seems it comes from FileSystem.fileSystemConfig() and its default field. Right?

EDIT: Related: Mount an alternate S3 bucket to a FS mount point

#11

FileSystem and FileSourceCollection are for two different things
FileSystem is an interface for interacting with the FileSystem configured for the server on that environment. FileSystemConfig is the config required for configuring the file system.
FileSourceCollection is for dataloading and configurations for loading data in a Source type. So if you override the configuration on the FileSourceCollection it is required that the server should have access to the file system.

For configuring FileSystem to point to correct location as you please
FileSystem.inst().urlFromEncodedPath() should be configured on FileSystemConfig.

Check this on how to set FileSystemConfig


But generally OPs manages configuring FileSystem on that cluster

Before you create a ticket for them you should plan and provide the list of tenant tags and the buckets that need to be accessed for that environment.

You get the inaccessible error because FileSystem on that environment is not setup to access the url.
even if you do FileSystem.inst().listFiles(fsc.inboxUrl()) you will get the same error.