Configuring S3 for multi-tenancy etc


#1

I would like to know how to use S3 wrt. isolation between tags, between APIs, in a recommended way.

  1. Why do I see c3/c3 as tenant/tag in the result of FileSourceCollection.get("<some canonical>").inboxUrl()? (I have curled to /file API using my tenant/tag)

  2. Why are files curled via /file and /import seen in the same S3 directory/ FileSourceCollection?

  3. It seems that SourceFile is not marked as completed when a file is curled via /import but then it gets reprocessed upon syncAll(fsc.inboxUrl()): am I wrong or how should this be reconciled? (My coworker curled file1 to /import while I was testing SourceFile APIs with file2 of the same canonical. When I called syncAll, file1 was reprocessed. I don’t know at the moment if he used the same tag as myself.)

  4. Is inboxUrl() returning a physical url or what is visible is actually tag/tenant-isolated?

  5. What is the right way to deal with the usual case of integrating data per canonical? There is one FileSourceCollection per canonical, but SourceFile seems not so connected to canonicals. Is there a better/safer way to synchronizing per canoncal than manually (file by file from a FileSourceCollection or via tricks like syncAll(fsc.inboxUrl()))?

@garrynigel suggested configuring with

FileSystemConfig.make().getConfig()

and asking the Ops about details.

Thanks


#2

curling to file endpoint involves sending files in payload from your local machine to the server.
an example from the c3 docs is as below
b. Upload data through the C3 Platform to using the /file API. The command is:

curl -v -H "Content-Type: text/csv;" -X PUT --data-binary @Factory001.csv
https://environmentURL/file/1/tenant/tag/s3://c3--dev-dl-custenv/BillingSys/Factory/Factory001.csv -u username:password

if you do not specify the full url or encoded path it will put it the default location configured for the tenant
tag in the environment

curl -v -H "Content-Type: text/csv;" -X PUT --data-binary @Factory001.csv
https://environmentURL/file/1/tenant/tag/Factory/Factory001.csv -u username:password

which can be checked as below

FileSystem.inst().urlFromEncodedPath()

The configurations for each tenant and tag is generally maintained in FileSystemConfig

Now the /import endpoint is very similar but in 7.8 this is typically how the command is

curl -H "Content-Type: text/csv" -X PUT --data-binary @<<file>.csv http://<<HostName>>/import/1/<<tenant>>/<<tag>>/<<CanonicalTypeName>>/<<filenameidentifier>> -u <<username>>:<<password>> -v

it will find the FileSourceCollection for the CanonicalTypeName and find the inboxUrl for that CanonicalTypeName and write the file in that location before it syncs and processes it.

inboxUrl is configured based on FileSystemConfig or you can override inboxUrl for each FileSourceCollection by setting inboxUrlOverride on the FileSourceCollection.

this is the document on the type file

  /**
   * Overrides the #inboxUrl
   * If data needs to be loaded from a folder different than the default one, provide the value via this field.
   * e.g. s3://my-custom-bucket/mypath/inbox
   */
  inboxUrlOverride: private string

Canonical does not have a 1 to 1 mapping with SourceCollection. SourceCollection is metadata to distinguish what type of source data is it? whether its a queue message, a file, direct http request, a sql table and properties related to it and what is the structure(Source type/ Canonical type) to read the data.

Integrating data per canonical I do not actually understand the use case. I think loading data and having priorities on SourceCollection is a feature we are introducing in 7.9 and maybe we can set those to determine which Sources need to be processed first and so forth.


#3

The use case “per-canonical” is what we do all the time in before-v7.8:

Process.transfer(MyCanonical)

usually via cron, which does transfer from sftp, load and process. Now we want data in S3 instead of sftp; then the transfer and load become sync. And it feels natural to stay in per-canonical mode like before.


#4

Correct so then if your case is only for files you can prioritize for each FileSourceCollection what the Source type and schedule crons for each FileSourceCollection accordingly.
Check SourceCollectionProcessMode which can be set for each SourceCollection. both on_arrival and on_schedule can be set to determine how you data-load.


#5

FWIW, below are two bash functions that we usually use to upload to/via C3. Please let me know if you have suggestions.

  1. targetting /import API
function upload() {
    if [ -f "${1}" ]
    then
        fname="${1}"
        IFS='_' read -ra fnames <<< "${fname}"
        bname=`basename "${fname}"`
        tname=`basename "${fnames[0]}"`
        echo fname: ${fname}, bname: ${bname}, tname: ${tname}
        curl -H "Content-Type: text/csv" -X PUT --data-binary "@${fname}" \
             "${C3_ENV_URL}/import/1/${C3_TENANT}/${C3_TAG}/${tname}/${bname}" \
             -H "Authorization: ${C3_TOKEN}" -vvv
    else
        echo "$0: File '${fname}' not found."
    fi
}
  1. targetting /file API
function uploadS3() {
    if [ -f "${1}" ]
    then
        fname="${1}"
        IFS='_' read -ra fnames <<< "${fname}"
        bname=`basename "${fname}"`
        tname=`basename "${fnames[0]}"`
        inbox="${C3_S3_MOUNT}/${tname}/inbox"
        echo fname: ${fname}, bname: ${bname}, tname: ${tname}, inbox: ${inbox}
        curl -H "Content-Type: text/csv" -X PUT --data-binary "@${fname}" \
             "${C3_ENV_URL}/file/1/${C3_TENANT}/${C3_TAG}/${inbox}/${bname}" \
             -H "Authorization: ${C3_TOKEN}" -vvv
    else
        echo "$0: File '${fname}' not found."
    fi
}

For example,

export C3_ENV_URL=<vanityUrl>
export C3_TENANT=myapp
export C3_TAG=test
export C3_TOKEN=deadbeef...
export C3_S3_MOUNT=s3://... # S3.mountName() + '/inbox'

#6

the above methods look good. if the paths provided to the file api and the import API are the same then files will end up in the same place. for eg

curl -H "Content-Type: text/csv" -X PUT --data-binary @file1.csv            
    "http:localhost:8080/import/1/dataIntegration/test/CanonicalAccount/file1.csv" "

and

curl -H "Content-Type: text/csv" -X PUT --data-binary @file1.csv            
    "http:localhost:8080/file/1/dataIntegration/test/s3://c3--local-env/dl/dataIntegration/test/CanonicalAccount/file1.csv" 

Will put the file in the same location as s3://c3--local-env/dl/dataIntegration/test/CanonicalAccount/file1.csv

only difference is import will also process the file.


#7

Will import also mark in SourceFile that it has been processed?


#8

Yes it will schedule it for processing. it should do the same thing as calling SourceFile.syncFile(file)