SourceFile.syncAll(fsc.rootUrl()) are resetting the
status of the
initial, causing files to get re-processed repeatedly.
This appears to have started with 7.8.8.
Our current process:
Files are placed in the Azure blob by an external tool. A cron job periodically does a
SourceFile.syncAll(), passing in the FileSourceCollection’s rootUrl, in order to create the SourceFile records. A second cron job then periodically fetches SourceFile records that are
initial, updates their contentType (the external tool is not able to set that), syncs them, and processes them.
That used to work fine, but now old files keep getting reset back to initial, so they keep getting re-processed. Is this expected behavior, or is there a better way of doing this?
Some other notes from experimenting:
If the file in the Azure blob has a contentType set (either manually in Azure, or via a different external tool that can set it properly, or via a previous File api call to set it), then both
SourceFile.syncAll wipe out that contentType information from the
SourceFile record. However, a
SourceFile.syncFile retains/restores the contentType information (with the downside of needing to know the specific file name).
SourceFile.syncFile does not appear to reset it back to
initial like the other methods do.
My current thought is to switch to doing a
FileSourceCollection.listFiles and then calling
SourceFile.syncFile for each file, as that appears to be the only sync method that does not wipe out / reset information for existing files. Is this the correct way of doing things?