Processing of Fetch API Calls



When making a Fetch API call that filters across multiple ID’s, are those ID’s filtered sequentially or in parallel?

For example, I have the following fetch specification:

    "spec": {
        "filter": "intersects(, ['SMBLB1_Consumption', 'SMBLB2_Consumption', 'SMBLB3_Consumption', 'SMBLB4_Consumption', 'SMBLB5_Consumption']) && start >= dateTime('2019-04-08T00:00:00-04:00')",
        "include": ", start, end, quantity.value"

Will the platform sequentially get measurements for each MeasurementSeries ID (FIFO) or is there some parallel processing done (something like 1 thread per MeasurementSeries ID in the filter criteria)?




It depends on the underlying storage technology.

If these measurementseries are persisted in cassandra, then my understanding is that some number of serieses will be read in parallel.

If the data is stored in postgres or other releationaldb, then fitlering and fetching is pushed down to the db layer.




The MeasurementSeries are persisted in Postgres and the Measurements are persisted in Cassandra.

So in that case, the measurements would be retrieved in parallel? Is there a way to determine what degree of parallelism is being used?




@wonga they will all be read in a serial manner as a part of the same fetch call. All (almost) C3 actions are single threaded and the unit of parallelization comes from the queues (InvalidationQueue)



but @rohit.sureka a single thread on the c3 side might make a multi-threaded push-down query to the databse correct?



@rileysiebel That is not the case. It’s possible that metric evaluation has some parallelization (@rohit.sureka?) but Fetch does not.



@wonga I’m not sure why you care about the sequencing of the filtering…
If you are worried about the ordering of the results, then use the order field in your spec.
If it is for performance reasons, then maybe one fetch call filtering on multiple ids could be slower than several fetches using 1 id each in parallel? (because it is Cassandra)



This question stems from a data integration issue my company is working on where we need to obtain Measurement data and the concern is that filtering on 1 ID at a time would be too much for the platform to handle if we are issuing 1K requests at once, where there is a risk of timeouts and requests not being serviced.

Ideally, we’d like to know the best chunk size for performance in terms of how many Measurement series (parent IDs) to filter by in a fetch call to the Measurement API.




@wonga why would you not distribute that query over the cluster as a map reduce job instead of having the call as a part of the same fetch with 100 ids? If you spread it out over the cluster, 1K requests should not be much for the platform to handle.



This may be helpful if you want to create a MapReduce Job on the fly, @wonga: JS MapReduce job example

but the trick with the MapReduce for you will be how to store the results in a way that you can read/analyze them, as they won’t be returned synchronously as your original fetch() result would be, since these tasks will be happening on the workers.

Normally, a distributed job is performing some task, where the result is some action: send an email, write a value to a database table, process data load, etc. However, in this case, the purpose seems to be to gather distributed data for easier analysis, hence I presume you need to view the exact data that you fetch, without aggregating or processing it too much, you can’t simply do your fetch, perform some action, and then be done with it.

If this is the case, I think your best bet may be to build this adhoc mapreduce job via the link above and then in the reduce step write the results to an S3 location where you can pull down the file and analyze it. Another option would be to store it in a persistable type that is dedicated to holding the results of your job. I imagine this type doesn’t exist, though, currently, and you probably don’t want to wait for its creation + promotion from the repository >> the environment you are querying.

Is this accurate?



After the JSMapReduceJob completes, you can call job.results() to get the results. You should not need to persist them yourself just so that you can read them later.

1 Like