Behavior of normalization "all" when integrating multiple measurements

#1

Hi,

We are integrating measurements with auto normalization (“all”). Measurements come in batch (ie. one canonical, multiple measurements). I’ve observed that the normalization queue (as shown in the InvalidationQueue.countAll() output) often gets bigger than the total number of different Timeseries we have (sometimes up to 4x times) and I am trying to understand why. When this happens we often have multiple measurements canonical uploads in status “PROCESSING”.

Therefore I am wondering if a normalization is triggered for each measurement (or each canonical) or if they are properly grouped together.

For instance, if I have measurements 1 & 2 for timeseries A in canonical C1, and measurement 3 for timeseries A in canonical C2:

  • Will the integration of 1 then 2 (from C1) trigger a single normalization (at least if A has not been yet normalized when 2 is integrated to Cassandra)?
  • Will the integration of 3 add another item to the normalization queue even when A has not been yet normalized (because the normalization of A due to 1 & 2 is still in the queue)?

Thanks

0 Likes

#2

Every measurement that is created puts an entry in the nromalization queue…
If the queue already has an etry in it (because all threads on all workers are occupied), then entries will be MERGED when they are put into the queue as appropriate (e.g. if a measurement arrives for TS 1 at t=1 and there is already an entry in the queue for TS1 at t=0, then the 2 entries will be merged into a normalization for TS1 for t=(0,1)

0 Likes

#3

@rileysiebel - does every measurement create an entry that is subsequently merged, or is it once per batch/series?

The reason I am clarifying is that we observed substantial performance optimizations when we switched from loading data like this:

Series1,2018-01-01T00:00:00
Series2,2018-01-01T00:00:00
Series3,2018-01-01T00:00:00
Series1,2018-01-01T00:00:15
Series2,2018-01-01T00:00:15
Series3,2018-01-01T00:00:15

to this:

Series1,2018-01-01T00:00:00
Series1,2018-01-01T00:00:15
Series2,2018-01-01T00:00:00
Series2,2018-01-01T00:00:15
Series3,2018-01-01T00:00:00
Series3,2018-01-01T00:00:15

According to @rohit.sureka, sorting inbound data helps because we create less redundant entries in the queues.

(This, of course, only matters for large loads that span across multiple batches.)

0 Likes

#4

@yaroslav The second is definitely better since the files currently get chunked and when an entry is made per batch, there is one entry put in the queue per batch/series combination.

@lerela
Will the integration of 1 then 2 (from C1) trigger a single normalization (at least if A has not been yet normalized when 2 is integrated to Cassandra)?

  • This could create 1 or 2 normalization calls depending on if the entries were merged in the queue before they were processed.

Will the integration of 3 add another item to the normalization queue even when A has not been yet normalized (because the normalization of A due to 1 & 2 is still in the queue)?

  • yes a new entry will be put in the queue for 3 as well and all the 3 entries could be merged by the queue depending on the compaction job that merges the entry. But it is also possible that the entries get picked up before the compaction happens
1 Like