Some series don't get normalized because the normalization queue never completely drains on 7.6


#1

I’m trying to integrate measurements at the interval MINUTE every quarter hour on static/console/. Those measurements get normalized in the background (setting AFTER_QUERY).

We had reduced the number of workers; the NormalizationQueue never reaches 0 but we assumed it was fine, thinking this queue was a FIFO.

However we (the customer) have just noticed that normalization was sometimes late (more than 1 day) for some series. Measurements are successfully integrated every 15mn but some normalized timeseries are not updated.

If we pause the JmsDataLoadQueue and let the NQ drain, we see that the issue vanishes and problematic series are normalized.

This suggests that series are properly added to the queue but that some items never (or rarely) get processed, being superseded by other series. Of course this is not acceptable for our application as we can’t provide customers with data off by 1 or 2 days, even if only (for instance) 5% of customers are impacted.

For example/to reproduce:

  • Running the following evalMetrics and checking latest value/last change timestamp:
    var ems = Facility.evalMetrics({
    ids: [“FCT-000511532637”],
    expressions: [“ElectricityNormalizedIndex”],
    start: “2018-05-18”,
    end: “2018-06-01”,
    interval: “FIVE_MINUTE”
    });

And comparing with last measurements:

  • c3Grid(Measurement.fetch({filter: ‘parent.id == “ID_NUM”’, order: ‘descending(start)’, limit: 10}))

15 to 30minute delay is expected with AFTER_QUERY. Hours (or days) is not (or it is, but then we’d need a fix).
For example, we have seen 11204.95kWh at 1:35PM in the evalMetrics. However last measurement is 11205.108kWh at 2:39PM.

To resolve this, we tried using Cluster.setWorkerMaxConcurrentComputes(12) when the CPU of the workers is slightly > 40%. Normalizing 30 minbutes of data with 10 workers now takes 20 minutes (instead of 25inutes with Cluster.setWorkerMaxConcurrentComputes(8)). CPU is now at 70-80%.

We also tried Cluster.setWorkerMaxConcurrentComputes(16) but CPU is at 100% and we don’t see any improvements on the overall time.

Two issues we’re looking for more insight on

  1. We expect the normalization of 15mn of data to be done in less than 15mn. What can we do now to improve the normalization without adding workers ? Can we expect to have a patch of AFTERQUERY for the 7.6 (fixing the bug planned for the 7.8) ?

  2. From Customer discussing with Ops in their COE, Ops observed that the workers may be reinitialized to Cluster.setWorkerMaxConcurrentComputes(8) due to the cluster autoscaling. How can we/the customer ensure that Cluster.setWorkerMaxConcurrentComputes(12) is maintained over time ?


#2

@artliou

One approach you can try is to set a cron job to compact the normalization queue entries (it will merge various entries for the same series into 1)

You can do that using something like:

{
  "type": "CronJob",
  "description": "Cron Job to trigger the compaction of normalization queue every 1 minute",
  "action": {
    "actionName": "compact",
    "typeName": "NormalizationQueue"
  },
  "concurrent": false,
  "scheduleDef": {
    "cronExpression": "0 0/5 * * * ?",
    "skipOverdue": true
  },
  "inactive": false,
  "trackHistory": true,
  "id": "trigger-normalization-queue-compaction-every-5mins",
  "version": 1,
  "name": "Cron Job to trigger the compaction of normalization queue every 5 minutes"
} 
  1. This should be filed as a ticket, currently these properties are transient and get reset on system restart.