Understanding removeOutliers behaviour


#1

I try to compare tsEval with rollingDiff() only and removeOutliers(rollingDiff()), on production data.
I am dealing here with consumption data on consumption meters.
Time series are PointPhysicalTimeSeries, with previous treatment, with a grain of 15 minutes.

I want to use removeOutliers since I have some unexpected very hudge values in the time series. Unfortunately I also have long period of time with no index increase (consumption = 0).

On the specific case described herebelow, I do not understand why I get mostly 0s using removeOutliers(rollingDiff()) instead of the values returned by rollingDiff().

NEMO Collect : Comparison between rollingDiff and removeOutlisers(rollingDiff())

On PROD https://engie-dhc-datacontinuum.c3iot.com/static/console/

With rollingDiff() only :

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, rollingDiff(sum(sum(normalized.data.quantity))) )) *1000)/period(1,grain()))*1000)”, start:“2018-07-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“DAY”}))

With removeOutliers on 6 month span and 300 threshold value : I get only 0s…

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, (rollingDiff(sum(sum(normalized.data.quantity))) < 0 ? 0 : removeOutliers(rollingDiff(sum(sum(normalized.data.quantity))),17280,300) ) )) *1000)/period(1,grain()))*1000)”, start:“2018-07-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“DAY”}))

If I reduce span and threshold values to 30,3 : same result…except for the last value

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, (rollingDiff(sum(sum(normalized.data.quantity))) < 0 ? 0 : removeOutliers(rollingDiff(sum(sum(normalized.data.quantity))),30,3) ) )) *1000)/period(1,grain()))*1000)”, start:“2018-07-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“DAY”}))

Here is the outcome without removeOutliers on the full 2018 year, showing no outliers in 2018 :

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, rollingDiff(sum(sum(normalized.data.quantity))) )) *1000)/period(1,grain()))*1000)”, start:“2018-01-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“DAY”}))

With removeOutliers() :


#2

To be clear: do you want to remove outliers on the raw time series or on the rollingDiff?


#3

I do not want to remove them on the raw time series, but to remove them on the rollingDiif.


#4

Since the logic is the following:
Logic: TS >= window(‘MEDIAN’, TS, -span, span, 1) - threshold 1.48 window(‘MEDIAN’, abs(TS - window(‘MEDIAN’, TS, -span, span, 1)), -span, span, 1) ? (TS <= window(‘MEDIAN’, TS, -span, span, 1) + threshold 1.48 window(‘MEDIAN’, abs(TS - window(‘MEDIAN’, TS, -span, span, 1)), -span, span, 1) ? TS : 0) : 0

Could you display for the rollingDiff metric:

  • window(‘MEDIAN’, rollingDiff, -span, span, 1)
  • window(‘MEDIAN’, abs(TS - window(‘MEDIAN’, TS, -span, span, 1)), -span, span, 1)

Based on those 2 metrics we can define the upper and lower bounds and we remove what is outside of those bounds.


#5

rollingDiff only with eval and divide by period :

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, rollingDiff(sum(sum(normalized.data.quantity))) )) *1000)/period(1,grain()))*1000)”, start:“2018-07-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“DAY”}))

With window and span=30 :

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, window(‘MEDIAN’, rollingDiff(sum(sum(normalized.data.quantity))), -30, 30, 1) )) *1000)/period(1,grain()))*1000)”, start:“2018-07-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“DAY”}))

I do not understand why it gives 0s as a median value…

Median of distance to median :

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, window(‘MEDIAN’, abs(rollingDiff(sum(sum(normalized.data.quantity))) - window(‘MEDIAN’, rollingDiff(sum(sum(normalized.data.quantity))), -30, 30, 1)), -span, span, 1) )) *1000)/period(1,grain()))*1000)”, start:“2018-07-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“DAY”}))

I realize the MEDIAN comes to 0s because there are too much 0s in the QUARTER OF HOUR time serie …

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, rollingDiff(sum(sum(normalized.data.quantity))) )) *1000)/period(1,grain()))*1000)”, start:“2018-07-01T00:00:00”,end:“2018-10-01T00:00:00”,grain:“QUARTER_HOUR”}))

How to overcome this knowing that, on the contrary, I have some period of time where I have also unexpected very hudge values ?..


#6

Is it expected to have so many zeros after previous treatment normalization?
If the median does not work in your case you can always build a similar expression using the AVG for example…


#7

Yes, it is expected to have so many zeros as it related to meters which index does not increase continuously. So we have some period of time without consumption (rollingDiff gives 0), and some time to time consumption, which gives a non-zero rollingDiff.

Actually, I managed to cope with this situation (huge outliers and many zeros in rollingDiff), by combining MEDIAN and AVG, and using a 6 month span. I use AVG to calculate the distance to average instead of distance to median.

This ends up like this :

c3Viz(PointPhysicalMeasurementSeries.tsEval({filter:“id==‘ConsumptionMeter1762A1_Volume’”, projection:“treatAsUnit(‘kg_per_s’,((sum(eval(‘QUARTER_HOUR’, ( (rollingDiff(sum(sum(normalized.data.quantity))) - window(‘MEDIAN’, rollingDiff(sum(sum(normalized.data.quantity))), -3, 3, 1) + ( 300 * 1.48 * window(‘MEDIAN’, abs(rollingDiff(sum(sum(normalized.data.quantity))) - window(‘AVG’, rollingDiff(sum(sum(normalized.data.quantity))), -17280, 17280, 1)), -17280, 17280, 1) ) >= 0 ) ? ((rollingDiff(sum(sum(normalized.data.quantity))) - window(‘MEDIAN’, rollingDiff(sum(sum(normalized.data.quantity))), -17280, 17280, 1) - ( 300 * 1.48 * window(‘MEDIAN’, abs(rollingDiff(sum(sum(normalized.data.quantity))) - window(‘AVG’, rollingDiff(sum(sum(normalized.data.quantity))), -17280, 17280, 1)), -17280, 17280, 1) ) <= 0 ) ? rollingDiff(sum(sum(normalized.data.quantity))) : 0) : 0 ) )) *1000)/period(1,grain()))*1000)”, start:“2018-01-01T00:00:00”,end:“2018-10-07T00:00:00”,grain:“DAY”}))

Thanks !