How to guarantee 'exactly once' data processing in DFEs

#1

I’ve a data flow analytics that is triggered hourly, and is defined as follows:

@DFE(period='Hour', interval='Hour')
type SmartBulbLongLifeInput mixes CompoundDataFlowEvent<SmartBulb> {
  . . .
}

I understand that this analytics will be triggered upon data arrival, which means it has an at least once semantic, meaning that the same event arriving multiple times will be processed multiple times.

How I can guarantee an exactly once semantic, meaning that the DFE will be robust to receiving multiple times the same event.

Does having a separate type that stores the DFE state and used to check if an event was already handled or not a viable solution?

#2

Check out the @DFE annotation flattenWindows field

1 Like
#3

@bachr Currently the Analytic will be invoked every time the data change happens. In distributed computing it is advisable to have your function be idempotent a.k.a produce the same output (deterministically) given the same input.

The system will do its best to invoke the analytic exactly once for the time period affected. But if there are multiple time ranges influencing the analytic, the analytic could get invoked multiple times depending on the data arrival rate.

Alternatively, the analytic could store state in a type to indicate that the analytic has been previously invoked or not and then decide to execute the rest of the logic in the process function.

2 Likes
#4

@rohit.sureka it seems in my case I can use flattenWindows as suggested by @ColumbusL then filter out duplicates if any.