Ignoring duplicates for Cassandra types


#1

I’ve PointMeasurements and RegisterMeasurements that I create with following transforms

    parent     : ~ expression { id : "TagName" }
    start      : ~ expression "dateTime(DateTime)"
    quantity   : ~ expression { value: "number(Value)" }
    statusCode : ~ expression "OpcQuality"
    dataVersion: ~ expression "toMillis(now())"

The underlying data we’re receiving have many duplicates, we tried to use dataVersion to handle this but it’s not working, we still see duplicates data (i.e. same parent/start/quantity).

Also looking at the declaration of those type it seems by default

@db(datastore='cassandra',
    partitionKeyField='parent',
    persistenceOrder='start',
    persistDuplicates=false,
    compactType=true,
    shortId=true,
    shortIdReservationRange=100000)

With persistDuplicates=false, we should be removing duplicates, why it’s not the case?


#2

I believe that to be considered a duplicate, ALL fields have to be the same, so I guess that entering milliseconds in dataVersion prevents the system from identifying the records as duplicates…


#3

thanks this what I was think too, will make test case with same data version and confirm.


#4

this sample script confirms, we should have same dataVersion:

> PointMeasurement.create({parent: {id: 'test_BAFYSGSOFA'}, start: '2019-01-01', quantity: {value: 10}, statusCode: '192', dataVersion: 1})
> PointMeasurement.create({parent: {id: 'test_BAFYSGSOFA'}, start: '2019-01-01', quantity: {value: 10}, statusCode: '192', dataVersion: 1})
> c3Count(PointMeasurement, 'parent.id=="test_BAFYSGSOFA"')
1
> PointMeasurement.create({parent: {id: 'test_BAFYSGSOFA'}, start: '2019-01-01', quantity: {value: 10}, statusCode: '192', dataVersion: 2})
> c3Count(PointMeasurement, 'parent.id=="test_BAFYSGSOFA"')
2

#5

I would go further and ignore dataVersion when checking for duplicates (if possible) because dataVersion is computed by the app: it is possible to create different dataVersion for real duplicates (not only by means of adding ms but anyhow), which then become different. Actually, the Vertuoz custom normalizer ignores dataVersion wrt. duplicates.