Order on a cassandra collection


#1

Does it make sense to have a db annotation on a cassandra collection to order it.

For example:
@db(order=“descending(start)”)
data: [PointMeasurement] (parent)


#2

If you want to change the ordering on a Cassandra you will need to remix that type:

@db(compactType=true,
    datastore='cassandra',
    partitionKeyField='parent',
    persistenceOrder='descending(start)',
    persistDuplicates=false,
    shortId=true,
    shortIdReservationRange=100000)
remix type PointMeasurement

I just tried out putting @db(order='descending(start)') on the data field for PointPhysicalMeasurementSeries and it didn’t work properly


#3

so to be clear we only have to specify the persistence order on the type itself, not on the fkey. Is that correct?


#4

also does it matter if I put persistenceOrder=‘start’ vs. persistenceOrder=‘descending(start)’.

For ex it would make sense to fetch and get the latest measurements (persistenceOrder=‘descending(start)’) but I would believe that it is better to have persistenceOrder=‘start’ to evaluate metrics.

What are the impacts during upsertion / metric eval / fetch?


#5

we only have to specify the persistence order on the type itself, not on the fkey. Is that correct?

Yep, that’s correct.

What are the impacts during upsertion / metric eval / fetch?

This is an interesting question. My understanding is that persistenceOrder is just creating a clustering column in Cassandra - I’ll break these down 1-by-1

  1. upsertion: Cassandra is unbelievably fast when it comes to write path efficiency because of clustering columns - sorting in ascending vs descending order shouldn’t have a negative impact
  2. metric eval: metrics are evaluated upon on a normalized timeseries, so the actual metric evaluation shouldn’t be affected whatsoever. I wouldn’t imagine normalization is affected either, since you’re a) loading the whole timeseries in anyways or b) loading a small range of points that will already be clustered together (in the case of incremental normalization being enabled).
  3. fetch: This is probably where you’ll see the most impact on performance if any, but fetching Cassandra data will still be efficient as long as the sort order you’re looking to use is the first clustering column you have set, so the following should have pretty comparable performance:
PointMeasurement.fetch({filter: ..., order: 'start'})

PointMeasurement.fetch({filter: ..., order: 'descending(start)'})

For example, DemandForecastMeasurement has persistenceOrder='demandForecastCreationDate, start'and I tested the following out:

// sort by the way persistence order is set
console.time()
c3Grid(DemandForecastMeasurement.fetch({
  filter: Filter.eq('parent.id', "my_id"),
  order: 'demandForecastCreationDate'
}))
console.timeEnd()
default: 2656.38623046875ms


// sort by the reverse
console.time()
c3Grid(DemandForecastMeasurement.fetch({
  filter: Filter.eq('parent.id', "my_id"),
  order: 'descending(demandForecastCreationDate)'
}))
console.timeEnd()
default: 2703.223876953125ms