Cassandra vs S3

I am working on helping customer save storage cost. I have a couple of questions related to the choice of S3 vs Cassandra. Right now, the data has about 50M rows, 9 columns. And this number could be much larger, say, 5B rows in the future.

  1. I know that S3 is much cheaper compared with Cassandra, is there any way for us to estimate the cost of saving data in Cassandra based on the feature of the dataset (for example, number of rows, number of the unique partition keys, etc)?

  2. Based on this post When to use FileData, I know that Normalized data will still be stored primarily in Cassandra as it’s assumed to be used more frequently. But what if the data won’t be used frequently? Do we have a certain threshold for choosing hot/cold storage framework?

  3. Any other rules we should be aware of when we choose the storage framework besides cost, query frequency and size of data?

Thanks for your advice!

Here are the things you should consider while selecting a datastore:

  • Performance, Scalability, Reliability & Availability
    • Sql: When data volume is relatively small < 10M - 100M, access patterns are not that heavy and latency is a concern
    • NoSql: When data volume is heavier > 100M but < 100B, access patterns are heavy and latency is a concern (cluster of nodes - so no single point of failure)
    • Analytical: When data has gone through ETL and aggregations need to be performed from a reporting purpose & latency is a concern (managed services redshift / Athena)
    • File Systems: When data volume is very heavy > 100B, access patterns are heavy but latency is NOT a concern (cluster of nodes - so no single point of failure - managed services like S3, Azure blob service)

Specifically for question 1, there is not simple way to estimate cost because they depend on several factors e.g. S3 does not charge much for actual storage, but if you access the data frequently the cost could be very high

  1. For Hot/Cold storage, you can specify using an annotation how much data you want to keep in the hotstore and how much in the cold store. See PartitionBucketStrategy type for details.
// Example of storing all data before 2015 in cold storage
@db(datastore='cassandra',
    partitionKeyField='parent',
    persistenceOrder='start,end',
    persistDuplicates=false,
    shortId=true,
    shortIdReservationRange=100000,
    compactType=true, defaultPartitionStrategy=PartitionBucketStrategy(maxHotSortKey="start < dateTime('2015-01-01')"))
entity type TestHotColdTSData mixes TimeseriesDataPoint<TestHotColdTS> schema name 'testHotColdTSData' {
  @ts
  quantity: Dimension
}
8 Likes

Thank you so much for your comprehensive explanation @rohit.sureka. That is really helpful! I was wondering if I access the data daily, will it cost a lot if I put data on S3? Is it easier to get the cost estimation for S3 in this case? And what is the optimal frequency for accessing data on S3?