Basic statistics for loaded raw data

Hi,

I was wondering if there exists a way to compute basic statistics, e.g., mean, min, max, std, etc, for raw time series data (not normalized) on the platform. Appreciate any help on this.

Thanks

@SinaPakazad We are working on a feature to generate stats from raw data. 7.10 or maybe later by V8. Sit tight :smiley:

We actually need that now. Can we have a prototype?

c3ShowType(DigestSummary) and see methods fromFile(). Needs a mapping to an existing type to parse the data. Also this feature is still being implemented and needs to be tested, but you can give it a shot.

@SinaPakazad We have been working with @Jake.Dong to scope out the user stories for this features and have been working on laying out the groundwork for these features for the past 2 months. We should be able to make significant progress/complete the remaining features in the next 4 weeks if you can wait. You should sync up with Jake to ensure that all of your required are covered.

Thanks @tony.li and @garrynigel. That is good to hear, though we need these asap, so we’ll do something for now and once the feature becomes available we can switch to that.

@SinaPakazad How much data are you talking about and how is it stored?

@garrynigel, @tony.li we’re trying to get data to answer the following questions -

  1. Did the customer truly extract all time-series data files to a C3 hosted FileSystem ? Often mistakes are made while extracting data. Currently, we use Jupyter to analyze summary and temporal statistics on the raw files and then compare it to a baseline or a claim - “All our tags(sensors) have second-level data all the time”.

  2. How did the transformation process change the data distribution ? Post-transformation and Pre-normalization what is the transformation loss ? Temporal and summary statistics per measurement series on raw ts data.

  3. Temporal and summary statistics on post-normalized persisted data (historical and incremental). CanonicalImportStats and TargetStats are useful, but I believe it logs the attempts made to persist. We’ve found the numbers derived from logs to be way over actuals.

@uday-kanwar

  1. SourceStatus and skipped/unchanged stats should help you answer this question.
  2. DigestSummary should be able to take stream of objs, so we should be able to run summaries on persisted data after transformation and before kicking of normalization.
  3. The new log framework gives counts of how many objs came in and how many were actually persisted. Logs(objCount in Splunk) will only give numbers of how many came in, hence they are always way over actuals. Again as DigestSummary will take stream of objs so running temporal statistics on post normalized data should be naturally supported.
1 Like

Hello @uday-kanwar ,

As @garrynigel said, you should be able get the statistics by passing a stream of objects to the DigestSummary APIs. The digest gives you, min, max, average, standard deviation and an approximate histogram for all numeric fields (one for each field), and it gives you the approximate unique count (~1% error) and top k most frequent elements (again, approximation) for string fields in the stream of Objs.

What does “second-level data” mean?

Shankar

PS: The reason why we provide approximations rather than exact quantities is because computing the exact quantity requires O(n) memory, which does not scale well.

1 Like