Retrieve CSVs into Dataset

Hi All,
I have N CSV files compressed in gzip format stored into S3.
My goal is read and create a Dataset starting from these.

How I can do it?

Thank you

G.

Hi ! What did you try ? What went wrong ?

Hi Nathan!
Unfortunately I don’t know how to. I looked for a way but at now I never try any code.
If I have CSV_A in my_path/csv_a.gz and separator is “,”, which are the JS commands to read it and load in a c3 dataset?

Thank you

Hi Giantommasso, without seeing the problem you are bumping into it’s pretty hard to help you.

On top of my head, you could:

  1. Create and open a TemporaryFile (https://docs.python.org/3.6/library/tempfile.html)
  2. Copy the S3 file to that temporary file (check File.readEncoded)
  3. Open that file as you would normally do from python (gzip + pandas)
  4. Do whatever you need to with the data
  5. Close the file

Concatenation can then be done in memory (alternatively you can also append to your temporary file uncompressed, basically appending rows in a csv file, and read it in one go)

Hi Louis,
what I’m looking for is a way to read CSV and use it as a dataset in javascript environment not in python.
Imagine the path my_path/csv_a.gz how I can read csv_a as a dataset in console?

Couple of disclaimers:

  • This is assuming the csv files you mentioned are not large. Is these are large files this is definitely not a recommended approach, as you should use streams to handle large file processing
  • I’m not all that familiar with the Dataset type so there may be a bug or two in this code :slight_smile:

Anyone who is familiar with using streams for reading in files please chime in with a better recommendation but here’s something quick and dirty you can do just to provide the basic mechanism:

var columnDelimiter = ",";
var filePaths = ["my_path/csv_a.gz", "another_path/csv_a.gz"];
var dataset = null;

_.each(filePaths, function (path) {
  // make a `File` type pointing at the content you want to read
  var file = File.make({ url: path });

  // Read the g-zipped file + decode
  // Split content by newline char, then by whatever delimiter used in the file
  var doubleArry = File.decode(file.readEncoded(), file.contentType, file.safeContentEncoding())
    .split("\n")
    .map(function (row) {
      return row.split(columnDelimiter).map(function (col) {
        return !isNaN(col) ? parseFloat(col) : col;
      })
    });

  // Create a new Dataset from the 2d array derived from your csv
  var width = doubleArry[0].length;
  var height = doubleArry.length - 1;

  var tempDs = Dataset.make({
    data: _.flatten(doubleArry.slice(1)),
    orient: 'row',
    shape: [height, width],
    indices: {
      '1': doubleArry[0]
    }
  });

  // Add the new Dataset to the rest of your data
  dataset = !!dataset ? Dataset.concatenateDatasets([dataset, tempDs]) : tempDs;
});

Hi Scott,
Starting from your post I change method in:

function retriveSingleDataset(sf){
  var idx = []
  var columnDelimiter = ",";
  var doubleArry = File.decode(sf.readEncoded(), sf.contentType, sf.safeContentEncoding()).split("\n").map(function (row) {
        var list =  row.split(columnDelimiter).map(function (col,index) {
  		      var val ;
		        index == 0 ? idx.push(col) : (!isNaN(col) ? val = parseFloat(col) : val = col);
            return val
         });
         return list.slice(1,)
      });
      var width = doubleArry[0].length;
	   var height = doubleArry.length - 1;
        var dataset = Dataset.make({
            data: _.flatten(doubleArry.slice(1)),
            orient: 'row',
            shape: [height, width],
            indices: {
              '0': idx.slice(1),
      		    '1': doubleArry[0]
            }
          })
      return dataset
}

If I run it on console, it works in a proper way and I get the right results, but when I provision the same method, the compiler returns me a syntax error on list.slice(1,) function, do you know why this happens and how to solve it?

Thank you

list.slice(1,) is a function call with a trailing comma, this is supported starting with ES 2017, hence why it runs fine on console but not when you provision it as it seems to not be supported yet. Remove the trailing comma list.slice(1) and everything should run as expected.

2 Likes