Using Pandas On Excel FIle in S3


#1

Is there a way to read an .xlsx file from S3 into a pandas dataframe on the platform?

I have tried something like:

s3File = c3.S3File(url=s3_filepath)
pd.read_excel(io.BytesIO(s3File.read()))

but the s3File.read() (and also s3File.readEncoded()) returns a format not supported by io.BytesIO. Is there a better way to go about this?


#2

can you post the error? did you try just passing the filename with pd.read_excel(s3_filepath)?
Otherwise can you try the following:

data = io.BytesIO(s3File.read())
xlsx = pd.ExcelFile(data)
sheets = []
for sheet in xlsx.sheet_names:
    sheets.append(xlsx.parse(sheet))
df = pd.concat(sheets)

#3

For pd.read_excel(s3_filepath), I get a FileNotFoundError. However, that url definitely exists because c3.S3File(url=s3_filepath).read() returns a long unicode string.

For your second suggestion and my original post, the io.BytesIO line throws the following error:
TypeError: 'unicode' does not have the buffer interface.


#4

You can try:


def read_excel(file_path,c3_connector=None):
    if c3_connector == None:
        xls = pd.ExcelFile(file_path)
    else:
        string = c3_connector.File.readEncoded(file_path)
        xls = pd.ExcelFile(StringIO(string))
        
    sheet_names = xls.sheet_names
    if len(sheet_names) == 1:
        return xls.parse(sheet_names[0])
    else:
        output = {}
        for sheet in  sheet_names:
            tmp = xls.parse(sheet)
            output[sheet] = tmp
        return output ```

#5

This code suggested above now throws the error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found u'UEsDBBQA'


Reading Excel File on S3
#6

Here’s what I ended up doing:

  1. JavaScript wrapper function:
function readExcelJSWrapper(s3_filepath) {
  var jsArrayBuffer = S3File.make({url: s3_filepath, contentType: "application/octet-stream"}).readEncoded();
  var jsUint8Array = new Uint8Array(jsArrayBuffer);
  var jsIntArray = _.map(jsUint8Array, x => x);
  return readExcel(jsIntArray);
}
  1. Python Excel processing function
def read_excel_python(jsIntArray) {
  excel_bytes = bytearray(jsIntArray)
  df = pd.read_excel(io.BytesIO(excel_bytes))
  .... more processing ...
  return result
}

Key problem I faced was that S3File.readEncoding returns a unicode string in Python and that unicode string has additional bytes in its bytearray representation from the S3File.readEncoding in JS. This is why I needed to convert to an int array first in JS, so that the bytearray passed to Python would not automatically convert to the wrong unicode string.