Reading and writing files to/from Amazon S3 with Pandas

Using the boto3 library and s3fs- supported pandas APIs

Image showing pandas data frame being read from and written files on Amazon S3

Reading and writing files from/to Amazon S3 with Pandas

Using the boto3 library and s3fs-supported pandas APIs

Contents

  • Write pandas data frame to CSV file on S3
  • > Using boto3
  • > Using s3fs-supported pandas API
  • Read a CSV file on S3 into a pandas data frame
  • > Using boto3
  • > Using s3fs-supported pandas API
  • Summary

⚠ Please read before proceeding

To follow along, you will need to install the following Python packages

  • boto3
  • s3fs
  • pandas

There was an outstanding issue regarding dependency resolution when both boto3 and s3fs were specified as dependencies in a project. See this GitHub issue if you’re interested in the details. Fortunately, the issue has since been resolved, and you can learn more about that on GitHub.

Before the issue was resolved, if you needed both packages (e.g. to run the following examples in the same environment, or more generally to use s3fs for convenient pandas-to-S3 interactions and boto3 for other programmatic interactions with AWS), you had to pin your s3fs to version “≤0.4” as a workaround (thanks Martin Campbell).

Before the issue was resolved:python -m pip install boto3 pandas "s3fs<=0.4"

After the issue was resolved:python -m pip install boto3 pandas s3fs


💭 You will notice in the examples below that while we need to import boto3 and pandas, we do not need to import s3fs despite needing to install the package. The reason is that we directly use boto3 and pandas in our code, but we won’t use the s3fs directly. Still, pandas needs it to connect with Amazon S3 under-the-hood.

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. (GH11915).
Release notes for pandas version 0.20.1

Write pandas data frame to CSV file on S3

Using boto3

Using s3fs-supported pandas API

Read a CSV file on S3 into a pandas data frame

Using boto3

Using s3fs-supported pandas API

Summary

You may want to use boto3 if you are using pandas in an environment where boto3 is already available and you have to interact with other AWS services too.

However, using boto3 requires slightly more code, and makes use of the io.StringIO (“an in-memory stream for text I/O”) and Python’s context manager (the with statement). Those are two additional things you may not have already known about, or wanted to learn or think about to “simply” read/write a file to Amazon S3.

I do recommend learning them, though; they come up fairly often, especially the with statement. But, pandas accommodates those of us who “simply” want to read and write files from/to Amazon S3 by using s3fs under-the-hood to do just that, with code that even novice pandas users would find familiar.aws_credentials = { "key": "***", "secret": "***", "token": "***" }
df = pd.read_csv("s3://...", storage_options=aws_credientials)

oraws_credentials = { "key": "***", "secret": "***", "token": "***" }
df.to_csv("s3://...", index=False, storage_options=aws_credentials)

Thank you for reading!


More Computing resources

Watch videos covering a variety of topics in Computing at OnelTalksTech.com