Reading and writing files to/from Amazon S3 with Pandas
Using the boto3 library and s3fs- supported pandas APIs
Reading and writing files from/to Amazon S3 with Pandas
Using the boto3 library and s3fs-supported pandas APIs
Contents
- Write pandas data frame to CSV file on S3
- > Using boto3
- > Using s3fs-supported pandas API
- Read a CSV file on S3 into a pandas data frame
- > Using boto3
- > Using s3fs-supported pandas API
- Summary
⚠ Please read before proceeding
To follow along, you will need to install the following Python packages
- boto3
- s3fs
- pandas
There was an outstanding issue regarding dependency resolution when both boto3 and s3fs were specified as dependencies in a project. See this GitHub issue if you’re interested in the details. Fortunately, the issue has since been resolved, and you can learn more about that on GitHub.
Before the issue was resolved, if you needed both packages (e.g. to run the following examples in the same environment, or more generally to use s3fs for convenient pandas-to-S3 interactions and boto3 for other programmatic interactions with AWS), you had to pin your s3fs to version “≤0.4” as a workaround (thanks Martin Campbell).
Before the issue was resolved:python -m pip install boto3 pandas "s3fs<=0.4"
After the issue was resolved:python -m pip install boto3 pandas s3fs
💭 You will notice in the examples below that while we need to import boto3 and pandas, we do not need to import s3fs despite needing to install the package. The reason is that we directly use boto3 and pandas in our code, but we won’t use the s3fs directly. Still, pandas needs it to connect with Amazon S3 under-the-hood.
pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, sinces3fs
is not a required dependency, you will need to install it separately, likeboto
in prior versions of pandas. (GH11915).
Release notes for pandas version 0.20.1
Write pandas data frame to CSV file on S3
Using boto3
Using s3fs-supported pandas API
Read a CSV file on S3 into a pandas data frame
Using boto3
Using s3fs-supported pandas API
Summary
You may want to use boto3 if you are using pandas in an environment where boto3 is already available and you have to interact with other AWS services too.
However, using boto3 requires slightly more code, and makes use of the io.StringIO
(“an in-memory stream for text I/O”) and Python’s context manager (the with
statement). Those are two additional things you may not have already known about, or wanted to learn or think about to “simply” read/write a file to Amazon S3.
I do recommend learning them, though; they come up fairly often, especially the with
statement. But, pandas accommodates those of us who “simply” want to read and write files from/to Amazon S3 by using s3fs under-the-hood to do just that, with code that even novice pandas users would find familiar.aws_credentials = { "key": "***", "secret": "***", "token": "***" }
df = pd.read_csv("s3://...", storage_options=aws_credientials)
oraws_credentials = { "key": "***", "secret": "***", "token": "***" }
df.to_csv("s3://...", index=False, storage_options=aws_credentials)
Thank you for reading!
Articles you should read next
- 4 Cute Python Functions for Working with Dirty Data
- Improving Code Quality in Python Codebases
- How to recursively reverse a linked list
More Computing resources
Watch videos covering a variety of topics in Computing at OnelTalksTech.com