Advertisement

A Common File Format for Python Pandas and R Data Frames

By on

Click here to learn more about author Steve Miller.

I’ve been doing analysis on a Chicago Crime data set off and on the last few of months, using the now ubiquitous Jupyter Notebook to manage my work. Trouble is, I like to switch between data science language leaders R and Python, using the best of each for data munging, visualization, and analytics. Alas, much as I love having two good choices, it can be a challenge assuring I’m always working with the same data in both environments. What’d be nice is a common file format for both, along with language-specific APIs for read/write.

Enter Feather, “A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow”. Feather provides a common read/write abstraction between Python and R dataframes, so that feather files written in Python can be read in R and vice-versa. The library authors are open source data munging luminaries Wes McKinney, author of Pandas, and Hadley Wickham, who’s re-invented data management and graphics in R. Working with early releases of the libraries was not without “opportunities”, but later versions have proven more reliable.

In the exercise below, I illustrate writing Feather files from dataframes in both R and Python, subsequently reading the just-written files in each. I “cheat” in this example, switching the kernel between R and Python cell by cell, a practice discouraged by the development team. Though living dangerously now, I’m pretty confident Jupyter will soon add the ability to manage kernels by individual cells.

Starting in R, read a 6M+ record, 22 record attribute csv file into a dataframe and write that dataframe to a feather file. Note the csv file read time of 35 seconds and the feather file write time of under 7 seconds.

s-miller-110216-image1

Next, read the same data into a Python pandas dataframe and write that dataframe to a feather file. The pandas read/write timings of 28 and 7 seconds are similar to those with R.

s-miller-110216-image2

Now read both just-written feather files into R dataframes. Note the 20 second total load time of the two 6M+ structures.

s-miller-110216-image3

Finally, read the same two feather files into pandas dataframes, noting the even speedier aggregate performance of 12 seconds.

s-miller-110216-image4

Though it’s still in its infancy, I like Feather a lot, and see it as an important step in the evolution of data science programming platforms. I anticipate continued progression of feather performance and capabilities, and look forward to additional interoperability between Python and R.

Leave a Reply