I was introduced to Parquet format back in 2015. At the time, I was tasked with working with an Impala-based system and it was using Parquet to store its data. My impression was Parquet was some technology built upon HDFS and required some sort of distributed, Hadoop-based system to work with it. That impression was not accurate.
A few years later, the Arrow package for R came out and it had support for, much to my surprise, Parquet. Suddenly Parquet seemed to be freed from HDFS and could plunk huge swaths of data into cute little folders directly in my computer’s filesystem. What a powerful tool I suddenly had. It was a great alternative to all the other dirty data formats I dealt with in the past.
Since then, we’ve standardized our data pipeline tools on Parquet. It’s a great format for data storage and transfer. It has a true standard for storing data and their types, unlike CSV. It compresses data down very nicely, leaving the files smaller than most other proprietary or CSV formats we’ve seen. It is fast to read and write, making it a great intermediate storage for our data pipelines. And, thanks to Arrow, all the languages we use in our tools have first-class support for working with Parquet. Also, shout out to DuckDB for being the SQLite of Parquet.
This last winter, one of our data vendors actually offered to send us data in Parquet format. It was amazing. AMAZING. We downloaded the files from them and within a minute I was able to query the data, get counts of rows, types of columns, and confirm that we had, indeed, received all the records we expected. It was unlike any other data ingestion experience I’ve ever had. No unzipping, no CSV tools, no proprietary formats. These pre-ETL steps were almost completely unnecessary and we could move right into transformation of the data into GDM
When I first encountered it, I never thought I’d be such a fan of Parquet, but I now sincerely hope that it continues to become the standard for transferring claims data between vendors and researchers.