Parquet: Fast Columnar Storage
Write a Pandas DataFrame to Parquet with to_parquet(), read it back faster than CSV, and use column pruning to load only needed fields.
What Is Parquet?
Apache Parquet is a columnar binary file format designed for analytical workloads. Unlike CSV, which stores data row by row as text, Parquet stores each column in a contiguous block, compresses it efficiently, and encodes metadata. This makes it dramatically faster for queries that read only a few columns from a wide table. Parquet is the de-facto standard format in data lakes (AWS S3, Google Cloud Storage) and is natively supported by Pandas, Dask, Spark, and BigQuery.
Writing Parquet with to_parquet()
Converting a Pandas DataFrame to Parquet is a single method call: df.to_parquet('output.parquet'). The default engine is pyarrow (install with pip install pyarrow). Parquet preserves column dtypes exactly — no more date columns being read back as strings. It also supports compression out of the box with the compression parameter; 'snappy' offers fast read/write with moderate compression, while 'gzip' gives higher compression at a cost of speed.
import pandas as pd
import numpy as np
# Create a sample DataFrame
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100000, freq='T'),
'value': np.random.randn(100000),
'category': np.random.choice(['A', 'B', 'C'], 100000)
})
# Write to Parquet
df.to_parquet('data.parquet', index=False, compression='snappy')
print('Written to data.parquet')All lessons in this course
- Streaming CSV with chunksize
- Incremental Aggregation Across Chunks
- Introduction to Dask DataFrames
- Parquet: Fast Columnar Storage