0Pricing
Pandas & NumPy Academy · Lesson

Parquet: Fast Columnar Storage

Write a Pandas DataFrame to Parquet with to_parquet(), read it back faster than CSV, and use column pruning to load only needed fields.

What Is Parquet?

Apache Parquet is a columnar binary file format designed for analytical workloads. Unlike CSV, which stores data row by row as text, Parquet stores each column in a contiguous block, compresses it efficiently, and encodes metadata. This makes it dramatically faster for queries that read only a few columns from a wide table. Parquet is the de-facto standard format in data lakes (AWS S3, Google Cloud Storage) and is natively supported by Pandas, Dask, Spark, and BigQuery.

Writing Parquet with to_parquet()

Converting a Pandas DataFrame to Parquet is a single method call: df.to_parquet('output.parquet'). The default engine is pyarrow (install with pip install pyarrow). Parquet preserves column dtypes exactly — no more date columns being read back as strings. It also supports compression out of the box with the compression parameter; 'snappy' offers fast read/write with moderate compression, while 'gzip' gives higher compression at a cost of speed.

import pandas as pd
import numpy as np

# Create a sample DataFrame
np.random.seed(0)
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100000, freq='T'),
    'value': np.random.randn(100000),
    'category': np.random.choice(['A', 'B', 'C'], 100000)
})

# Write to Parquet
df.to_parquet('data.parquet', index=False, compression='snappy')
print('Written to data.parquet')

All lessons in this course

  1. Streaming CSV with chunksize
  2. Incremental Aggregation Across Chunks
  3. Introduction to Dask DataFrames
  4. Parquet: Fast Columnar Storage
← Back to Pandas & NumPy Academy