0PricingLogin
Pandas & NumPy Academy · Lesson

Streaming CSV with chunksize

Read a large CSV in fixed-size chunks with pd.read_csv(chunksize=), process each chunk, and concatenate or accumulate results.

The Problem with Large CSV Files

When a CSV file is larger than available RAM — say, a 50 GB log file on a machine with 16 GB of memory — calling pd.read_csv('file.csv') fails with a MemoryError or causes the system to swap heavily, making it unusably slow. The solution is chunked reading: instead of loading the entire file at once, you process it in fixed-size pieces, accumulating results without ever holding everything in memory simultaneously.

The chunksize Parameter in read_csv

Passing chunksize=N to pd.read_csv() returns a TextFileReader iterator rather than a DataFrame. Each iteration yields a DataFrame of at most N rows. The file is read lazily — no data is loaded until you ask for the next chunk. This iterator can be used in a for loop or passed to pd.concat(). Choose a chunksize large enough for efficient I/O (e.g., 10,000–100,000 rows) but small enough to fit comfortably in memory.

import pandas as pd

# Returns a TextFileReader iterator, NOT a DataFrame
chunks = pd.read_csv('sales_data.csv', chunksize=10000)
print(type(chunks))  # <class 'pandas.io.parsers.readers.TextFileReader'>

for chunk in chunks:
    print(f'Chunk shape: {chunk.shape}')
    # process each chunk independently
    break   # just show the first chunk here

All lessons in this course

  1. Streaming CSV with chunksize
  2. Incremental Aggregation Across Chunks
  3. Introduction to Dask DataFrames
  4. Parquet: Fast Columnar Storage
← Back to Pandas & NumPy Academy