Streaming CSV with chunksize
Read a large CSV in fixed-size chunks with pd.read_csv(chunksize=), process each chunk, and concatenate or accumulate results.
The Problem with Large CSV Files
When a CSV file is larger than available RAM — say, a 50 GB log file on a machine with 16 GB of memory — calling pd.read_csv('file.csv') fails with a MemoryError or causes the system to swap heavily, making it unusably slow. The solution is chunked reading: instead of loading the entire file at once, you process it in fixed-size pieces, accumulating results without ever holding everything in memory simultaneously.
The chunksize Parameter in read_csv
Passing chunksize=N to pd.read_csv() returns a TextFileReader iterator rather than a DataFrame. Each iteration yields a DataFrame of at most N rows. The file is read lazily — no data is loaded until you ask for the next chunk. This iterator can be used in a for loop or passed to pd.concat(). Choose a chunksize large enough for efficient I/O (e.g., 10,000–100,000 rows) but small enough to fit comfortably in memory.
import pandas as pd
# Returns a TextFileReader iterator, NOT a DataFrame
chunks = pd.read_csv('sales_data.csv', chunksize=10000)
print(type(chunks)) # <class 'pandas.io.parsers.readers.TextFileReader'>
for chunk in chunks:
print(f'Chunk shape: {chunk.shape}')
# process each chunk independently
break # just show the first chunk hereAll lessons in this course
- Streaming CSV with chunksize
- Incremental Aggregation Across Chunks
- Introduction to Dask DataFrames
- Parquet: Fast Columnar Storage