Pandas & NumPy Academy · Lesson

Chunked Reading for Large Files

Process files that exceed RAM by reading in chunks with chunksize in read_csv and aggregating results incrementally.

When Files Exceed RAM

A common bottleneck in production data pipelines is a CSV file that is larger than available RAM. If a file is 20 GB and the machine has 16 GB of RAM, pd.read_csv('file.csv') will crash with a MemoryError. The solution is chunked reading: loading the file in fixed-size pieces, processing each piece, and accumulating results. This allows you to analyse arbitrarily large files without loading them entirely into memory.

import pandas as pd
import numpy as np

# Simulate a large CSV by writing one
np.random.seed(0)
sample = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=100000, freq='h'),
    'region': np.random.choice(['North','South','East','West'], 100000),
    'sales': np.random.randint(100, 1000, 100000)
})
sample.to_csv('/tmp/large_sales.csv', index=False)
print(f'File size: {pd.io.common.get_handle("/tmp/large_sales.csv", "r").handle.seek(0, 2)/1e6:.1f} MB... (simulated)')
print('Rows:', len(sample))

chunksize Parameter in read_csv

Pass chunksize=n to pd.read_csv() to return a TextFileReader iterator instead of a DataFrame. Each iteration yields the next n rows as a DataFrame. This way, only n rows are in memory at a time. A good starting value for chunksize is 100,000 rows — small enough to fit in RAM but large enough to keep I/O overhead manageable. Adjust based on your available RAM and column count.

import pandas as pd

# Read file in chunks of 25,000 rows
chunk_iter = pd.read_csv('/tmp/large_sales.csv', chunksize=25000)

# Peek at the first chunk
first_chunk = next(chunk_iter)
print('First chunk shape:', first_chunk.shape)
print(first_chunk.head(3))

All lessons in this course

Profiling with timeit and memory_profiler
Avoiding iterrows and Python Loops
Efficient Data Types for Memory Reduction
Chunked Reading for Large Files

← Back to Pandas & NumPy Academy