Chunked Reading for Large Files
Process files that exceed RAM by reading in chunks with chunksize in read_csv and aggregating results incrementally.
When Files Exceed RAM
A common bottleneck in production data pipelines is a CSV file that is larger than available RAM. If a file is 20 GB and the machine has 16 GB of RAM, pd.read_csv('file.csv') will crash with a MemoryError. The solution is chunked reading: loading the file in fixed-size pieces, processing each piece, and accumulating results. This allows you to analyse arbitrarily large files without loading them entirely into memory.
import pandas as pd
import numpy as np
# Simulate a large CSV by writing one
np.random.seed(0)
sample = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=100000, freq='h'),
'region': np.random.choice(['North','South','East','West'], 100000),
'sales': np.random.randint(100, 1000, 100000)
})
sample.to_csv('/tmp/large_sales.csv', index=False)
print(f'File size: {pd.io.common.get_handle("/tmp/large_sales.csv", "r").handle.seek(0, 2)/1e6:.1f} MB... (simulated)')
print('Rows:', len(sample))chunksize Parameter in read_csv
Pass chunksize=n to pd.read_csv() to return a TextFileReader iterator instead of a DataFrame. Each iteration yields the next n rows as a DataFrame. This way, only n rows are in memory at a time. A good starting value for chunksize is 100,000 rows — small enough to fit in RAM but large enough to keep I/O overhead manageable. Adjust based on your available RAM and column count.
import pandas as pd
# Read file in chunks of 25,000 rows
chunk_iter = pd.read_csv('/tmp/large_sales.csv', chunksize=25000)
# Peek at the first chunk
first_chunk = next(chunk_iter)
print('First chunk shape:', first_chunk.shape)
print(first_chunk.head(3))All lessons in this course
- Profiling with timeit and memory_profiler
- Avoiding iterrows and Python Loops
- Efficient Data Types for Memory Reduction
- Chunked Reading for Large Files