0PricingLogin
Pandas & NumPy Academy · Lesson

Incremental Aggregation Across Chunks

Accumulate running counts, sums, and min/max across chunks without storing the full file in memory.

Why Incremental Aggregation?

Incremental aggregation is the key to analysing datasets larger than RAM without distributing computation across multiple machines. Instead of loading all data to compute a final statistic, you maintain running accumulators — partial sums, counts, min/max values — updating them with each chunk. The final result is assembled from these lightweight accumulators after the full file is scanned. This pattern scales to terabyte files on a single laptop.

Counting Rows and Computing Mean

Computing the mean across chunks requires tracking the running sum and count separately. You cannot simply average the per-chunk means because chunks may have different sizes. The correct formula is total_sum / total_count. This pattern extends to any quantity that can be decomposed: variance, correlation, and histograms all have incremental formulas.

import pandas as pd

total_sum = 0.0
total_count = 0

for chunk in pd.read_csv('transactions.csv', chunksize=100000):
    total_sum += chunk['amount'].sum()
    total_count += chunk['amount'].notna().sum()

grand_mean = total_sum / total_count
print(f'Rows processed: {total_count:,}')
print(f'Grand mean: {grand_mean:.4f}')

All lessons in this course

  1. Streaming CSV with chunksize
  2. Incremental Aggregation Across Chunks
  3. Introduction to Dask DataFrames
  4. Parquet: Fast Columnar Storage
← Back to Pandas & NumPy Academy