Pandas & NumPy Academy · Lesson

Avoiding iterrows and Python Loops

Replace row-by-row loops with vectorised column operations, np.where, and pd.cut to achieve 10-100x speedups.

The Cost of Row-by-Row Iteration

Pandas is built on NumPy, which processes entire arrays at once using compiled C code. When you iterate row by row with for _, row in df.iterrows(), you bypass this and fall back to pure Python — each row is extracted as a Series object, and the loop runs in the Python interpreter at roughly 100-1000x the cost of an equivalent vectorised operation. For a 1-million-row DataFrame, this can mean minutes instead of milliseconds.

import pandas as pd
import numpy as np
import timeit

np.random.seed(0)
df = pd.DataFrame({'a': np.random.randn(10000), 'b': np.random.randn(10000)})

# Slow: iterrows loop
def loop_version(df):
    result = []
    for _, row in df.iterrows():
        result.append(row['a'] + row['b'])
    return pd.Series(result)

# Fast: vectorised
t_loop = timeit.timeit(lambda: loop_version(df), number=5)
t_vec = timeit.timeit(lambda: df['a'] + df['b'], number=500)

print(f'Loop (5 runs):       {t_loop:.3f}s total')
print(f'Vectorised (500 runs): {t_vec:.3f}s total')
print(f'Speedup: ~{(t_loop/5)/(t_vec/500):.0f}x')

Replace Conditional Loops with np.where

np.where(condition, value_if_true, value_if_false) is the vectorised equivalent of an element-wise if/else. Instead of iterating rows and writing an if statement, pass the condition and both outcome values as arrays. np.where computes this in C for the entire column at once, making it 50-200x faster than an equivalent Python loop for large DataFrames.

import pandas as pd
import numpy as np
import timeit

np.random.seed(0)
df = pd.DataFrame({'price': np.random.uniform(10, 100, 100000)})

# Slow: iterrows
def label_loop(df):
    labels = []
    for _, row in df.iterrows():
        if row['price'] > 50:
            labels.append('expensive')
        else:
            labels.append('cheap')
    return labels

# Fast: np.where
def label_vectorised(df):
    return np.where(df['price'] > 50, 'expensive', 'cheap')

# Verify equivalence on a small subset
assert list(label_loop(df.head(100))) == list(label_vectorised(df.head(100)))
print('Results match!')
print('Fast version result sample:', label_vectorised(df)[:5])

All lessons in this course

Profiling with timeit and memory_profiler
Avoiding iterrows and Python Loops
Efficient Data Types for Memory Reduction
Chunked Reading for Large Files

← Back to Pandas & NumPy Academy