Avoiding iterrows and Python Loops
Replace row-by-row loops with vectorised column operations, np.where, and pd.cut to achieve 10-100x speedups.
The Cost of Row-by-Row Iteration
Pandas is built on NumPy, which processes entire arrays at once using compiled C code. When you iterate row by row with for _, row in df.iterrows(), you bypass this and fall back to pure Python — each row is extracted as a Series object, and the loop runs in the Python interpreter at roughly 100-1000x the cost of an equivalent vectorised operation. For a 1-million-row DataFrame, this can mean minutes instead of milliseconds.
import pandas as pd
import numpy as np
import timeit
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randn(10000), 'b': np.random.randn(10000)})
# Slow: iterrows loop
def loop_version(df):
result = []
for _, row in df.iterrows():
result.append(row['a'] + row['b'])
return pd.Series(result)
# Fast: vectorised
t_loop = timeit.timeit(lambda: loop_version(df), number=5)
t_vec = timeit.timeit(lambda: df['a'] + df['b'], number=500)
print(f'Loop (5 runs): {t_loop:.3f}s total')
print(f'Vectorised (500 runs): {t_vec:.3f}s total')
print(f'Speedup: ~{(t_loop/5)/(t_vec/500):.0f}x')Replace Conditional Loops with np.where
np.where(condition, value_if_true, value_if_false) is the vectorised equivalent of an element-wise if/else. Instead of iterating rows and writing an if statement, pass the condition and both outcome values as arrays. np.where computes this in C for the entire column at once, making it 50-200x faster than an equivalent Python loop for large DataFrames.
import pandas as pd
import numpy as np
import timeit
np.random.seed(0)
df = pd.DataFrame({'price': np.random.uniform(10, 100, 100000)})
# Slow: iterrows
def label_loop(df):
labels = []
for _, row in df.iterrows():
if row['price'] > 50:
labels.append('expensive')
else:
labels.append('cheap')
return labels
# Fast: np.where
def label_vectorised(df):
return np.where(df['price'] > 50, 'expensive', 'cheap')
# Verify equivalence on a small subset
assert list(label_loop(df.head(100))) == list(label_vectorised(df.head(100)))
print('Results match!')
print('Fast version result sample:', label_vectorised(df)[:5])All lessons in this course
- Profiling with timeit and memory_profiler
- Avoiding iterrows and Python Loops
- Efficient Data Types for Memory Reduction
- Chunked Reading for Large Files