Pandas & NumPy Academy · Lesson

Efficient Data Types for Memory Reduction

Downcast numeric columns to int32/float32 and convert string columns to Categorical to cut DataFrame memory by up to 70%.

Why Data Types Affect Performance

In Pandas, every column has a dtype (data type) that determines how values are stored in memory and how fast operations run. Pandas uses wide types by default when loading data: int64 (8 bytes per value), float64 (8 bytes), and object (variable, often 50-200 bytes per string). For millions of rows, choosing smaller types can reduce memory by 50-80% and speed up operations by 2-5x due to better cache utilisation.

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({
    'id': np.arange(1000000, dtype='int64'),
    'score': np.random.uniform(0, 100, 1000000).astype('float64'),
    'category': np.random.choice(['A','B','C','D'], 1000000)
})

mem = df.memory_usage(deep=True)
print('Memory per column:')
print(mem)
print(f'Total: {mem.sum() / 1e6:.1f} MB')

Downcasting Integer Columns

If a column contains integer values that fit in a smaller range, downcast it from int64 to a smaller integer type. pd.to_numeric(series, downcast='integer') automatically selects the smallest integer type that can hold all values: int8 (–128 to 127, 1 byte), int16 (–32768 to 32767, 2 bytes), int32 (±2 billion, 4 bytes), or stays as int64 if needed. An int8 column uses 8x less memory than int64.

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({
    'age': np.random.randint(18, 90, 500000).astype('int64'),
    'score': np.random.randint(0, 100, 500000).astype('int64'),
    'large_id': np.random.randint(0, 2**31, 500000).astype('int64')
})

# Downcast integers
for col in df.select_dtypes('int64').columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')

print('Dtypes after downcasting:')
print(df.dtypes)
print(f'\nMemory: {df.memory_usage(deep=True).sum()/1e6:.2f} MB')

All lessons in this course

Profiling with timeit and memory_profiler
Avoiding iterrows and Python Loops
Efficient Data Types for Memory Reduction
Chunked Reading for Large Files

← Back to Pandas & NumPy Academy