Pandas & NumPy Academy · Lesson

Categorical Data Type

Convert low-cardinality string columns to pandas Categorical to reduce memory usage and speed up groupby operations.

What Is the Categorical Dtype?

The Categorical dtype in Pandas is designed for columns that contain a limited set of discrete values (low cardinality), such as gender, status, region, or product category. Instead of storing the full string for every row, Pandas stores the unique values (called categories) once and uses an integer code per row to reference them. This is similar to how databases use lookup tables or enum types.

import pandas as pd

df = pd.DataFrame({
    'region': ['North', 'South', 'East', 'North', 'East', 'South'] * 1000
})

# Without Categorical: object dtype stores every string
print('object dtype memory:', df['region'].memory_usage(deep=True))

# With Categorical: only stores 3 unique values + integer codes
df['region_cat'] = df['region'].astype('category')
print('category dtype memory:', df['region_cat'].memory_usage(deep=True))

Creating a Categorical Series

Convert a column to Categorical by calling .astype('category') on any Series. The resulting Categorical Series stores the unique values as .cat.categories and the integer position codes as .cat.codes. You can also create a Categorical directly with pd.Categorical() to specify the categories and their order up front.

import pandas as pd

s = pd.Series(['low', 'high', 'med', 'high', 'low'])
s_cat = s.astype('category')

print(s_cat.cat.categories)  # Index(['high', 'low', 'med'], dtype='object')
print(s_cat.cat.codes)       # integer codes per row
# 0    1
# 1    0
# 2    2
# 3    0
# 4    1

All lessons in this course

← Back to Pandas & NumPy Academy