Pandas & NumPy Academy · Lesson

Pattern Matching with Regex

Extract substrings and check patterns with .str.extract(), .str.contains(), and .str.match() using regular expressions.

Why Regex in Pandas?

Regular expressions (regex) are a language for describing string patterns. In Pandas, the .str accessor exposes regex through contains(), match(), extract(), findall(), and replace(). Regex is the right tool when your patterns are too complex for simple contains/startswith checks — for example, validating email formats, extracting phone numbers from free text, or finding all dollar amounts in a notes column.

import pandas as pd

df = pd.DataFrame({
    'text': ['Order #12345 placed', 'No order', 'Ref #67890 paid', 'Refund for #11111']
})

# Find rows that contain an order number (# followed by 5 digits)
has_order = df['text'].str.contains(r'#\d{5}', regex=True)
print(df[has_order])
#                  text
# 0  Order #12345 placed
# 2    Ref #67890 paid
# 3  Refund for #11111

.str.contains() with Regex

.str.contains(pattern) with regex=True (the default) returns a boolean Series that is True wherever the pattern matches anywhere in the string. Use anchors like ^ (start) and $ (end) to restrict where the match occurs. Common uses: filter rows where a field matches a format requirement, like a valid date pattern or a properly formatted code.

import pandas as pd

df = pd.DataFrame({
    'code': ['US-001', 'UK-042', 'DE-18', 'FR-', 'us-003']
})

# Valid format: two uppercase letters, hyphen, 3 digits
valid = df['code'].str.contains(r'^[A-Z]{2}-\d{3}$', regex=True)
print(df['code'][valid].tolist())
# ['US-001', 'UK-042']

All lessons in this course

← Back to Pandas & NumPy Academy