Detecting and Removing Duplicates
Find exact and partial duplicates with duplicated() and drop_duplicates(), and decide which duplicate record to keep.
Why Duplicates Matter
Duplicate rows silently inflate counts, revenue totals, and averages without any error message. A sales dataset with 500 duplicate order records will overstate revenue by the exact sum of those 500 orders. Detecting and removing duplicates is one of the first cleaning steps you should perform, before any aggregation or modelling, because downstream errors compound from this single source of corruption.
import pandas as pd
df = pd.read_csv('orders.csv')
print('Shape before dedup:', df.shape)
print('Exact duplicates:', df.duplicated().sum())Finding Exact Duplicates
df.duplicated() returns a boolean Series that is True for every row that is an exact copy of a previous row. The default keep='first' marks all but the first occurrence. Passing keep=False marks every occurrence of a duplicate — useful when you want to inspect all copies of a duplicated record before deciding which to keep.
# Mark only the second+ occurrences
partial_dups = df[df.duplicated(keep='first')]
print('Rows to remove:', len(partial_dups))
# Mark ALL copies of any duplicate
all_dups = df[df.duplicated(keep=False)]
print('Rows involved in duplicates:', len(all_dups))All lessons in this course
- Detecting and Removing Duplicates
- Outlier Detection and Treatment
- Standardising Inconsistent Categories
- Schema Validation and Assertions