0PricingLogin
Pandas & NumPy Academy · Lesson

Detecting and Removing Duplicates

Find exact and partial duplicates with duplicated() and drop_duplicates(), and decide which duplicate record to keep.

Why Duplicates Matter

Duplicate rows silently inflate counts, revenue totals, and averages without any error message. A sales dataset with 500 duplicate order records will overstate revenue by the exact sum of those 500 orders. Detecting and removing duplicates is one of the first cleaning steps you should perform, before any aggregation or modelling, because downstream errors compound from this single source of corruption.

import pandas as pd

df = pd.read_csv('orders.csv')
print('Shape before dedup:', df.shape)
print('Exact duplicates:', df.duplicated().sum())

Finding Exact Duplicates

df.duplicated() returns a boolean Series that is True for every row that is an exact copy of a previous row. The default keep='first' marks all but the first occurrence. Passing keep=False marks every occurrence of a duplicate — useful when you want to inspect all copies of a duplicated record before deciding which to keep.

# Mark only the second+ occurrences
partial_dups = df[df.duplicated(keep='first')]
print('Rows to remove:', len(partial_dups))

# Mark ALL copies of any duplicate
all_dups = df[df.duplicated(keep=False)]
print('Rows involved in duplicates:', len(all_dups))

All lessons in this course

  1. Detecting and Removing Duplicates
  2. Outlier Detection and Treatment
  3. Standardising Inconsistent Categories
  4. Schema Validation and Assertions
← Back to Pandas & NumPy Academy