Deduplicating Rows Safely
Removing exact and near-duplicate rows while keeping one canonical record.
The Deduplication Problem
"This table has duplicate rows. Remove them but keep one copy of each." Almost every data-engineering interview includes some flavor of this. The challenge is doing it safely: keeping exactly one canonical row and not accidentally deleting distinct records that merely look similar.
We will cover detecting duplicates, choosing which copy to keep, deduplicating in a SELECT, and physically deleting duplicates from a table.
Define Duplicate First
The first question to ask the interviewer: "What makes two rows duplicates?" Options include:
- Exact duplicates: every column is identical.
- Key duplicates: same business key (e.g. same
email) but other columns may differ.
The technique differs for each. Never assume; clarifying the duplicate definition is the single most important step and interviewers expect you to ask.
All lessons in this course
- Top-N Rows Per Group With ROW_NUMBER
- Handling Ties in Top-N
- Deduplicating Rows Safely
- Keeping the Latest Row Per Key