Outlier Detection and Treatment
Identify outliers with IQR fencing and Z-scores, decide whether to cap, remove, or flag them, and document decisions.
What Is an Outlier?
An outlier is a data point that differs substantially from the rest of the dataset. Outliers can be genuine (a single enterprise client with a $500,000 order in a dataset of $100 average orders) or erroneous (a negative price, or a typo that turned 120 into 12,000). The first step in outlier treatment is to decide whether the extreme value is real and meaningful or a data quality problem — the treatment differs dramatically between the two cases.
import pandas as pd
import numpy as np
df = pd.read_parquet('sales_clean.parquet')
print(df['revenue'].describe())Visual Outlier Detection: Box Plot
A box plot is the fastest visual tool for spotting outliers. The box spans the interquartile range (IQR, Q1–Q3), whiskers extend to 1.5× IQR, and points outside the whiskers are plotted individually as suspected outliers. Use df['revenue'].plot(kind='box') or Seaborn's sns.boxplot() to see the outliers immediately without computing thresholds manually.
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(6, 4))
df['revenue'].plot(kind='box', ax=ax)
ax.set_title('Revenue Distribution — Box Plot')
plt.tight_layout()
plt.show()All lessons in this course
- Detecting and Removing Duplicates
- Outlier Detection and Treatment
- Standardising Inconsistent Categories
- Schema Validation and Assertions