0PricingLogin
Pandas & NumPy Academy · Lesson

Outlier Detection and Treatment

Identify outliers with IQR fencing and Z-scores, decide whether to cap, remove, or flag them, and document decisions.

What Is an Outlier?

An outlier is a data point that differs substantially from the rest of the dataset. Outliers can be genuine (a single enterprise client with a $500,000 order in a dataset of $100 average orders) or erroneous (a negative price, or a typo that turned 120 into 12,000). The first step in outlier treatment is to decide whether the extreme value is real and meaningful or a data quality problem — the treatment differs dramatically between the two cases.

import pandas as pd
import numpy as np

df = pd.read_parquet('sales_clean.parquet')
print(df['revenue'].describe())

Visual Outlier Detection: Box Plot

A box plot is the fastest visual tool for spotting outliers. The box spans the interquartile range (IQR, Q1–Q3), whiskers extend to 1.5× IQR, and points outside the whiskers are plotted individually as suspected outliers. Use df['revenue'].plot(kind='box') or Seaborn's sns.boxplot() to see the outliers immediately without computing thresholds manually.

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(6, 4))
df['revenue'].plot(kind='box', ax=ax)
ax.set_title('Revenue Distribution — Box Plot')
plt.tight_layout()
plt.show()

All lessons in this course

  1. Detecting and Removing Duplicates
  2. Outlier Detection and Treatment
  3. Standardising Inconsistent Categories
  4. Schema Validation and Assertions
← Back to Pandas & NumPy Academy