Pandas & NumPy Academy · Lesson

Data Cleaning and Feature Engineering

Apply the advanced cleaning checklist, engineer date features and ratio columns, and validate the clean dataset with assertions.

Loading the Audited Dataset

Continuing the capstone project, load the clean Parquet checkpoint saved in the previous step. This decouples the cleaning stage from ingestion — you can iterate on cleaning logic without re-running the slow merge and parse operations. Read the clean_orders Parquet and verify the row count and dtypes match expectations from the audit summary. The Parquet file preserves all dtypes including categorical columns, so no re-casting is needed.

import pandas as pd

# Load from checkpoint
df = pd.read_parquet('output/clean_orders.parquet')
print(f'Loaded: {df.shape}')
print(df.dtypes)
print(f'Date range: {df["order_date"].min().date()} to {df["order_date"].max().date()}')

Applying the Advanced Cleaning Checklist

With the merged dataset loaded, apply the advanced cleaning checklist: remove exact and partial duplicates on order_id, cap outliers in unit_price using IQR fencing, standardise inconsistent category values (e.g., 'Electronics' vs 'electronics' vs 'ELECTRONIC'), and validate that quantity × unit_price is always positive for normal orders. Each step is a function that accepts and returns a DataFrame, making the pipeline testable and reversible.

import pandas as pd
import numpy as np

def remove_duplicates(df):
    n_before = len(df)
    df = df.drop_duplicates(subset=['order_id'], keep='first')
    print(f'Duplicates removed: {n_before - len(df)}')
    return df

def standardise_categories(df):
    df['category'] = (df['category']
                      .astype(str)
                      .str.strip()
                      .str.title())
    return df

def cap_price_outliers(df):
    q1, q3 = df['unit_price'].quantile([0.25, 0.75])
    iqr = q3 - q1
    upper = q3 + 3 * iqr
    df['unit_price'] = df['unit_price'].clip(upper=upper)
    return df

df = remove_duplicates(df)
df = standardise_categories(df)
df = cap_price_outliers(df)
print('Cleaning complete.')

All lessons in this course

← Back to Pandas & NumPy Academy