Machine Learning Academy · Lesson

Handling Missing Values: Drop, Impute, and Flag

Learners will detect nulls, apply mean/median/most-frequent imputation with SimpleImputer, and decide when dropping rows is safer than filling.

Why Missing Values Are Dangerous

Missing values — represented as NaN, None, or empty cells — are one of the most common data quality issues in real-world datasets. If left unhandled, they cause errors in scikit-learn estimators, which expect fully numeric matrices. Every ML pipeline must address missing values before training. There are three main strategies: drop rows/columns, impute (fill in estimated values), or flag the missingness as its own feature.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25, np.nan, 35, np.nan, 50],
    'income': [40000, 55000, np.nan, 72000, 90000],
    'score': [85, 90, 78, np.nan, 95]
})

print(df.isnull().sum())  # Count nulls per column
print(df.isnull().mean())  # Fraction missing per column

Detecting Nulls with Pandas

Before deciding how to handle missing values, you need to quantify the problem. Pandas provides df.isnull() to create a boolean mask and df.isnull().sum() to count nulls per column. The df.info() method also shows non-null counts. Always inspect the fraction of missing values — if more than 50% of a column is missing, that column may need to be dropped entirely rather than imputed.

import pandas as pd
import numpy as np

df = pd.read_csv('housing.csv')

# Fraction missing per column (sorted)
missing_frac = df.isnull().mean().sort_values(ascending=False)
print(missing_frac[missing_frac > 0])

# Heat map of missing values
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False)
plt.show()

All lessons in this course

Handling Missing Values: Drop, Impute, and Flag
Feature Scaling: StandardScaler and MinMaxScaler
Encoding Categorical Variables: OrdinalEncoder and OneHotEncoder
Combining Steps with ColumnTransformer

← Back to Machine Learning Academy