Handling Missing Values: Drop, Impute, and Flag
Learners will detect nulls, apply mean/median/most-frequent imputation with SimpleImputer, and decide when dropping rows is safer than filling.
Why Missing Values Are Dangerous
Missing values — represented as NaN, None, or empty cells — are one of the most common data quality issues in real-world datasets. If left unhandled, they cause errors in scikit-learn estimators, which expect fully numeric matrices. Every ML pipeline must address missing values before training. There are three main strategies: drop rows/columns, impute (fill in estimated values), or flag the missingness as its own feature.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'age': [25, np.nan, 35, np.nan, 50],
'income': [40000, 55000, np.nan, 72000, 90000],
'score': [85, 90, 78, np.nan, 95]
})
print(df.isnull().sum()) # Count nulls per column
print(df.isnull().mean()) # Fraction missing per columnDetecting Nulls with Pandas
Before deciding how to handle missing values, you need to quantify the problem. Pandas provides df.isnull() to create a boolean mask and df.isnull().sum() to count nulls per column. The df.info() method also shows non-null counts. Always inspect the fraction of missing values — if more than 50% of a column is missing, that column may need to be dropped entirely rather than imputed.
import pandas as pd
import numpy as np
df = pd.read_csv('housing.csv')
# Fraction missing per column (sorted)
missing_frac = df.isnull().mean().sort_values(ascending=False)
print(missing_frac[missing_frac > 0])
# Heat map of missing values
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False)
plt.show()All lessons in this course
- Handling Missing Values: Drop, Impute, and Flag
- Feature Scaling: StandardScaler and MinMaxScaler
- Encoding Categorical Variables: OrdinalEncoder and OneHotEncoder
- Combining Steps with ColumnTransformer