Data Wrangling and Exploratory Data Analysis
Learners will load a raw dataset, profile distributions and correlations, identify data quality issues, and document findings in a reproducible Jupyter notebook.
What Is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is the practice of profiling a dataset before modelling to understand its structure, distributions, relationships, and quality issues. Coined by John Tukey, EDA follows a simple principle: let the data speak. Rushing past EDA and jumping straight to modelling is the second most common reason ML projects fail, after poor problem scoping. EDA prevents you from building models on faulty foundations.
Loading and First Inspection
Always start with a structural inspection: how many rows, columns, and what data types? Use df.shape, df.dtypes, df.head(), and df.info(). Look for columns that appear numeric but are stored as object (strings), and columns that appear categorical but are encoded as integers. Mismatched dtypes cause silent bugs downstream when scikit-learn treats an integer-encoded category as a continuous variable.
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv('customer_churn.csv')
print('Shape:', df.shape)
print('\nData types:')
print(df.dtypes)
print('\nFirst 5 rows:')
print(df.head())
print('\nSummary info:')
df.info()All lessons in this course
- Project Scoping: Defining the Problem and Success Criteria
- Data Wrangling and Exploratory Data Analysis
- Model Selection Tournament: Compare Five Algorithms
- Packaging, Documenting, and Presenting the Final Model