Machine Learning Academy · Lesson

Creating New Features: Log Transforms, Binning, and Interactions

Learners will apply log transforms to skewed columns, bin continuous values into ordinal categories, and multiply pairs of features to capture interaction effects.

Why Feature Engineering Matters

Feature engineering is the process of transforming raw data into representations that make patterns more accessible to machine learning models. Even the best algorithm is limited by the quality of its input features. Adding the right engineered feature can boost model accuracy more than any amount of hyperparameter tuning. The intuition: if you give the model the right numbers to work with, it can learn simpler, more generalisable rules than if it must discover complex transformations by itself.

Log Transforms: Taming Skewed Distributions

Many real-world quantities — income, house prices, population, transaction amounts — follow right-skewed distributions where most values are small but a few extreme values stretch the tail. Linear models and distance-based algorithms (KNN, SVM) perform poorly on such features because the large values dominate distance calculations. Applying np.log1p() (log of x+1, safe for zero values) compresses the scale, making the distribution more symmetric and reducing the influence of extreme outliers.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)

print('Population skewness (raw):', round(df['Population'].skew(), 2))
df['Population_log'] = np.log1p(df['Population'])
print('Population skewness (log): ', round(df['Population_log'].skew(), 2))

print('AveRooms skewness (raw):', round(df['AveRooms'].skew(), 2))
df['AveRooms_log'] = np.log1p(df['AveRooms'])
print('AveRooms skewness (log): ', round(df['AveRooms_log'].skew(), 2))

All lessons in this course

Creating New Features: Log Transforms, Binning, and Interactions
Date and Time Feature Extraction
Feature Selection: Variance Threshold and SelectKBest
Recursive Feature Elimination with Cross-Validation

← Back to Machine Learning Academy