Machine Learning Academy · Lesson

Weight Initialisation: Xavier and He Initialisation

Learners will apply Xavier uniform and He normal initialisation, observe how they prevent vanishing/exploding gradients in deep networks compared to default random init.

Why Initialisation Matters

The weights of a neural network must be initialised to non-zero values before training — but the choice of how to initialise them profoundly affects training dynamics. Poor initialisation causes vanishing gradients (weights shrink to near zero, gradients become negligible) or exploding gradients (weights grow unboundedly, gradients become NaN). Good initialisation keeps activations and gradients in a healthy range from the very first batch, enabling stable and fast training.

import torch
import torch.nn as nn

# All-zeros init: disaster! All neurons compute the same
# gradient (symmetry breaking fails)
model_bad = nn.Linear(4, 4)
nn.init.zeros_(model_bad.weight)
print('All-zero gradients:', model_bad.weight.grad)

# Constant init: same problem
# Random init from N(0,1): works for shallow, fails deep
# Xavier / He: designed for deep networks

The Symmetry Breaking Problem

If all weights are initialised to the same value (including zero), every neuron in a layer computes exactly the same output and receives exactly the same gradient. All neurons learn the same feature — the hidden layer collapses to a single neuron for all practical purposes. This symmetry problem is why random initialisation is necessary: each neuron must start with a different random weight to break symmetry and learn different representations.

import torch
import torch.nn as nn

# Demonstrate symmetry breaking failure
model = nn.Linear(3, 4, bias=False)
nn.init.constant_(model.weight, 0.1)  # all same

x = torch.randn(5, 3)
y = model(x)

# All 4 neurons produce identical outputs!
print('All neurons identical:', torch.allclose(y[:, 0], y[:, 1]))
# True -- the 4 output neurons are indistinguishable

All lessons in this course

Learning Rate: The Most Important Hyperparameter
Batch Normalisation: Stable and Faster Training
Dropout Regularisation to Prevent Overfitting
Weight Initialisation: Xavier and He Initialisation

← Back to Machine Learning Academy