Weight Initialisation: Xavier and He Initialisation
Learners will apply Xavier uniform and He normal initialisation, observe how they prevent vanishing/exploding gradients in deep networks compared to default random init.
Why Initialisation Matters
The weights of a neural network must be initialised to non-zero values before training — but the choice of how to initialise them profoundly affects training dynamics. Poor initialisation causes vanishing gradients (weights shrink to near zero, gradients become negligible) or exploding gradients (weights grow unboundedly, gradients become NaN). Good initialisation keeps activations and gradients in a healthy range from the very first batch, enabling stable and fast training.
import torch
import torch.nn as nn
# All-zeros init: disaster! All neurons compute the same
# gradient (symmetry breaking fails)
model_bad = nn.Linear(4, 4)
nn.init.zeros_(model_bad.weight)
print('All-zero gradients:', model_bad.weight.grad)
# Constant init: same problem
# Random init from N(0,1): works for shallow, fails deep
# Xavier / He: designed for deep networksThe Symmetry Breaking Problem
If all weights are initialised to the same value (including zero), every neuron in a layer computes exactly the same output and receives exactly the same gradient. All neurons learn the same feature — the hidden layer collapses to a single neuron for all practical purposes. This symmetry problem is why random initialisation is necessary: each neuron must start with a different random weight to break symmetry and learn different representations.
import torch
import torch.nn as nn
# Demonstrate symmetry breaking failure
model = nn.Linear(3, 4, bias=False)
nn.init.constant_(model.weight, 0.1) # all same
x = torch.randn(5, 3)
y = model(x)
# All 4 neurons produce identical outputs!
print('All neurons identical:', torch.allclose(y[:, 0], y[:, 1]))
# True -- the 4 output neurons are indistinguishableAll lessons in this course
- Learning Rate: The Most Important Hyperparameter
- Batch Normalisation: Stable and Faster Training
- Dropout Regularisation to Prevent Overfitting
- Weight Initialisation: Xavier and He Initialisation