Run Training as a Kubernetes Job
Execute batch training to completion on the cluster.
Training Is Not a Server
A model server runs forever, but a training run should start, finish, and stop. Kubernetes has a different object built for that: the Job. 🏁
A Job Runs to Completion
A Job creates one or more Pods and watches them until they exit successfully. Once training succeeds, the Job is done and frees its resources.
apiVersion: batch/v1
kind: Job
metadata:
name: train-ranker
spec:
template:
spec:
restartPolicy: NeverAll lessons in this course
- Pods, Deployments, and Services for Models
- Request CPU, Memory, and GPU
- Configure with ConfigMaps and Secrets
- Run Training as a Kubernetes Job