Saving and Loading a Pipeline with joblib
Learners will serialise a fitted Pipeline to disk with joblib.dump and reload it in a fresh Python session to make predictions without re-training.
Why Persist a Trained Pipeline?
Training a machine learning pipeline can take minutes or hours. Once fitted, you want to save it to disk so you can reload it later for predictions without re-training. Persistence is also essential for deployment: you train on a development machine and serve predictions on a production server. The saved file must include both the preprocessing steps and the model weights.
Two Serialisation Options: pickle and joblib
Python's built-in pickle module can serialise any Python object, including sklearn pipelines. joblib is a third-party library (bundled with scikit-learn) that is generally preferred for ML objects because it is more efficient for large NumPy arrays — using memory mapping instead of copying — and can compress the output file automatically.
import pickle
import joblib
# Both approaches work; joblib is recommended for sklearn objects
print('pickle version:', pickle.HIGHEST_PROTOCOL)
import sklearn
print('sklearn version:', sklearn.__version__)All lessons in this course
- Building Your First Pipeline: Scaler Plus Classifier
- ColumnTransformer Inside a Pipeline
- Cross-Validating and Grid-Searching a Full Pipeline
- Saving and Loading a Pipeline with joblib