Scheduling and Logging Pipeline Runs
Run your pipeline as a Python script from the command line, log start and end times, and use cron or a scheduler for automation.
From Notebook to Script
A pipeline that runs only when a developer manually opens a notebook provides no business value beyond the first run. To run automatically every day, the pipeline must be structured as a Python script that is executable from the command line: python pipeline.py. This requires a if __name__ == '__main__': entry point, command-line argument parsing, and proper logging — the three pillars of a production script.
# pipeline.py
import argparse
import logging
import pandas as pd
def main(config_path):
logging.info(f'Starting pipeline with config: {config_path}')
# ... run ETL steps ...
logging.info('Pipeline complete.')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--config', default='config.json')
args = parser.parse_args()
main(args.config)Configuring Python Logging
Python's built-in logging module is the correct tool for pipeline logs — not print() statements. Configure a logger with both console output and file output using logging.basicConfig(). Log at the INFO level for normal progress and ERROR for failures. File-based logs persist after the process exits, which is essential for debugging scheduled runs that no one was watching.
import logging
from datetime import date
log_file = f'pipeline_{date.today()}.log'
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s',
handlers=[
logging.FileHandler(log_file),
logging.StreamHandler()
]
)
logging.info('Logger configured.')All lessons in this course
- Structuring Transformation Steps as Functions
- Parameterising Pipelines with Config Dicts
- Testing Pipeline Steps with Assertions
- Scheduling and Logging Pipeline Runs