Parameterising Pipelines with Config Dicts
Replace hardcoded file paths and column names with a config dict passed at runtime to make the pipeline reusable.
The Problem with Hardcoded Values
A pipeline with hardcoded file paths, column names, and threshold values breaks whenever the environment changes — a different server, a renamed column, or a changed business rule. Every change requires editing the pipeline code itself, creating risk of introducing bugs. The solution is to externalise all variable values into a configuration dictionary that is loaded at runtime and passed to the pipeline functions.
import pandas as pd
# BAD: hardcoded values scattered through code
df = pd.read_csv('/data/orders_2024.csv')
df = df.dropna(subset=['revenue', 'quantity'])
df = df[df['revenue'] < 5000]
df.to_parquet('/output/orders_clean.parquet')
print('Hardcoded paths and thresholds are fragile')Defining a Config Dictionary
Replace every hardcoded value with an entry in a configuration dictionary. Group related settings logically: input/output paths together, cleaning thresholds together, column name mappings together. The config dict becomes the single source of truth for all pipeline parameters. Changing one value in the config updates every function that uses it without touching the function bodies.
CONFIG = {
'input_path': '/data/orders_2024.csv',
'output_path': '/output/orders_clean.parquet',
'required_cols': ['order_id', 'order_date', 'revenue', 'quantity'],
'date_cols': ['order_date'],
'revenue_cap': 5000,
'min_quantity': 1,
'categorical_cols': ['region', 'category']
}
print('Config loaded:', list(CONFIG.keys()))All lessons in this course
- Structuring Transformation Steps as Functions
- Parameterising Pipelines with Config Dicts
- Testing Pipeline Steps with Assertions
- Scheduling and Logging Pipeline Runs