Pandas & NumPy Academy · Lesson

Structuring Transformation Steps as Functions

Break your notebook into extract, transform, and load functions, each accepting and returning a DataFrame for easy testing.

Moving Beyond Notebooks

Jupyter notebooks are great for exploration but poor for production data pipelines. Code scattered across 50 cells with global state and no tests is fragile: changing one cell can silently break another. The professional alternative is to extract each transformation into a named function that accepts a DataFrame and returns a DataFrame. This separation of concerns is the foundation of maintainable data engineering code.

# Notebook-style (fragile)
df = pd.read_csv('orders.csv')
df = df.dropna(subset=['revenue'])
df = df[df['quantity'] > 0]
df['revenue_per_unit'] = df['revenue'] / df['quantity']

# Function-style (robust)
def extract(path):
    return pd.read_csv(path)

def transform(df):
    return (
        df.dropna(subset=['revenue'])
        .query('quantity > 0')
        .assign(revenue_per_unit=lambda d: d['revenue'] / d['quantity'])
    )

df = transform(extract('orders.csv'))

The ETL Pattern: Extract, Transform, Load

The ETL pattern divides a data pipeline into three phases: Extract (read from source), Transform (clean and enrich), and Load (write to destination). Each phase is a separate function. This separation makes it easy to swap data sources (CSV vs. database), change cleaning logic, or change the output format without touching the other two phases. Every production pipeline should follow this structure.

def extract(config):
    return pd.read_csv(config['input_path'], parse_dates=['order_date'])

def transform(df, config):
    return (
        df
        .dropna(subset=config['required_cols'])
        .query('quantity > 0')
        .assign(revenue=lambda d: d['quantity'] * d['unit_price'])
    )

def load(df, config):
    df.to_parquet(config['output_path'], index=False)
    print(f'Saved {len(df)} rows.')

All lessons in this course

Structuring Transformation Steps as Functions
Parameterising Pipelines with Config Dicts
Testing Pipeline Steps with Assertions
Scheduling and Logging Pipeline Runs

← Back to Pandas & NumPy Academy