GroupBy Analysis by Region and Category
Aggregate total revenue and order count by region and product category, and identify the top-5 SKUs by margin.
Setting Up GroupBy Analysis
After computing a revenue column, the natural next step is to aggregate it by business dimensions like region and category. The Pandas groupby() call splits the DataFrame into groups, you apply an aggregation function, and Pandas combines the results into a summary table. This split-apply-combine pattern replaces nested SQL GROUP BY queries.
import pandas as pd
df = pd.read_parquet('sales_features.parquet')
# Total revenue by region
region_rev = df.groupby('region')['revenue'].sum().sort_values(ascending=False)
print(region_rev)Grouping by Category
Aggregate revenue by product category to identify which categories drive the most sales. Use .agg() to compute multiple statistics at once — total revenue, order count, and average order value — in a single pass over the data rather than three separate groupby calls.
cat_summary = df.groupby('category')['revenue'].agg(
total_revenue='sum',
order_count='count',
avg_order_value='mean'
).sort_values('total_revenue', ascending=False)
print(cat_summary)All lessons in this course
- Loading and Auditing the Sales Dataset
- Revenue Calculations and Feature Engineering
- GroupBy Analysis by Region and Category
- Monthly Trend Visualisation