Chi-Squared Test for Independence
Test whether two categorical variables are independent using chi2_contingency on a crosstab frequency table.
Testing Categorical Relationships
The chi-squared test for independence tests whether two categorical variables are statistically independent or whether there is an association between them. For example: 'Is customer churn independent of subscription tier?' or 'Is product preference independent of age group?' Unlike t-tests that compare numeric means, chi-squared tests compare observed frequencies in a contingency table to the frequencies we would expect if the variables were independent.
Building a Contingency Table with Pandas
A contingency table (also called a cross-tabulation) shows the count of observations for each combination of two categorical variables. pd.crosstab(df['var1'], df['var2']) builds this table directly from a DataFrame. Each cell contains the count of observations where row category and column category co-occur. This is the input to scipy.stats.chi2_contingency().
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'subscription': np.random.choice(['free', 'basic', 'pro'], 300),
'churned': np.random.choice(['yes', 'no'], 300, p=[0.3, 0.7])
})
# Build contingency table
ct = pd.crosstab(df['subscription'], df['churned'])
print(ct)
print()
print('Row totals:', ct.sum(axis=1).to_dict())All lessons in this course
- Descriptive Stats and Normality Testing
- T-Tests for Comparing Means
- Chi-Squared Test for Independence
- ANOVA and Post-Hoc Tests