Multivariate Statistics for Correlated Columns
Create CREATE STATISTICS objects to capture dependencies the planner assumes are independent.
The Independence Assumption
When PostgreSQL estimates how many rows a query will return, it leans on per-column statistics stored in pg_statistic. To combine predicates on multiple columns, the planner makes a crucial simplifying assumption: the columns are statistically independent.
Under independence, the selectivity of WHERE a = 1 AND b = 2 is computed as sel(a=1) * sel(b=2). That multiplication is fast and correct — but only when the columns truly are unrelated.
In real schemas, columns are frequently correlated: a city implies a postal code, a product implies a category, an order date implies a fiscal quarter. When the planner multiplies selectivities for correlated columns, its estimate collapses far below reality.
How Bad Estimates Hurt You
A row-count estimate that is off by orders of magnitude steers the planner toward the wrong plan:
- Underestimate → planner picks a nested loop expecting 3 rows, but 300,000 arrive → the loop executes its inner side hundreds of thousands of times.
- Underestimate → planner chooses an index scan + heap fetches instead of a single sequential scan that would have been cheaper.
- Bad join order → a large intermediate result is materialized early, blowing up memory and spilling to disk.
The symptom you see in EXPLAIN ANALYZE is a wide gap between rows= (estimated) and actual rows=. That gap is your signal that correlation may be the culprit.
EXPLAIN ANALYZE
SELECT * FROM addresses
WHERE city = 'New York'
AND state = 'NY';All lessons in this course
- How the Planner Estimates Row Counts
- Multivariate Statistics for Correlated Columns
- MCV and N-Distinct Corrections
- Validating Estimates Against Actual Rows