0Pricing
PostgreSQL Performance & Query Optimization · Lesson

Multivariate Statistics for Correlated Columns

Create CREATE STATISTICS objects to capture dependencies the planner assumes are independent.

The Independence Assumption

When PostgreSQL estimates how many rows a query will return, it leans on per-column statistics stored in pg_statistic. To combine predicates on multiple columns, the planner makes a crucial simplifying assumption: the columns are statistically independent.

Under independence, the selectivity of WHERE a = 1 AND b = 2 is computed as sel(a=1) * sel(b=2). That multiplication is fast and correct — but only when the columns truly are unrelated.

In real schemas, columns are frequently correlated: a city implies a postal code, a product implies a category, an order date implies a fiscal quarter. When the planner multiplies selectivities for correlated columns, its estimate collapses far below reality.

How Bad Estimates Hurt You

A row-count estimate that is off by orders of magnitude steers the planner toward the wrong plan:

  • Underestimate → planner picks a nested loop expecting 3 rows, but 300,000 arrive → the loop executes its inner side hundreds of thousands of times.
  • Underestimate → planner chooses an index scan + heap fetches instead of a single sequential scan that would have been cheaper.
  • Bad join order → a large intermediate result is materialized early, blowing up memory and spilling to disk.

The symptom you see in EXPLAIN ANALYZE is a wide gap between rows= (estimated) and actual rows=. That gap is your signal that correlation may be the culprit.

EXPLAIN ANALYZE
SELECT * FROM addresses
WHERE city = 'New York'
  AND state = 'NY';

All lessons in this course

  1. How the Planner Estimates Row Counts
  2. Multivariate Statistics for Correlated Columns
  3. MCV and N-Distinct Corrections
  4. Validating Estimates Against Actual Rows
← Back to PostgreSQL Performance & Query Optimization