Partitioning S3 Data for Query Performance
Learners will define partition attributes on S3 paths so Data Federation can prune irrelevant files and deliver fast analytical queries.
Why Partitioning S3 Data Matters
In Atlas Data Federation, every query against an S3-backed virtual collection potentially reads many files. Without partitioning, even a query asking for a single day's data might scan an entire year of files. Partitioning organises S3 objects into a directory structure that encodes queryable metadata in the path, allowing the query engine to skip irrelevant files — a technique called partition pruning.
Partition Pruning: The Core Mechanism
Partition pruning works because Atlas Data Federation parses the S3 object key (path) and extracts the values defined as partition attributes in the storage configuration. When a query filter matches one of these attributes, the query engine only reads objects whose path values match — skipping all others without even issuing S3 GetObject requests for them.
// Path template with partition attributes
// /events/{year int}/{month int}/{day int}/data.parquet
// Query: fetch March 15, 2025 data
db.events.find({ year: 2025, month: 3, day: 15 })
// Data Federation issues S3 ListObjects only for:
// /events/2025/3/15/
// All other years/months/days are skippedAll lessons in this course
- What Is Atlas Data Federation?
- Mapping S3 and Atlas Sources to a Virtual Namespace
- Running Cross-Source Aggregation Pipelines
- Partitioning S3 Data for Query Performance