MongoDB Academy · Lesson

Running Cross-Source Aggregation Pipelines

Learners will write aggregation pipelines that join Atlas collection data with S3-stored JSON or Parquet files in a single query.

What Makes Cross-Source Pipelines Special?

A cross-source aggregation pipeline in Atlas Data Federation runs the same aggregation stages you know from MongoDB, but the data under each stage may come from different physical systems — an S3 bucket, a live Atlas cluster, or both. The federated query engine handles all the routing, fan-out, and result merging transparently. From your application's perspective, it looks like a single MongoDB collection query.

Simple Cross-Source Find

The simplest cross-source query is a find() on a virtual collection backed by S3 files. The query engine reads the files, parses them, and applies the filter. Fields in the filter that match partition attributes in the path cause automatic file pruning. Fields that do not match partition attributes are applied as a post-read filter.

// Virtual collection 'events' backed by S3 JSON files
// Path: /data/events/{year int}/{month int}/*.json

// This query prunes to /data/events/2025/1/ only
const jan2025 = await db.collection('events').find({
  year: 2025,
  month: 1,
  eventType: 'purchase'   // post-read filter (not a partition attr)
}).toArray()

All lessons in this course

What Is Atlas Data Federation?
Mapping S3 and Atlas Sources to a Virtual Namespace
Running Cross-Source Aggregation Pipelines
Partitioning S3 Data for Query Performance

← Back to MongoDB Academy