RDDs and DataFrames
Spark data abstractions.
What Is Apache Spark?
Apache Spark is a distributed engine for large-scale data processing. It splits data across a cluster and runs computations in parallel. Scala is Spark's native language, giving a concise, type-aware API.
The RDD
The Resilient Distributed Dataset (RDD) is Spark's low-level abstraction: an immutable, partitioned collection that can be processed in parallel and rebuilt from lineage if a node fails.
All lessons in this course
- RDDs and DataFrames
- Transformations and Actions
- Spark SQL
- Aggregations