AWS Glue: ETL and Data Catalogue
Run serverless ETL jobs with AWS Glue, register table schemas in the Glue Data Catalogue, and crawl new data automatically.
What Is AWS Glue?
AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service. You do not provision or manage any servers — Glue allocates Spark or Python workers automatically when a job runs. Glue consists of two main components: the ETL engine for data transformation jobs and the Data Catalogue for storing metadata about your data sources and targets.
Glue Data Catalogue Explained
The Glue Data Catalogue is a centralised metadata repository compatible with the Apache Hive Metastore. It stores databases, tables, column definitions, data types, and partition information. Services such as Athena, Redshift Spectrum, and EMR all use the same Data Catalogue, making it a single source of truth for what data exists and where it lives in S3.
# List databases in the Glue Data Catalogue
aws glue get-databases --query 'DatabaseList[*].Name'
# List tables in a database
aws glue get-tables --database-name my_db --query 'TableList[*].Name'All lessons in this course
- Building a Data Lake on S3
- AWS Glue: ETL and Data Catalogue
- Amazon Athena: Serverless SQL on S3
- Kinesis Streams, Firehose, and Real-Time Analytics