Data Lake ETL Process Overhaul


Architectural Diagram of the ETL Pipeline

Project Description

Background Info

JSON-encoded data was continuously collected through webhooks from various systems:

This data then flowed through an ingress pipeline process and into a data lake.

Problem

An AWS cost analysis revealed high costs associated with that data ingress pipeline.

This pipeline consisted of:

Solution

The PostgreSQL AWS RDS instance and associated jobs needed to be decommissioned and replaced with AWS Glue ETL jobs better suited for translating irregular, unstructured data out of the data lake.

Contributions

In collaboration with my team, we established and refined the design and architecture of the new data pipelines.

I handled the shipping data while they handled the order and payment gateway data.

My shipping data pipeline did the following:

  1. AWS Glue Workflow: triggers a Glue crawler once a day
  2. Glue Crawler: crawls the S3 shipping data and groups it into tables and partitions
  3. Glue ETL: extracts those tables and partitions, transforms the data, and loads into Parquet files in another S3 bucket
  4. Glue Crawler: crawls the Parquet files from AWS S3 and loads into the AWS Glue Data Catalog
  5. AWS Athena: exposes the shipping data from the Data Catalog as Athena tables and views

The Glue ETL job was written in Python and ran in Apache Spark.

  1. Extract: Imported previously-crawled data
  2. Transform: Handed data conversions, resolved choices, and mapping
  3. Load: Wrote transformed data to Parquet files on AWS S3

Challenges Overcame

The biggest challenges were:

Accomplishments

Technologies Used

  • Languages: Python, JSON
  • Concepts: Data Lakes, ETL Jobs, Data Crawlers, Glue Workflow, Webhooks
  • Tools: pip3, PySpark, AWS Athena, Jupyter Notebooks
  • Stack: Apache Spark, AWS S3, AWS Glue, AWS Athena