Scalable Data Lakes and Data Warehouses on AWS

Building Scalable Data Lakes and Data Warehouses on AWS

Client

Personal

Timeframe

May 23 - June 5

Services

AWS S3, AWS Redshift, AWS Glue, EMR, Apache Hudi, MWAA, Airflow, Athena, Lake Formation

Source Code

Project Overview

The project delivered a fully operational, scalable AWS-based data platform featuring both data lake (S3) and data warehouse (Redshift), handling large-scale data migrations, robust ETL pipelines, and built for future AI/ML readiness.

Implemented Components

Data Lake & Warehouse Architecture: Used Amazon S3 as a hierarchical data lake, supported by Amazon Redshift for structured analytics.
Large-scale Warehousing: Migrated terabyte-scale datasets from legacy systems (e.g. Teradata, SQL Server) into Redshift, optimizing schema, sort/distribution keys, and query performance.

ETL & Transformation:

AWS Glue jobs performing schema detection, source data cleansing, deduplication, and Parquet conversion.
Apache Hudi on EMR enabling SCD-style operations and incremental writes.
Redshift-loading via COPY and Glue upsert logic.

Orchestration: Automated workflows via Amazon MWAA (Managed Airflow), coordinating Glue jobs, EMR tasks, and Redshift script execution with event-driven triggers and scheduling.
Data Governance: Centralized metadata in AWS Glue Data Catalog and enforced lake permissions via Lake Formation.
AI/ML Foundation: Architecture designed to enable downstream ML workloads with clean, cataloged datasets, S3/Redshift integration, and schema versioning.

Tech Stack & Infrastructure

Cloud Platform: AWS
Storage: Amazon S3 (data lake), Amazon Redshift (data warehouse)
ETL & Transformation: AWS Glue, Apache Hudi on EMR
Orchestration: Amazon MWAA (Apache Airflow)
Governance: AWS Glue Data Catalog, AWS Lake Formation
Migration Tools: S3 → Redshift COPY, Glue incremental upsert logic
Formats & Modularity: Parquet, SCD, hot/cold storage strategies

Key Innovations & Learnings

Efficient Incremental Loads: Unified Parquet conversion using Glue + Hudi's upsert capabilities enables cost-effective incremental processing.
Robust Orchestration: MWAA DAGs manage error recovery, retries, and dependencies across Glue, EMR, and Redshift tasks.
Governed Lake-to-Warehouse Flow: Central metadata and unified access enabled secure, scalable data processing pipelines.
Scalable Migration Strategy: Balanced hot (frequently accessed) vs cold (historical) data to manage costs and performance.
Modular Architecture: Clearly separated source, transform, and loading layers with reusable components and orchestration.

Outcome & Impact

Performance & Scalability: Successfully migrated legacy systems, enabling petabyte-scale analytics in Redshift.
Operational Automation: Fully automated ETL pipelines with event-driven triggers and scheduled orchestration.
Secure & Governed Data: Catalogue-driven governance fosters data trust, discoverability, and compliance.
ML-Ready Platform: Prepared datasets support future AI/ML workloads with structured access paths and schema evolution support.
Production-Grade Infrastructure: System is designed for parallel workloads, reuse, and cost-effective storage management.