Scalable Data Lakes and Data Warehouses on AWS

Building Scalable Data Lakes and Data Warehouses on AWS

folio-img5

Client

Personal

Timeframe

May 23 - June 5

Services

AWS S3, AWS Redshift, AWS Glue, EMR, Apache Hudi, MWAA, Airflow, Athena, Lake Formation

Project Overview

The project delivered a fully operational, scalable AWS-based data platform featuring both data lake (S3) and data warehouse (Redshift), handling large-scale data migrations, robust ETL pipelines, and built for future AI/ML readiness.

Implemented Components

  • Data Lake & Warehouse Architecture: Used Amazon S3 as a hierarchical data lake, supported by Amazon Redshift for structured analytics.
  • Large-scale Warehousing: Migrated terabyte-scale datasets from legacy systems (e.g. Teradata, SQL Server) into Redshift, optimizing schema, sort/distribution keys, and query performance.

ETL & Transformation:

  • AWS Glue jobs performing schema detection, source data cleansing, deduplication, and Parquet conversion.
  • Apache Hudi on EMR enabling SCD-style operations and incremental writes.
  • Redshift-loading via COPY and Glue upsert logic.
  • Orchestration: Automated workflows via Amazon MWAA (Managed Airflow), coordinating Glue jobs, EMR tasks, and Redshift script execution with event-driven triggers and scheduling.
  • Data Governance: Centralized metadata in AWS Glue Data Catalog and enforced lake permissions via Lake Formation.
  • AI/ML Foundation: Architecture designed to enable downstream ML workloads with clean, cataloged datasets, S3/Redshift integration, and schema versioning.
Real-Time Data Engineering
Toomore – E-Commerce Revamp & Intelligence System
Toomore – E-Commerce Revamp & Intelligence System

Tech Stack & Infrastructure

  • Cloud Platform: AWS
  • Storage: Amazon S3 (data lake), Amazon Redshift (data warehouse)
  • ETL & Transformation: AWS Glue, Apache Hudi on EMR
  • Orchestration: Amazon MWAA (Apache Airflow)
  • Governance: AWS Glue Data Catalog, AWS Lake Formation
  • Migration Tools: S3 → Redshift COPY, Glue incremental upsert logic
  • Formats & Modularity: Parquet, SCD, hot/cold storage strategies

Key Innovations & Learnings

  • Efficient Incremental Loads: Unified Parquet conversion using Glue + Hudi's upsert capabilities enables cost-effective incremental processing.
  • Robust Orchestration: MWAA DAGs manage error recovery, retries, and dependencies across Glue, EMR, and Redshift tasks.
  • Governed Lake-to-Warehouse Flow: Central metadata and unified access enabled secure, scalable data processing pipelines.
  • Scalable Migration Strategy: Balanced hot (frequently accessed) vs cold (historical) data to manage costs and performance.
  • Modular Architecture: Clearly separated source, transform, and loading layers with reusable components and orchestration.

Outcome & Impact

  • Performance & Scalability: Successfully migrated legacy systems, enabling petabyte-scale analytics in Redshift.
  • Operational Automation: Fully automated ETL pipelines with event-driven triggers and scheduled orchestration.
  • Secure & Governed Data: Catalogue-driven governance fosters data trust, discoverability, and compliance.
  • ML-Ready Platform: Prepared datasets support future AI/ML workloads with structured access paths and schema evolution support.
  • Production-Grade Infrastructure: System is designed for parallel workloads, reuse, and cost-effective storage management.