Building Scalable Data Lakes and Data Warehouses on AWS

Client
Personal
Timeframe
May 23 - June 5
Services
AWS S3, AWS Redshift, AWS Glue, EMR, Apache Hudi, MWAA, Airflow, Athena, Lake Formation
Project Overview
The project delivered a fully operational, scalable AWS-based data platform featuring both data lake (S3) and data warehouse (Redshift), handling large-scale data migrations, robust ETL pipelines, and built for future AI/ML readiness.
Implemented Components
- Data Lake & Warehouse Architecture: Used Amazon S3 as a hierarchical data lake, supported by Amazon Redshift for structured analytics.
- Large-scale Warehousing: Migrated terabyte-scale datasets from legacy systems (e.g. Teradata, SQL Server) into Redshift, optimizing schema, sort/distribution keys, and query performance.
ETL & Transformation:
- AWS Glue jobs performing schema detection, source data cleansing, deduplication, and Parquet conversion.
- Apache Hudi on EMR enabling SCD-style operations and incremental writes.
- Redshift-loading via COPY and Glue upsert logic.
- Orchestration: Automated workflows via Amazon MWAA (Managed Airflow), coordinating Glue jobs, EMR tasks, and Redshift script execution with event-driven triggers and scheduling.
- Data Governance: Centralized metadata in AWS Glue Data Catalog and enforced lake permissions via Lake Formation.
- AI/ML Foundation: Architecture designed to enable downstream ML workloads with clean, cataloged datasets, S3/Redshift integration, and schema versioning.





Tech Stack & Infrastructure
- Cloud Platform: AWS
- Storage: Amazon S3 (data lake), Amazon Redshift (data warehouse)
- ETL & Transformation: AWS Glue, Apache Hudi on EMR
- Orchestration: Amazon MWAA (Apache Airflow)
- Governance: AWS Glue Data Catalog, AWS Lake Formation
- Migration Tools: S3 → Redshift COPY, Glue incremental upsert logic
- Formats & Modularity: Parquet, SCD, hot/cold storage strategies
Key Innovations & Learnings
- Efficient Incremental Loads: Unified Parquet conversion using Glue + Hudi's upsert capabilities enables cost-effective incremental processing.
- Robust Orchestration: MWAA DAGs manage error recovery, retries, and dependencies across Glue, EMR, and Redshift tasks.
- Governed Lake-to-Warehouse Flow: Central metadata and unified access enabled secure, scalable data processing pipelines.
- Scalable Migration Strategy: Balanced hot (frequently accessed) vs cold (historical) data to manage costs and performance.
- Modular Architecture: Clearly separated source, transform, and loading layers with reusable components and orchestration.
Outcome & Impact
- Performance & Scalability: Successfully migrated legacy systems, enabling petabyte-scale analytics in Redshift.
- Operational Automation: Fully automated ETL pipelines with event-driven triggers and scheduled orchestration.
- Secure & Governed Data: Catalogue-driven governance fosters data trust, discoverability, and compliance.
- ML-Ready Platform: Prepared datasets support future AI/ML workloads with structured access paths and schema evolution support.
- Production-Grade Infrastructure: System is designed for parallel workloads, reuse, and cost-effective storage management.