Data Engineering Bootcamp
- May 21
- 4 min read
Updated: 2 days ago
Build a scalable, compliant, enterprise-grade data pipelines with strict governance.
By the end of 60 days, learners will be able to:
|
01 Python for Data Engineering | Week 1 | |
Topics covered
| Module project Retail Sales Analyser Load a messy CSV of retail transactions, clean it with Pandas, compute monthly revenue per category, and produce a 4-panel EDA report saved as PNG. Tools: Python · Pandas · NumPy · Matplotlib | |
02 SQL for Data Engineering | Week 2 | |
Topics covered
| Module project E-commerce KPI Dashboard Queries Given an orders + customers + products schema, write a suite of 10 SQL queries covering cohort retention, 30-day rolling revenue, top-N products by region, and customer LTV. Tools: SQL · Databricks SQL · AWS RDS | |
03 Apache Spark | Week 3 | |
Topics covered
| Module project Web Server Log Processor Parse 1 GB of Apache access logs with PySpark, count requests per endpoint, flag anomalous IP bursts, and write a partitioned Parquet output for downstream querying. Tools: PySpark · Databricks · Parquet | |
04 Databricks Essentials | Week 4 | |
Topics covered
| Module project Delta Lake Inventory Pipeline Ingest daily product inventory CSVs into a Delta table, apply SCD Type 2 change tracking, use time travel to audit a mistaken batch load, and VACUUM stale snapshots. Tools: Databricks · Delta Lake · PySpark | |
05 Cloud Fundamentals — AWS | Week 5 | |
Topics covered
| Module project Data Lake Landing Zone Design a 3-zone S3 architecture (raw / curated / analytics), script automated uploads with boto3, configure IAM roles for Databricks access, and enable S3 server-side encryption. Tools: AWS S3 · IAM · boto3 · Databricks | |
06 Workflow Orchestration — Airflow | Week 6 | |
Topics covered
| Module project Automated ETL Scheduler Write a multi-task Airflow DAG that polls an S3 prefix for new files, triggers a Databricks notebook run via DatabricksRunNowOperator, validates row counts, and sends a Slack alert on failure. Tools: Airflow · Databricks · S3 · Slack API | |
07 End-to-End ETL Pipeline | Week 7 | |
Topics covered
| Module project Financial Transactions Pipeline Build a complete medallion pipeline: raw bank feeds land in S3, Spark transforms and deduplicates into Delta Silver, dbt models produce Gold aggregate tables, Airflow orchestrates daily runs, and Great Expectations validates each layer. Tools: Airflow · Spark · Delta Lake · dbt · Great Expectations | |
08 Industry Capstone Project | Week 8 | |
Topics covered
| Module project Supply Chain Analytics Platform Chosen from a real-world domain (supply chain, fintech, or healthcare). Deliver a fully working pipeline, with Data Governace , Delta Gold layer, a set of SQL analytics queries, test coverage, and a 10-minute recorded demo. Tools: All prior tools · Pytest · Sphinx docs etc | |
09 Dashboarding & Monitoring — Grafana | Week 9 (Optional) | |
Topics covered
| Module project Pipeline Observability Dashboard Create a Grafana dashboard with 6 panels: rows processed per run, error rate, job duration trend, data freshness indicator, row count anomaly detector, and a top-10 failed records table. Wire alerts to email. Tools: Grafana · Databricks SQL · Prometheus | |
10 Version Control & CI/CD | Week 10 (Optional) | |
Topics covered
| Module project Automated Pipeline Deployment Set up a GitHub monorepo for the ETL pipeline, write a GitHub Actions workflow that runs pytest on every PR, lints with ruff, and on merge to main deploys the updated DAG to Airflow and Databricks production workspace. Tools: Git · GitHub Actions · pytest · ruff · Databricks API | |
* Modules 09 and 10 are optional and covered if time permits within the 30-day schedule.



Comments