Data Engineering Bootcamp

Updated: Jun 16

Build a scalable, compliant, enterprise-grade data pipelines with strict governance.

By the end of 60 days, learners will be able to:

Design, build, orchestrate, monitor, and deploy a complete ETL pipeline
Use Python, SQL, Spark, Databricks, AWS, Airflow, Grafana, Git, and GitHub Actions together
Deliver production-ready pipelines with version control and CI/CD deployment

01 Python for Data Engineering		Week 1
Topics covered Python data types, control flow, and functions NumPy arrays and vectorised operations Pandas: DataFrames, Series, groupby, merge Data cleaning: nulls, type coercion, deduplication Matplotlib & Seaborn for quick EDA visualisations	Module project Retail Sales Analyser Load a messy CSV of retail transactions, clean it with Pandas, compute monthly revenue per category, and produce a 4-panel EDA report saved as PNG. Tools: Python · Pandas · NumPy · Matplotlib

02 SQL for Data Engineering		Week 2
Topics covered SELECT, WHERE, ORDER BY, LIMIT Aggregations: GROUP BY, HAVING, COUNT, SUM Joins: INNER, LEFT, FULL OUTER, CROSS CTEs and subqueries for readable logic Window functions: ROW_NUMBER, RANK, LAG, LEAD	Module project E-commerce KPI Dashboard Queries Given an orders + customers + products schema, write a suite of 10 SQL queries covering cohort retention, 30-day rolling revenue, top-N products by region, and customer LTV. Tools: SQL · Databricks SQL · AWS RDS

03 Apache Spark		Week 3
Topics covered Spark architecture: driver, executors, DAG planner RDDs vs DataFrames vs Datasets Transformations (lazy) vs actions (eager) Joins, aggregations, and shuffle optimisation Reading/writing Parquet, JSON, and Delta files	Module project Web Server Log Processor Parse 1 GB of Apache access logs with PySpark, count requests per endpoint, flag anomalous IP bursts, and write a partitioned Parquet output for downstream querying. Tools: PySpark · Databricks · Parquet

04 Databricks Essentials		Week 4
Topics covered Databricks workspace: clusters, notebooks, repos Delta Lake: ACID transactions, time travel, VACUUM Structured Streaming basics Optimise & Z-Order for query performance Unity Catalog: data governance and access control	Module project Delta Lake Inventory Pipeline Ingest daily product inventory CSVs into a Delta table, apply SCD Type 2 change tracking, use time travel to audit a mistaken batch load, and VACUUM stale snapshots. Tools: Databricks · Delta Lake · PySpark

05 Cloud Fundamentals — AWS		Week 5
Topics covered S3: buckets, prefixes, lifecycle policies IAM: roles, policies, least-privilege pattern EC2 basics and key-pair SSH access AWS Glue Catalog for schema management Ingesting data into S3 and mounting in Databricks	Module project Data Lake Landing Zone Design a 3-zone S3 architecture (raw / curated / analytics), script automated uploads with boto3, configure IAM roles for Databricks access, and enable S3 server-side encryption. Tools: AWS S3 · IAM · boto3 · Databricks

06 Workflow Orchestration — Airflow		Week 6
Topics covered DAG anatomy: tasks, operators, sensors TaskFlow API with Python decorators Scheduling: cron expressions, data intervals XComs for task-to-task data passing Retry logic, SLAs, and alerting hooks	Module project Automated ETL Scheduler Write a multi-task Airflow DAG that polls an S3 prefix for new files, triggers a Databricks notebook run via DatabricksRunNowOperator, validates row counts, and sends a Slack alert on failure. Tools: Airflow · Databricks · S3 · Slack API

07 End-to-End ETL Pipeline		Week 7
Topics covered Medallion architecture: Bronze to Silver to Gold Schema enforcement and data quality checks Idempotent pipeline design (safe reruns) Handling late-arriving and out-of-order data End-to-end lineage and audit logging	Module project Financial Transactions Pipeline Build a complete medallion pipeline: raw bank feeds land in S3, Spark transforms and deduplicates into Delta Silver, dbt models produce Gold aggregate tables, Airflow orchestrates daily runs, and Great Expectations validates each layer. Tools: Airflow · Spark · Delta Lake · dbt · Great Expectations

08 Industry Capstone Project		Week 8
Topics covered End-to-end project scoping and design doc Stakeholder requirements mapped to a data model Full pipeline implementation with tests Documentation: README, data dictionary, runbook Demo presentation with live query walkthrough	Module project Supply Chain Analytics Platform Chosen from a real-world domain (supply chain, fintech, or healthcare). Deliver a fully working pipeline, with Data Governace , Delta Gold layer, a set of SQL analytics queries, test coverage, and a 10-minute recorded demo. Tools: All prior tools · Pytest · Sphinx docs etc

09 Dashboarding & Monitoring — Grafana		Week 9 (Optional)
Topics covered Grafana architecture: panels, dashboards, datasources Connecting Grafana to Databricks SQL endpoint Building pipeline health dashboards Alerting rules and notification channels Data quality KPI panels with threshold colouring	Module project Pipeline Observability Dashboard Create a Grafana dashboard with 6 panels: rows processed per run, error rate, job duration trend, data freshness indicator, row count anomaly detector, and a top-10 failed records table. Wire alerts to email. Tools: Grafana · Databricks SQL · Prometheus

10 Version Control & CI/CD		Week 10 (Optional)
Topics covered Git: branching strategy (trunk-based vs Gitflow) Pull requests, code review, and merge policies GitHub Actions: triggers, jobs, steps, secrets Automated testing of PySpark code with pytest Deploy Airflow DAGs & Databricks notebooks via CI/CD	Module project Automated Pipeline Deployment Set up a GitHub monorepo for the ETL pipeline, write a GitHub Actions workflow that runs pytest on every PR, lints with ruff, and on merge to main deploys the updated DAG to Airflow and Databricks production workspace. Tools: Git · GitHub Actions · pytest · ruff · Databricks API

* Modules 09 and 10 are optional and covered if time permits within the 30-day schedule.

Related Posts