top of page

Data Engineering Bootcamp

  • May 21
  • 4 min read

Updated: 2 days ago


Build a scalable, compliant, enterprise-grade data pipelines with strict governance.

By the end of 60 days, learners will be able to:

  • Design, build, orchestrate, monitor, and deploy a complete ETL pipeline

  • Use Python, SQL, Spark, Databricks, AWS, Airflow, Grafana, Git, and GitHub Actions together

  • Deliver production-ready pipelines with version control and CI/CD deployment


01 Python for Data Engineering

Week 1

Topics covered

  • Python data types, control flow, and functions

  • NumPy arrays and vectorised operations

  • Pandas: DataFrames, Series, groupby, merge

  • Data cleaning: nulls, type coercion, deduplication

  • Matplotlib & Seaborn for quick EDA visualisations

Module project

Retail Sales Analyser

Load a messy CSV of retail transactions, clean it with Pandas, compute monthly revenue per category, and produce a 4-panel EDA report saved as PNG.

Tools: Python · Pandas · NumPy · Matplotlib


02 SQL for Data Engineering

Week 2

Topics covered

  • SELECT, WHERE, ORDER BY, LIMIT

  • Aggregations: GROUP BY, HAVING, COUNT, SUM

  • Joins: INNER, LEFT, FULL OUTER, CROSS

  • CTEs and subqueries for readable logic

  • Window functions: ROW_NUMBER, RANK, LAG, LEAD

Module project

E-commerce KPI Dashboard Queries

Given an orders + customers + products schema, write a suite of 10 SQL queries covering cohort retention, 30-day rolling revenue, top-N products by region, and customer LTV.

Tools: SQL · Databricks SQL · AWS RDS


03 Apache Spark

Week 3

Topics covered

  • Spark architecture: driver, executors, DAG planner

  • RDDs vs DataFrames vs Datasets

  • Transformations (lazy) vs actions (eager)

  • Joins, aggregations, and shuffle optimisation

  • Reading/writing Parquet, JSON, and Delta files

Module project

Web Server Log Processor

Parse 1 GB of Apache access logs with PySpark, count requests per endpoint, flag anomalous IP bursts, and write a partitioned Parquet output for downstream querying.

Tools: PySpark · Databricks · Parquet


04 Databricks Essentials

Week 4

Topics covered

  • Databricks workspace: clusters, notebooks, repos

  • Delta Lake: ACID transactions, time travel, VACUUM

  • Structured Streaming basics

  • Optimise & Z-Order for query performance

  • Unity Catalog: data governance and access control

Module project

Delta Lake Inventory Pipeline

Ingest daily product inventory CSVs into a Delta table, apply SCD Type 2 change tracking, use time travel to audit a mistaken batch load, and VACUUM stale snapshots.

Tools: Databricks · Delta Lake · PySpark


05 Cloud Fundamentals — AWS

Week 5

Topics covered

  • S3: buckets, prefixes, lifecycle policies

  • IAM: roles, policies, least-privilege pattern

  • EC2 basics and key-pair SSH access

  • AWS Glue Catalog for schema management

  • Ingesting data into S3 and mounting in Databricks

Module project

Data Lake Landing Zone

Design a 3-zone S3 architecture (raw / curated / analytics), script automated uploads with boto3, configure IAM roles for Databricks access, and enable S3 server-side encryption.

Tools: AWS S3 · IAM · boto3 · Databricks


06 Workflow Orchestration — Airflow

Week 6

Topics covered

  • DAG anatomy: tasks, operators, sensors

  • TaskFlow API with Python decorators

  • Scheduling: cron expressions, data intervals

  • XComs for task-to-task data passing

  • Retry logic, SLAs, and alerting hooks

Module project

Automated ETL Scheduler

Write a multi-task Airflow DAG that polls an S3 prefix for new files, triggers a Databricks notebook run via DatabricksRunNowOperator, validates row counts, and sends a Slack alert on failure.

Tools: Airflow · Databricks · S3 · Slack API


07 End-to-End ETL Pipeline

Week 7

Topics covered

  • Medallion architecture: Bronze to Silver to Gold

  • Schema enforcement and data quality checks

  • Idempotent pipeline design (safe reruns)

  • Handling late-arriving and out-of-order data

  • End-to-end lineage and audit logging

Module project

Financial Transactions Pipeline

Build a complete medallion pipeline: raw bank feeds land in S3, Spark transforms and deduplicates into Delta Silver, dbt models produce Gold aggregate tables, Airflow orchestrates daily runs, and Great Expectations validates each layer.

Tools: Airflow · Spark · Delta Lake · dbt · Great Expectations


08 Industry Capstone Project

Week 8

Topics covered

  • End-to-end project scoping and design doc

  • Stakeholder requirements mapped to a data model

  • Full pipeline implementation with tests

  • Documentation: README, data dictionary, runbook

  • Demo presentation with live query walkthrough

Module project

Supply Chain Analytics Platform

Chosen from a real-world domain (supply chain, fintech, or healthcare). Deliver a fully working pipeline, with Data Governace , Delta Gold layer, a set of SQL analytics queries, test coverage, and a 10-minute recorded demo.

Tools: All prior tools · Pytest · Sphinx docs etc


09 Dashboarding & Monitoring — Grafana

Week 9 (Optional)

Topics covered

  • Grafana architecture: panels, dashboards, datasources

  • Connecting Grafana to Databricks SQL endpoint

  • Building pipeline health dashboards

  • Alerting rules and notification channels

  • Data quality KPI panels with threshold colouring

Module project

Pipeline Observability Dashboard

Create a Grafana dashboard with 6 panels: rows processed per run, error rate, job duration trend, data freshness indicator, row count anomaly detector, and a top-10 failed records table. Wire alerts to email.

Tools: Grafana · Databricks SQL · Prometheus


10 Version Control & CI/CD

Week 10 (Optional)

Topics covered

  • Git: branching strategy (trunk-based vs Gitflow)

  • Pull requests, code review, and merge policies

  • GitHub Actions: triggers, jobs, steps, secrets

  • Automated testing of PySpark code with pytest

  • Deploy Airflow DAGs & Databricks notebooks via CI/CD

Module project

Automated Pipeline Deployment

Set up a GitHub monorepo for the ETL pipeline, write a GitHub Actions workflow that runs pytest on every PR, lints with ruff, and on merge to main deploys the updated DAG to Airflow and Databricks production workspace.

Tools: Git · GitHub Actions · pytest · ruff · Databricks API


* Modules 09 and 10 are optional and covered if time permits within the 30-day schedule.

 
 
 

Comments


bottom of page