includes AI, RAG & Claude Code

Cloud Data Engineering
Roadmap 2026

A practical, opinionated guide to going from foundations to production-grade cloud data infrastructure, with the tools that actually matter this year.

Melody Egwuchukwu
Melody Egwuchukwu
Cloud Data Engineer
Phase 1 — Foundations
🐍
01 — Programming
Python
🗄
02 — Query Language
SQL
CONCEPTS

Data Modelling and Design

Important to have a good understanding of this before diving into Database and Warehouse Design

ACID Properties
Primary & Foreign Keys
Surrogate Keys
Normalisation
OLAP vs OLTP
⌨️
03 — Operating System
Linux
CONCEPTS

Command to Notes

Filesystem navigation
chmod permissions
top & htop
Shell scripting
Cron jobs
awk & grep
curl & wget
Phase 2 — Core Engineering
04 — Distributed Processing
Spark & Kafka
Video Series

Apache Spark — Complete Guide

Spark architecture, job execution, and hands-on PySpark from scratch.

Book

Spark: The Definitive Guide

Bill Chambers & Matei Zaharia. Read it alongside the video series.

CONCEPTS

Understand these Concepts

Spark architecture
Master & workers
Memory management
Jobs, stages & tasks
Optimisation
Tuning Spark jobs
Apache Kafka
Stream processing
05 — Data Modelling
Warehouse Design
Book

The Kimball Dimensional Data Warehouse Toolkit

Focus on Chapters 1–3. The foundational mental model for fact tables, dimensions, and star schemas.

Concepts

Medallion Architecture

Bronze → Silver → Gold layering pattern used in Databricks, Delta Lake, and modern lakehouses.

Star schema
Fact tables
Slowly changing dims
dbt modelling
🔄
06 — Orchestration
Apache Airflow
Core Concept

DAGs & Operators

How Airflow schedules work, what makes a good DAG, and how to avoid common anti-patterns.

Core Concept

Sensors, Hooks & Connections

Connecting Airflow to your cloud services — S3, BigQuery, Snowflake, Databricks.

Phase 3 — DevOps & Cloud
🏗
07 — DevOps & Infrastructure
Git · Docker · Terraform · CI/CD
Tool

Git & GitHub

Version control, branching strategy, pull requests, and how teams collaborate on pipelines.

Tool

Docker

Containerise your pipelines. Build reproducible, portable environments for local and cloud.

Tool

Terraform

Infrastructure as code. Provision cloud resources on AWS, Azure, or GCP with version control.

Tool

CI/CD Pipelines

GitHub Actions or GitLab CI to automate testing, linting, and deployment of data pipelines.

Ensure you decide on what Cloud platform you want to learn, the skills are transferable to others.

Phase 4 — 2026 Additions
08 — AI-Powered Data Engineering
RAG · Vector Databases · Claude Code
New in 2026
AI is no longer optional in Data Engineering
RAG pipelines, vector stores, and AI coding assistants are now part of the production stack.
↗ Video Resource

RAG for Data Engineers

How Retrieval-Augmented Generation works, how to build RAG pipelines, and where data engineers sit in the AI stack. Essential watching for 2026.

Watch on YouTube ↗
New 2026

Vector Databases

Pinecone, Weaviate, pgvector. How embeddings are stored, indexed, and queried at scale. The data layer behind every AI product.

New 2026

Claude Code

Agentic coding with Claude in the terminal. Accelerate pipeline development, debugging, and documentation — without leaving your workflow.

New 2026

Embedding Pipelines

Chunking strategies, embedding models, and how to build the data pipelines that feed RAG systems reliably in production.

New 2026

LLM Observability

Monitoring AI pipelines is different from traditional pipelines. Tracing, latency, hallucination detection, and cost tracking.

CONCEPTS

Understand these Concepts

Chunking strategies
Embedding models
Semantic search
pgvector / Pinecone
Prompt engineering
LangChain basics
AI pipeline monitoring
Claude Code CLI
From Melody

One thing I'd tell my past self

Don't try to learn all of this at once. Pick one module, build something real with it, and move forward. The engineers who grow fastest are the ones who ship, not the ones who finish every course.

Follow me for more cloud and data topics