includes AI, RAG & Claude Code

Cloud Data Engineering
Roadmap 2026

A practical, opinionated guide to going from foundations to production-grade cloud data infrastructure, with the tools that actually matter this year.

Melody Egwuchukwu

Cloud Data Engineer

Phase 1 — Foundations

🐍

01 — Programming

Python

Video Series ↗

Python Full Tutorial

Comprehensive beginner-to-intermediate Python — the right starting point before anything else.

Video Series ↗

Python OOP

Object-oriented programming patterns — essential for writing maintainable pipeline code.

DSA — Easy to Medium

Striver's A2Z sheet. Focus on the topics relevant for data engineering interviews.

Python: Focus for Data Engineering

These are what you will be doing mostly as DE.

DSA: Focus for Data Engineering

Helps with understanding how data systems work

Arrays & Strings

Hashing (HashMap / Dictionary)

Stacks & Queues

Heaps / Priority Queue

Graphs (BFS / DFS – basic)

Recursion (basic understanding)

🗄

02 — Query Language

SQL

Full Course ↗

SQL Full Course

Everything from SELECT to window functions, joins, and subqueries in one sitting.

Video Series ↗

Database Design

Schema design principles, ER diagrams, and how data models are built in practice.

Tips Series ↗

SQL Tips & Tricks

Advanced patterns and optimisations that separate good SQL from production SQL.

StrataScratch — SQL Problems

Real interview questions from data engineering and analytics roles.Practice the Medium Hard.

LeetCode — Top SQL 50

The canonical 50 problems that cover every SQL pattern you'll likely face in interviews.

Data Modelling and Design

Important to have a good understanding of this before diving into Database and Warehouse Design

ACID Properties

Primary & Foreign Keys

⌨️

03 — Operating System

Linux

Video Series ↗

Linux for Beginners

Navigate the filesystem, manage permissions, write scripts — all the skills you'll use daily.

Command to Notes

Filesystem navigation

chmod permissions

Shell scripting

Phase 2 — Core Engineering

⚡

04 — Distributed Processing

Spark & Kafka

Video Series ↗

Apache Spark — Complete Guide

Spark architecture, job execution, and hands-on PySpark from scratch.

Book

Spark: The Definitive Guide

Bill Chambers & Matei Zaharia. Read it alongside the video series.

Understand these Concepts

Spark architecture

Master & workers

Memory management

Jobs, stages & tasks

Tuning Spark jobs

Stream processing

05 — Data Modelling

Warehouse Design

Book

The Kimball Dimensional Data Warehouse Toolkit

Focus on Chapters 1–3. The foundational mental model for fact tables, dimensions, and star schemas.

Concepts

Medallion Architecture

Bronze → Silver → Gold layering pattern used in Databricks, Delta Lake, and modern lakehouses.

Star schema

Fact tables

Slowly changing dims

dbt modelling

🔄

06 — Orchestration

Apache Airflow

Core Concept

DAGs & Operators

How Airflow schedules work, what makes a good DAG, and how to avoid common anti-patterns.

Core Concept

Sensors, Hooks & Connections

Connecting Airflow to your cloud services — S3, BigQuery, Snowflake, Databricks.

Phase 3 — DevOps & Cloud

🏗

07 — DevOps & Infrastructure

Git · Docker · Terraform · CI/CD

Tool

Git & GitHub

Version control, branching strategy, pull requests, and how teams collaborate on pipelines.

Tool

Docker

Containerise your pipelines. Build reproducible, portable environments for local and cloud.

Tool

Terraform

Infrastructure as code. Provision cloud resources on AWS, Azure, or GCP with version control.

Tool

CI/CD Pipelines

GitHub Actions or GitLab CI to automate testing, linting, and deployment of data pipelines.

Ensure you decide on what Cloud platform you want to learn, the skills are transferable to others.

Phase 4 — 2026 Additions

08 — AI-Powered Data Engineering

RAG · Vector Databases · Claude Code

↗ Video Resource

RAG for Data Engineers

How Retrieval-Augmented Generation works, how to build RAG pipelines, and where data engineers sit in the AI stack. Essential watching for 2026.

Watch on YouTube ↗

New 2026

Vector Databases

Pinecone, Weaviate, pgvector. How embeddings are stored, indexed, and queried at scale. The data layer behind every AI product.

New 2026

Claude Code

Agentic coding with Claude in the terminal. Accelerate pipeline development, debugging, and documentation — without leaving your workflow.

New 2026

Embedding Pipelines

Chunking strategies, embedding models, and how to build the data pipelines that feed RAG systems reliably in production.

New 2026

LLM Observability

Monitoring AI pipelines is different from traditional pipelines. Tracing, latency, hallucination detection, and cost tracking.

Understand these Concepts

Chunking strategies

Embedding models

Semantic search

pgvector / Pinecone

Prompt engineering

LangChain basics

AI pipeline monitoring

Claude Code CLI

From Melody

One thing I'd tell my past self

Don't try to learn all of this at once. Pick one module, build something real with it, and move forward. The engineers who grow fastest are the ones who ship, not the ones who finish every course.

Follow me for more cloud and data topics