Python Full Tutorial
Comprehensive beginner-to-intermediate Python — the right starting point before anything else.
Python OOP
Object-oriented programming patterns — essential for writing maintainable pipeline code.
DSA — Easy to Medium
Striver's A2Z sheet. Focus on the topics relevant for data engineering interviews.
Python: Focus for Data Engineering
These are what you will be doing mostly as DE.
DSA: Focus for Data Engineering
Helps with understanding how data systems work
SQL Full Course
Everything from SELECT to window functions, joins, and subqueries in one sitting.
Database Design
Schema design principles, ER diagrams, and how data models are built in practice.
SQL Tips & Tricks
Advanced patterns and optimisations that separate good SQL from production SQL.
StrataScratch — SQL Problems
Real interview questions from data engineering and analytics roles.Practice the Medium Hard.
LeetCode — Top SQL 50
The canonical 50 problems that cover every SQL pattern you'll likely face in interviews.
Data Modelling and Design
Important to have a good understanding of this before diving into Database and Warehouse Design
Command to Notes
Apache Spark — Complete Guide
Spark architecture, job execution, and hands-on PySpark from scratch.
Spark: The Definitive Guide
Bill Chambers & Matei Zaharia. Read it alongside the video series.
Understand these Concepts
The Kimball Dimensional Data Warehouse Toolkit
Focus on Chapters 1–3. The foundational mental model for fact tables, dimensions, and star schemas.
Medallion Architecture
Bronze → Silver → Gold layering pattern used in Databricks, Delta Lake, and modern lakehouses.
DAGs & Operators
How Airflow schedules work, what makes a good DAG, and how to avoid common anti-patterns.
Sensors, Hooks & Connections
Connecting Airflow to your cloud services — S3, BigQuery, Snowflake, Databricks.
Git & GitHub
Version control, branching strategy, pull requests, and how teams collaborate on pipelines.
Docker
Containerise your pipelines. Build reproducible, portable environments for local and cloud.
Terraform
Infrastructure as code. Provision cloud resources on AWS, Azure, or GCP with version control.
CI/CD Pipelines
GitHub Actions or GitLab CI to automate testing, linting, and deployment of data pipelines.
Ensure you decide on what Cloud platform you want to learn, the skills are transferable to others.
RAG for Data Engineers
How Retrieval-Augmented Generation works, how to build RAG pipelines, and where data engineers sit in the AI stack. Essential watching for 2026.
Watch on YouTube ↗Vector Databases
Pinecone, Weaviate, pgvector. How embeddings are stored, indexed, and queried at scale. The data layer behind every AI product.
Claude Code
Agentic coding with Claude in the terminal. Accelerate pipeline development, debugging, and documentation — without leaving your workflow.
Embedding Pipelines
Chunking strategies, embedding models, and how to build the data pipelines that feed RAG systems reliably in production.
LLM Observability
Monitoring AI pipelines is different from traditional pipelines. Tracing, latency, hallucination detection, and cost tracking.