Skip to content
Unverified — AI-generated content. Help verify this page

Data Engineer Learning Path

A structured 12-week journey through the Knowledge Vault for aspiring and intermediate data engineers. This path covers SQL mastery, data modeling, ETL/ELT patterns, stream processing, the full 25-page data pipeline section, 69 EDA pages, real-time analytics, lakehouse architecture, orchestration with Airflow/Prefect, data quality with Great Expectations and Pandera, and production operations.

Who This Is For

  • Software engineers transitioning into data engineering
  • Junior data engineers building towards mid/senior level
  • Backend engineers who want to own data pipelines
  • Anyone preparing for data engineering interviews

Prerequisites

  • Basic programming in Python
  • Familiarity with SQL syntax (SELECT, JOIN, WHERE)
  • Understanding of what a database is
  • No prior data engineering experience required

Total estimated time: ~60 hours across 12 weeks

Learning Progression


Week 1-2: Database Foundations

Estimated reading time: 6 hours

Checkpoint

After this section you should be able to explain: how B-tree indexes speed up queries, what WAL guarantees, why ClickHouse is fast for analytics (columnar storage, vectorized execution), and when to choose a columnar database over row-based.


Week 2-3: SQL Mastery

Estimated reading time: 4 hours

Checkpoint

After this section you should be able to: read an EXPLAIN plan, design indexes for query patterns, write window functions and CTEs fluently, and understand why connection pooling matters for pipelines.


Week 3-4: Data Modeling

Estimated reading time: 5 hours


Week 4-5: ETL/ELT Patterns

Estimated reading time: 5 hours


Week 5-6: Data Pipeline (25 pages)

Estimated reading time: 8 hours

The complete data pipeline section covering preprocessing, ingestion, validation, and orchestration.

Data Ingestion

Preprocessing

Validation & Quality

Orchestration

Checkpoint

After this section you should be able to: build end-to-end data pipelines with ingestion, preprocessing, validation, and orchestration; choose between Airflow and Prefect; and implement data quality checks with Great Expectations and Pandera.


Week 6-7: Stream Processing

Estimated reading time: 6 hours


Week 7-8: EDA Foundations (Part 1 of 69 pages)

Estimated reading time: 8 hours

The first half of the comprehensive EDA section covering tools, univariate analysis, and data cleaning.

Python Tools

Univariate Analysis

Data Cleaning


Week 8-9: EDA Advanced (Part 2 of 69 pages)

Estimated reading time: 8 hours

Bivariate/multivariate analysis, feature engineering, statistical testing, and domain-specific EDA.

Bivariate & Multivariate

Feature Engineering

Advanced Topics


Week 9-10: Orchestration & Pipeline Patterns

Estimated reading time: 5 hours


Week 10-11: Lakehouse & Real-Time Analytics

Estimated reading time: 5 hours


Week 11: Observability & Operations

Estimated reading time: 4 hours


Week 12: Capstone Projects

Estimated reading time: 3 hours + project time

EDA Projects

Pipeline Projects

Production Blueprints


What You Will Be Able to Do After This Path

  • Design star schemas, Data Vault models, and SCDs for any business domain
  • Build idempotent ETL/ELT pipelines with proper error handling
  • Process streaming data with Kafka, windowing, and watermarks
  • Perform comprehensive EDA with pandas, matplotlib, seaborn, and plotly
  • Orchestrate pipelines with Airflow or Prefect
  • Validate data quality with Great Expectations and Pandera
  • Design lakehouse architectures with medallion layers
  • Build real-time analytics with ClickHouse, Druid, or Pinot

Total Progress

This path contains approximately 130 pages (including 69 EDA pages and 25 data pipeline pages). Budget 12 weeks at 5 hours per week. The EDA section alone could take 3-4 weeks of focused study.

"What I cannot create, I do not understand." — Richard Feynman