Data Engineer Learning Path
A structured 12-week journey through the Knowledge Vault for aspiring and intermediate data engineers. This path covers SQL mastery, data modeling, ETL/ELT patterns, stream processing, the full 25-page data pipeline section, 69 EDA pages, real-time analytics, lakehouse architecture, orchestration with Airflow/Prefect, data quality with Great Expectations and Pandera, and production operations.
Who This Is For
- Software engineers transitioning into data engineering
- Junior data engineers building towards mid/senior level
- Backend engineers who want to own data pipelines
- Anyone preparing for data engineering interviews
Prerequisites
- Basic programming in Python
- Familiarity with SQL syntax (SELECT, JOIN, WHERE)
- Understanding of what a database is
- No prior data engineering experience required
Total estimated time: ~60 hours across 12 weeks
Learning Progression
Week 1-2: Database Foundations
Estimated reading time: 6 hours
- [ ] Required -- Database Selection Guide (20 min)
- [ ] Required -- Storage Engines (30 min)
- [ ] Required -- PostgreSQL Internals (35 min)
- [ ] Required -- Write-Ahead Logging (25 min)
- [ ] Required -- Indexing Deep Dive (30 min)
- [ ] Required -- Isolation Levels (25 min)
- [ ] Required -- MVCC (20 min)
- [ ] Required -- Replication (30 min)
- [ ] Required -- ClickHouse Internals (25 min)
- [ ] Optional -- MongoDB Internals (25 min)
- [ ] Optional -- Time-Series Databases (20 min)
Checkpoint
After this section you should be able to explain: how B-tree indexes speed up queries, what WAL guarantees, why ClickHouse is fast for analytics (columnar storage, vectorized execution), and when to choose a columnar database over row-based.
Week 2-3: SQL Mastery
Estimated reading time: 4 hours
- [ ] Required -- Query Planning & Optimization (30 min)
- [ ] Required -- Index Strategy (25 min)
- [ ] Required -- Query Optimization (25 min)
- [ ] Required -- N+1 Problem (20 min)
- [ ] Required -- Connection Pooling (20 min)
- [ ] Required -- SQL Cheat Sheet (15 min)
- [ ] Required -- Advanced SQL Cheat Sheet (15 min)
- [ ] Optional -- Connection Pool Tuning (20 min)
- [ ] Optional -- VACUUM & ANALYZE (20 min)
- [ ] Optional -- Database Profiling (25 min)
Checkpoint
After this section you should be able to: read an EXPLAIN plan, design indexes for query patterns, write window functions and CTEs fluently, and understand why connection pooling matters for pipelines.
Week 3-4: Data Modeling
Estimated reading time: 5 hours
- [ ] Required -- Data Modeling Overview (15 min)
- [ ] Required -- Normalization & Denormalization (30 min)
- [ ] Required -- Dimensional Modeling (35 min)
- [ ] Required -- Slowly Changing Dimensions (30 min)
- [ ] Required -- Schema Evolution (25 min)
- [ ] Required -- Data Vault (30 min)
- [ ] Optional -- Event Schema Evolution (25 min)
Week 4-5: ETL/ELT Patterns
Estimated reading time: 5 hours
- [ ] Required -- ETL Patterns Overview (15 min)
- [ ] Required -- ETL vs ELT (25 min)
- [ ] Required -- Batch Processing (30 min)
- [ ] Required -- Incremental Loads (25 min)
- [ ] Required -- Idempotent Pipelines (25 min)
- [ ] Required -- Error Handling (25 min)
- [ ] Optional -- CDC Patterns (30 min)
Week 5-6: Data Pipeline (25 pages)
Estimated reading time: 8 hours
The complete data pipeline section covering preprocessing, ingestion, validation, and orchestration.
Data Ingestion
- [ ] Required -- Data Pipeline Overview (15 min)
- [ ] Required -- Pipeline Patterns (25 min)
- [ ] Required -- Database Extraction (20 min)
- [ ] Required -- API Ingestion (20 min)
- [ ] Required -- Web Scraping (20 min)
- [ ] Required -- Data Contracts (25 min)
- [ ] Required -- File Formats (20 min)
Preprocessing
- [ ] Required -- Preprocessing Pipeline (25 min)
- [ ] Required -- Numerical Preprocessing (20 min)
- [ ] Required -- Categorical Preprocessing (20 min)
- [ ] Required -- Text Preprocessing (20 min)
- [ ] Required -- Datetime Preprocessing (20 min)
- [ ] Required -- String Preprocessing (15 min)
- [ ] Optional -- Missing Imputation (20 min)
- [ ] Optional -- Image Preprocessing (20 min)
- [ ] Optional -- Type Inference (15 min)
- [ ] Optional -- Deduplication (20 min)
Validation & Quality
- [ ] Required -- Great Expectations (25 min)
- [ ] Required -- Pandera Validation (20 min)
- [ ] Required -- Pipeline Monitoring (20 min)
Orchestration
- [ ] Required -- Airflow Pipelines (25 min)
- [ ] Required -- Prefect Pipelines (25 min)
Checkpoint
After this section you should be able to: build end-to-end data pipelines with ingestion, preprocessing, validation, and orchestration; choose between Airflow and Prefect; and implement data quality checks with Great Expectations and Pandera.
Week 6-7: Stream Processing
Estimated reading time: 6 hours
- [ ] Required -- Stream Processing Overview (15 min)
- [ ] Required -- Kafka Internals (35 min)
- [ ] Required -- Exactly-Once Processing (25 min)
- [ ] Required -- Windowing (25 min)
- [ ] Required -- Watermarks (25 min)
- [ ] Required -- Backpressure (20 min)
- [ ] Required -- State Management (25 min)
- [ ] Optional -- Kafka Streams (25 min)
- [ ] Optional -- Kafka Connect (20 min)
- [ ] Optional -- Ordering Guarantees (25 min)
- [ ] Optional -- Queue Selection Guide (20 min)
Week 7-8: EDA Foundations (Part 1 of 69 pages)
Estimated reading time: 8 hours
The first half of the comprehensive EDA section covering tools, univariate analysis, and data cleaning.
Python Tools
- [ ] Required -- EDA Overview (15 min)
- [ ] Required -- EDA Workflow (25 min)
- [ ] Required -- NumPy (25 min)
- [ ] Required -- Pandas Fundamentals (30 min)
- [ ] Required -- Pandas Advanced (30 min)
- [ ] Required -- Matplotlib (25 min)
- [ ] Required -- Seaborn (25 min)
- [ ] Optional -- Plotly (25 min)
- [ ] Optional -- Polars for EDA (25 min)
- [ ] Optional -- SciPy Stats (25 min)
- [ ] Reference -- Pandas EDA Cheat Sheet (10 min)
Univariate Analysis
- [ ] Required -- Data Types Deep Dive (20 min)
- [ ] Required -- Understanding Distributions (25 min)
- [ ] Required -- Univariate Numerical (25 min)
- [ ] Required -- Univariate Categorical (25 min)
- [ ] Required -- Univariate Temporal (20 min)
- [ ] Optional -- Univariate Text (20 min)
- [ ] Optional -- Understanding Scale (20 min)
Data Cleaning
- [ ] Required -- Missing Data (25 min)
- [ ] Required -- Outlier Analysis (25 min)
- [ ] Required -- Data Profiling (20 min)
- [ ] Required -- Data Cleaning Categories (20 min)
- [ ] Optional -- Data Cleaning Dates (20 min)
- [ ] Optional -- Data Cleaning Text (20 min)
- [ ] Optional -- Data Cleaning Edge Cases (20 min)
Week 8-9: EDA Advanced (Part 2 of 69 pages)
Estimated reading time: 8 hours
Bivariate/multivariate analysis, feature engineering, statistical testing, and domain-specific EDA.
Bivariate & Multivariate
- [ ] Required -- Bivariate Num-Num (25 min)
- [ ] Required -- Bivariate Cat-Num (25 min)
- [ ] Required -- Bivariate Cat-Cat (20 min)
- [ ] Required -- Multivariate (25 min)
- [ ] Required -- Correlation Traps (20 min)
- [ ] Required -- Multicollinearity (20 min)
Feature Engineering
- [ ] Required -- Feature Creation (25 min)
- [ ] Required -- Encoding Strategies (25 min)
- [ ] Required -- Scaling & Normalization (20 min)
- [ ] Required -- Transformations (20 min)
- [ ] Optional -- Text Features (20 min)
- [ ] Optional -- Datetime Features (20 min)
- [ ] Optional -- High Cardinality (20 min)
Advanced Topics
- [ ] Required -- Imbalanced Data (20 min)
- [ ] Required -- Data Quality Validation (20 min)
- [ ] Required -- Data Leakage (20 min)
- [ ] Required -- Sampling Strategies (20 min)
- [ ] Optional -- Statistical Test Selector (20 min)
- [ ] Optional -- Statistical Power (20 min)
- [ ] Optional -- Data Drift (20 min)
- [ ] Optional -- Reproducibility (20 min)
- [ ] Optional -- Large Datasets (20 min)
- [ ] Optional -- Small Datasets (15 min)
- [ ] Optional -- EDA Checklist (15 min)
- [ ] Optional -- Visualization Decision Tree (15 min)
- [ ] Optional -- Automated EDA (20 min)
- [ ] Optional -- Communicating Findings (20 min)
Week 9-10: Orchestration & Pipeline Patterns
Estimated reading time: 5 hours
- [ ] Required -- Pipeline Patterns Overview (15 min)
- [ ] Required -- Orchestration (30 min)
- [ ] Required -- Data Lineage (25 min)
- [ ] Required -- Testing Data Pipelines (25 min)
- [ ] Required -- Data Quality Checks (25 min)
- [ ] Required -- Job Queue Blueprint (40 min)
- [ ] Optional -- Docker Overview (15 min)
- [ ] Optional -- Production Dockerfiles (25 min)
- [ ] Optional -- GitHub Actions Deep Dive (30 min)
Week 10-11: Lakehouse & Real-Time Analytics
Estimated reading time: 5 hours
- [ ] Required -- Lakehouse Overview (15 min)
- [ ] Required -- Medallion Architecture (25 min)
- [ ] Required -- Table Formats (25 min)
- [ ] Required -- Query Engines (25 min)
- [ ] Required -- Real-Time Analytics Overview (15 min)
- [ ] Required -- ClickHouse vs Druid vs Pinot (25 min)
- [ ] Optional -- Analytics Pipeline Blueprint (40 min)
- [ ] Optional -- Realtime Pipeline Blueprint (35 min)
Week 11: Observability & Operations
Estimated reading time: 4 hours
- [ ] Required -- Monitoring Overview (15 min)
- [ ] Required -- Metrics Design (25 min)
- [ ] Required -- Structured Logging (25 min)
- [ ] Required -- Alert Design (25 min)
- [ ] Required -- Correlation IDs (20 min)
- [ ] Optional -- Grafana Dashboards (25 min)
Week 12: Capstone Projects
Estimated reading time: 3 hours + project time
EDA Projects
- [ ] Optional -- Project: Titanic (30 min)
- [ ] Optional -- Project: E-Commerce (30 min)
- [ ] Optional -- Project: Financial (30 min)
- [ ] Optional -- Project: Healthcare (30 min)
- [ ] Optional -- Project: NLP (30 min)
Pipeline Projects
- [ ] Optional -- Project: E-Commerce Pipeline (30 min)
- [ ] Optional -- Project: IoT Streaming (30 min)
- [ ] Optional -- Project: Real Estate (30 min)
Production Blueprints
- [ ] Required -- Analytics Pipeline Blueprint (40 min)
- [ ] Required -- Search Service Blueprint (40 min)
- [ ] Optional -- Audit Log Blueprint (35 min)
What You Will Be Able to Do After This Path
- Design star schemas, Data Vault models, and SCDs for any business domain
- Build idempotent ETL/ELT pipelines with proper error handling
- Process streaming data with Kafka, windowing, and watermarks
- Perform comprehensive EDA with pandas, matplotlib, seaborn, and plotly
- Orchestrate pipelines with Airflow or Prefect
- Validate data quality with Great Expectations and Pandera
- Design lakehouse architectures with medallion layers
- Build real-time analytics with ClickHouse, Druid, or Pinot
Cross-References to Related Paths
- Data Scientist Path -- From EDA into ML/DL modeling
- ML/DL Engineer Path -- Deep learning after data foundations
- AI/ML Engineer Path -- AI engineering and LLM integration
- Backend Engineer Path -- Backend systems that feed data pipelines
- Platform Engineer Path -- Infrastructure for data platforms
Total Progress
This path contains approximately 130 pages (including 69 EDA pages and 25 data pipeline pages). Budget 12 weeks at 5 hours per week. The EDA section alone could take 3-4 weeks of focused study.