DevOps Engineer Learning Path

A structured 12-week journey through the Knowledge Vault for DevOps engineers, SREs, and infrastructure engineers. This path takes you from containerization through orchestration, infrastructure as code, CI/CD pipelines, SRE practices, observability tools, incident response, debugging in production, disaster recovery, runbooks, and release engineering.

Who This Is For

Developers transitioning into DevOps or SRE roles
Junior DevOps engineers building towards mid-level
Backend engineers who want to own their production infrastructure
Anyone preparing for SRE interviews

Prerequisites

Basic Linux command line (navigation, processes, permissions)
Understanding of networking (TCP/IP, DNS, HTTP)
Some experience with at least one cloud provider (AWS, GCP, or Azure)
Familiarity with Git and CI/CD concepts

Total estimated time: ~55 hours across 12 weeks

Learning Progression

Week 1-2: Docker & Containers

Estimated reading time: 4 hours

Containers are the foundation of modern infrastructure. Understand how they work, how to build efficient images, and how to secure them.

[ ] Required -- Docker Overview (15 min)
[ ] Required -- Docker Internals (30 min)
[ ] Required -- Production Dockerfiles (25 min)
[ ] Required -- Multi-Stage Builds (25 min)
[ ] Required -- Image Optimization (25 min)
[ ] Required -- Compose Patterns (25 min)
[ ] Required -- Security Hardening (25 min)
[ ] Reference -- Docker Cheat Sheet (10 min)
[ ] Reference -- Docker Compose Cheat Sheet (10 min)

Checkpoint

After this section you should be able to: write production-grade multi-stage Dockerfiles, optimize image size and layer caching, use Docker Compose for local development, and apply security best practices (non-root users, read-only filesystems, minimal base images).

Week 2-3: Kubernetes Fundamentals

Estimated reading time: 4.5 hours

Kubernetes is the standard for container orchestration. Start with the architecture and core resource types.

[ ] Required -- Kubernetes Overview (15 min)
[ ] Required -- Architecture & Internals (35 min)
[ ] Required -- Pod Lifecycle (25 min)
[ ] Required -- Deployments & StatefulSets (30 min)
[ ] Required -- Services & Ingress (25 min)
[ ] Required -- Secrets Management (25 min)
[ ] Required -- Network Policies (25 min)
[ ] Required -- Helm Charts (25 min)
[ ] Reference -- Kubernetes Cheat Sheet (10 min)
[ ] Reference -- Helm Cheat Sheet (10 min)

Checkpoint

After this section you should be able to: explain the K8s control plane components, create Deployments and Services, configure health probes, manage secrets, and write basic Helm charts.

Week 3-4: Kubernetes Production

Estimated reading time: 4 hours

Running Kubernetes in production requires understanding scaling, security, debugging, and operational patterns.

[ ] Required -- RBAC (25 min)
[ ] Required -- HPA, VPA & KEDA (30 min)
[ ] Required -- Production Checklist (25 min)
[ ] Required -- Troubleshooting (30 min)
[ ] Required -- CNI Networking (25 min)
[ ] Optional -- Operators (25 min)
[ ] Optional -- CRDs & Operators (25 min)
[ ] Optional -- Admission Webhooks (25 min)
[ ] Optional -- GitOps (25 min)
[ ] Optional -- ECS vs EKS (25 min)
[ ] Optional -- GKE (25 min)
[ ] Reference -- kubectl Advanced Cheat Sheet (10 min)

Checkpoint

After this section you should be able to: configure RBAC policies, set up horizontal pod autoscaling, debug CrashLoopBackOff and OOMKilled pods, implement GitOps with ArgoCD/Flux, and understand CRDs and admission webhooks.

Week 4-5: Terraform & Infrastructure as Code

Estimated reading time: 5 hours

Manage infrastructure declaratively. Terraform is the most widely adopted IaC tool across cloud providers.

[ ] Required -- Terraform Overview (15 min)
[ ] Required -- Terraform Fundamentals (30 min)
[ ] Required -- State Management (30 min)
[ ] Required -- Modules (25 min)
[ ] Required -- Workspaces (20 min)
[ ] Required -- Security Hardening (25 min)
[ ] Optional -- AWS Startup Stack (30 min)
[ ] Optional -- GCP Startup Stack (25 min)
[ ] Optional -- Cost Optimization (20 min)
[ ] Optional -- Multi-Region Terraform (25 min)
[ ] Reference -- Terraform Cheat Sheet (10 min)

Checkpoint

After this section you should be able to: write modular Terraform configurations, manage remote state safely, use workspaces for environment separation, and apply security best practices to IaC.

Week 5-6: CI/CD Pipelines & Release Engineering

Estimated reading time: 5 hours

Automate building, testing, and deploying your applications with robust CI/CD pipelines and mature release practices.

CI/CD

[ ] Required -- CI/CD Overview (15 min)
[ ] Required -- Pipeline Patterns (25 min)
[ ] Required -- GitHub Actions Deep Dive (30 min)
[ ] Required -- Environment Promotion (25 min)
[ ] Required -- Artifact Management (20 min)
[ ] Required -- Security Scanning (25 min)
[ ] Optional -- GitLab CI (25 min)

Deployment & Release

[ ] Required -- Deployment Strategies Overview (15 min)
[ ] Required -- Blue-Green Deployments (20 min)
[ ] Required -- Canary Deployments (20 min)
[ ] Required -- Rolling Updates (20 min)
[ ] Required -- Release Engineering (25 min)
[ ] Required -- Feature Flags (25 min)
[ ] Optional -- Feature Flags for Deployment (20 min)
[ ] Optional -- Rollback Procedures (20 min)
[ ] Optional -- Database Migrations (25 min)

Checkpoint

After this section you should be able to: design multi-stage CI/CD pipelines, implement blue-green and canary deployments, integrate security scanning into the pipeline, manage release trains with feature flags, and handle rollbacks safely.

Week 6-7: Monitoring & Metrics

Estimated reading time: 3.5 hours

You cannot manage what you cannot measure. Build comprehensive observability into your infrastructure.

[ ] Required -- Monitoring Overview (15 min)
[ ] Required -- Metrics Design (25 min)
[ ] Required -- Prometheus Deep Dive (30 min)
[ ] Required -- Grafana Dashboards (25 min)
[ ] Required -- Custom Metrics (20 min)
[ ] Required -- Observability Tools (25 min)
[ ] Optional -- Monitoring Anti-Patterns (20 min)
[ ] Reference -- PromQL Cheat Sheet (10 min)

Checkpoint

After this section you should be able to: design metrics using RED and USE methodologies, set up Prometheus with service discovery, build Grafana dashboards with meaningful alerts, and avoid common monitoring anti-patterns.

Week 7-8: Logging & Observability

Estimated reading time: 3 hours

Logs are your forensic trail. Learn structured logging, aggregation, and how to tie everything together with correlation IDs.

[ ] Required -- Logging Overview (10 min)
[ ] Required -- Structured Logging (25 min)
[ ] Required -- Log Levels Strategy (20 min)
[ ] Required -- Correlation IDs (20 min)
[ ] Required -- Log Aggregation (25 min)
[ ] Required -- Sensitive Data Redaction (20 min)
[ ] Required -- Debugging in Production (25 min)

Checkpoint

After this section you should be able to: implement structured JSON logging, design a log level strategy, propagate correlation IDs across services, set up centralized log aggregation, redact PII from logs, and debug production issues systematically.

Week 8-9: SRE Practices

Estimated reading time: 4 hours

SRE gives you the framework to balance reliability with velocity. Learn error budgets, SLOs, toil reduction, and capacity planning.

[ ] Required -- SRE Overview (15 min)
[ ] Required -- SLI, SLO, SLA (25 min)
[ ] Required -- Error Budgets (25 min)
[ ] Required -- Toil Reduction (25 min)
[ ] Required -- Capacity Planning (25 min)
[ ] Required -- On-Call Best Practices (20 min)
[ ] Optional -- On-Call Handbook (25 min)

Checklists

[ ] Required -- Checklists Overview (10 min)
[ ] Required -- Pre-Launch Checklist (20 min)
[ ] Required -- Security Review Checklist (20 min)
[ ] Required -- Observability Readiness (20 min)
[ ] Optional -- Performance Review Checklist (20 min)

Checkpoint

After this section you should be able to: define SLOs and error budgets for your services, identify and reduce toil, plan capacity for growth, run effective on-call rotations, and use pre-launch checklists to prevent outages.

Week 9-10: Alerting & Incident Response

Estimated reading time: 4 hours

When things break -- and they will -- you need clear processes for detection, response, and learning.

Alerting

[ ] Required -- Alerting Overview (10 min)
[ ] Required -- Alert Design (25 min)
[ ] Required -- Severity Levels (20 min)
[ ] Required -- Escalation Policies (20 min)
[ ] Optional -- Runbook Templates (20 min)

Incident Response

[ ] Required -- Incident Response Overview (10 min)
[ ] Required -- Incident Classification (20 min)
[ ] Required -- War Room Procedures (20 min)
[ ] Required -- Postmortem Framework (25 min)
[ ] Required -- Chaos Engineering (25 min)
[ ] Optional -- Communication Templates (15 min)

Runbooks

[ ] Required -- Runbooks Overview (10 min)
[ ] Required -- Database Failover (20 min)
[ ] Required -- DDoS Response (20 min)
[ ] Required -- Service Degradation (20 min)
[ ] Optional -- Certificate Rotation (20 min)

Checkpoint

After this section you should be able to: design alerts that reduce on-call fatigue, classify incidents by severity, run a war room, write blameless postmortems, execute runbooks for common failures, and introduce chaos engineering.

Week 10-11: Debugging in Production

Estimated reading time: 3.5 hours

Systematic approaches to diagnosing production problems based on symptoms, not guesses.

[ ] Required -- Debugging Playbooks Overview (10 min)
[ ] Required -- API Slow Response (25 min)
[ ] Required -- Intermittent 502 (25 min)
[ ] Required -- Database CPU (25 min)
[ ] Required -- Memory Leak (25 min)
[ ] Required -- High Error Rate (25 min)
[ ] Required -- Pods Restarting (20 min)

Checkpoint

After this section you should be able to: systematically diagnose API latency spikes, intermittent 502 errors, database CPU exhaustion, memory leaks, and pod restart loops using structured playbooks.

Week 11: Disaster Recovery

Estimated reading time: 3 hours

Plan for the worst. Understand disaster recovery strategies, failover, and business continuity.

[ ] Required -- Disaster Recovery Overview (25 min)
[ ] Required -- Multi-Region Overview (15 min)
[ ] Required -- Data Replication (25 min)
[ ] Required -- Failover Strategies (25 min)
[ ] Optional -- Database Migrations (15 min)
[ ] Optional -- Cloud Migration (25 min)
[ ] Optional -- Monolith to Microservices (25 min)

Checkpoint

After this section you should be able to: design disaster recovery plans with RPO/RTO targets, implement cross-region data replication, plan failover strategies, and manage large-scale migrations.

Week 12: Multi-Region & Advanced Infrastructure

Estimated reading time: 4 hours

Scale beyond a single region for high availability and global performance.

[ ] Required -- Architecture Patterns (30 min)
[ ] Required -- Traffic Routing (25 min)
[ ] Optional -- Cost Analysis (20 min)

Cloud Provider Deep Dives:

[ ] Optional -- AWS VPC Networking (25 min)
[ ] Optional -- AWS IAM Deep Dive (25 min)
[ ] Optional -- AWS RDS & Aurora (25 min)
[ ] Optional -- AWS Lambda (20 min)
[ ] Optional -- AWS Well-Architected (25 min)
[ ] Optional -- AWS Cost Optimization (20 min)
[ ] Optional -- GCP Cloud Run (20 min)
[ ] Optional -- GCP IAM (20 min)
[ ] Optional -- GCP Pub/Sub (20 min)

Checkpoint

After this section you should be able to: design active-active and active-passive multi-region architectures, configure DNS-based traffic routing with health checks, and optimize cloud costs.

What You Will Be Able to Do After This Path

Build and optimize Docker containers for production workloads
Deploy and manage Kubernetes clusters with RBAC, autoscaling, and GitOps
Write modular Terraform for multi-environment infrastructure
Design CI/CD pipelines with security scanning and progressive deployment
Define and enforce SLOs with error budgets
Build observability stacks (metrics, logs, traces) from scratch
Respond to incidents with runbooks, postmortems, and chaos engineering
Debug production issues systematically using debugging playbooks
Plan disaster recovery and multi-region architectures

Backend Engineer Path -- Deep dive into databases, caching, and application architecture
Platform Engineer Path -- Build internal developer platforms on top of this foundation
Security Engineer Path -- Secure your infrastructure and pipelines
System Design Interview Path -- Apply infrastructure knowledge to interview problems
Cybersecurity Engineer Path -- Offensive and defensive security operations

Total Progress

This path contains approximately 90 pages. At a pace of 5 pages per day, you can complete it in about 3 weeks. Weeks 1-6 form the core -- prioritize those if you are short on time.

DevOps Engineer Learning Path ​

Who This Is For ​

Prerequisites ​

Learning Progression ​

Week 1-2: Docker & Containers ​

Week 2-3: Kubernetes Fundamentals ​

Week 3-4: Kubernetes Production ​

Week 4-5: Terraform & Infrastructure as Code ​

Week 5-6: CI/CD Pipelines & Release Engineering ​

CI/CD ​

Deployment & Release ​

Week 6-7: Monitoring & Metrics ​

Week 7-8: Logging & Observability ​

Week 8-9: SRE Practices ​

Checklists ​

Week 9-10: Alerting & Incident Response ​

Alerting ​

Incident Response ​

Runbooks ​

Week 10-11: Debugging in Production ​

Week 11: Disaster Recovery ​

Week 12: Multi-Region & Advanced Infrastructure ​

What You Will Be Able to Do After This Path ​

Cross-References to Related Paths ​

Related Pages

DevOps Engineer Learning Path

Who This Is For

Prerequisites

Learning Progression

Week 1-2: Docker & Containers

Week 2-3: Kubernetes Fundamentals

Week 3-4: Kubernetes Production

Week 4-5: Terraform & Infrastructure as Code

Week 5-6: CI/CD Pipelines & Release Engineering

CI/CD

Deployment & Release

Week 6-7: Monitoring & Metrics

Week 7-8: Logging & Observability

Week 8-9: SRE Practices

Checklists

Week 9-10: Alerting & Incident Response

Alerting

Incident Response

Runbooks

Week 10-11: Debugging in Production

Week 11: Disaster Recovery

Week 12: Multi-Region & Advanced Infrastructure

What You Will Be Able to Do After This Path

Cross-References to Related Paths