DevOps Engineer Learning Path
A structured 12-week journey through the Knowledge Vault for DevOps engineers, SREs, and infrastructure engineers. This path takes you from containerization through orchestration, infrastructure as code, CI/CD pipelines, SRE practices, observability tools, incident response, debugging in production, disaster recovery, runbooks, and release engineering.
Who This Is For
- Developers transitioning into DevOps or SRE roles
- Junior DevOps engineers building towards mid-level
- Backend engineers who want to own their production infrastructure
- Anyone preparing for SRE interviews
Prerequisites
- Basic Linux command line (navigation, processes, permissions)
- Understanding of networking (TCP/IP, DNS, HTTP)
- Some experience with at least one cloud provider (AWS, GCP, or Azure)
- Familiarity with Git and CI/CD concepts
Total estimated time: ~55 hours across 12 weeks
Learning Progression
Week 1-2: Docker & Containers
Estimated reading time: 4 hours
Containers are the foundation of modern infrastructure. Understand how they work, how to build efficient images, and how to secure them.
- [ ] Required -- Docker Overview (15 min)
- [ ] Required -- Docker Internals (30 min)
- [ ] Required -- Production Dockerfiles (25 min)
- [ ] Required -- Multi-Stage Builds (25 min)
- [ ] Required -- Image Optimization (25 min)
- [ ] Required -- Compose Patterns (25 min)
- [ ] Required -- Security Hardening (25 min)
- [ ] Reference -- Docker Cheat Sheet (10 min)
- [ ] Reference -- Docker Compose Cheat Sheet (10 min)
Checkpoint
After this section you should be able to: write production-grade multi-stage Dockerfiles, optimize image size and layer caching, use Docker Compose for local development, and apply security best practices (non-root users, read-only filesystems, minimal base images).
Week 2-3: Kubernetes Fundamentals
Estimated reading time: 4.5 hours
Kubernetes is the standard for container orchestration. Start with the architecture and core resource types.
- [ ] Required -- Kubernetes Overview (15 min)
- [ ] Required -- Architecture & Internals (35 min)
- [ ] Required -- Pod Lifecycle (25 min)
- [ ] Required -- Deployments & StatefulSets (30 min)
- [ ] Required -- Services & Ingress (25 min)
- [ ] Required -- Secrets Management (25 min)
- [ ] Required -- Network Policies (25 min)
- [ ] Required -- Helm Charts (25 min)
- [ ] Reference -- Kubernetes Cheat Sheet (10 min)
- [ ] Reference -- Helm Cheat Sheet (10 min)
Checkpoint
After this section you should be able to: explain the K8s control plane components, create Deployments and Services, configure health probes, manage secrets, and write basic Helm charts.
Week 3-4: Kubernetes Production
Estimated reading time: 4 hours
Running Kubernetes in production requires understanding scaling, security, debugging, and operational patterns.
- [ ] Required -- RBAC (25 min)
- [ ] Required -- HPA, VPA & KEDA (30 min)
- [ ] Required -- Production Checklist (25 min)
- [ ] Required -- Troubleshooting (30 min)
- [ ] Required -- CNI Networking (25 min)
- [ ] Optional -- Operators (25 min)
- [ ] Optional -- CRDs & Operators (25 min)
- [ ] Optional -- Admission Webhooks (25 min)
- [ ] Optional -- GitOps (25 min)
- [ ] Optional -- ECS vs EKS (25 min)
- [ ] Optional -- GKE (25 min)
- [ ] Reference -- kubectl Advanced Cheat Sheet (10 min)
Checkpoint
After this section you should be able to: configure RBAC policies, set up horizontal pod autoscaling, debug CrashLoopBackOff and OOMKilled pods, implement GitOps with ArgoCD/Flux, and understand CRDs and admission webhooks.
Week 4-5: Terraform & Infrastructure as Code
Estimated reading time: 5 hours
Manage infrastructure declaratively. Terraform is the most widely adopted IaC tool across cloud providers.
- [ ] Required -- Terraform Overview (15 min)
- [ ] Required -- Terraform Fundamentals (30 min)
- [ ] Required -- State Management (30 min)
- [ ] Required -- Modules (25 min)
- [ ] Required -- Workspaces (20 min)
- [ ] Required -- Security Hardening (25 min)
- [ ] Optional -- AWS Startup Stack (30 min)
- [ ] Optional -- GCP Startup Stack (25 min)
- [ ] Optional -- Cost Optimization (20 min)
- [ ] Optional -- Multi-Region Terraform (25 min)
- [ ] Reference -- Terraform Cheat Sheet (10 min)
Checkpoint
After this section you should be able to: write modular Terraform configurations, manage remote state safely, use workspaces for environment separation, and apply security best practices to IaC.
Week 5-6: CI/CD Pipelines & Release Engineering
Estimated reading time: 5 hours
Automate building, testing, and deploying your applications with robust CI/CD pipelines and mature release practices.
CI/CD
- [ ] Required -- CI/CD Overview (15 min)
- [ ] Required -- Pipeline Patterns (25 min)
- [ ] Required -- GitHub Actions Deep Dive (30 min)
- [ ] Required -- Environment Promotion (25 min)
- [ ] Required -- Artifact Management (20 min)
- [ ] Required -- Security Scanning (25 min)
- [ ] Optional -- GitLab CI (25 min)
Deployment & Release
- [ ] Required -- Deployment Strategies Overview (15 min)
- [ ] Required -- Blue-Green Deployments (20 min)
- [ ] Required -- Canary Deployments (20 min)
- [ ] Required -- Rolling Updates (20 min)
- [ ] Required -- Release Engineering (25 min)
- [ ] Required -- Feature Flags (25 min)
- [ ] Optional -- Feature Flags for Deployment (20 min)
- [ ] Optional -- Rollback Procedures (20 min)
- [ ] Optional -- Database Migrations (25 min)
Checkpoint
After this section you should be able to: design multi-stage CI/CD pipelines, implement blue-green and canary deployments, integrate security scanning into the pipeline, manage release trains with feature flags, and handle rollbacks safely.
Week 6-7: Monitoring & Metrics
Estimated reading time: 3.5 hours
You cannot manage what you cannot measure. Build comprehensive observability into your infrastructure.
- [ ] Required -- Monitoring Overview (15 min)
- [ ] Required -- Metrics Design (25 min)
- [ ] Required -- Prometheus Deep Dive (30 min)
- [ ] Required -- Grafana Dashboards (25 min)
- [ ] Required -- Custom Metrics (20 min)
- [ ] Required -- Observability Tools (25 min)
- [ ] Optional -- Monitoring Anti-Patterns (20 min)
- [ ] Reference -- PromQL Cheat Sheet (10 min)
Checkpoint
After this section you should be able to: design metrics using RED and USE methodologies, set up Prometheus with service discovery, build Grafana dashboards with meaningful alerts, and avoid common monitoring anti-patterns.
Week 7-8: Logging & Observability
Estimated reading time: 3 hours
Logs are your forensic trail. Learn structured logging, aggregation, and how to tie everything together with correlation IDs.
- [ ] Required -- Logging Overview (10 min)
- [ ] Required -- Structured Logging (25 min)
- [ ] Required -- Log Levels Strategy (20 min)
- [ ] Required -- Correlation IDs (20 min)
- [ ] Required -- Log Aggregation (25 min)
- [ ] Required -- Sensitive Data Redaction (20 min)
- [ ] Required -- Debugging in Production (25 min)
Checkpoint
After this section you should be able to: implement structured JSON logging, design a log level strategy, propagate correlation IDs across services, set up centralized log aggregation, redact PII from logs, and debug production issues systematically.
Week 8-9: SRE Practices
Estimated reading time: 4 hours
SRE gives you the framework to balance reliability with velocity. Learn error budgets, SLOs, toil reduction, and capacity planning.
- [ ] Required -- SRE Overview (15 min)
- [ ] Required -- SLI, SLO, SLA (25 min)
- [ ] Required -- Error Budgets (25 min)
- [ ] Required -- Toil Reduction (25 min)
- [ ] Required -- Capacity Planning (25 min)
- [ ] Required -- On-Call Best Practices (20 min)
- [ ] Optional -- On-Call Handbook (25 min)
Checklists
- [ ] Required -- Checklists Overview (10 min)
- [ ] Required -- Pre-Launch Checklist (20 min)
- [ ] Required -- Security Review Checklist (20 min)
- [ ] Required -- Observability Readiness (20 min)
- [ ] Optional -- Performance Review Checklist (20 min)
Checkpoint
After this section you should be able to: define SLOs and error budgets for your services, identify and reduce toil, plan capacity for growth, run effective on-call rotations, and use pre-launch checklists to prevent outages.
Week 9-10: Alerting & Incident Response
Estimated reading time: 4 hours
When things break -- and they will -- you need clear processes for detection, response, and learning.
Alerting
- [ ] Required -- Alerting Overview (10 min)
- [ ] Required -- Alert Design (25 min)
- [ ] Required -- Severity Levels (20 min)
- [ ] Required -- Escalation Policies (20 min)
- [ ] Optional -- Runbook Templates (20 min)
Incident Response
- [ ] Required -- Incident Response Overview (10 min)
- [ ] Required -- Incident Classification (20 min)
- [ ] Required -- War Room Procedures (20 min)
- [ ] Required -- Postmortem Framework (25 min)
- [ ] Required -- Chaos Engineering (25 min)
- [ ] Optional -- Communication Templates (15 min)
Runbooks
- [ ] Required -- Runbooks Overview (10 min)
- [ ] Required -- Database Failover (20 min)
- [ ] Required -- DDoS Response (20 min)
- [ ] Required -- Service Degradation (20 min)
- [ ] Optional -- Certificate Rotation (20 min)
Checkpoint
After this section you should be able to: design alerts that reduce on-call fatigue, classify incidents by severity, run a war room, write blameless postmortems, execute runbooks for common failures, and introduce chaos engineering.
Week 10-11: Debugging in Production
Estimated reading time: 3.5 hours
Systematic approaches to diagnosing production problems based on symptoms, not guesses.
- [ ] Required -- Debugging Playbooks Overview (10 min)
- [ ] Required -- API Slow Response (25 min)
- [ ] Required -- Intermittent 502 (25 min)
- [ ] Required -- Database CPU (25 min)
- [ ] Required -- Memory Leak (25 min)
- [ ] Required -- High Error Rate (25 min)
- [ ] Required -- Pods Restarting (20 min)
Checkpoint
After this section you should be able to: systematically diagnose API latency spikes, intermittent 502 errors, database CPU exhaustion, memory leaks, and pod restart loops using structured playbooks.
Week 11: Disaster Recovery
Estimated reading time: 3 hours
Plan for the worst. Understand disaster recovery strategies, failover, and business continuity.
- [ ] Required -- Disaster Recovery Overview (25 min)
- [ ] Required -- Multi-Region Overview (15 min)
- [ ] Required -- Data Replication (25 min)
- [ ] Required -- Failover Strategies (25 min)
- [ ] Optional -- Database Migrations (15 min)
- [ ] Optional -- Cloud Migration (25 min)
- [ ] Optional -- Monolith to Microservices (25 min)
Checkpoint
After this section you should be able to: design disaster recovery plans with RPO/RTO targets, implement cross-region data replication, plan failover strategies, and manage large-scale migrations.
Week 12: Multi-Region & Advanced Infrastructure
Estimated reading time: 4 hours
Scale beyond a single region for high availability and global performance.
- [ ] Required -- Architecture Patterns (30 min)
- [ ] Required -- Traffic Routing (25 min)
- [ ] Optional -- Cost Analysis (20 min)
Cloud Provider Deep Dives:
- [ ] Optional -- AWS VPC Networking (25 min)
- [ ] Optional -- AWS IAM Deep Dive (25 min)
- [ ] Optional -- AWS RDS & Aurora (25 min)
- [ ] Optional -- AWS Lambda (20 min)
- [ ] Optional -- AWS Well-Architected (25 min)
- [ ] Optional -- AWS Cost Optimization (20 min)
- [ ] Optional -- GCP Cloud Run (20 min)
- [ ] Optional -- GCP IAM (20 min)
- [ ] Optional -- GCP Pub/Sub (20 min)
Checkpoint
After this section you should be able to: design active-active and active-passive multi-region architectures, configure DNS-based traffic routing with health checks, and optimize cloud costs.
What You Will Be Able to Do After This Path
- Build and optimize Docker containers for production workloads
- Deploy and manage Kubernetes clusters with RBAC, autoscaling, and GitOps
- Write modular Terraform for multi-environment infrastructure
- Design CI/CD pipelines with security scanning and progressive deployment
- Define and enforce SLOs with error budgets
- Build observability stacks (metrics, logs, traces) from scratch
- Respond to incidents with runbooks, postmortems, and chaos engineering
- Debug production issues systematically using debugging playbooks
- Plan disaster recovery and multi-region architectures
Cross-References to Related Paths
- Backend Engineer Path -- Deep dive into databases, caching, and application architecture
- Platform Engineer Path -- Build internal developer platforms on top of this foundation
- Security Engineer Path -- Secure your infrastructure and pipelines
- System Design Interview Path -- Apply infrastructure knowledge to interview problems
- Cybersecurity Engineer Path -- Offensive and defensive security operations
Total Progress
This path contains approximately 90 pages. At a pace of 5 pages per day, you can complete it in about 3 weeks. Weeks 1-6 form the core -- prioritize those if you are short on time.