CI/CD Overview
Why CI/CD Exists
In the early 2000s, software teams shipped code quarterly. A "release" meant weeks of manual integration, days of testing, and prayer-filled deployment nights. The cost of a single deployment failure could mean months of rollback work. The industry needed a fundamentally different approach.
Continuous Integration (CI) emerged from Kent Beck's Extreme Programming practices in the late 1990s. The core insight was simple: if integration is painful, do it more often. Martin Fowler formalized the practice in 2006, defining CI as developers integrating their work frequently — at least daily — with each integration verified by automated builds and tests.
Continuous Delivery (CD) extended this idea: if deploying is painful, automate it. Jez Humble and David Farley's 2010 book "Continuous Delivery" laid the foundation, arguing that software should always be in a deployable state.
Continuous Deployment goes further — every change that passes automated tests is deployed to production automatically, with no human gate.
The Problem Statement
Without CI/CD, teams face:
| Problem | Impact | CI/CD Solution |
|---|---|---|
| Integration hell | Days of merge conflicts | Continuous integration with automated testing |
| Manual testing bottleneck | Weeks of QA cycles | Automated test suites at every commit |
| Deployment fear | Infrequent, risky releases | Automated, repeatable deployment pipelines |
| Environment drift | "Works on my machine" | Infrastructure as Code, containerized builds |
| Slow feedback loops | Bugs found weeks after introduction | Immediate feedback on every commit |
| Configuration inconsistency | Production differs from staging | Environment promotion with immutable artifacts |
First Principles
The Deployment Pipeline as a Value Stream
A CI/CD pipeline is a value stream — the sequence of steps that transforms a code change into running production software. Every step must add value; every delay is waste.
Fundamental Properties
Every effective CI/CD system must satisfy these properties:
Determinism: Given the same inputs (code + dependencies + configuration), the pipeline must produce the same outputs. Non-deterministic builds are the root of most CI/CD suffering.
Idempotency: Running the pipeline twice on the same commit must produce the same result. Deployments must be safe to retry.
Immutability: Build artifacts are never modified after creation. The binary deployed to staging is byte-for-byte identical to what reaches production.
Observability: Every pipeline step must emit structured logs, metrics, and traces. When a deployment fails at 3 AM, you need answers fast.
Isolation: Pipeline steps should not leak state between runs. Today's build must not be affected by yesterday's failure.
The DORA Metrics Framework
Google's DevOps Research and Assessment (DORA) team identified four key metrics that predict software delivery performance:
| Performance Level | Deploy Frequency | Lead Time | Change Failure Rate | MTTR |
|---|---|---|---|---|
| Elite | On-demand (multiple/day) | < 1 hour | 0-15% | < 1 hour |
| High | Weekly to monthly | 1 day - 1 week | 16-30% | < 1 day |
| Medium | Monthly to 6 months | 1 week - 1 month | 16-30% | 1 day - 1 week |
| Low | > 6 months | > 6 months | 46-60% | > 6 months |
Core Mechanics
Pipeline Architecture
A modern CI/CD pipeline consists of several interconnected systems:
Build System Internals
At its core, a CI runner is an event-driven job executor:
- Event ingestion: Webhooks from VCS trigger pipeline creation
- Job scheduling: The scheduler matches jobs to available runners based on labels, capacity, and resource requirements
- Workspace preparation: The runner clones the repository, restores caches, and sets up the execution environment
- Step execution: Each step runs in sequence (or parallel where specified) within the job's container/VM
- Artifact collection: Outputs are extracted and stored
- Status reporting: Results are sent back to the VCS via status checks
// Conceptual model of a CI pipeline executor
interface PipelineConfig {
trigger: TriggerEvent;
stages: Stage[];
globalEnv: Record<string, string>;
timeout: number;
}
interface Stage {
name: string;
jobs: Job[];
dependsOn?: string[];
}
interface Job {
name: string;
runner: RunnerSelector;
container?: ContainerConfig;
steps: Step[];
services?: ServiceConfig[];
cache?: CacheConfig;
artifacts?: ArtifactConfig;
timeout: number;
retries: RetryConfig;
}
interface Step {
name: string;
run?: string;
uses?: string; // reusable action reference
with?: Record<string, string>;
env?: Record<string, string>;
if?: string; // conditional expression
continueOnError?: boolean;
}
interface RunnerSelector {
labels: string[];
group?: string;
}
interface CacheConfig {
key: string;
paths: string[];
restoreKeys?: string[];
}
class PipelineExecutor {
private stageResults: Map<string, StageResult> = new Map();
async execute(config: PipelineConfig): Promise<PipelineResult> {
const startTime = Date.now();
for (const stage of this.topologicalSort(config.stages)) {
// Check if dependencies succeeded
if (stage.dependsOn?.some(dep =>
this.stageResults.get(dep)?.status !== 'success'
)) {
this.stageResults.set(stage.name, { status: 'skipped' });
continue;
}
// Execute all jobs in the stage concurrently
const jobResults = await Promise.allSettled(
stage.jobs.map(job => this.executeJob(job, config.globalEnv))
);
const stageStatus = jobResults.every(
r => r.status === 'fulfilled' && r.value.status === 'success'
) ? 'success' : 'failure';
this.stageResults.set(stage.name, { status: stageStatus });
if (stageStatus === 'failure') {
return {
status: 'failure',
duration: Date.now() - startTime,
stages: this.stageResults,
};
}
}
return {
status: 'success',
duration: Date.now() - startTime,
stages: this.stageResults,
};
}
private async executeJob(
job: Job,
globalEnv: Record<string, string>
): Promise<JobResult> {
const env = { ...globalEnv, ...this.resolveJobEnv(job) };
const workspace = await this.prepareWorkspace(job);
// Restore cache
if (job.cache) {
await this.restoreCache(job.cache, workspace);
}
// Start services (databases, redis, etc.)
const services = job.services
? await this.startServices(job.services)
: [];
try {
for (const step of job.steps) {
// Evaluate conditional
if (step.if && !this.evaluateExpression(step.if, env)) {
continue;
}
const stepResult = await this.executeStep(step, workspace, env);
if (stepResult.exitCode !== 0 && !step.continueOnError) {
return { status: 'failure', failedStep: step.name };
}
}
// Save cache
if (job.cache) {
await this.saveCache(job.cache, workspace);
}
// Collect artifacts
if (job.artifacts) {
await this.collectArtifacts(job.artifacts, workspace);
}
return { status: 'success' };
} finally {
await Promise.all(services.map(s => s.stop()));
await workspace.cleanup();
}
}
private topologicalSort(stages: Stage[]): Stage[] {
const visited = new Set<string>();
const sorted: Stage[] = [];
const stageMap = new Map(stages.map(s => [s.name, s]));
const visit = (stage: Stage) => {
if (visited.has(stage.name)) return;
visited.add(stage.name);
for (const dep of stage.dependsOn ?? []) {
const depStage = stageMap.get(dep);
if (depStage) visit(depStage);
}
sorted.push(stage);
};
stages.forEach(visit);
return sorted;
}
// ... additional helper methods
}Caching Mechanics
Caching is the single most impactful optimization in CI/CD. Without it, every build starts from scratch — downloading dependencies, compiling code, and building containers.
Cache key strategies:
// Exact match — fastest, least flexible
cache-key: deps-{{ hashFiles('package-lock.json') }}
// Prefix match — falls back to partial restore
restore-keys:
- deps-{{ hashFiles('package-lock.json') }}
- deps-{{ branch }}
- deps-main
- deps-Cache invalidation formula:
Where
Implementation: Production Pipeline Configuration
GitHub Actions Multi-Stage Pipeline
# .github/workflows/ci-cd.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
env:
NODE_VERSION: '20'
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
lint-and-typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm run typecheck
unit-tests:
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run test -- --shard=${{ matrix.shard }}/4
- uses: actions/upload-artifact@v4
with:
name: coverage-${{ matrix.shard }}
path: coverage/
integration-tests:
runs-on: ubuntu-latest
needs: [lint-and-typecheck]
services:
postgres:
image: postgres:16
env:
POSTGRES_PASSWORD: testpass
POSTGRES_DB: testdb
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
ports:
- 6379:6379
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- run: npm ci
- run: npm run test:integration
env:
DATABASE_URL: postgresql://postgres:testpass@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
security-scan:
runs-on: ubuntu-latest
needs: [lint-and-typecheck]
steps:
- uses: actions/checkout@v4
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
severity: 'CRITICAL,HIGH'
exit-code: '1'
build:
runs-on: ubuntu-latest
needs: [unit-tests, integration-tests, security-scan]
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
image-digest: ${{ steps.build.outputs.digest }}
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
- id: build
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
runs-on: ubuntu-latest
needs: [build]
if: github.ref == 'refs/heads/main'
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
echo "Deploying ${{ needs.build.outputs.image-tag }} to staging"
# kubectl set image deployment/app app=${{ needs.build.outputs.image-tag }}
deploy-production:
runs-on: ubuntu-latest
needs: [deploy-staging]
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: |
echo "Deploying to production with canary strategy"Edge Cases & Failure Modes
Common Pipeline Failures
| Failure Mode | Cause | Mitigation |
|---|---|---|
| Flaky tests | Non-deterministic test order, timing, shared state | Test isolation, retry policies, quarantine |
| Cache poisoning | Corrupted cache restored to new builds | Cache key versioning, periodic invalidation |
| Runner exhaustion | Too many concurrent jobs | Autoscaling runners, queue limits |
| Secret leakage | Secrets printed in logs | Secret masking, audit logging |
| Dependency confusion | Malicious package with internal name | Scoped registries, lockfile pinning |
| Webhook storms | VCS sends duplicate events | Idempotency keys, deduplication |
| Clock skew | Runners have drifted clocks | NTP enforcement, relative timestamps |
| Disk exhaustion | Large artifacts fill runner disk | Cleanup steps, ephemeral runners |
Flaky Test Management
interface FlakyTestConfig {
maxRetries: number;
quarantineEnabled: boolean;
quarantineThreshold: number; // failures in last N runs
alertOnNewFlaky: boolean;
}
class FlakyTestDetector {
private testHistory: Map<string, boolean[]> = new Map();
recordResult(testName: string, passed: boolean): void {
const history = this.testHistory.get(testName) ?? [];
history.push(passed);
// Keep last 50 runs
if (history.length > 50) history.shift();
this.testHistory.set(testName, history);
}
getFlakyScore(testName: string): number {
const history = this.testHistory.get(testName);
if (!history || history.length < 5) return 0;
// Count transitions (pass -> fail or fail -> pass)
let transitions = 0;
for (let i = 1; i < history.length; i++) {
if (history[i] !== history[i - 1]) transitions++;
}
// Flaky score: ratio of transitions to total runs
// A consistently passing or failing test has score ~0
// A randomly flipping test approaches 1.0
return transitions / (history.length - 1);
}
shouldQuarantine(testName: string, config: FlakyTestConfig): boolean {
return config.quarantineEnabled &&
this.getFlakyScore(testName) > config.quarantineThreshold;
}
}Performance Characteristics
Pipeline Duration Benchmarks
| Pipeline Component | Cold (No Cache) | Warm (Cached) | Optimization |
|---|---|---|---|
| Git clone (monorepo, 10GB) | 120s | 5s (shallow) | Shallow clone, sparse checkout |
| npm install (2000 deps) | 90s | 8s | npm ci + cache |
| TypeScript compilation | 45s | 12s | Incremental compilation |
| Unit tests (5000 tests) | 180s | 180s (parallelized: 45s) | Sharding across 4 runners |
| Docker build (Node.js) | 120s | 15s | Layer caching, multi-stage |
| Security scan (Trivy) | 30s | 10s | DB cache |
| E2E tests (100 scenarios) | 600s | 600s (parallelized: 120s) | Test sharding, 5 runners |
| Total (sequential) | ~1185s | ~830s | |
| Total (optimized) | ~480s | ~200s | Parallelism + caching |
Cost Model
Where:
= number of pipeline types = daily runs of pipeline = average runner-minutes for pipeline = cost per runner-minute = artifact storage cost = data transfer cost
Example cost calculation for a team of 20 developers:
| Resource | Quantity | Unit Cost | Monthly Cost |
|---|---|---|---|
| CI minutes (GitHub Actions) | 50,000 min | $0.008/min (Linux) | $400 |
| Larger runners (4 vCPU) | 10,000 min | $0.032/min | $320 |
| Artifact storage | 50 GB | $0.25/GB | $12.50 |
| Self-hosted runners (3x m5.xlarge) | 3 instances | $140/mo each | $420 |
| Total | ~$1,152 |
Mathematical Foundations
Queue Theory in CI/CD
CI runners form a queuing system. Using M/M/c queue model (Poisson arrivals, exponential service, c servers):
Where:
= average job arrival rate (jobs/minute) = average job service rate per runner (jobs/minute) = number of runners = server utilization (must be < 1 for stability)
The probability of waiting (Erlang C formula):
Average wait time in queue:
Practical Application
If your team generates 30 CI jobs/hour (
Little's Law Applied to Pipelines
Where
If
Real-World War Stories
War Story — The npm Left-Pad Incident (2016)
When Azer Koculu unpublished the left-pad package from npm, thousands of CI/CD pipelines worldwide broke instantly. Builds that had worked for months suddenly failed because they pulled dependencies fresh every time.
Root cause: No lockfile pinning, no private registry mirror.
Fix: Teams adopted npm ci (which uses lockfiles strictly), private registries like Verdaccio, and artifact caching strategies that didn't rely on public registries being available.
Lesson: Your pipeline is only as reliable as its weakest external dependency. Mirror everything critical.
War Story — The 6-Hour Build Queue
A fintech company with 200 developers on a monorepo experienced build queues exceeding 6 hours during peak development. Developers started batching changes into larger PRs to avoid the queue — which made reviews harder, increased merge conflicts, and ultimately slowed delivery.
Root cause: Fixed pool of 20 self-hosted runners, no autoscaling, all jobs running on the same runner type.
Fix: Migrated to autoscaling runner groups with spot instances. Implemented job-level runner selection (lint on small runners, builds on large runners). Added aggressive caching that cut build times from 25 minutes to 8 minutes. Queue times dropped to under 2 minutes.
Lesson: CI infrastructure must scale with your team. The cost of idle developers waiting for builds far exceeds the cost of additional compute.
Decision Framework
Choosing a CI/CD Platform
| Factor | GitHub Actions | GitLab CI | Jenkins | CircleCI | Buildkite |
|---|---|---|---|---|---|
| Hosted runners | Yes | Yes | No | Yes | No |
| Self-hosted runners | Yes | Yes | Yes | Yes | Yes |
| Container-native | Yes | Yes | Plugin | Yes | Yes |
| Monorepo support | Good | Good | Excellent | Good | Excellent |
| Config format | YAML | YAML | Groovy/YAML | YAML | YAML |
| Marketplace/plugins | 20,000+ | 500+ | 1,800+ | 300+ | 200+ |
| OIDC auth | Yes | Yes | Plugin | Yes | Yes |
| Max concurrency | 500 (enterprise) | Unlimited (self) | Unlimited | 80 | Unlimited |
| Free tier | 2,000 min/mo | 400 min/mo | Self-host | 6,000 min/mo | Free (self-host) |
| Best for | GitHub-native teams | GitLab-native teams | Complex enterprises | Startups | Large-scale |
When to Use Each Approach
Advanced Topics
Pipeline as Code Patterns
Modern CI/CD has evolved toward pipeline-as-code — treating pipeline definitions as first-class software artifacts with the same rigor as application code:
- Reusable workflows: Shared pipeline definitions across repositories
- Pipeline testing: Unit tests for pipeline logic
- Pipeline versioning: Semantic versioning for shared pipeline libraries
- Policy as Code: OPA/Rego policies governing what pipelines can do
GitOps and Declarative CD
GitOps inverts the deployment model: instead of CI pushing to environments, a controller in each environment pulls the desired state from Git.
Emerging Trends
- AI-powered CI/CD: Predictive test selection, automatic flaky test detection, intelligent retry policies
- Supply chain security: SLSA framework, Sigstore signing, SBOM generation
- Ephemeral environments: Per-PR preview environments with full stack
- Platform engineering: Internal developer platforms abstracting CI/CD complexity
- WebAssembly runners: Wasm-based build isolation for faster, more secure builds
Section Overview
This section covers CI/CD in depth across the following pages:
| Page | What You'll Learn |
|---|---|
| GitHub Actions Deep Dive | Workflows, matrix strategies, caching, OIDC, reusable workflows |
| GitLab CI | Stages, jobs, pipelines, runners, DAG pipelines |
| Pipeline Patterns | Trunk-based development, monorepo strategies, parallel testing |
| Artifact Management | Container registries, npm packages, versioning strategies |
| Environment Promotion | Dev to staging to production, GitOps, progressive delivery |
| Security Scanning | SAST, DAST, dependency scanning, Trivy, supply chain security |
Each page builds on the fundamentals covered here. Start with GitHub Actions if you're new to CI/CD, or jump to Pipeline Patterns if you're optimizing existing workflows.