CI/CD Overview

Why CI/CD Exists

In the early 2000s, software teams shipped code quarterly. A "release" meant weeks of manual integration, days of testing, and prayer-filled deployment nights. The cost of a single deployment failure could mean months of rollback work. The industry needed a fundamentally different approach.

Continuous Integration (CI) emerged from Kent Beck's Extreme Programming practices in the late 1990s. The core insight was simple: if integration is painful, do it more often. Martin Fowler formalized the practice in 2006, defining CI as developers integrating their work frequently — at least daily — with each integration verified by automated builds and tests.

Continuous Delivery (CD) extended this idea: if deploying is painful, automate it. Jez Humble and David Farley's 2010 book "Continuous Delivery" laid the foundation, arguing that software should always be in a deployable state.

Continuous Deployment goes further — every change that passes automated tests is deployed to production automatically, with no human gate.

The Problem Statement

Without CI/CD, teams face:

Problem	Impact	CI/CD Solution
Integration hell	Days of merge conflicts	Continuous integration with automated testing
Manual testing bottleneck	Weeks of QA cycles	Automated test suites at every commit
Deployment fear	Infrequent, risky releases	Automated, repeatable deployment pipelines
Environment drift	"Works on my machine"	Infrastructure as Code, containerized builds
Slow feedback loops	Bugs found weeks after introduction	Immediate feedback on every commit
Configuration inconsistency	Production differs from staging	Environment promotion with immutable artifacts

First Principles

The Deployment Pipeline as a Value Stream

A CI/CD pipeline is a value stream — the sequence of steps that transforms a code change into running production software. Every step must add value; every delay is waste.

Fundamental Properties

Every effective CI/CD system must satisfy these properties:

Determinism: Given the same inputs (code + dependencies + configuration), the pipeline must produce the same outputs. Non-deterministic builds are the root of most CI/CD suffering.
Idempotency: Running the pipeline twice on the same commit must produce the same result. Deployments must be safe to retry.
Immutability: Build artifacts are never modified after creation. The binary deployed to staging is byte-for-byte identical to what reaches production.
Observability: Every pipeline step must emit structured logs, metrics, and traces. When a deployment fails at 3 AM, you need answers fast.
Isolation: Pipeline steps should not leak state between runs. Today's build must not be affected by yesterday's failure.

The DORA Metrics Framework

Google's DevOps Research and Assessment (DORA) team identified four key metrics that predict software delivery performance:

Deployment Frequency = \frac{Number of Deployments}{Time Period}

Lead Time for Changes = T_{deploy} - T_{commit}

Change Failure Rate = \frac{Failed Deployments}{Total Deployments} \times 100 %

Mean Time to Recovery (MTTR) = \frac{\sum_{i = 1}^{n} (T_{{restored}_{i}} - T_{{failed}_{i}})}{n}

Performance Level	Deploy Frequency	Lead Time	Change Failure Rate	MTTR
Elite	On-demand (multiple/day)	< 1 hour	0-15%	< 1 hour
High	Weekly to monthly	1 day - 1 week	16-30%	< 1 day
Medium	Monthly to 6 months	1 week - 1 month	16-30%	1 day - 1 week
Low	> 6 months	> 6 months	46-60%	> 6 months

Core Mechanics

Pipeline Architecture

A modern CI/CD pipeline consists of several interconnected systems:

Build System Internals

At its core, a CI runner is an event-driven job executor:

Event ingestion: Webhooks from VCS trigger pipeline creation
Job scheduling: The scheduler matches jobs to available runners based on labels, capacity, and resource requirements
Workspace preparation: The runner clones the repository, restores caches, and sets up the execution environment
Step execution: Each step runs in sequence (or parallel where specified) within the job's container/VM
Artifact collection: Outputs are extracted and stored
Status reporting: Results are sent back to the VCS via status checks

typescript

// Conceptual model of a CI pipeline executor
interface PipelineConfig {
  trigger: TriggerEvent;
  stages: Stage[];
  globalEnv: Record<string, string>;
  timeout: number;
}

interface Stage {
  name: string;
  jobs: Job[];
  dependsOn?: string[];
}

interface Job {
  name: string;
  runner: RunnerSelector;
  container?: ContainerConfig;
  steps: Step[];
  services?: ServiceConfig[];
  cache?: CacheConfig;
  artifacts?: ArtifactConfig;
  timeout: number;
  retries: RetryConfig;
}

interface Step {
  name: string;
  run?: string;
  uses?: string;   // reusable action reference
  with?: Record<string, string>;
  env?: Record<string, string>;
  if?: string;      // conditional expression
  continueOnError?: boolean;
}

interface RunnerSelector {
  labels: string[];
  group?: string;
}

interface CacheConfig {
  key: string;
  paths: string[];
  restoreKeys?: string[];
}

class PipelineExecutor {
  private stageResults: Map<string, StageResult> = new Map();

  async execute(config: PipelineConfig): Promise<PipelineResult> {
    const startTime = Date.now();

    for (const stage of this.topologicalSort(config.stages)) {
      // Check if dependencies succeeded
      if (stage.dependsOn?.some(dep =>
        this.stageResults.get(dep)?.status !== 'success'
      )) {
        this.stageResults.set(stage.name, { status: 'skipped' });
        continue;
      }

      // Execute all jobs in the stage concurrently
      const jobResults = await Promise.allSettled(
        stage.jobs.map(job => this.executeJob(job, config.globalEnv))
      );

      const stageStatus = jobResults.every(
        r => r.status === 'fulfilled' && r.value.status === 'success'
      ) ? 'success' : 'failure';

      this.stageResults.set(stage.name, { status: stageStatus });

      if (stageStatus === 'failure') {
        return {
          status: 'failure',
          duration: Date.now() - startTime,
          stages: this.stageResults,
        };
      }
    }

    return {
      status: 'success',
      duration: Date.now() - startTime,
      stages: this.stageResults,
    };
  }

  private async executeJob(
    job: Job,
    globalEnv: Record<string, string>
  ): Promise<JobResult> {
    const env = { ...globalEnv, ...this.resolveJobEnv(job) };
    const workspace = await this.prepareWorkspace(job);

    // Restore cache
    if (job.cache) {
      await this.restoreCache(job.cache, workspace);
    }

    // Start services (databases, redis, etc.)
    const services = job.services
      ? await this.startServices(job.services)
      : [];

    try {
      for (const step of job.steps) {
        // Evaluate conditional
        if (step.if && !this.evaluateExpression(step.if, env)) {
          continue;
        }

        const stepResult = await this.executeStep(step, workspace, env);

        if (stepResult.exitCode !== 0 && !step.continueOnError) {
          return { status: 'failure', failedStep: step.name };
        }
      }

      // Save cache
      if (job.cache) {
        await this.saveCache(job.cache, workspace);
      }

      // Collect artifacts
      if (job.artifacts) {
        await this.collectArtifacts(job.artifacts, workspace);
      }

      return { status: 'success' };
    } finally {
      await Promise.all(services.map(s => s.stop()));
      await workspace.cleanup();
    }
  }

  private topologicalSort(stages: Stage[]): Stage[] {
    const visited = new Set<string>();
    const sorted: Stage[] = [];
    const stageMap = new Map(stages.map(s => [s.name, s]));

    const visit = (stage: Stage) => {
      if (visited.has(stage.name)) return;
      visited.add(stage.name);
      for (const dep of stage.dependsOn ?? []) {
        const depStage = stageMap.get(dep);
        if (depStage) visit(depStage);
      }
      sorted.push(stage);
    };

    stages.forEach(visit);
    return sorted;
  }

  // ... additional helper methods
}

Caching Mechanics

Caching is the single most impactful optimization in CI/CD. Without it, every build starts from scratch — downloading dependencies, compiling code, and building containers.

Cache key strategies:

// Exact match — fastest, least flexible
cache-key: deps-{{ hashFiles('package-lock.json') }}

// Prefix match — falls back to partial restore
restore-keys:
  - deps-{{ hashFiles('package-lock.json') }}
  - deps-{{ branch }}
  - deps-main
  - deps-

Cache invalidation formula:

Cache Hit Rate = \frac{Cache Hits}{Cache Hits + Cache Misses} \times 100 %

Time Saved = Cache Hits \times (T_{cold} - T_{warm})

Where $T_{cold}$ is the time for a cacheless build and $T_{warm}$ is the time when cache is available.

Implementation: Production Pipeline Configuration

GitHub Actions Multi-Stage Pipeline

yaml

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

concurrency:
  group: ci-${{ github.ref }}
  cancel-in-progress: true

env:
  NODE_VERSION: '20'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  lint-and-typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run typecheck

  unit-tests:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - run: npm ci
      - run: npm run test -- --shard=${{ matrix.shard }}/4
      - uses: actions/upload-artifact@v4
        with:
          name: coverage-${{ matrix.shard }}
          path: coverage/

  integration-tests:
    runs-on: ubuntu-latest
    needs: [lint-and-typecheck]
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_PASSWORD: testpass
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7
        ports:
          - 6379:6379
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      - run: npm ci
      - run: npm run test:integration
        env:
          DATABASE_URL: postgresql://postgres:testpass@localhost:5432/testdb
          REDIS_URL: redis://localhost:6379

  security-scan:
    runs-on: ubuntu-latest
    needs: [lint-and-typecheck]
    steps:
      - uses: actions/checkout@v4
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'

  build:
    runs-on: ubuntu-latest
    needs: [unit-tests, integration-tests, security-scan]
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
      image-digest: ${{ steps.build.outputs.digest }}
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
      - id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    runs-on: ubuntu-latest
    needs: [build]
    if: github.ref == 'refs/heads/main'
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: |
          echo "Deploying ${{ needs.build.outputs.image-tag }} to staging"
          # kubectl set image deployment/app app=${{ needs.build.outputs.image-tag }}

  deploy-production:
    runs-on: ubuntu-latest
    needs: [deploy-staging]
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: |
          echo "Deploying to production with canary strategy"

Edge Cases & Failure Modes

Common Pipeline Failures

Failure Mode	Cause	Mitigation
Flaky tests	Non-deterministic test order, timing, shared state	Test isolation, retry policies, quarantine
Cache poisoning	Corrupted cache restored to new builds	Cache key versioning, periodic invalidation
Runner exhaustion	Too many concurrent jobs	Autoscaling runners, queue limits
Secret leakage	Secrets printed in logs	Secret masking, audit logging
Dependency confusion	Malicious package with internal name	Scoped registries, lockfile pinning
Webhook storms	VCS sends duplicate events	Idempotency keys, deduplication
Clock skew	Runners have drifted clocks	NTP enforcement, relative timestamps
Disk exhaustion	Large artifacts fill runner disk	Cleanup steps, ephemeral runners

Flaky Test Management

typescript

interface FlakyTestConfig {
  maxRetries: number;
  quarantineEnabled: boolean;
  quarantineThreshold: number; // failures in last N runs
  alertOnNewFlaky: boolean;
}

class FlakyTestDetector {
  private testHistory: Map<string, boolean[]> = new Map();

  recordResult(testName: string, passed: boolean): void {
    const history = this.testHistory.get(testName) ?? [];
    history.push(passed);

    // Keep last 50 runs
    if (history.length > 50) history.shift();
    this.testHistory.set(testName, history);
  }

  getFlakyScore(testName: string): number {
    const history = this.testHistory.get(testName);
    if (!history || history.length < 5) return 0;

    // Count transitions (pass -> fail or fail -> pass)
    let transitions = 0;
    for (let i = 1; i < history.length; i++) {
      if (history[i] !== history[i - 1]) transitions++;
    }

    // Flaky score: ratio of transitions to total runs
    // A consistently passing or failing test has score ~0
    // A randomly flipping test approaches 1.0
    return transitions / (history.length - 1);
  }

  shouldQuarantine(testName: string, config: FlakyTestConfig): boolean {
    return config.quarantineEnabled &&
           this.getFlakyScore(testName) > config.quarantineThreshold;
  }
}

Performance Characteristics

Pipeline Duration Benchmarks

Pipeline Component	Cold (No Cache)	Warm (Cached)	Optimization
Git clone (monorepo, 10GB)	120s	5s (shallow)	Shallow clone, sparse checkout
npm install (2000 deps)	90s	8s	npm ci + cache
TypeScript compilation	45s	12s	Incremental compilation
Unit tests (5000 tests)	180s	180s (parallelized: 45s)	Sharding across 4 runners
Docker build (Node.js)	120s	15s	Layer caching, multi-stage
Security scan (Trivy)	30s	10s	DB cache
E2E tests (100 scenarios)	600s	600s (parallelized: 120s)	Test sharding, 5 runners
Total (sequential)	~1185s	~830s
Total (optimized)	~480s	~200s	Parallelism + caching

Cost Model

Monthly CI Cost = \sum_{i = 1}^{N} (D_{i} \times R_{i} \times C_{r}) + S_{storage} + T_{transfer}

Where:

$N$ = number of pipeline types
$D_{i}$ = daily runs of pipeline $i$
$R_{i}$ = average runner-minutes for pipeline $i$
$C_{r}$ = cost per runner-minute
$S_{storage}$ = artifact storage cost
$T_{transfer}$ = data transfer cost

Example cost calculation for a team of 20 developers:

Resource	Quantity	Unit Cost	Monthly Cost
CI minutes (GitHub Actions)	50,000 min	$0.008/min (Linux)	$400
Larger runners (4 vCPU)	10,000 min	$0.032/min	$320
Artifact storage	50 GB	$0.25/GB	$12.50
Self-hosted runners (3x m5.xlarge)	3 instances	$140/mo each	$420
Total			~$1,152

Mathematical Foundations

Queue Theory in CI/CD

CI runners form a queuing system. Using M/M/c queue model (Poisson arrivals, exponential service, c servers):

ρ = \frac{λ}{c μ}

Where:

$λ$ = average job arrival rate (jobs/minute)
$μ$ = average job service rate per runner (jobs/minute)
$c$ = number of runners
$ρ$ = server utilization (must be < 1 for stability)

The probability of waiting (Erlang C formula):

P_{w} = \frac{\frac{(c ρ)^{c}}{c! (1 - ρ)}}{\sum_{k = 0}^{c - 1} \frac{(c ρ)^{k}}{k!} + \frac{(c ρ)^{c}}{c! (1 - ρ)}}

Average wait time in queue:

W_{q} = \frac{P_{w}}{c μ (1 - ρ)}

Practical Application

If your team generates 30 CI jobs/hour ( $λ = 0.5$ jobs/min), each taking an average of 10 minutes ( $μ = 0.1$ jobs/min), you need at least $c = ⌈ λ / μ ⌉ = 5$ runners to avoid unbounded queue growth. In practice, target $ρ < 0.7$ to keep wait times reasonable — meaning 8 runners.

Little's Law Applied to Pipelines

L = λ W

Where $L$ is the average number of in-flight pipelines, $λ$ is the arrival rate, and $W$ is the average pipeline duration. This is useful for capacity planning:

If $λ = 100$ pipelines/day and $W = 15$ minutes (0.01 days), then $L = 1$ pipeline in-flight on average. During peak hours (3x average), $L = 3$ concurrent pipelines.

Real-World War Stories

War Story — The npm Left-Pad Incident (2016)

When Azer Koculu unpublished the left-pad package from npm, thousands of CI/CD pipelines worldwide broke instantly. Builds that had worked for months suddenly failed because they pulled dependencies fresh every time.

Root cause: No lockfile pinning, no private registry mirror.

Fix: Teams adopted npm ci (which uses lockfiles strictly), private registries like Verdaccio, and artifact caching strategies that didn't rely on public registries being available.

Lesson: Your pipeline is only as reliable as its weakest external dependency. Mirror everything critical.

War Story — The 6-Hour Build Queue

A fintech company with 200 developers on a monorepo experienced build queues exceeding 6 hours during peak development. Developers started batching changes into larger PRs to avoid the queue — which made reviews harder, increased merge conflicts, and ultimately slowed delivery.

Root cause: Fixed pool of 20 self-hosted runners, no autoscaling, all jobs running on the same runner type.

Fix: Migrated to autoscaling runner groups with spot instances. Implemented job-level runner selection (lint on small runners, builds on large runners). Added aggressive caching that cut build times from 25 minutes to 8 minutes. Queue times dropped to under 2 minutes.

Lesson: CI infrastructure must scale with your team. The cost of idle developers waiting for builds far exceeds the cost of additional compute.

Decision Framework

Choosing a CI/CD Platform

Factor	GitHub Actions	GitLab CI	Jenkins	CircleCI	Buildkite
Hosted runners	Yes	Yes	No	Yes	No
Self-hosted runners	Yes	Yes	Yes	Yes	Yes
Container-native	Yes	Yes	Plugin	Yes	Yes
Monorepo support	Good	Good	Excellent	Good	Excellent
Config format	YAML	YAML	Groovy/YAML	YAML	YAML
Marketplace/plugins	20,000+	500+	1,800+	300+	200+
OIDC auth	Yes	Yes	Plugin	Yes	Yes
Max concurrency	500 (enterprise)	Unlimited (self)	Unlimited	80	Unlimited
Free tier	2,000 min/mo	400 min/mo	Self-host	6,000 min/mo	Free (self-host)
Best for	GitHub-native teams	GitLab-native teams	Complex enterprises	Startups	Large-scale

When to Use Each Approach

Advanced Topics

Pipeline as Code Patterns

Modern CI/CD has evolved toward pipeline-as-code — treating pipeline definitions as first-class software artifacts with the same rigor as application code:

Reusable workflows: Shared pipeline definitions across repositories
Pipeline testing: Unit tests for pipeline logic
Pipeline versioning: Semantic versioning for shared pipeline libraries
Policy as Code: OPA/Rego policies governing what pipelines can do

GitOps and Declarative CD

GitOps inverts the deployment model: instead of CI pushing to environments, a controller in each environment pulls the desired state from Git.

Emerging Trends

AI-powered CI/CD: Predictive test selection, automatic flaky test detection, intelligent retry policies
Supply chain security: SLSA framework, Sigstore signing, SBOM generation
Ephemeral environments: Per-PR preview environments with full stack
Platform engineering: Internal developer platforms abstracting CI/CD complexity
WebAssembly runners: Wasm-based build isolation for faster, more secure builds

Section Overview

This section covers CI/CD in depth across the following pages:

Page	What You'll Learn
GitHub Actions Deep Dive	Workflows, matrix strategies, caching, OIDC, reusable workflows
GitLab CI	Stages, jobs, pipelines, runners, DAG pipelines
Pipeline Patterns	Trunk-based development, monorepo strategies, parallel testing
Artifact Management	Container registries, npm packages, versioning strategies
Environment Promotion	Dev to staging to production, GitOps, progressive delivery
Security Scanning	SAST, DAST, dependency scanning, Trivy, supply chain security

Each page builds on the fundamentals covered here. Start with GitHub Actions if you're new to CI/CD, or jump to Pipeline Patterns if you're optimizing existing workflows.

CI/CD Overview ​

Why CI/CD Exists ​

The Problem Statement ​

First Principles ​

The Deployment Pipeline as a Value Stream ​

Fundamental Properties ​

The DORA Metrics Framework ​

Core Mechanics ​

Pipeline Architecture ​

Build System Internals ​

Caching Mechanics ​

Implementation: Production Pipeline Configuration ​

GitHub Actions Multi-Stage Pipeline ​

Edge Cases & Failure Modes ​

Common Pipeline Failures ​

Flaky Test Management ​

Performance Characteristics ​

Pipeline Duration Benchmarks ​

Cost Model ​

Mathematical Foundations ​

Queue Theory in CI/CD ​

Little's Law Applied to Pipelines ​

Real-World War Stories ​

Decision Framework ​

Choosing a CI/CD Platform ​

When to Use Each Approach ​

Advanced Topics ​

Pipeline as Code Patterns ​

GitOps and Declarative CD ​

Emerging Trends ​

Section Overview ​

Related Pages