Skip to content
Unverified — AI-generated content. Help verify this page

SLI / SLO / SLA Engineering

Service Level Indicators, Objectives, and Agreements form a three-layer system for defining, measuring, and committing to reliability. Without them, reliability is a vague aspiration — "the system should be fast and available." With them, reliability is a precise engineering constraint — "99.9% of requests complete in under 300ms over a 30-day rolling window, and if we breach this target for three consecutive months, customers receive a 10% service credit."

This precision transforms reliability from an emotional debate ("it feels slow") into a data-driven practice ("p99 latency breached SLO for 47 minutes, consuming 8% of our error budget").

The Three Layers

LayerDefinitionOwned ByExample
SLIA quantitative measure of service behaviorEngineeringRatio of successful HTTP requests to total requests
SLOA target value or range for an SLIEngineering + Product99.9% of requests succeed over 30 days
SLAA contractual commitment to meet specific SLOs, with penalties for breachBusiness + Legal99.9% availability; 10% service credit if breached

The Critical Insight

SLOs should always be stricter than SLAs. If your SLA promises 99.9%, your SLO should target 99.95%. The gap between SLO and SLA is your safety margin — it gives you time to detect and fix problems before they trigger contractual penalties.

Service Level Indicators (SLIs)

What Makes a Good SLI?

A good SLI is:

  1. User-facing — it measures something the user cares about, not an internal metric
  2. Measurable — it can be computed from existing data (logs, metrics, probes)
  3. Proportional — small changes in system behavior produce proportional changes in the SLI
  4. Actionable — when the SLI degrades, engineers know where to look

The Four Golden SLIs

Google recommends focusing on four categories of SLIs:

CategoryWhat It MeasuresExample SLIServices
AvailabilityIs the service up and serving?Ratio of successful requests to total requestsAll services
LatencyHow fast does the service respond?Proportion of requests faster than thresholdAPIs, web apps
ThroughputHow much work does the service handle?Bytes processed per secondData pipelines, batch jobs
CorrectnessDoes the service return the right answer?Proportion of responses that are correctData pipelines, ML models, billing

SLI Specification vs Implementation

SLI Specification — what you want to measure (the ideal):

"The proportion of valid requests served successfully"

SLI Implementation — how you actually measure it:

"The proportion of HTTP requests that return a 2xx or 4xx status code, measured at the load balancer, excluding health checks"

The implementation is always an approximation of the specification. Document both, and acknowledge the gap.

Choosing SLIs by Service Type

Service TypePrimary SLISecondary SLIs
REST APIAvailability (success rate)Latency (p50, p99), Error rate by endpoint
Web applicationPage load timeAvailability, Core Web Vitals
Batch pipelineCorrectness (data quality)Throughput, Freshness (data age)
Streaming systemThroughputLatency (end-to-end), Message loss rate
Storage systemDurabilityAvailability, Latency, Throughput
ML inferenceLatencyAvailability, Prediction accuracy

Where to Measure SLIs

The measurement point matters enormously:

Measurement PointProsCons
Client-side (RUM)Most accurate representation of user experienceNoisy (user device/network issues), hard to attribute
CDN/EdgeCaptures most infra issuesMisses client-side problems
Load balancerReliable, easy to instrumentMisses CDN and network issues
ApplicationMost detailed, per-endpoint metricsMisses infra failures (LB, network)
Synthetic probesConsistent baseline, catches total outagesDoes not reflect real user experience

Measure at the Edge, Not the Application

If your application metric shows 100% success rate but the load balancer is returning 502s because the app is not responding, your SLI is lying. Always include a measurement point that is outside the application — load balancer logs or synthetic probes.

SLI Implementation Examples

Availability SLI (Prometheus)

yaml
# Recording rule for availability SLI
groups:
  - name: sli_availability
    rules:
      - record: sli:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      - record: sli:availability:ratio_rate30d
        expr: |
          sum(increase(http_requests_total{status!~"5.."}[30d]))
          /
          sum(increase(http_requests_total[30d]))

Latency SLI (Prometheus)

yaml
# Recording rule for latency SLI — proportion of requests under threshold
groups:
  - name: sli_latency
    rules:
      - record: sli:latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m]))

      - record: sli:latency:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          )

Correctness SLI (Application-Level)

python
# Correctness SLI for a data pipeline
from prometheus_client import Counter

records_processed = Counter(
    'pipeline_records_processed_total',
    'Total records processed',
    ['result']  # 'correct', 'incorrect', 'skipped'
)

def process_record(record):
    result = transform(record)
    expected = get_expected_output(record)

    if result == expected:
        records_processed.labels(result='correct').inc()
    else:
        records_processed.labels(result='incorrect').inc()
        log_data_quality_issue(record, result, expected)

# SLI: records_processed{result="correct"} / sum(records_processed)

Service Level Objectives (SLOs)

Setting SLOs

SLOs answer the question: "How reliable does this service need to be?" The answer is not "as reliable as possible" — it is "reliable enough that users are satisfied, and no more."

Step 1: Understand User Expectations

MethodWhat It Tells You
Analyze current performanceWhat reliability level are users already getting?
Customer support ticketsAt what point do users start complaining?
Competitor analysisWhat reliability level do alternatives offer?
User researchWhat do users say they need? (Be careful — users often overstate needs)
Business impact analysisWhat is the revenue impact of different reliability levels?

Step 2: Choose an SLO Target

Start with current performance and adjust:

If current availability = 99.95% and users are satisfied:
  → Set SLO at 99.9% (slightly below current, gives breathing room)

If current availability = 99.5% and users are complaining:
  → Set SLO at 99.9% (aspirational, drives improvement work)
  → Set intermediate milestone at 99.7% (achievable short-term)

Do NOT Set SLOs at 100%

An SLO of 100% means zero error budget, which means zero tolerance for any change. No deployments, no experiments, no migrations — because any of these could cause a single error. 100% availability is not just impossible; it is counterproductive. It paralyzes the team.

Step 3: Define the SLO Window

Window TypeDescriptionProsCons
Rolling (30 days)SLO evaluated over trailing 30 daysNo "reset" gaming; always relevantIncidents linger for 30 days
Calendar (monthly)SLO resets on the 1st of each monthClean reporting; aligns with billingTeams can burn budget early and coast
Rolling (7 days)Shorter evaluation windowFaster signalLess tolerant of maintenance

Recommendation: Use 30-day rolling windows for operational SLOs and calendar monthly for business reporting.

The SLO Document

Every service should have an SLO document. Here is a production-ready template:

markdown
## SLO Document — Order Service

### Service Description
The Order Service handles e-commerce order creation, payment processing,
and order status queries. It is a Tier-1 service — revenue-generating
and customer-facing.

### Stakeholders
- Service Owner: Alice Johnson (Engineering)
- Product Owner: Bob Williams (Product)
- SRE Lead: Carol Martinez (SRE)
- Approved by: Dave Chen (VP Engineering)

### SLIs and SLOs

| SLI | Description | Measurement | SLO Target | Window |
|-----|-------------|-------------|------------|--------|
| Availability | Successful requests / total requests | Load balancer access logs, excluding health checks | 99.95% | 30-day rolling |
| Latency (fast) | Requests completing < 200ms | Application metrics (histogram) | 90% of requests | 30-day rolling |
| Latency (slow) | Requests completing < 1000ms | Application metrics (histogram) | 99% of requests | 30-day rolling |
| Correctness | Orders with correct total calculation | End-of-day reconciliation job | 99.999% | 30-day rolling |

### Error Budget
- Availability budget: 0.05% = 21.6 minutes / 30 days
- Error budget policy: see [Error Budgets](/devops/sre/error-budgets)

### Exclusions
- Planned maintenance windows (announced 48h in advance, max 4h/month)
- Client errors (4xx responses) are not counted against availability
- Synthetic monitoring traffic is excluded

### Review Schedule
- Monthly: SLO compliance review with service team
- Quarterly: SLO target review with stakeholders
- Annually: Full SLO redesign if needed

SLO Compliance Tracking

python
# SLO compliance checker
from dataclasses import dataclass
from datetime import datetime

@dataclass
class SLOStatus:
    sli_name: str
    target: float
    current: float
    compliant: bool
    error_budget_remaining_pct: float
    window_start: datetime
    window_end: datetime

def check_slo_compliance(
    sli_value: float,    # Current SLI value (e.g., 0.9993)
    slo_target: float,   # SLO target (e.g., 0.9995)
    window_days: int = 30,
) -> SLOStatus:
    error_budget_total = 1 - slo_target     # e.g., 0.0005
    error_budget_consumed = max(0, 1 - sli_value)  # e.g., 0.0007
    error_budget_remaining = error_budget_total - error_budget_consumed
    remaining_pct = (error_budget_remaining / error_budget_total) * 100

    return SLOStatus(
        sli_name="availability",
        target=slo_target,
        current=sli_value,
        compliant=sli_value >= slo_target,
        error_budget_remaining_pct=round(remaining_pct, 2),
        window_start=datetime.utcnow(),
        window_end=datetime.utcnow(),
    )

Service Level Agreements (SLAs)

SLO vs SLA

AspectSLOSLA
AudienceInternal engineering teamsExternal customers
ConsequencesError budget policy actionsFinancial penalties, contract termination
StringencyStricter (provides safety margin)Less strict (safety margin between SLO and SLA)
Who sets itEngineering + ProductBusiness + Legal
DocumentationInternal SLO documentLegal contract

SLA Components

A well-written SLA includes:

ComponentDescriptionExample
Service descriptionWhat is covered"The API endpoint at api.example.com"
Availability definitionHow uptime is calculated"Percentage of 1-minute intervals where > 95% of health checks succeed"
ExclusionsWhat does not count"Scheduled maintenance, force majeure, customer-caused issues"
Measurement periodEvaluation window"Calendar month"
Credit schedulePenalties for breach"< 99.9%: 10% credit; < 99.0%: 25% credit; < 95%: 50% credit"
Credit capMaximum penalty"Credits shall not exceed 50% of monthly fees"
Claim processHow customers request credits"Submit ticket within 30 days of incident"

SLA Credit Schedule Example

Monthly Uptime %Service Credit
99.0% - 99.9%10% of monthly bill
95.0% - 99.0%25% of monthly bill
< 95.0%50% of monthly bill

Engineering Considerations for SLAs

SLAs are Legal Documents

Engineers should be deeply involved in SLA definition but should not write them alone. Common engineering mistakes in SLAs:

  • Promising SLAs tighter than what the infrastructure can deliver
  • Not defining measurement methodology precisely
  • Not excluding planned maintenance
  • Not accounting for dependency SLAs (your SLA cannot exceed your cloud provider's SLA)
  • Not setting a credit cap (unlimited penalties are an existential business risk)

Dependency SLA Analysis

Your service's SLA is bounded by the SLAs of your dependencies:

python
# Composite SLA calculation
def composite_sla(dependency_slas: list[float], topology: str = "serial") -> float:
    """
    Calculate composite SLA from dependency SLAs.

    Serial topology: all dependencies must be available.
    Parallel (redundant): at least one must be available.
    """
    if topology == "serial":
        # SLA = product of all dependency SLAs
        composite = 1.0
        for sla in dependency_slas:
            composite *= sla
        return round(composite, 6)
    elif topology == "parallel":
        # SLA = 1 - product of all failure probabilities
        composite_failure = 1.0
        for sla in dependency_slas:
            composite_failure *= (1 - sla)
        return round(1 - composite_failure, 6)

# Serial: API → Auth Service → Database
serial_sla = composite_sla([0.999, 0.999, 0.9999], topology="serial")
# => 0.997901 ≈ 99.79%
# You CANNOT promise 99.9% if your serial dependency chain gives 99.79%

# Parallel: Primary DB || Replica DB
parallel_sla = composite_sla([0.999, 0.999], topology="parallel")
# => 0.999999 ≈ 99.9999%
# Redundancy dramatically improves availability

Putting It All Together

The SLI/SLO/SLA Lifecycle

Common Pitfalls

PitfallImpactFix
Too many SLIsAlert fatigue, unclear prioritiesMax 3-5 SLIs per service
SLIs not user-facingSLOs that do not reflect user experienceMeasure at the edge, not internally
SLOs set by fiatUnrealistic targets, team ignores themBase SLOs on data and user research
SLO = SLANo safety margin; any slip triggers penaltiesSLO should be 2-5x stricter than SLA
No SLO review cadenceSLOs become stale as the service evolvesReview monthly, adjust quarterly
Ignoring latency SLOsOnly tracking availability misses slownessAlways include latency SLIs at p50 and p99

Further Reading

  • Error Budgets — what to do with the budget that SLOs create
  • SRE Overview — the broader SRE framework
  • Observability — the monitoring infrastructure that powers SLI measurement
  • On-Call Handbook — incident response when SLOs are breached
  • Implementing Service Level Objectives by Alex Hidalgo — the definitive book on SLOs
  • Google SRE Book, Chapter 4: "Service Level Objectives" — sre.google/sre-book/service-level-objectives
  • Google SRE Workbook, Chapter 2: "Implementing SLOs" — sre.google/workbook/implementing-slos

"What I cannot create, I do not understand." — Richard Feynman