Skip to content
Unverified — AI-generated content. Help verify this page

PromQL Cheat Sheet

Quick reference for PromQL selectors, label matchers, range vectors, aggregation operators, common queries, recording rules, and alerting rules.


Data Types

TypeDescriptionExample
Instant vectorSet of time series, single timestamphttp_requests_total
Range vectorSet of time series, range of timestampshttp_requests_total[5m]
ScalarSingle numeric value42, 3.14
StringSingle string value (limited use)"hello"

Selectors & Label Matchers

Instant Vector Selectors

promql
# Metric name only
http_requests_total

# With label matchers
http_requests_total{method="GET"}
http_requests_total{method="GET", status="200"}
http_requests_total{job="api-server", instance=~"10.0.0.*"}

Label Matching Operators

OperatorDescriptionExample
=Exact match{method="GET"}
!=Not equal{status!="200"}
=~Regex match{path=~"/api/v[12]/.*"}
!~Regex not match{method!~"OPTIONS|HEAD"}

Range Vector Selectors

promql
# Last 5 minutes of data
http_requests_total[5m]

# Time durations
http_requests_total[30s]      # 30 seconds
http_requests_total[5m]       # 5 minutes
http_requests_total[1h]       # 1 hour
http_requests_total[1d]       # 1 day
http_requests_total[1w]       # 1 week

Offset Modifier

promql
# Value 1 hour ago
http_requests_total offset 1h

# Rate 1 day ago (compare to yesterday)
rate(http_requests_total[5m] offset 1d)

# @ modifier (at specific timestamp)
http_requests_total @ 1704067200

Rate & Counter Functions

FunctionDescriptionUse With
rate(v[t])Per-second average rateCounters (always use for counters)
irate(v[t])Instantaneous rate (last 2 points)Counters (volatile, dashboards)
increase(v[t])Total increase over rangeCounters
delta(v[t])Change over rangeGauges
idelta(v[t])Change between last 2 pointsGauges
deriv(v[t])Per-second derivative (linear regression)Gauges

TIP

Always apply rate() or increase() to counters before any aggregation. Never use raw counter values directly -- they only go up and reset on restart.

promql
# Correct: rate first, then sum
sum(rate(http_requests_total[5m])) by (method)

# Wrong: sum first, then rate (hides counter resets)
# rate(sum(http_requests_total)[5m])

Aggregation Operators

Syntax

promql
<aggr_op>([parameter,] <vector>) [by|without (<labels>)]

Operators

OperatorDescriptionExample
sumSum valuessum(rate(http_requests_total[5m]))
avgAverageavg(node_cpu_seconds_total)
minMinimummin(node_memory_available_bytes)
maxMaximummax(container_memory_usage_bytes) by (pod)
countCount seriescount(up == 1)
stddevStandard deviationstddev(request_duration_seconds)
stdvarStandard variancestdvar(request_duration_seconds)
topkTop K seriestopk(5, rate(http_requests_total[5m]))
bottomkBottom K seriesbottomk(3, up)
quantileQuantile across seriesquantile(0.95, request_duration_seconds)
count_valuesCount by valuecount_values("version", build_info)
groupGroup (all values become 1)group(up) by (job)

by vs without

promql
# Keep only these labels in result
sum(rate(http_requests_total[5m])) by (method, status)

# Remove these labels from result (keep everything else)
sum(rate(http_requests_total[5m])) without (instance)

Binary Operators

Arithmetic

OperatorDescription
+Addition
-Subtraction
*Multiplication
/Division
%Modulo
^Power

Comparison (Filter)

OperatorDescription
==Equal
!=Not equal
>Greater than
<Less than
>=Greater or equal
<=Less or equal
promql
# Filter: keep only series where value > 100
http_requests_total > 100

# Return 0/1 instead of filtering (bool modifier)
http_requests_total > bool 100

Vector Matching

promql
# One-to-one matching (same labels)
http_requests_total / http_requests_duration_seconds

# Ignore labels for matching
method_a{job="a"} / ignoring(job) method_b{job="b"}

# Match on specific labels only
method_a / on(instance, method) method_b

# Many-to-one / one-to-many
http_errors / ignoring(code) group_left http_requests

Built-in Functions

Math

FunctionDescription
abs(v)Absolute value
ceil(v)Round up
floor(v)Round down
round(v, to)Round to nearest
clamp(v, min, max)Clamp values
clamp_min(v, min)Minimum clamp
clamp_max(v, max)Maximum clamp
ln(v)Natural log
log2(v)Log base 2
log10(v)Log base 10
sqrt(v)Square root
exp(v)Exponential
sgn(v)Sign (-1, 0, 1)

Range Vector Functions

FunctionDescription
avg_over_time(v[t])Average over range
min_over_time(v[t])Minimum over range
max_over_time(v[t])Maximum over range
sum_over_time(v[t])Sum over range
count_over_time(v[t])Count of samples in range
quantile_over_time(q, v[t])Quantile over range
stddev_over_time(v[t])Std deviation over range
last_over_time(v[t])Most recent value
present_over_time(v[t])1 if any sample exists
changes(v[t])Number of value changes
resets(v[t])Number of counter resets

Label Functions

FunctionDescription
label_replace(v, dst, repl, src, regex)Regex replace label
label_join(v, dst, sep, src1, src2, ...)Join labels
promql
# Extract "api" from path="/api/v1/users"
label_replace(metric, "service", "$1", "path", "/([^/]+)/.*")

# Combine labels
label_join(metric, "full_name", "-", "first", "last")

Other Functions

FunctionDescription
histogram_quantile(q, v)Quantile from histogram buckets
predict_linear(v[t], secs)Linear prediction
sort(v)Sort ascending
sort_desc(v)Sort descending
time()Current Unix timestamp
timestamp(v)Timestamp of each sample
vector(scalar)Scalar to vector
scalar(v)Single-element vector to scalar
absent(v)1 if vector is empty
absent_over_time(v[t])1 if no samples in range

Common Queries

Error Rate

promql
# Error rate (fraction)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error rate per service
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

# Error rate percentage
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

Latency Percentiles (Histograms)

promql
# p50 latency
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# p95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# p99 latency per endpoint
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)
)

# Average request duration
sum(rate(http_request_duration_seconds_sum[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Saturation (Resource Usage)

promql
# CPU utilization per instance
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk utilization
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

# Container CPU usage vs limit
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
/
sum(container_spec_cpu_quota / container_spec_cpu_period) by (pod)

# Container memory usage vs limit
sum(container_memory_working_set_bytes) by (pod)
/
sum(container_spec_memory_limit_bytes) by (pod)

Traffic

promql
# Requests per second
sum(rate(http_requests_total[5m]))

# Requests per second by method
sum(rate(http_requests_total[5m])) by (method)

# Top 5 endpoints by request rate
topk(5, sum(rate(http_requests_total[5m])) by (handler))

Availability & Uptime

promql
# Targets that are down
up == 0

# Count of healthy vs total
count(up == 1) / count(up)

# Availability over 30 days
avg_over_time(up[30d])
promql
# Predict disk full (in seconds)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600)

# Disk will be full within 24 hours
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0

# Rate of change
deriv(process_resident_memory_bytes[1h])

Recording Rules

Recording rules precompute expensive queries and store them as new time series.

yaml
# prometheus.yml or rules file
groups:
  - name: http_rules
    interval: 30s
    rules:
      # Request rate by service
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # Error ratio by service
      - record: job:http_errors:ratio5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # p95 latency by handler
      - record: handler:http_duration:p95_5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)
          )

TIP

Naming convention for recording rules: level:metric:operations. Example: job:http_requests:rate5m means aggregated by job, metric is http_requests, operation is rate over 5 minutes.


Alerting Rules

yaml
groups:
  - name: app_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {​{ $labels.job }}"
          description: "Error rate is {​{ $value | humanizePercentage }} for {​{ $labels.job }}"

      # High latency
      - alert: HighLatency
        expr: handler:http_duration:p95_5m > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {​{ $labels.handler }}"
          description: "p95 latency is {​{ $value | humanizeDuration }}"

      # Target down
      - alert: TargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Target {​{ $labels.instance }} is down"

      # Disk filling up
      - alert: DiskFillingUp
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk predicted full within 24h on {​{ $labels.instance }}"

      # Pod crash looping
      - alert: PodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod {​{ $labels.pod }} is crash looping"

Template Functions (Annotations)

FunctionDescriptionExample
humanizeHuman-readable number1234567 to 1.235M
humanize1024Binary unitsBytes to KiB, MiB
humanizeDurationDuration string3661 to 1h1m1s
humanizePercentagePercentage0.05 to 5%
humanizeTimestampTimestamp to dateUnix to readable date

Gotchas & Best Practices

PitfallSolution
Aggregating raw countersAlways apply rate() or increase() first
rate() returns 0 for new seriesUse longer range [10m] or increase()
Missing series gives no resultUse or vector(0) for default
histogram_quantile wrong labelsMust keep le label in by() clause
High cardinality explosionsAvoid labels with unbounded values (user IDs, URLs)
Stale series after restartUse up to detect target health
irate vs raterate for alerts/recording rules, irate only for dashboards
promql
# Default to 0 when no error series exist
sum(rate(http_requests_total{status=~"5.."}[5m])) or vector(0)

# Handle division by zero
sum(rate(errors[5m])) / (sum(rate(total[5m])) > 0)

When to Use X vs Y

DecisionChoice AChoice BUse A WhenUse B When
Rate functionrate()irate()Alerts, recording rules, smoothDashboards, spiky detail
Counter totalincrease()rate()Want total count over windowWant per-second rate
Gauge changedelta()deriv()Actual changePer-second rate of change
Aggregationby (labels)without (labels)Few labels to keepFew labels to remove
Percentilehistogram_quantilequantileHistogram metric (buckets)Across existing series
Missing dataabsent()up == 0Metric disappeared entirelyTarget is down
Time windowShort [1m]Long [15m]High-resolution, volatileSmooth, stable for alerts

Test Yourself
  1. What function should you always apply to a counter before aggregation?rate() or increase()

  2. How do you calculate the p99 latency from a histogram?histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

  3. What is the difference between by and without in aggregation?by keeps only the listed labels; without removes the listed labels and keeps everything else.

  4. How do you provide a default value of 0 when a series does not exist? Append or vector(0) to the query.

  5. What function predicts a future value based on linear regression?predict_linear(v[t], seconds_ahead)

  6. What is the naming convention for recording rules?level:metric:operations (e.g., job:http_requests:rate5m)

  7. How do you get the value of a metric from 1 hour ago?metric_name offset 1h

  8. What modifier returns 0 or 1 instead of filtering in comparison operators?bool (e.g., http_requests_total > bool 100)

  9. How do you calculate CPU utilization from node_cpu_seconds_total?100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

  10. What function detects that a metric has completely disappeared?absent(metric_name)

Common Gotchas

  • Aggregating raw counters without rate(). Counters only go up and reset on restart. Summing raw counters hides resets and gives meaningless numbers.
  • Forgetting le in by() for histogram_quantile. Without the le label, the function cannot compute quantiles and returns garbage.
  • Using irate() in alerting rules. irate uses only the last two data points and is too volatile for stable alerts. Use rate() for alerts and recording rules.
  • High-cardinality labels (user IDs, request paths). Each unique label combination creates a new time series. Unbounded labels cause memory explosions in Prometheus.

One-Liner Summary

PromQL is Prometheus's query language for slicing time-series metrics -- master rate(), histogram_quantile(), and aggregation operators to build dashboards and alerts that actually catch production issues.

"What I cannot create, I do not understand." — Richard Feynman