prometheus rules配置

时间:2025-05-13 18:16:29

支持两种rules。

recording rules

alerting rules

编辑后不需要重启prometheus验证语法,通过promtool工具:

go get /prometheus/prometheus/cmd/promtool
promtool check rules /path/to/

Recording rules

对采集的metric最计算或聚合,生成新的metric

groups:
  - name: example
    rules:
    - record: job:http_inprogress_requests:sum
      expr: sum(http_inprogress_requests) by (job)

Alerting rules

可以通过表达式定义报警规则。报警规则的配置和recording rules一样。

groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency

for:第一次判断前的等待时间

labels: 会被添加到告警中

annotations:存一些告警信息到补充和描述

模版

labels和annotations可以采用模版变量。

用法:

# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}

例子:

groups:
- name: example
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $ }} down"
      description: "{{ $ }} of job {{ $ }} has been down for more than 5 minutes."

  # Alert for any instance that has a median request latency >1s.
  - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {{ $ }}"
      description: "{{ $ }} has a median request latency above 1s (current value: {{ $value }}s)"
alert: InstanceDown
expr: up == 0
for: 5m
labels:
  - severity: page
annotations:
  summary: "Instance {{$}} down"
  description: "{{$}} of job {{$}} has been down for more than 5 minutes."

带循环的

{{ range query "up" }}
  {{ . }} {{ .Value }}
{{ end }}

更多参考:/docs/prometheus/latest/configuration/template_examples/