支持两种rules。
recording rules
alerting rules
编辑后不需要重启prometheus验证语法,通过promtool工具:
go get /prometheus/prometheus/cmd/promtool
promtool check rules /path/to/
Recording rules
对采集的metric最计算或聚合,生成新的metric
groups:
- name: example
rules:
- record: job:http_inprogress_requests:sum
expr: sum(http_inprogress_requests) by (job)
Alerting rules
可以通过表达式定义报警规则。报警规则的配置和recording rules一样。
groups:
- name: example
rules:
- alert: HighErrorRate
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
for:第一次判断前的等待时间
labels: 会被添加到告警中
annotations:存一些告警信息到补充和描述
模版
labels和annotations可以采用模版变量。
用法:
# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}
例子:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $ }} down"
description: "{{ $ }} of job {{ $ }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $ }}"
description: "{{ $ }} has a median request latency above 1s (current value: {{ $value }}s)"
alert: InstanceDown
expr: up == 0
for: 5m
labels:
- severity: page
annotations:
summary: "Instance {{$}} down"
description: "{{$}} of job {{$}} has been down for more than 5 minutes."
带循环的
{{ range query "up" }}
{{ . }} {{ .Value }}
{{ end }}
更多参考:/docs/prometheus/latest/configuration/template_examples/