1 概述
Pormetheus的警告由独立的两部分组成。
Prometheus 服务中的警告规则将警告发送警告到Alertmanager。然后这个Alertmanager管理这些警告。
包括:
- silencing,
- inhibition,
- aggregation,
- 以及通过一些方法发送通知,例如:email,PagerDuty和HipChat。
2 Alertmanager (警报管理器)
2.1 Grouping(分组)
Grouping分组将性质类似的警告分组成一个通知类
。
当许多系统同时出现故障时,这种情况尤其有用,可以使数百到数千个警报可能同时触发。
例如:
- 当出现网络分区时,十个到数百个服务实例正在集群中运行。
- 当多半服务实例暂时无法访问数据库,如果服务实例不能和数据库通信,则对于已经配置好警报规则的Prometheus服务将会对每个服务实例发送一个警报,这样便会导致数百个警报发送到Alertmanager。
- 如果一个用户仅仅想看到一个页面,这个页面上的数据是精确地表示哪个服务实例受影响了。如果没有设置分组,这些数据会有许多个通知,还是比较分散的,这时便可以使用grouping进行分组
- Alertmanager便可以通过它们的集群和警报名称来分组标签, 这样它可以发送一个单独受影响的通知。
如何配置:
-
警报分组,分组通知的时间,和通知的接受者
是在配置文件中由一个路由树配置的
2.2 inhibition(抑制)
如果某些其他警报已经触发了,则对于某些警报,Inhibition是一个抑制通知的概念
。
例如:
- 一个警报已经触发,它正在通知整个集群是不可达的时,Alertmanager则可以配置成关心这个集群的其他警报无效。
-
这可以防止与实际问题无关的数百或数千个触发警报的通知
。
如何配置:
- 通过
Alertmanager的配置文件配置Inhibition
。
2.3 silencing(静默)
静默,可以在给定时间内简单地忽略所有警报
。
slience基于matchers配置,类似路由树。
- 来到的警告将会被检查,判断它们是否和活跃的slience相等或者正则表达式匹配。
- 如果匹配成功,则不会将这些警报发送给接收者。
如何配置:
- Silences
在Alertmanager的web接口中配置
。
2.4 Client behavior(客户行为)
Alertmanager 对其客户的行为有特殊要求。这些仅与 Prometheus 不用于发送警报的高级用例相关。
2.5 High Availability(高可用性)
Alertmanager 支持配置以创建集群以实现高可用性。这可以使用–cluster-* 标志进行配置。
重要的是不要在 Prometheus 和它的 Alertmanagers 之间对流量进行负载平衡,而是将 Prometheus 指向所有 Alertmanagers 的列表。
3 configuration (配置)
Alertmanager通过命令行标志和配置文件
进行配置。
- 命令行标志配置不可变的系统参数,
查看所有命令,请使用命令
alertmanager -h
。 - 配置文件定义了禁止规则、通知路由和通知接收器。
可视化编辑器可以帮助构建路由树。
Alertmanager能够在运行时动态加载配置文件。
- 如果新的配置有错误,则配置中的变化不会生效,错误也会被记录;
- 同时错误日志被输出到终端,通过发送
SIGHUP
信号量给这个进程,或者通过HTTP POST请求/-/reload
来触发Alertmanager配置动态重新加载。
3.1 配置文件
使用-
指定要加载的配置文件
./alertmanager -=
配置文件使用yaml格式编写的,括号表示参数是可选的,对于非列表参数,该值将设置为指定的默认值。
-
<duration>
: 与正则表达式匹配的持续时间[0-9]+(ms|[smhdwy])
((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)
例如:1d, 1h30m, 5m, 10s -
<labeltime>
: 与正则表达式匹配的字符串[a-zA-Z_][a-zA-Z0-9_]*
-
<labelvalue>
: 一串 unicode 字符 -
<filepath>
: 当前工作目录下的有效路径 -
<boolean>
: 布尔值:false 或者 true
。 -
<string>
:常规字符串
-
<secret>
: 一个秘密的常规字符串,例如密码 -
<tmpl_string>
: 一个在使用前被模板扩展的字符串 -
<tmpl_secret>:
在使用前进行模板扩展的字符串,这是一个秘密的常规字符串
10.<int>
: 一个整数值
全局配置指定在所有其他配置上下文中有效的参数。它们还作为其他配置部分的默认值。
global:
# The default SMTP From header field.
[ smtp_from: <tmpl_string> ]
# The default SMTP smarthost used for sending emails, including port number.
# Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
# Example: :587
[ smtp_smarthost: <string> ]
# The default hostname to identify to the SMTP server.
[ smtp_hello: <string> | default = "localhost" ]
# SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
[ smtp_auth_username: <string> ]
# SMTP Auth using LOGIN and PLAIN.
[ smtp_auth_password: <secret> ]
# SMTP Auth using PLAIN.
[ smtp_auth_identity: <string> ]
# SMTP Auth using CRAM-MD5.
[ smtp_auth_secret: <secret> ]
# The default SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ smtp_require_tls: <bool> | default = true ]
# The API URL to use for Slack notifications.
[ slack_api_url: <secret> ]
[ slack_api_url_file: <filepath> ]
[ victorops_api_key: <secret> ]
[ victorops_api_url: <string> | default = "https:///integrations/generic/20131114/alert/" ]
[ pagerduty_url: <string> | default = "https:///v2/enqueue" ]
[ opsgenie_api_key: <secret> ]
[ opsgenie_api_url: <string> | default = "https:///" ]
[ wechat_api_url: <string> | default = "https:///cgi-bin/" ]
[ wechat_api_secret: <secret> ]
[ wechat_api_corp_id: <string> ]
# The default HTTP client configuration
[ http_config: <http_config> ]
# ResolveTimeout is the default value used by alertmanager if the alert does
# not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
# This has no impact on alerts from Prometheus, as they always include EndsAt.
[ resolve_timeout: <duration> | default = 5m ]
# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, . 'templates/*.tmpl'.
templates:
[ - <filepath> ... ]
# The root node of the routing tree.
route: <route>
# A list of notification receivers.
receivers:
- <receiver> ...
# A list of inhibition rules.
inhibit_rules:
[ - <inhibit_rule> ... ]
# A list of mute time intervals for muting routes.
mute_time_intervals:
[ - <mute_time_interval> ... ]
3.2 <route>
路由块定义路由树中的节点及其子节点
。如果未设置,其可选配置参数将从其父节点继承。
每个警报在已配置路由树的顶部节点,这个节点必须匹配所有警报,然后遍历所有的子节点
。
- 如果
continue设置成false, 当匹配到第一个孩子时,它会停止下来
; - 如果
continue设置成true, 则警报将继续匹配后续的兄弟姐妹节点
。 - 如果
一个警报不匹配一个节点的任何孩子,这个警报将会基于当前节点的配置参数来处理警报
。
[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]
# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]
# DEPRECATED: Use matchers below.
# A set of equality matchers an alert has to fulfill to match the node.
match:
[ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use matchers below.
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
[ <labelname>: <regex>, ... ]
# A list of matchers that an alert has to fulfill to match the node.
matchers:
[ - <matcher> ... ]
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]
# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]
# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]
# Times when the route should be muted. These must match the name of a
# mute time interval defined in the mute_time_intervals section.
# Additionally, the root node cannot have any mute times.
# When a route is muted it will not send any notifications, but
# otherwise acts normally (including ending the route-matching process
# if the `continue` option is not set.)
mute_time_intervals:
[ - <string> ...]
# Zero or more child routes.
routes:
[ - <route> ... ]
举例
# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
receiver: 'default-receiver'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [cluster, alertname]
# All alerts that do not match the following child routes
# will remain at the root node and be dispatched to 'default-receiver'.
routes:
# All alerts with service=mysql or service=cassandra
# are dispatched to the database pager.
- receiver: 'database-pager'
group_wait: 10s
matchers:
- service=~"mysql|cassandra"
# All alerts with the team=frontend label match this sub-route.
# They are grouped by product and environment rather than cluster
# and alertname.
- receiver: 'frontend-pager'
group_by: [product, environment]
matchers:
- team="frontend"
3.3 <mute_time_interval>
指定可以在路由树中引用的命名时间间隔,以在一天中的特定时间使特定路由静音。
name: <string>
time_intervals:
[ - <time_interval> ... ]
3.4 <time_interval>
包含时间间隔的实际定义。该语法支持以下字段:
- times:
[ - <time_range> ...]
weekdays:
[ - <weekday_range> ...]
days_of_month:
[ - <days_of_month_range> ...]
months:
[ - <month_range> ...]
years:
[ - <year_range> ...]
所有字段都是列表。
在每个非空列表中,必须至少满足一个元素才能匹配该字段。
如果未指定字段,则任何值都将匹配该字段。对于匹配完整时间间隔的瞬间,所有字段都必须匹配。
所有定义均采用 UTC,目前不支持其他时区。
3.4.1 time_range
范围包括开始时间和结束时间,以便于表示在小时边界开始/结束的时间。
例如,开始时间:“17:00”和结束时间:“24:00”将从 17:00 开始,并在 24:00 之前结束。
times:
- start_time: HH:MM
end_time: HH:MM
3.4.2 days_of_month_range
月份中数字天数的列表。天数从 1 开始。也接受从月底开始的负值,
例如,
- 1 月期间的 -1 表示 1 月 31 日。
- [‘1:5’, ‘-3:-1’]。延长超过月初或月底将导致它被钳制。
- [‘1:31’],在二月指定将根据闰年将实际结束日期限制为 28 或 29。两端包容。
3.4.3 month_range
不区分大小写的名称(例如“January”)或数字标识的日历月列表,
如:
- January = 1。
也接受范围。
例如
- [‘1:3’, ‘may:august’, ‘december’]。两端包容。
3.4.4 year_range
年份的数字列表。接受范围。
例如
- [‘2020:2022’, ‘2030’]。两端包容。
3.5 <inhibit_rule>
当存在与另一组匹配器匹配的警报(源)时,抑制规则将与一组匹配器匹配的警报(目标)静音。
对于列表中的标签名称,目标警报和源警报必须具有相同的标签值,在equal
这个标签里面。
缺少标签和具有空值的标签是一回事。因此,如果源警报和目标警报中都缺少列出的所有标签名称equal
,改抑制规则将生效。
为了防止警报抑制自身,同时匹配规则的目标端和源端的警报不能被相同为真的警报(包括自身)抑制
建议以警报从不匹配双方的方式选择目标和源匹配器,它更容易推理并且不会触发这种特殊情况
# DEPRECATED: Use target_matchers below.
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
[ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use target_matchers below.
target_match_re:
[ <labelname>: <regex>, ... ]
# A list of matchers that have to be fulfilled by the target
# alerts to be muted.
target_matchers:
[ - <matcher> ... ]
# DEPRECATED: Use source_matchers below.
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
[ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use source_matchers below.
source_match_re:
[ <labelname>: <regex>, ... ]
# A list of matchers for which one or more alerts have
# to exist for the inhibition to take effect.
source_matchers:
[ - <matcher> ... ]
# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]
3.6 <http_config>
允许配置接收方用来与基于 HTTP 的 API 服务通信的 HTTP 客户端。
# Note that `basic_auth` and `authorization` options are mutually exclusive.
# Sets the `Authorization` header with the configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
[ username: <string> ]
[ password: <secret> ]
[ password_file: <string> ]
# Optional the `Authorization` header configuration.
authorization:
# Sets the authentication type.
[ type: <string> | default: Bearer ]
# Sets the credentials. It is mutually exclusive with
# `credentials_file`.
[ credentials: <secret> ]
# Sets the credentials with the credentials read from the configured file.
# It is mutually exclusive with `credentials`.
[ credentials_file: <filename> ]
# Optional OAuth 2.0 configuration.
# Cannot be used at the same time as basic_auth or authorization.
oauth2:
[ <oauth2> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <bool> | default = true ]
# Configures the TLS settings.
tls_config:
[ <tls_config> ]
3.6.1 oauth2
使用客户端凭据授予类型的 OAuth 2.0 身份验证。
Alertmanager 使用给定的客户端访问和密钥从指定的端点获取访问令牌。
client_id: <string>
[ client_secret: <secret> ]
# Read the client secret from a file.
# It is mutually exclusive with `client_secret`.
[ client_secret_file: <filename> ]
# Scopes for the token request.
scopes:
[ - <string> ... ]
# The URL to fetch the token from.
token_url: <string>
# Optional parameters to append to the token URL.
endpoint_params:
[ <string>: <string> ... ]
3.6.2 <tls_config>
允许配置 TLS 连接
# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]
# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]
# ServerName extension to indicate the name of the server.
# /html/rfc4366#section-3.1
[ server_name: <string> ]
# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]
3.7 <receiver>
Receiver 是一个或多个通知集成的命名配置。
注意:作为取消过去暂停新接收器的一部分,除了现有要求外,还同意新的通知集成需要有一个具有推送访问权限的承诺维护者。
# The unique name of the receiver.
name: <string>
# Configurations for several notification integrations.
email_configs:
[ - <email_config>, ... ]
pagerduty_configs:
[ - <pagerduty_config>, ... ]
pushover_configs:
[ - <pushover_config>, ... ]
slack_configs:
[ - <slack_config>, ... ]
opsgenie_configs:
[ - <opsgenie_config>, ... ]
webhook_configs:
[ - <webhook_config>, ... ]
victorops_configs:
[ - <victorops_config>, ... ]
wechat_configs:
[ - <wechat_config>, ... ]
3.7.1 <email_config>
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]
# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]
# The hostname to identify to the SMTP server.
[ hello: <string> | default = global.smtp_hello ]
# SMTP authentication information.
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]
# The SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ require_tls: <bool> | default = global.smtp_require_tls ]
# TLS configuration.
tls_config:
[ <tls_config> ]
# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]
# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]
3.8 其他
其他的请查看官方文档
/docs/alerting/latest/configuration/