Prometheus（5）Alert manager配置和Pormetheus 配置说明

1 概述

Pormetheus的警告由独立的两部分组成。

Prometheus 服务中的警告规则将警告发送警告到Alertmanager。然后这个Alertmanager管理这些警告。
包括：

silencing,
inhibition,
aggregation，
以及通过一些方法发送通知，例如：email，PagerDuty和HipChat。

2 Alertmanager (警报管理器)

2.1 Grouping（分组）

Grouping分组将性质类似的警告分组成一个通知类。

当许多系统同时出现故障时，这种情况尤其有用，可以使数百到数千个警报可能同时触发。

例如：

当出现网络分区时，十个到数百个服务实例正在集群中运行。
当多半服务实例暂时无法访问数据库，如果服务实例不能和数据库通信，则对于已经配置好警报规则的Prometheus服务将会对每个服务实例发送一个警报，这样便会导致数百个警报发送到Alertmanager。
如果一个用户仅仅想看到一个页面，这个页面上的数据是精确地表示哪个服务实例受影响了。如果没有设置分组，这些数据会有许多个通知，还是比较分散的，这时便可以使用grouping进行分组
Alertmanager便可以通过它们的集群和警报名称来分组标签, 这样它可以发送一个单独受影响的通知。

如何配置：

警报分组，分组通知的时间，和通知的接受者是在配置文件中由一个路由树配置的

2.2 inhibition（抑制）

如果某些其他警报已经触发了，则对于某些警报，Inhibition是一个抑制通知的概念。

例如：

一个警报已经触发，它正在通知整个集群是不可达的时，Alertmanager则可以配置成关心这个集群的其他警报无效。
这可以防止与实际问题无关的数百或数千个触发警报的通知。

如何配置：

通过Alertmanager的配置文件配置Inhibition。

2.3 silencing（静默）

静默，可以在给定时间内简单地忽略所有警报。

slience基于matchers配置，类似路由树。

来到的警告将会被检查，判断它们是否和活跃的slience相等或者正则表达式匹配。
如果匹配成功，则不会将这些警报发送给接收者。

如何配置：

Silences在Alertmanager的web接口中配置。

2.4 Client behavior（客户行为）

Alertmanager 对其客户的行为有特殊要求。这些仅与 Prometheus 不用于发送警报的高级用例相关。

2.5 High Availability（高可用性）

Alertmanager 支持配置以创建集群以实现高可用性。这可以使用–cluster-* 标志进行配置。

重要的是不要在 Prometheus 和它的 Alertmanagers 之间对流量进行负载平衡，而是将 Prometheus 指向所有 Alertmanagers 的列表。

3 configuration (配置)

Alertmanager通过命令行标志和配置文件进行配置。

命令行标志配置不可变的系统参数，

查看所有命令，请使用命令alertmanager -h。
配置文件定义了禁止规则、通知路由和通知接收器。

可视化编辑器可以帮助构建路由树。

Alertmanager能够在运行时动态加载配置文件。

如果新的配置有错误，则配置中的变化不会生效，错误也会被记录；
同时错误日志被输出到终端，通过发送SIGHUP信号量给这个进程，或者通过HTTP POST请求/-/reload来触发Alertmanager配置动态重新加载。

3.1 配置文件

使用-指定要加载的配置文件

./alertmanager -=

配置文件使用yaml格式编写的，括号表示参数是可选的，对于非列表参数，该值将设置为指定的默认值。

<duration>: 与正则表达式匹配的持续时间[0-9]+(ms|[smhdwy])
((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)
例如：1d, 1h30m, 5m, 10s
<labeltime>: 与正则表达式匹配的字符串[a-zA-Z_][a-zA-Z0-9_]*
<labelvalue>: 一串 unicode 字符
<filepath>: 当前工作目录下的有效路径
<boolean>: 布尔值： false 或者 true。
<string>: 常规字符串
<secret>: 一个秘密的常规字符串，例如密码
<tmpl_string>: 一个在使用前被模板扩展的字符串
<tmpl_secret>: 在使用前进行模板扩展的字符串，这是一个秘密的常规字符串
10.<int>: 一个整数值

全局配置指定在所有其他配置上下文中有效的参数。它们还作为其他配置部分的默认值。

global:
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: :587
  [ smtp_smarthost: <string> ]
  # The default hostname to identify to the SMTP server.
  [ smtp_hello: <string> | default = "localhost" ]
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  [ smtp_auth_username: <string> ]
  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]
  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]
  # SMTP Auth using CRAM-MD5.
  [ smtp_auth_secret: <secret> ]
  # The default SMTP TLS requirement.
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  [ smtp_require_tls: <bool> | default = true ]

  # The API URL to use for Slack notifications.
  [ slack_api_url: <secret> ]
  [ slack_api_url_file: <filepath> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https:///integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https:///v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https:///" ]
  [ wechat_api_url: <string> | default = "https:///cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]

  # The default HTTP client configuration
  [ http_config: <http_config> ]

  # ResolveTimeout is the default value used by alertmanager if the alert does
  # not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
  # This has no impact on alerts from Prometheus, as they always include EndsAt.
  [ resolve_timeout: <duration> | default = 5m ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, . 'templates/*.tmpl'.
templates:
  [ - <filepath> ... ]

# The root node of the routing tree.
route: <route>

# A list of notification receivers.
receivers:
  - <receiver> ...

# A list of inhibition rules.
inhibit_rules:
  [ - <inhibit_rule> ... ]

# A list of mute time intervals for muting routes.
mute_time_intervals:
  [ - <mute_time_interval> ... ]

3.2 `<route>`

路由块定义路由树中的节点及其子节点。如果未设置，其可选配置参数将从其父节点继承。

每个警报在已配置路由树的顶部节点，这个节点必须匹配所有警报，然后遍历所有的子节点。

如果continue设置成false, 当匹配到第一个孩子时，它会停止下来；
如果continue设置成true, 则警报将继续匹配后续的兄弟姐妹节点。
如果一个警报不匹配一个节点的任何孩子，这个警报将会基于当前节点的配置参数来处理警报。

[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]

# DEPRECATED: Use matchers below.
# A set of equality matchers an alert has to fulfill to match the node.
match:
  [ <labelname>: <labelvalue>, ... ]

# DEPRECATED: Use matchers below.
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers that an alert has to fulfill to match the node. 
matchers:
  [ - <matcher> ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

# Times when the route should be muted. These must match the name of a
# mute time interval defined in the mute_time_intervals section. 
# Additionally, the root node cannot have any mute times.
# When a route is muted it will not send any notifications, but
# otherwise acts normally (including ending the route-matching process
# if the `continue` option is not set.)
mute_time_intervals:
  [ - <string> ...]

# Zero or more child routes.
routes:
  [ - <route> ... ]

举例

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with service=mysql or service=cassandra
  # are dispatched to the database pager.
  - receiver: 'database-pager'
    group_wait: 10s
    matchers:
    - service=~"mysql|cassandra"
  # All alerts with the team=frontend label match this sub-route.
  # They are grouped by product and environment rather than cluster
  # and alertname.
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    matchers:
    - team="frontend"

3.3 `<mute_time_interval>`

指定可以在路由树中引用的命名时间间隔，以在一天中的特定时间使特定路由静音。

name: <string>
time_intervals:
  [ - <time_interval> ... ]

3.4 `<time_interval>`

包含时间间隔的实际定义。该语法支持以下字段：

- times:
  [ - <time_range> ...]
  weekdays:
  [ - <weekday_range> ...]
  days_of_month:
  [ - <days_of_month_range> ...]
  months:
  [ - <month_range> ...]
  years:
  [ - <year_range> ...]

所有字段都是列表。
在每个非空列表中，必须至少满足一个元素才能匹配该字段。
如果未指定字段，则任何值都将匹配该字段。对于匹配完整时间间隔的瞬间，所有字段都必须匹配。
所有定义均采用 UTC，目前不支持其他时区。

3.4.1 time_range

范围包括开始时间和结束时间，以便于表示在小时边界开始/结束的时间。

例如，开始时间：“17:00”和结束时间：“24:00”将从 17:00 开始，并在 24:00 之前结束。

    times:
    - start_time: HH:MM
      end_time: HH:MM

3.4.2 days_of_month_range

月份中数字天数的列表。天数从 1 开始。也接受从月底开始的负值，

例如，

1 月期间的 -1 表示 1 月 31 日。
[‘1:5’, ‘-3:-1’]。延长超过月初或月底将导致它被钳制。
[‘1:31’]，在二月指定将根据闰年将实际结束日期限制为 28 或 29。两端包容。

3.4.3 month_range

不区分大小写的名称（例如“January”）或数字标识的日历月列表，
如：

January = 1。

也接受范围。

例如

[‘1:3’, ‘may:august’, ‘december’]。两端包容。

3.4.4 year_range

年份的数字列表。接受范围。

例如

[‘2020:2022’, ‘2030’]。两端包容。

3.5 `<inhibit_rule>`

当存在与另一组匹配器匹配的警报（源）时，抑制规则将与一组匹配器匹配的警报（目标）静音。

对于列表中的标签名称，目标警报和源警报必须具有相同的标签值，在equal这个标签里面。

缺少标签和具有空值的标签是一回事。因此，如果源警报和目标警报中都缺少列出的所有标签名称equal，改抑制规则将生效。

为了防止警报抑制自身，同时匹配规则的目标端和源端的警报不能被相同为真的警报（包括自身）抑制

建议以警报从不匹配双方的方式选择目标和源匹配器，它更容易推理并且不会触发这种特殊情况

# DEPRECATED: Use target_matchers below.
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use target_matchers below.
target_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers that have to be fulfilled by the target 
# alerts to be muted.
target_matchers:
  [ - <matcher> ... ]

# DEPRECATED: Use source_matchers below.
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use source_matchers below.
source_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers for which one or more alerts have 
# to exist for the inhibition to take effect.
source_matchers:
  [ - <matcher> ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]

3.6 `<http_config>`

允许配置接收方用来与基于 HTTP 的 API 服务通信的 HTTP 客户端。

# Note that `basic_auth` and `authorization` options are mutually exclusive.

# Sets the `Authorization` header with the configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# Optional the `Authorization` header configuration.
authorization:
  # Sets the authentication type.
  [ type: <string> | default: Bearer ]
  # Sets the credentials. It is mutually exclusive with
  # `credentials_file`.
  [ credentials: <secret> ]
  # Sets the credentials with the credentials read from the configured file.
  # It is mutually exclusive with `credentials`.
  [ credentials_file: <filename> ]

# Optional OAuth 2.0 configuration.
# Cannot be used at the same time as basic_auth or authorization.
oauth2:
  [ <oauth2> ]

# Optional proxy URL.
[ proxy_url: <string> ]

# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <bool> | default = true ]

# Configures the TLS settings.
tls_config:
  [ <tls_config> ]

3.6.1 `oauth2`

使用客户端凭据授予类型的 OAuth 2.0 身份验证。

Alertmanager 使用给定的客户端访问和密钥从指定的端点获取访问令牌。

client_id: <string>
[ client_secret: <secret> ]

# Read the client secret from a file.
# It is mutually exclusive with `client_secret`.
[ client_secret_file: <filename> ]

# Scopes for the token request.
scopes:
  [ - <string> ... ]

# The URL to fetch the token from.
token_url: <string>

# Optional parameters to append to the token URL.
endpoint_params:
  [ <string>: <string> ... ]

3.6.2 `<tls_config>`

允许配置 TLS 连接

# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]

# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]

# ServerName extension to indicate the name of the server.
# /html/rfc4366#section-3.1
[ server_name: <string> ]

# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]

3.7 `<receiver>`

Receiver 是一个或多个通知集成的命名配置。

注意：作为取消过去暂停新接收器的一部分，除了现有要求外，还同意新的通知集成需要有一个具有推送访问权限的承诺维护者。

# The unique name of the receiver.
name: <string>

# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
wechat_configs:
  [ - <wechat_config>, ... ]

3.7.1 <email_config>

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

# The email address to send notifications to.
to: <tmpl_string>

# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]

# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]

# The hostname to identify to the SMTP server.
[ hello: <string> | default = global.smtp_hello ]

# SMTP authentication information.
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]

# The SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ require_tls: <bool> | default = global.smtp_require_tls ]

# TLS configuration.
tls_config:
  [ <tls_config> ]

# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]

3.8 其他

其他的请查看官方文档
/docs/alerting/latest/configuration/

秒客网

Prometheus（5）Alert manager配置和Pormetheus 配置说明

1 概述

2 Alertmanager (警报管理器)

2.1 Grouping（分组）

2.2 inhibition（抑制）

2.3 silencing（静默）

2.4 Client behavior（客户行为）

2.5 High Availability（高可用性）

3 configuration (配置)

3.1 配置文件

3.2 `<route>`

3.3 `<mute_time_interval>`

3.4 `<time_interval>`

3.4.1 time_range

3.4.2 days_of_month_range

3.4.3 month_range

3.4.4 year_range

3.5 `<inhibit_rule>`

3.6 `<http_config>`

3.6.1 `oauth2`

3.6.2 `<tls_config>`

3.7 `<receiver>`

3.7.1 <email_config>

3.8 其他

相关文章

Prometheus（5）Alert manager配置和Pormetheus 配置说明

1 概述

2 Alertmanager (警报管理器)

2.1 Grouping（分组）

2.2 inhibition（抑制）

2.3 silencing（静默）

2.4 Client behavior（客户行为）

2.5 High Availability（高可用性）

3 configuration (配置)

3.1 配置文件

3.2 <route>

3.3 <mute_time_interval>

3.4 <time_interval>

3.4.1 time_range

3.4.2 days_of_month_range

3.4.3 month_range

3.4.4 year_range

3.5 <inhibit_rule>

3.6 <http_config>

3.6.1 oauth2

3.6.2 <tls_config>

3.7 <receiver>

3.7.1 <email_config>

3.8 其他

相关文章

3.2 `<route>`

3.3 `<mute_time_interval>`

3.4 `<time_interval>`

3.5 `<inhibit_rule>`

3.6 `<http_config>`

3.6.1 `oauth2`

3.6.2 `<tls_config>`

3.7 `<receiver>`