kubernetes之收集集群的events，监控集群行为

一、概述

线上部署的k8s已经扛过了双11的洗礼，期间先是通过对网络和监控的优化顺利度过了双11并且表现良好。先简单介绍一下我们kubernetes的使用方式：

物理机系统：Ubuntu-16.04（kernel 升级到4.17）

kuberneets-version：1.13.2

网络组件：calico（采用的是BGP模式+bgp reflector）

kube-proxy：使用的是ipvs模式

监控：prometheus+grafana

日志： fluentd + ES

metrics： metrics-server

HPA：cpu + memory

告警：钉钉

CI/CD： gitlab-ci/gitlab-runner

应用管理工具：helm、chartmuseum（不建议直接使用helm，helm charts可读性很差，学习成本较高）

由于k8s、物理环境共存，需要打通通网络提供访问：kube-gateway

有的地方涉及到公司内部的东西不方便写出来，但是绝大部分在我之前的博客都有介绍，有兴趣的可以参考一下。

自己的反思：

开始的时候，k8s集群在线上跑了一段时间，但是我发现我对集群内部的变化没有办法把控的很清楚，比如某个pod被重新调度了、某个node节点上的imagegc失败了、某个hpa被触发了等等，而这些都是可以通过events拿到的，但是events并不是永久存储的，它包含了集群各种资源的状态变化，所以我们可以通过收集分析events来了解整个集群内部的变化，经过一番探索找到一个开源的eventrouter来收集events事件，经过一些改造使其符合我们的业务场景，更名为eventrouter-kafka（https://github.com/cuishuaigit/eventrouter-kafka）直接将修改配置直传kafka，而不是需要各种配置，感觉原版的配置有些繁琐不是很好用，而我们的日志也是走kafka队列的，减轻ES的写压力。现在的events收集流程：

eventrouter---->kafka---->logstash(过滤、解析)----->ES------elastalert---->钉钉

经过添加上面的收集events使k8s集群又完善了一步。

二、简述流程

1、部署eventrouter

eventrouter是使用golang写的，可以根据自己的需求二次开发，部署很简单，参考：https://github.com/cuishuaigit/eventrouter-kafka。这里就不细述了。

2、kafka集群

参考：https://github.com/cuishuaigit/k8s-kafka

3、logstash

现在相应版本的logstash，下载地址：https://www.elastic.co/guide/en/logstash/6.5/installing-logstash.html

然后进行配置，这里贴一下我的测试配置：

input{

   kafka{

      bootstrap_servers => ["kafka-0.kafka-svc.kafka.svc.cluster.local:9092,kafka-1.kafka-svc.kafka.svc.cluster.local:9092,kafka-2.kafka-svc.kafka.svc.cluster.local:9092"]

      client_id => "eventrouter-prod"

      #auto_offset_reset => "latest"

      group_id => "eventrouter"

      consumer_threads =>

      #decorate_events  => true

      id => "eventrouter"

      topics => ["eventrouter"]

}

}

filter {

  if [message] =~ 'DNSConfigForming' {

     drop { }

  }

  json {

    source => "message"

  }

  mutate {

    remove_field => [ "message","old_event" ]

}

}

output{

 elasticsearch {

                        hosts => "10.4.9.28:9200"

                        index => "eventrouter-%{+YYYY-MM-dd}"

                 }

}

4、ES

version: ''

services:

  elasticsearch:

    image: docker.elastic.co/elasticsearch/elasticsearch:6.5.

    container_name: elasticsearch

    environment:

      - cluster.name=docker-cluster

      - bootstrap.memory_lock=true

      - "ES_JAVA_OPTS=-Xms4096m -Xmx4096m"

    ulimits:

      memlock:

        soft: -

        hard: -

    volumes:

      - /data/es1:/usr/share/elasticsearch/data

      - /data/backups:/usr/share/elasticsearch/backups

      - /data/longterm_backups:/usr/share/elasticsearch/longterm_backups

      - ./config/jvm.options:/usr/share/elasticsearch/config/jvm.options

    ports:

      - "9200:9200"

    networks:

      - esnet

#  elasticsearch2:

#    image: docker.elastic.co/elasticsearch/elasticsearch:6.5.

#    container_name: elasticsearch2

#    environment:

#      - cluster.name=docker-cluster

#      - bootstrap.memory_lock=true

#      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"

#      - "discovery.zen.ping.unicast.hosts=elasticsearch"

#    ulimits:

#      memlock:

#        soft: -

#        hard: -

#    volumes:

#      - /data/es2:/usr/share/elasticsearch/data

#    networks:

#      - esnet

  kibana:

    image: docker.elastic.co/kibana/kibana:6.5.

    container_name: kibana

    environment:

      SERVER_NAME: kibana

      SERVER_HOST: "0.0.0.0"

      ELASTICSEARCH_URL: http://elasticsearch:9200

      XPACK_MONITORING_UI_CONATINER_ELASTICSEARCH_ENABLED: "true"

    volumes:

      - /data/plugin:/usr/share/kibana/plugin

      - /tmp/:/etc/archives

    ports:

      - "5601:5601"

    networks:

      - esnet

    depends_on:

      - elasticsearch

networks:

 esnet:

   driver: bridge

cat config/jvm.properties

## JVM configuration

################################################################

## IMPORTANT: JVM heap size

################################################################

##

## You should always set the min and max JVM heap

## size to the same value. For example, to set

## the heap to  GB, set:

##

## -Xms4g

## -Xmx4g

##

## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html

## for more information

##

################################################################

# Xms represents the initial size of total heap space

# Xmx represents the maximum size of total heap space

-Xms2g

-Xmx2g

################################################################

## Expert settings

################################################################

##

## All settings below this section are considered

## expert settings. Don't tamper with them unless

## you understand what you are doing

##

################################################################

## GC configuration

-XX:+UseConcMarkSweepGC

-XX:CMSInitiatingOccupancyFraction=

-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration

# NOTE: G1GC is only supported on JDK version  or later.

# To use G1GC uncomment the lines below.

# -:-XX:-UseConcMarkSweepGC

# -:-XX:-UseCMSInitiatingOccupancyOnly

# -:-XX:+UseG1GC

# -:-XX:InitiatingHeapOccupancyPercent=

## optimizations

# pre-touch memory pages used by the JVM during initialization

-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size

-Xss1m

# set to headless, just in case

-Djava.awt.headless=true

# ensure UTF- encoding by default (e.g. filenames)

-Dfile.encoding=UTF-

# use our provided JNA always versus the system one

-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common

# exceptions because stack traces are important for debugging

-XX:-OmitStackTraceInFastThrow

# flags to configure Netty

-Dio.netty.noUnsafe=true

-Dio.netty.noKeySetOptimization=true

-Dio.netty.recycler.maxCapacityPerThread=

# log4j

-Dlog4j.shutdownHookEnabled=false

-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails

# heap dumps are created in the working directory of the JVM

-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and

# has sufficient space

-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs

-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK  GC logging

:-XX:+PrintGCDetails

:-XX:+PrintGCDateStamps

:-XX:+PrintTenuringDistribution

:-XX:+PrintGCApplicationStoppedTime

:-Xloggc:logs/gc.log

:-XX:+UseGCLogFileRotation

:-XX:NumberOfGCLogFiles=

:-XX:GCLogFileSize=64m

# JDK + GC logging

-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=,filesize=64m

# due to internationalization enhancements in JDK  Elasticsearch need to set the provider to COMPAT otherwise

# time/date parsing will break in an incompatible way for some date patterns and locals

-:-Djava.locale.providers=COMPAT

# temporary workaround for C2 bug with JDK  on hardware with AVX-

-:-XX:UseAVX=

5、elastalert

部署参考https://github.com/Yelp/elastalert.git

使用：

mkdir  /etc/elastalert

将clone的elastalert目录下面的config.yaml.example拷贝到上面创建的目录里面：

cpoy  elastalert/config.yaml.example     /etc/elastalert/config.yaml

只需要修改：

rules_folder、es_host、es_port，如果设置了用户密码，还需要修改。

创建rules

mkdir /etc/elastalert/rules

6、钉钉

创建机器人参考我其他的博客，获取token，下载钉钉plugin， https://github.com/xuyaoqiang/elastalert-dingtalk-plugin

将elastalert_modules拷贝到/etc/elastalert目录下面

cp  -r elastalert-dingtalk-plugin/elastalert_modules   /etc/elastalert/elastalert

rules example

# Alert when the rate of events exceeds a threshold

# (Optional)

# Elasticsearch host

es_host: 10.2.9.28

# (Optional)

# Elasticsearch port

es_port: 

# (OptionaL) Connect with SSL to Elasticsearch

#use_ssl: True

# (Optional) basic-auth username and password for Elasticsearch

#es_username: someusername

#es_password: somepassword

# (Required)

# Rule name, must be unique

name: Other event frequency rule

# (Required)

# Type of alert.

# the frequency rule type alerts when num_events events occur with timeframe time

type: frequency

# (Required)

# Index to search, wildcard supported

index: eventrouter-*

# (Required, frequency specific)

# Alert when this many documents matching the query occur within a timeframe

num_events: 

# (Required, frequency specific)

# num_events must occur within this amount of time to trigger an alert

timeframe:

  #hours:

  minutes:

# (Required)

# A list of Elasticsearch filters used for find events

# These filters are joined with AND and nested in a filtered query

# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html

filter:

#- term:

#    some_field: "some_value"

- query:

    query_string:

      query: "event.type: Warning NOT event.involvedObject.kind: Node"

# (Required)

# The alert is use when a match is found

#smtp_host: smtp.exmail.qq.com

#smtp_port:

#smtp_auth_file: /etc/elastalert/smtp_auth_file.yaml

#email_reply_to: ci@qq.com

#from_addr: ci@qq.com

realert:

  minutes:

exponential_realert:

  hours: 

alert:

#- "email"

- "elastalert_modules.dingtalk_alert.DingTalkAlerter"

dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token=47194e6904c6e3133a9080980984444c8e5d7745e1f76c12cefa99c8c8ac718dd88d4c"

dingtalk_msgtype: "text"

alert_text_type: alert_text_only

alert_text: "

   ====elastalert message====\n

   EventTime>>:  {}\n

   Event_involvedObject_name>>:  {}\n

   Event_involvedObject_kind>>:  {}\n

   Event_involvedObject_namespace>>:  {}\n

   Message>>:  {}\n

   Event_reason>>: {}\n

   verb>>: {}

"

alert_text_args:

- "@timestamp"

- event.involvedObject.name

- event.source.component

- event.involvedObject.namespace

- event.message

- event.reason

- verb

# (required, email specific)

# a list of email addresses to send alerts to

#email:

#- "ci@qq.com"

自己定制的告警消息格式：

alert:

#- "email"

- "elastalert_modules.dingtalk_alert.DingTalkAlerter"

dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token=47194e6904c6e3133a9080980984444c8e5d7745e1f76c12cefa99c8c8ac718dd88d4c"

dingtalk_msgtype: "text"

alert_text_type: alert_text_only

alert_text: "

   ====elastalert message====\n

   EventTime>>:  {}\n

   Event_involvedObject_name>>:  {}\n

   Event_involvedObject_kind>>:  {}\n

   Event_involvedObject_namespace>>:  {}\n

   Message>>:  {}\n

   Event_reason>>: {}\n

   verb>>: {}

"

alert_text_args:

- "@timestamp"

- event.involvedObject.name

- event.source.component

- event.involvedObject.namespace

- event.message

- event.reason

- verb

详细信息参考官网：https://elastalert.readthedocs.io/en/latest/recipes/writing_filters.html#writingfilters

kubernetes之收集集群的events，监控集群行为

秒客网

kubernetes之收集集群的events，监控集群行为

相关文章