关于Kubernetes集群中常见问题的排查方法的一些笔记

时间:2022-12-12 18:03:57


写在前面


  • 学习​​K8s​​,所以整理记忆
  • 文章理论内容来源于:
  • ​《Kubernetes权威指南:从Docker到Kubernetes实践全接触》​​第四版.第十一章
  • 这里整理学习笔记

一切时代的艺术都在努力为我们内心那神圣的无声的欲望提供语言。 ——赫尔曼·黑塞《彼得·卡门青》


因为没有具体的Demo,所以文章有些空,类似于一些指导思想,读着乏味,这里先列出干货:一些查问题的网站,关于内容之后有机会在补充相关的案例,如果解决问题,时间紧张的小伙伴还是针对问题描述下面的平台里找找

查问题的网站

​Kubernetes​​​官网中监控、记录和调试相关问题: ​​https://kubernetes.io/docs/tasks/debug-application-cluster/​

​Kubernetes​​​官方论坛: ​​https://discuss.kubernetes.io/​​(这个需要*)

​GitHub​​​库关于​​Kubernetes​​​问题列表:​​https://github.com/kubernetes/kubernetes/issues​

​*​​​网站上关于​​Kubernetes​​​的问题讨论:​​https://*.com/questions/tagged/kubernetes​

​Kubernetes Slack​​​聊天群组: ​​https://kubernetes.slack.com/​​(需要谷歌账号)

Kubernetes集群中常见问题的排查方法

为了跟踪和发现在Kubernetes集群中运行的容器应用出现的问题,我们常用如下查错方法。

查看Kubernetes对象的当前运行时信息,特别是与对象关联的​Event事件​​。这些事件记录了​​相关主题​​​、​​发生时间​​​、​​最近发生时间​​​、​​发生次数​​​及​​事件原因​​​等,对排查故障非常有价值。通过查看对象的​​运行时数据​​​,我们还可以发现​​参数错误​​​、​​关联错误​​​、​​状态异常​​等明显问题。由于在Kubernetes中多种对象相互关联,因此这一步可能会涉及多·个相关对象的排查问题。

对于​服务、容器​​方面的问题,可能需要深入​​容器内部​​​进行​​故障诊断​​​,此时可以通过查看​​容器的运行日志​​来定位具体问题。

对于某些复杂问题,例如​Pod调度这种全局性​​的问题,可能需要结合​​集群中每个节点上的Kubernetes服务日志​​​来排查。比如搜集​​Master​​​上的​​kube-apiserver, kube-schedule, kube-controler-manager​​​服务日志,以及各个​​Node​​​上的​​kubelet, kube-proxy​​服务日志.

查看系统Event

在​Kubernetes集群​​中​​创建Pod​​​后,我们可以​​通过kubectl get pods命令​​​查看​​Pod列表​​​,但通过该命令显示的信息有限。Kubernetes提供了​​kubectl describe pod​​​命令来查看一个​​Pod​​的详细信息,例如:

通过​kubectl describe pod​​命令,可以显示​​Pod创建​​​时的​​配置定义、状态等信息​​​,还可以显示与该​​Pod​​​相关的最近的​​Event​​事件,事件信息对于查错非常有用。

如果​某个Pod一直处于Pending状态​​,我们就可以通过​​kubectl describe​​了解具体的原因:

  • 没有可用的​​Node以供调度​​​,可能原因为pod端口冲突,或者受​​Taints​​影响,。
  • 开启了​​资源配额管理​​​,但在当前调度的目标节点上​​资源不足​​。
  • ​镜像下载失败等​​。

查看​pod​详细信息

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl describe pods etcd-vms81.liruilongs.github.io -n kube-system
# pod创建的基本信息
Name: etcd-vms81.liruilongs.github.io
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: vms81.liruilongs.github.io/192.168.26.81
Start Time: Tue, 25 Jan 2022 21:54:20 +0800
Labels: component=etcd
tier=control-plane
Annotations: kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.81:2379
kubernetes.io/config.hash: 1502584f9ab841720212d4341d723ba2
kubernetes.io/config.mirror: 1502584f9ab841720212d4341d723ba2
kubernetes.io/config.seen: 2021-12-13T00:01:04.834825537+08:00
kubernetes.io/config.source: file
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running # Node当前的运行状态,
IP: 192.168.26.81
IPs:
IP: 192.168.26.81
Controlled By: Node/vms81.liruilongs.github.io
Containers:
etcd: # pod的一些基础信息
Container ID: docker://20d99a98a4c2590e8726916932790200ba1cf93c48f3c84ca1298ffdcaa4f28a
Image: registry.aliyuncs.com/google_containers/etcd:3.5.0-0
Image ID: docker-pullable://registry.aliyuncs.com/google_containers/etcd@sha256:9ce33ba33d8e738a5b85ed50b5080ac746deceed4a7496c550927a7a19ca3b6d
Port: <none>
Host Port: <none>
Command: # 容器运行的一些启动参数
etcd
--advertise-client-urls=https://192.168.26.81:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--client-cert-auth=true
--data-dir=/var/lib/etcd
--initial-advertise-peer-urls=https://192.168.26.81:2380
--initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
--key-file=/etc/kubernetes/pki/etcd/server.key
--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
--listen-metrics-urls=http://127.0.0.1:2381
--listen-peer-urls=https://192.168.26.81:2380
--name=vms81.liruilongs.github.io
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
--peer-client-cert-auth=true
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
--snapshot-count=10000
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
State: Running
Started: Tue, 25 Jan 2022 21:54:20 +0800
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Mon, 24 Jan 2022 08:35:16 +0800
Finished: Tue, 25 Jan 2022 21:53:56 +0800
Ready: True
Restart Count: 128
Requests: # 涉及到的一些资源信息
cpu: 100m
memory: 100Mi
Liveness: http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=8
Startup: http-get http://127.0.0.1:2381/health delay=10s timeout=15s period=10s #success=1 #failure=24
Environment: <none>
Mounts:
/etc/kubernetes/pki/etcd from etcd-certs (rw)
/var/lib/etcd from etcd-data (rw)
Conditions: #pod启动以后会做一系列的自检工作:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes: # 映射的宿主机的数据卷信息,这里的定义为宿主机共享
etcd-certs:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/pki/etcd
HostPathType: DirectoryOrCreate
etcd-data:
Type: HostPath (bare host directory volume)
Path: /var/lib/etcd
HostPathType: DirectoryOrCreate
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
Events: <none>
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

查看集群中的​Node​​节点和​​节点的详细信息​

[root@liruilong k8s]# kubectl  get nodes
NAME STATUS AGE
127.0.0.1 Ready 2d
[root@liruilong k8s]# kubectl describe node 127.0.0.1
# Node基本信息:名称、标签、创建时间等。
Name: 127.0.0.1
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=127.0.0.1
Taints: <none>
CreationTimestamp: Fri, 27 Aug 2021 00:07:09 +0800
Phase:
# Node当前的运行状态, Node启动以后会做一系列的自检工作:
# 比如磁盘是否满了,如果满了就标注OutODisk=True
# 否则继续检查内存是否不足(如果内存不足,就标注MemoryPressure=True)
# 最后一切正常,就设置为Ready状态(Ready=True)
# 该状态表示Node处于健康状态, Master将可以在其上调度新的任务了(如启动Pod)
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Sun, 29 Aug 2021 23:05:53 +0800 Sat, 28 Aug 2021 00:30:35 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Sun, 29 Aug 2021 23:05:53 +0800 Fri, 27 Aug 2021 00:07:09 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 29 Aug 2021 23:05:53 +0800 Fri, 27 Aug 2021 00:07:09 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Sun, 29 Aug 2021 23:05:53 +0800 Sat, 28 Aug 2021 00:30:35 +0800 KubeletReady kubelet is posting ready status
# Node的主机地址与主机名。
Addresses: 127.0.0.1,127.0.0.1,127.0.0.1
# Node上的资源总量:描述Node可用的系统资源,包括CPU、内存数量、最大可调度Pod数量等,注意到目前Kubernetes已经实验性地支持GPU资源分配了(alpha.kubernetes.io/nvidia-gpu=0)
Capacity:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 1
memory: 1882012Ki
pods: 110
# Node可分配资源量:描述Node当前可用于分配的资源量。
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 1
memory: 1882012Ki
pods: 110
# 主机系统信息:包括主机的唯一标识UUID, Linux kernel版本号、操作系统类型与版本、Kubernetes版本号、kubelet与kube-proxy的版本号等。
System Info:
Machine ID: 963c2c41b08343f7b063dddac6b2e486
System UUID: EB90EDC4-404C-410B-800F-3C65816C0E2D
Boot ID: 4a9349b0-ce4b-4b4a-8766-c5c4256bb80b
Kernel Version: 3.10.0-1160.15.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.5.2
Kube-Proxy Version: v1.5.2
ExternalID: 127.0.0.1
# 当前正在运行的Pod列表概要信息
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default mysql-2cpt9 0 (0%) 0 (0%) 0 (0%) 0 (0%)
default myweb-53r32 0 (0%) 0 (0%) 0 (0%) 0 (0%)
default myweb-609w4 0 (0%) 0 (0%) 0 (0%) 0 (0%)
# 已分配的资源使用概要信息,例如资源申请的最低、最大允许使用量占系统总量的百分比。
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
0 (0%) 0 (0%) 0 (0%) 0 (0%)
# Node相关的Event信息。
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4h 27m 3 {kubelet 127.0.0.1} Warning MissingClusterDNS kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. pod: "myweb-609w4_default(01d719dd-08b1-11ec-9d6a-00163e1220cb)". Falling back to DNSDefault policy.
25m 25m 1 {kubelet 127.0.0.1} Warning MissingClusterDNS kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. pod: "mysql-2cpt9_default(1c9353ba-08d7-11ec-9d6a-00163e1220cb)". Falling back to DNSDefault policy.

查看容器日志

在需要排查容器内部应用程序生成的日志时,我们可以使用​kubectl logs <pod_name>​命令

这里打印​etcd​​数据库的​​日志信息​​​,查看日志中异常的相关信息,这里用过过滤​​error​​关键字的方法来查看相关的信息

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl logs etcd-vms81.liruilongs.github.io -n kube-system | grep -i error | head -5
{"level":"info","ts":"2022-01-25T13:54:33.191Z","caller":"wal/repair.go:96","msg":"repaired","path":"/var/lib/etcd/member/wal/0000000000000014-0000000000185aba.wal","error":"unexpected EOF"}
{"level":"info","ts":"2022-01-25T13:54:33.192Z","caller":"etcdserver/storage.go:109","msg":"repaired WAL","error":"unexpected EOF"}
{"level":"warn","ts":"2022-01-25T13:54:33.884Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:53950","server-name":"","error":"EOF"}
{"level":"warn","ts":"2022-01-25T13:54:33.885Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"127.0.0.1:53948","server-name":"","error":"EOF"}
{"level":"warn","ts":"2022-01-28T03:00:37.549Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"628.230855ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/runtimeclasses/\" range_end:\"/registry/runtimeclasses0\" count_only:true ","response":"","error":"context canceled"}
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

查看Kubernetes服务日志

如果在​​Linux​​​系统上安装​​Kubernetes​​​,并且使用​​systemd​​​系统管理​​Kubernetes​​​服务,那么​​systemd​​​的​​journal​​​系统会接管服务程序的输出日志。在这种环境中,可以通过使用​​systemd status​​​或​​journalct​​具来查看系统服务的日志。例如:

查看服务服务启动的相关信息,通过这个,可以定位服务加载的配置文件​​信息,​​启动参数配置情况​

┌──[root@vms81.liruilongs.github.io]-[~]
└─$systemctl status kubelet.service -l
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since 二 2022-01-25 21:53:35 CST; 6 days ago
Docs: https://kubernetes.io/docs/
Main PID: 1014 (kubelet)
Memory: 208.2M
CGroup: /system.slice/kubelet.service
└─1014 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.5

2月 01 17:47:14 vms81.liruilongs.github.io kubelet[1014]: W0201 17:47:14.258523 1014 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1b874bfdef201d69db10b200b8f47d5.slice/docker-c20fa960cfebd38172e123a5d87ecd499518bf22381f7aaa62d57131e7eb1aae.scope": unable to determine device info for dir: /var/lib/docker/overlay2/07d7695f2c479fbd0b654016345fcbacd0838276fb57f8291f993ed6799fae8d/diff: stat failed on /var/lib/docker/overlay2/07d7695f2c479fbd0b654016345fcbacd0838276fb57f8291f993ed6799fae8d/diff with error: no such file or directory, continuing to push stats
。。。。。。。。。。

通过 ​journalct​来查看相关的服务日志信息,查看当前用户下的kubelet服务日志中有error关键字的字段的报错问题

┌──[root@vms81.liruilongs.github.io]-[~]
└─$journalctl -u kubelet.service | grep -i error | head -2
1月 25 21:53:55 vms81.liruilongs.github.io kubelet[1014]: I0125 21:53:55.865441 1014 docker_service.go:264] "Docker Info" dockerInfo=&{ID:HN3K:C6LG:QGV7:N2CG:VELF:CJ6T:HFR5:EEKH:HLPO:CDEU:GN3E:QAJJ Containers:32 ContainersRunning:11 ContainersPaused:0 ContainersStopped:21 Images:32 Driver:overlay2 DriverStatus:[[Backing Filesystem xfs] [Supports d_type true] [Native Overlay Diff true] [userxattr false]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host ipvlan macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:true KernelMemory:true KernelMemoryTCP:true CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:true IPv4Forwarding:true BridgeNfIptables:true BridgeNfIP6tables:true Debug:false NFd:26 OomKillDisable:true NGoroutines:39 SystemTime:2022-01-25T21:53:55.833509372+08:00 LoggingDriver:json-file CgroupDriver:systemd CgroupVersion:1 NEventsListener:0 KernelVersion:3.10.0-693.el7.x86_64 OperatingSystem:CentOS Linux 7 (Core) OSVersion:7 OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc000a8f960 NCPU:2 MemTotal:4126896128 GenericResources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:vms81.liruilongs.github.io Labels:[] ExperimentalBuild:false ServerVersion:20.10.9 ClusterStore: ClusterAdvertise: Runtimes:map[io.containerd.runc.v2:{Path:runc Args:[] Shim:<nil>} io.containerd.runtime.v1.linux:{Path:runc Args:[] Shim:<nil>} runc:{Path:runc Args:[] Shim:<nil>}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:<nil> Warnings:[]} LiveRestoreEnabled:false Isolation: InitBinary:docker-init ContainerdCommit:{ID:5b46e404f6b9f661a205e28d59c982d3634148f8 Expected:5b46e404f6b9f661a205e28d59c982d3634148f8} RuncCommit:{ID:v1.0.2-0-g52b36a2 Expected:v1.0.2-0-g52b36a2} InitCommit:{ID:de40ad0 Expected:de40ad0} SecurityOptions:[name=seccomp,profile=default] ProductLicense: DefaultAddressPools:[]
1月 25 21:53:56 vms81.liruilongs.github.io kubelet[1014]: E0125 21:53:56.293100 1014 controller.go:144] failed to ensure lease exists, will retry in 200ms, error: Get "https://192.168.26.81:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vms81.liruilongs.github.io?timeout=10s": dial tcp 192.168.26.81:6443: connect: connection refused
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

如果不使用​systemd​​系统接管​​Kubernetes​​服务的标准输出,则也可以通过日志相关的启动参数来指定日志的存放目录。当然,这里的相关启动参数的配置信息需要通过查看pod文件来查看

查看​kube-controller-manager​的启动参数和认证相关的配置文件

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl describe pod kube-controller-manager-vms81.liruilongs.github.io -n kube-system | grep -i -A 20 command
Command:
kube-controller-manager
--allocate-node-cidrs=true
--authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
--authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
--bind-address=127.0.0.1
--client-ca-file=/etc/kubernetes/pki/ca.crt
--cluster-cidr=10.244.0.0/16
--cluster-name=kubernetes
--cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
--cluster-signing-key-file=/etc/kubernetes/pki/ca.key
--controllers=*,bootstrapsigner,tokencleaner
--kubeconfig=/etc/kubernetes/controller-manager.conf
--leader-elect=true
--port=0
--requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
--root-ca-file=/etc/kubernetes/pki/ca.crt
--service-account-private-key-file=/etc/kubernetes/pki/sa.key
--service-cluster-ip-range=10.96.0.0/12
--use-service-account-credentials=true
State: Running
┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl describe pod kube-controller-manager-vms81.liruilongs.github.io -n kube-system | grep kubeconfig
--authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
--authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
--kubeconfig=/etc/kubernetes/controller-manager.conf
/etc/kubernetes/controller-manager.conf from kubeconfig (ro)
kubeconfig:
┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl describe pod kube-controller-manager-vms81.liruilongs.github.io -n kube-system | grep -i -A 20 Volumes
Volumes:
ca-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType: DirectoryOrCreate
etc-pki:
Type: HostPath (bare host directory volume)
Path: /etc/pki
HostPathType: DirectoryOrCreate
flexvolume-dir:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
HostPathType: DirectoryOrCreate
k8s-certs:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/pki
HostPathType: DirectoryOrCreate
kubeconfig:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/controller-manager.conf
HostPathType: FileOrCreate
┌──[root@vms81.liruilongs.github.io]-[~]
└─$

Pod资源对象相关的问题,比如无法创建​​Pod​​​, ​​Pod​​​启动后就停止或者​​Pod​​​副本无法增加,等等。此时,可以先确定​​Pod​​​在哪个节点上,然后登录这个节点,从​​kubelet​​​的日志中查询该​​Pod​​的完整日志,然后进行问题排查。

对于与Pod扩容相关或者与RC相关的问题,则很可能在​​kube-controller-manager​​​及​​kube-scheduler​​的日志中找出问题的关键点。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl logs kube-scheduler-vms81.liruilongs.github.io
┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl logs kube-controller-manager-vms81.liruilongs.github.io

kube-proxy​​经常被我们忽视,因为即使它意外停止, ​​Pod​​​的状态也是正常的,但会导致某些服务访问异常。这些错误通常与每个节点上的​​kube-proxy​​​服务有着密切的关系。遇到这些问题时,首先要排查​​kube-proxy​​服务的日志,同时排查防火墙服务,要特别留意在防火墙中是否有人为添加的可疑规则。

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl logs kube-proxy-tbwz5

常见问题

由于无法下载pause镜像导致Pod一直处于Pending状态

Pod创建成功,但RESTARTS数量持续增加:容器的启动命令不能保持在前台运行。

通过服务名无法访问服务

在​​Kubernetes​​​集群中应尽量使用服务名访问正在运行的微服务,但有时会访问失败。由于​​服务涉及服务名的DNS域名解析​​​、​​kube-proxy组件的负载分发​​​、​​后端Pod列表的状态​​等,所以可通过以下几方面排查问题。

1.查看​Service​​的​​后端Endpoint​​是否正常

可以通过​​kubectl get endpoints <service name>​​​命令查看某个服务的后端​​Endpoint​​列表,如果列表为空,则可能因为:

┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 50d
liruilong-kube-prometheus-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 16d
metrics-server ClusterIP 10.111.104.173 <none> 443/TCP 50d
┌──[root@vms81.liruilongs.github.io]-[~]
└─$kubectl get endpoints
NAME ENDPOINTS AGE
kube-dns 10.244.88.66:53,10.244.88.67:53,10.244.88.66:53 + 3 more... 50d
liruilong-kube-prometheus-kubelet 192.168.26.81:10250,192.168.26.82:10250,192.168.26.83:10250 + 6 more... 16d
metrics-server <none> 50d
┌──[root@vms81.liruilongs.github.io]-[~]
└─$
  • ​Service​​​的​​Label Selector​​​与​​Pod的Label不匹配​​,沒有相关的pod可以提供能力
  • 后端​​Pod​​​一直没有达到​​Ready​​​状态(通过​​kubectl get pods​​​进一步查看​​Pod的状态​​)
  • Service的targetPort端口号与Pod的containerPort不一致等。即容器暴露的端口不是SVC暴露的端口,需要使用targetPort来转发

2·查看Service的名称能否被正确解析为ClusterIP地址

可以通过在客户端容器中ping ..svc进行检查,如果能够得到​​Service​​​的​​ClusterlP​​​地址,则说明​​DNS服务​​​能够​​正确解析Service​​​的名称;如果不能得到​​Service​​​的​​ClusterlP地址​​​,则可能是因为​​Kubernetes集群​​​的​​DNS服务工作异常​​。

3·查看​kube-proxy​​的​​转发规则​​是否正确

我们可以将​​kube-proxy​​​服务设置为​​IPVS或iptables负载分发模式​​。

  • 对于​​IPVS负载分发模式​​,可以通过​​ipvsadm​​工具查看​​Node上的IPVS规则​​,查看是否正确设置​​Service ClusterlP​​的相关规则。
  • 对于​​iptables负载分发模式​​,可以通过查看​​Node上的iptables规则​​,查看是否正确设置​​Service ClusterlP​​的相关规则。

寻求帮助

网站和社区

​Kubernetes​​​官网中监控、记录和调试相关问题: ​​https://kubernetes.io/docs/tasks/debug-application-cluster/​

关于Kubernetes集群中常见问题的排查方法的一些笔记

​Kubernetes​​​官方论坛: ​​https://discuss.kubernetes.io/​​(这个需要*)

关于Kubernetes集群中常见问题的排查方法的一些笔记

​GitHub​​​库关于​​Kubernetes​​​问题列表:​​https://github.com/kubernetes/kubernetes/issues​

关于Kubernetes集群中常见问题的排查方法的一些笔记

​*​​​网站上关于​​Kubernetes​​​的问题讨论:​​https://*.com/questions/tagged/kubernetes​

关于Kubernetes集群中常见问题的排查方法的一些笔记

​Kubernetes Slack​​​聊天群组: https://kubernetes.slack.com/(需要谷歌账号)