kubeadm中添加k8s节点问题

£神魔★判官ぃ 2022-08-31 07:10 305阅读 0赞

从下方可以看到3个地方出了问题
etcd-master1、kube-apiserver-master1、kube-flannel-ds-42z5p

  1. [root@master3 ~]# kubectl get pods -n kube-system -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. coredns-546565776c-m96fb 1/1 Running 0 46d 10.244.1.3 master2 <none> <none>
  4. coredns-546565776c-thczd 1/1 Running 0 44d 10.244.2.2 master3 <none> <none>
  5. etcd-master1 0/1 CrashLoopBackOff 21345 124d 10.128.4.164 master1 <none> <none>
  6. etcd-master2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none>
  7. etcd-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none>
  8. kube-apiserver-master1 0/1 CrashLoopBackOff 21349 124d 10.128.4.164 master1 <none> <none>
  9. kube-apiserver-master2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none>
  10. kube-apiserver-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none>
  11. kube-controller-manager-master1 1/1 Running 11 124d 10.128.4.164 master1 <none> <none>
  12. kube-controller-manager-master2 1/1 Running 2 124d 10.128.4.251 master2 <none> <none>
  13. kube-controller-manager-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none>
  14. kube-flannel-ds-42z5p 0/1 Error 1568 6d2h 10.128.2.173 bg7.test.com.cn <none> <none>
  15. kube-flannel-ds-6g59q 1/1 Running 7 43d 10.128.4.8 wd8.test.com.cn <none> <none>
  16. kube-flannel-ds-85hxd 1/1 Running 3 123d 10.128.4.107 wd6.test.com.cn <none> <none>
  17. kube-flannel-ds-brd8d 1/1 Running 1 33d 10.128.4.160 wd9.test.com.cn <none> <none>
  18. kube-flannel-ds-gmmhx 1/1 Running 3 124d 10.128.4.82 wd5.test.com.cn <none> <none>
  19. kube-flannel-ds-lj4g2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none>
  20. kube-flannel-ds-n68dn 1/1 Running 11 124d 10.128.4.164 master1 <none> <none>
  21. kube-flannel-ds-ppnd7 1/1 Running 4 124d 10.128.4.191 wd4.test.com.cn <none> <none>
  22. kube-flannel-ds-tf9lk 1/1 Running 0 33d 10.128.4.170 wd7.test.com.cn <none> <none>
  23. kube-flannel-ds-vt5nh 1/1 Running 1 124d 10.128.4.211 master3 <none> <none>
  24. kube-proxy-622c7 1/1 Running 11 124d 10.128.4.164 master1 <none> <none>
  25. kube-proxy-7bp72 1/1 Running 0 7d4h 10.128.2.173 bg7.test.com.cn <none> <none>
  26. kube-proxy-8cx5q 1/1 Running 4 123d 10.128.4.107 wd6.test.com.cn <none> <none>
  27. kube-proxy-h2qh5 1/1 Running 1 124d 10.128.4.211 master3 <none> <none>
  28. kube-proxy-kpkm4 1/1 Running 7 43d 10.128.4.8 wd8.test.com.cn <none> <none>
  29. kube-proxy-lp74p 1/1 Running 1 33d 10.128.4.160 wd9.test.com.cn <none> <none>
  30. kube-proxy-nwsnm 1/1 Running 1 124d 10.128.4.251 master2 <none> <none>
  31. kube-proxy-psjll 1/1 Running 4 124d 10.128.4.82 wd5.test.com.cn <none> <none>
  32. kube-proxy-v6x42 1/1 Running 0 33d 10.128.4.170 wd7.test.com.cn <none> <none>
  33. kube-proxy-vdfmz 1/1 Running 4 124d 10.128.4.191 wd4.test.com.cn <none> <none>
  34. kube-scheduler-master1 1/1 Running 11 124d 10.128.4.164 master1 <none> <none>
  35. kube-scheduler-master2 1/1 Running 1 124d 10.128.4.251 master2 <none> <none>
  36. kube-scheduler-master3 1/1 Running 1 124d 10.128.4.211 master3 <none> <none>
  37. kuboard-7986796cf8-2g6bs 1/1 Running 0 44d 10.244.1.4 master2 <none> <none>
  38. metrics-server-677dcb8b4d-pshqw 1/1 Running 0 44d 10.128.4.191 wd4.test.com.cn <none> <none>

1 flannl的问题
K8s集群中flannel组件处于CrashLoopBackOff状态的解决思路
这个网站的内容是加载ipvs的问题,可以通过lsmod | grep ip_vs查看加载是否成功

  1. [root@master3 net.d]# cat /etc/sysconfig/modules/ipvs.modules
  2. #!/bin/sh
  3. modprobe -- ip_vs
  4. modprobe -- ip_vs_rr
  5. modprobe -- ip_vs_wrr
  6. modprobe -- ip_vs_sh
  7. modprobe -- nf_conntrack_ipv4

但我这里的异常并不是这样的

  1. [root@master3 ~]# kubectl logs kube-flannel-ds-42z5p -n kube-system
  2. I0714 08:58:00.590712 1 main.go:519] Determining IP address of default interface
  3. I0714 08:58:00.687885 1 main.go:532] Using interface with name eth0 and address 10.128.2.173
  4. I0714 08:58:00.687920 1 main.go:549] Defaulting external address to interface address (10.128.2.173)
  5. W0714 08:58:00.687965 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
  6. E0714 08:58:30.689584 1 main.go:250] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-42z5p': Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-42z5p": dial tcp 10.96.0.1:443: i/o timeout

翻阅资料在k8s中安装flannel的故障解决: Failed to create SubnetManager: error retrieving pod spec for : the server doe
使用kubeadm在ububtu16.04安装kubernetes1.6.1-flannel
使用kubeadm快速部署一套K8S集群
查看集群,在一台没有的work节点可以看到

  1. [root@wd5 ~]# ps -ef|grep flannel
  2. root 8359 28328 0 17:13 pts/0 00:00:00 grep --color=auto flannel
  3. root 22735 22714 0 May31 ? 00:26:16 /opt/bin/flanneld --ip-masq --kube-subnet-mgr

而有问题的work节点是没有此进程的

  1. [root@bg7 ~]# kubectl create -f https://github.com/coreos/flannel/raw/master/Documentation/kube-flannel-rbac.yml
  2. The connection to the server localhost:8080 was refused - did you specify the right host or port?

查看k8s集群状态

  1. [root@master3 ~]# kubectl get cs
  2. NAME STATUS MESSAGE ERROR
  3. controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
  4. scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
  5. etcd-0 Healthy {"health":"true"}

解决k8s Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
vi /etc/kubernetes/manifests/kube-scheduler.yamlvi /etc/kubernetes/manifests/kube-controller-manager.yaml
- --port=0注释掉后,执行systemctl restart kubelet.service,现在的状态才正常了

  1. NAME STATUS MESSAGE ERROR
  2. scheduler Healthy ok
  3. controller-manager Healthy ok
  4. etcd-0 Healthy {"health":"true"}

上面的配置更改并没有修复pods状态异常的问题
查看k8s flannel 网络问题 dial tcp 10.0.0.1:443: i/o timeout
没有问题的节点都有这个cni虚拟网卡,而有问题的节点则没有.

  1. [root@wd4 ~]# ifconfig
  2. cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
  3. inet 10.244.3.1 netmask 255.255.255.0 broadcast 10.244.3.255
  4. inet6 fe80::44d6:8ff:fe10:9c7e prefixlen 64 scopeid 0x20<link>
  5. ether 46:d6:08:10:9c:7e txqueuelen 1000 (Ethernet)
  6. RX packets 322756760 bytes 105007395106 (97.7 GiB)
  7. RX errors 0 dropped 0 overruns 0 frame 0
  8. TX packets 328180837 bytes 158487160202 (147.6 GiB)
  9. TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

因为kube-controller-manager.yaml设置的集群管理的网段是10.244.0.0/16
查看节点状态,下方有异常信息,之前没有注意到

  1. [root@bg7 net.d]# service kubelet status
  2. Redirecting to /bin/systemctl status kubelet.service
  3. kubelet.service - kubelet: The Kubernetes Node Agent
  4. Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  5. Drop-In: /usr/lib/systemd/system/kubelet.service.d
  6. └─10-kubeadm.conf
  7. Active: active (running) since Thu 2021-07-08 13:40:44 CST; 6 days ago
  8. Docs: https://kubernetes.io/docs/
  9. Main PID: 5290 (kubelet)
  10. Tasks: 45
  11. Memory: 483.8M
  12. CGroup: /system.slice/kubelet.service
  13. └─5290 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --co...
  14. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.322908 5290 cni.go:364] Error adding longhorn-system_longhorn-csi-plugi...rectory
  15. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.355433 5290 cni.go:364] Error adding longhorn-system_engine-image-ei-e1...rectory
  16. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.372196 5290 cni.go:364] Error adding longhorn-system_longhorn-manager-2...rectory
  17. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.378600 5290 pod_container_deletor.go:77] Container "5ae13a0a2be56237a3f...tainers
  18. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.395855 5290 pod_container_deletor.go:77] Container "ea0b2a805f720628172...tainers
  19. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: W0714 18:30:58.411259 5290 pod_container_deletor.go:77] Container "63776660a9ee92b50ee...tainers
  20. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700878 5290 remote_runtime.go:105] RunPodSandbox from runtime service failed: ...
  21. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700942 5290 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "longhorn-csi-...
  22. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.700958 5290 kuberuntime_manager.go:733] createPodSandbox for pod "longhorn-csi...
  23. Jul 14 18:30:58 bg7.test.com.cn kubelet[5290]: E0714 18:30:58.701009 5290 pod_workers.go:191] Error syncing pod 3b0799d3-9446-4f51-94...446-4f5
  24. Hint: Some lines were ellipsized, use -l to show in full.

在work节点安装网络插件

  1. [root@bg7 ~]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
  2. The connection to the server localhost:8080 was refused - did you specify the right host or port?

出现这个问题是需要在master节点中admin.conf配置到work节点中

  1. echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile
  2. source ~/.bash_profile
  3. [root@bg7 kubernetes]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
  4. podsecuritypolicy.policy/psp.flannel.unprivileged configured
  5. clusterrole.rbac.authorization.k8s.io/flannel unchanged
  6. clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
  7. serviceaccount/flannel unchanged
  8. configmap/kube-flannel-cfg unchanged
  9. daemonset.apps/kube-flannel-ds configured

kubeadm 安装kubetnetes(flannel)
实在找不到办法,重置work节点

  1. systemctl stop kubelet
  2. kubeadm reset
  3. rm -rf /etc/cni/net.d
  4. # 如果开启了防火墙则执行
  5. iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
  6. # 加入集群
  7. kubeadm join 10.128.4.18:16443 --token xfp80m.xx--discovery-token-ca-cert-hash sha256:dee39c2f7c7484af5872018d786626c9a6264da9334xxxxxxxx
  8. #

根本问题原来是6443端口被限制的问题

  1. [root@master2 ~]# netstat -ntlp | grep 6443
  2. tcp 0 0 0.0.0.0:16443 0.0.0.0:* LISTEN 886/haproxy
  3. tcp6 0 0 :::6443 :::* LISTEN 3006/kube-apiserver
  4. [root@bg7 net.d]# kubectl describe pod kube-flannel-ds-5jhm6 -n kube-system
  5. Name: kube-flannel-ds-5jhm6
  6. Namespace: kube-system
  7. Priority: 2000001000
  8. Priority Class Name: system-node-critical
  9. Node: bg7.test.com.cn/10.128.2.173
  10. Start Time: Thu, 15 Jul 2021 14:17:39 +0800
  11. Labels: app=flannel
  12. controller-revision-hash=68c5dd74df
  13. pod-template-generation=2
  14. tier=node
  15. Annotations: <none>
  16. Status: Running
  17. IP: 10.128.2.173
  18. IPs:
  19. IP: 10.128.2.173
  20. Controlled By: DaemonSet/kube-flannel-ds
  21. Init Containers:
  22. install-cni:
  23. Container ID: docker://f04fdac1c8d9d0f98bd11159aebb42f9870709fd6fa2bb96739f8d255967033a
  24. Image: quay.io/coreos/flannel:v0.14.0
  25. Image ID: docker-pullable://quay.io/coreos/flannel@sha256:4a330b2f2e74046e493b2edc30d61fdebbdddaaedcb32d62736f25be8d3c64d5
  26. Port: <none>
  27. Host Port: <none>
  28. Command:
  29. cp
  30. Args:
  31. -f
  32. /etc/kube-flannel/cni-conf.json
  33. /etc/cni/net.d/10-flannel.conflist
  34. State: Terminated
  35. Reason: Completed
  36. Exit Code: 0
  37. Started: Thu, 15 Jul 2021 14:45:18 +0800
  38. Finished: Thu, 15 Jul 2021 14:45:18 +0800
  39. Ready: True
  40. Restart Count: 0
  41. Environment: <none>
  42. Mounts:
  43. /etc/cni/net.d from cni (rw)
  44. /etc/kube-flannel/ from flannel-cfg (rw)
  45. /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-wc2lq (ro)
  46. Containers:
  47. kube-flannel:
  48. Container ID: docker://8ab52d4dc3c29d13d7453a33293a8696391f31826afdc1981a1df9c7eafd6994
  49. Image: quay.io/coreos/flannel:v0.14.0
  50. Image ID: docker-pullable://quay.io/coreos/flannel@sha256:4a330b2f2e74046e493b2edc30d61fdebbdddaaedcb32d62736f25be8d3c64d5
  51. Port: <none>
  52. Host Port: <none>
  53. Command:
  54. /opt/bin/flanneld
  55. Args:
  56. --ip-masq
  57. --kube-subnet-mgr
  58. State: Waiting
  59. Reason: CrashLoopBackOff
  60. Last State: Terminated
  61. Reason: Error
  62. Exit Code: 1
  63. Started: Thu, 15 Jul 2021 15:27:58 +0800
  64. Finished: Thu, 15 Jul 2021 15:28:29 +0800
  65. Ready: False
  66. Restart Count: 12
  67. Limits:
  68. cpu: 100m
  69. memory: 50Mi
  70. Requests:
  71. cpu: 100m
  72. memory: 50Mi
  73. Environment:
  74. POD_NAME: kube-flannel-ds-5jhm6 (v1:metadata.name)
  75. POD_NAMESPACE: kube-system (v1:metadata.namespace)
  76. Mounts:
  77. /etc/kube-flannel/ from flannel-cfg (rw)
  78. /run/flannel from run (rw)
  79. /var/run/secrets/kubernetes.io/serviceaccount from flannel-token-wc2lq (ro)
  80. Conditions:
  81. Type Status
  82. Initialized True
  83. Ready False
  84. ContainersReady False
  85. PodScheduled True
  86. Volumes:
  87. run:
  88. Type: HostPath (bare host directory volume)
  89. Path: /run/flannel
  90. HostPathType:
  91. cni:
  92. Type: HostPath (bare host directory volume)
  93. Path: /etc/cni/net.d
  94. HostPathType:
  95. flannel-cfg:
  96. Type: ConfigMap (a volume populated by a ConfigMap)
  97. Name: kube-flannel-cfg
  98. Optional: false
  99. flannel-token-wc2lq:
  100. Type: Secret (a volume populated by a Secret)
  101. SecretName: flannel-token-wc2lq
  102. Optional: false
  103. QoS Class: Burstable
  104. Node-Selectors: <none>
  105. Tolerations: :NoSchedule
  106. node.kubernetes.io/disk-pressure:NoSchedule
  107. node.kubernetes.io/memory-pressure:NoSchedule
  108. node.kubernetes.io/network-unavailable:NoSchedule
  109. node.kubernetes.io/not-ready:NoExecute
  110. node.kubernetes.io/pid-pressure:NoSchedule
  111. node.kubernetes.io/unreachable:NoExecute
  112. node.kubernetes.io/unschedulable:NoSchedule
  113. Events:
  114. Type Reason Age From Message
  115. ---- ------ ---- ---- -------
  116. Normal Pulled 48m kubelet Container image "quay.io/coreos/flannel:v0.14.0" already present on machine
  117. Normal Created 48m kubelet Created container install-cni
  118. Normal Started 48m kubelet Started container install-cni
  119. Normal Created 44m (x5 over 48m) kubelet Created container kube-flannel
  120. Normal Started 44m (x5 over 48m) kubelet Started container kube-flannel
  121. Normal Pulled 28m (x9 over 48m) kubelet Container image "quay.io/coreos/flannel:v0.14.0" already present on machine
  122. Warning BackOff 3m8s (x177 over 47m) kubelet Back-off restarting failed container
  123. journalctl -xeu kubelet
  124. "longhorn-csi-plugin-fw2ck_longhorn-system" network: open /run/flannel/subnet.env: no such file or directory
  125. [root@wd4 flannel]# cat subnet.env
  126. FLANNEL_NETWORK=10.244.0.0/16
  127. FLANNEL_SUBNET=10.244.3.1/24
  128. FLANNEL_MTU=1450
  129. FLANNEL_IPMASQ=true

从下图可以看到10.96.0.1可以ping,但是443端口却无法访问

  1. [root@bg7 ~]# ping 10.96.0.1
  2. PING 10.96.0.1 (10.96.0.1) 56(84) bytes of data.
  3. 64 bytes from 10.96.0.1: icmp_seq=1 ttl=64 time=0.034 ms
  4. [root@bg7 ~]# telnet 10.96.0.1 443
  5. Trying 10.96.0.1...

2 etcd-master1

  1. [root@master1 ~]# kubectl logs etcd-master1 -n kube-system
  2. [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
  3. 2021-07-14 09:56:08.703026 I | etcdmain: etcd Version: 3.4.3
  4. 2021-07-14 09:56:08.703052 I | etcdmain: Git SHA: 3cf2f69b5
  5. 2021-07-14 09:56:08.703055 I | etcdmain: Go Version: go1.12.12
  6. 2021-07-14 09:56:08.703058 I | etcdmain: Go OS/Arch: linux/amd64
  7. 2021-07-14 09:56:08.703062 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
  8. 2021-07-14 09:56:08.703101 N | etcdmain: the server is already initialized as member before, starting as etcd member...
  9. [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
  10. 2021-07-14 09:56:08.703131 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
  11. 2021-07-14 09:56:08.703235 C | etcdmain: open /etc/kubernetes/pki/etcd/peer.crt: no such file or directory

这个问题相对简单, 将其他master节点etcd的证书什么的复制过去就可以了,因为k8s集群每个master节点是对等,故而猜测直接复制过去可以用

  1. [root@master2 ~]# cd /etc/kubernetes/pki/etcd
  2. [root@master2 etcd]# ll
  3. total 32
  4. -rw-r--r-- 1 root root 1017 Mar 12 11:59 ca.crt
  5. -rw------- 1 root root 1675 Mar 12 11:59 ca.key
  6. -rw-r--r-- 1 root root 1094 Mar 12 13:47 healthcheck-client.crt
  7. -rw------- 1 root root 1675 Mar 12 13:47 healthcheck-client.key
  8. -rw-r--r-- 1 root root 1127 Mar 12 13:47 peer.crt
  9. -rw------- 1 root root 1675 Mar 12 13:47 peer.key
  10. -rw-r--r-- 1 root root 1127 Mar 12 13:47 server.crt
  11. -rw------- 1 root root 1675 Mar 12 13:47 server.key
  12. cd /etc/kubernetes/pki/etcd
  13. scp healthcheck-client.crt root@10.128.4.164:/etc/kubernetes/pki/etcd
  14. scp healthcheck-client.key peer.crt peer.key server.crt server.key root@10.128.4.164:/etc/kubernetes/pki/etcd

查看etcd,按照下面的命令安装etcdctl客户端命令行工具,这个是宿主机中安装的ectd的访问工具

  1. wget https://github.com/etcd-io/etcd/releases/download/v3.4.14/etcd-v3.4.14-linux-amd64.tar.gz
  2. tar -zxf etcd-v3.4.14-linux-amd64.tar.gz
  3. mv etcd-v3.4.14-linux-amd64/etcdctl /usr/local/bin
  4. chmod +x /usr/local/bin/

除了上面的方式,还是直接进入到docker容器中

  1. docker exec -it $(docker ps -f name=etcd_etcd -q) /bin/sh
  2. # 查看 etcd 集群的成员列表
  3. # etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
  4. 63009835561e0671, started, master1, https://10.128.4.164:2380, https://10.128.4.164:2379, false
  5. b245d1beab861d15, started, master2, https://10.128.4.251:2380, https://10.128.4.251:2379, false
  6. f3f56f36d83eef49, started, master3, https://10.128.4.211:2380, https://10.128.4.211:2379, false

查看高可用集群健康状态

  1. [root@master3 application]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 endpoint health
  2. {"level":"warn","ts":"2021-07-14T19:37:51.455+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2684301f-38ba-4150-beab-ed052321a6d9/10.128.4.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
  3. +-------------------+--------+------------+---------------------------+
  4. | ENDPOINT | HEALTH | TOOK | ERROR |
  5. +-------------------+--------+------------+---------------------------+
  6. | 10.128.4.211:2379 | true | 8.541405ms | |
  7. | 10.128.4.251:2379 | true | 8.922941ms | |
  8. | 10.128.4.164:2379 | false | 5.0002425s | context deadline exceeded |

查看etcd高可用集群列表

  1. [root@master3 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 member list
  2. +------------------+---------+---------+---------------------------+---------------------------+------------+
  3. | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
  4. +------------------+---------+---------+---------------------------+---------------------------+------------+
  5. | 63009835561e0671 | started | master1 | https://10.128.4.164:2380 | https://10.128.4.164:2379 | false |
  6. | b245d1beab861d15 | started | master2 | https://10.128.4.251:2380 | https://10.128.4.251:2379 | false |
  7. | f3f56f36d83eef49 | started | master3 | https://10.128.4.211:2380 | https://10.128.4.211:2379 | false |
  8. +------------------+---------+---------+---------------------------+---------------------------+------------+

查看etcd高可用集群leader

  1. [root@master3 ~]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=10.128.4.164:2379,10.128.4.251:2379,10.128.4.211:2379 endpoint status
  2. {"level":"warn","ts":"2021-07-15T10:24:33.494+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///10.128.4.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.128.4.164:2379: connect: connection refused\""}
  3. Failed to get the status of endpoint 10.128.4.164:2379 (context deadline exceeded)
  4. +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
  5. | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
  6. +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
  7. | 10.128.4.251:2379 | b245d1beab861d15 | 3.4.3 | 25 MB | false | false | 16 | 46888364 | 46888364 | |
  8. | 10.128.4.211:2379 | f3f56f36d83eef49 | 3.4.3 | 25 MB | true | false | 16 | 46888364 | 46888364 | |
  9. +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

按照下面的命令,将有效的证书复制到master1,结果还是出问题

  1. scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
  2. scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
  3. scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
  4. scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
  5. scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/
  6. scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
  7. scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
  8. scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
  9. scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
  10. scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/

然后想到的办法是将master节点从集群中移除并重新加入
将 master 节点服务器从 k8s 集群中移除并重新加入

  1. # 移除k8s中的有问题的master节点
  2. kubectl drain master1
  3. kubectl delete node master1
  4. # etcl中移除相应的配置,注意etcd中12637f5ec2bd02b8是通过etcd 集群的成员列表来查看的
  5. etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 12637f5ec2bd02b8
  6. # 注意这个是在没有问题的master节点中执行
  7. mkdir -p /etc/kubernetes/pki/etcd/
  8. scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
  9. scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
  10. scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
  11. scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
  12. scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/
  13. scp /etc/kubernetes/pki/ca.* root@10.128.4.164:/etc/kubernetes/pki/
  14. scp /etc/kubernetes/pki/sa.* root@10.128.4.164:/etc/kubernetes/pki/
  15. scp /etc/kubernetes/pki/front-proxy-ca.* root@10.128.4.164:/etc/kubernetes/pki/
  16. scp /etc/kubernetes/pki/etcd/ca.* root@10.128.4.164:/etc/kubernetes/pki/etcd/
  17. scp /etc/kubernetes/admin.conf root@10.128.4.164:/etc/kubernetes/
  18. # 注意下面的命令是 有问题的master节点中执行
  19. kubeadm reset
  20. # 注意这个是 有问题的节点中执行
  21. kubeadm join 10.128.4.18:16443 --token xfp80m.tzbnqxoyv1p21687 --discovery-token-ca-cert-hash sha256:dee39c2f7c7484af5872018d786626c9a6264da93346acc9114ffacd0a2782d7 --control-plane
  22. kubectl cordon master1
  23. # 至此就ok了,同步kube-apiserver-master1的问题也解决了

如果不小心在原本没有问题的机器上执行kubeadm reset,可以看到下面的情况master3变为了NotReady状态

  1. [root@master1 pki]# kubectl get nodes
  2. NAME STATUS ROLES AGE VERSION
  3. bg7.test.com.cn Ready <none> 7d22h v1.18.9
  4. master1 Ready master 6m6s v1.18.9
  5. master2 Ready,SchedulingDisabled master 124d v1.18.9
  6. master3 NotReady,SchedulingDisabled master 124d v1.18.9
  7. wd4.test.com.cn Ready <none> 124d v1.18.9

解决方案参考K8S的master节点与NODE节点误执行kubeadm reset处理办法
下面的操作方式我没有执行成功,我还是按照先移除再添加的方式成功了

  1. scp /etc/kubernetes/admin.conf root@10.128.2.173:/etc/kubernetes/
  2. mkdir -p $HOME/.kube
  3. sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  4. sudo chown $(id -u):$(id -g) $HOME/.kube/config
  5. kubeadm init --kubernetes-version=v1.18.9 --pod-network-cidr=10.244.0.0/16
  6. echo "export KUBECONFIG=/etc/kubernetes/admin.conf" >> ~/.bash_profile
  7. source ~/.bash_profile

3 节点调度问题
SchedulingDisabled是不可调度,这个肯定有问题,执行命令kubectl uncordon wd9.test.com.cn,将原来不能调度的节点设置为可调度
通过kubectl cordon master1又可以将master节点设置为不可调度

  1. [root@master1 pki]# kubectl get nodes
  2. NAME STATUS ROLES AGE VERSION
  3. bg7.test.com.cn Ready,SchedulingDisabled <none> 7d6h v1.18.9
  4. master1 Ready master 124d v1.18.9
  5. master2 Ready master 124d v1.18.9
  6. master3 Ready master 124d v1.18.9
  7. wd4.test.com.cn Ready <none> 124d v1.18.9
  8. wd5.test.com.cn Ready <none> 124d v1.18.9
  9. wd6.test.com.cn Ready,SchedulingDisabled <none> 124d v1.18.9
  10. wd7.test.com.cn Ready <none> 34d v1.18.9
  11. wd8.test.com.cn Ready,SchedulingDisabled <none> 43d v1.18.9
  12. wd9.test.com.cn Ready

发表评论

表情:
评论列表 (有 0 条评论,305人围观)

还没有评论,来说两句吧...

相关阅读

    相关 Kubeadm方式部署K8S

    Kubeadm方式部署 kubeadm是官方社区推出的一个用于快速部署kubernetes集群的工具。 这个工具能通过两条指令完成一个kubernetes集群的部署:

    相关 kubeadm安装单机k8s

    目前出现的安装方式 目前我看到的有三种方式 minikube、microk8s、kubeadm 除了以上方式,甚至有以安装二进制文件的方式安装的。参看文档 mi