[K8S] Pod调度

àì夳堔傛蜴生んèń 2022-09-07 03:48 358阅读 0赞

- pod创建的过程
- 资源限制(cpu,memory)
- nodeSelector
- nodeAffinity
- 污点和污点容忍
- 指定调度节点

@pod创建的过程
1、kubectl run(创建一个pod,请求发送给)-> apiserver(将数据存储到) -> etcd
2、scheduler(将创建的pod根据调度算法选择一个合适的节点并标记,返回给) -> apiserver -> etcd
3、kubelet(发现有新的pod分配,调用docker api创建容器,将容器状态返回给) -> apiserver -> etcd

Kubernetes基于list-watch机制的控制器架构
其它组件监控自己负责的资源,当这些资源发生变化时,kube apiserver会通知这些组件
Etcd存储集群的数据信息,apiserver作为统一入口,任何对数据的操作都必须经过apiserver。
客户端(kubelet/scheduler/controller-manager)通过list-watch监听apiserver中资源(pod/rs/rc等等)的create,update和delete事件,并针对事件类型调用相应的事件处理函数。

list-watch传送门 -> 理解 K8S 的设计精髓之 List-Watch机制和Informer模块 https://zhuanlan.zhihu.com/p/59660536

@资源限制(cpu,memory)

limitpod.yaml内容如下,

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: testpod
  5. spec:
  6. containers:
  7. - image: nginx
  8. name: testpod
  9. resources:
  10. requests:
  11. memory: 100Mi
  12. cpu: 400m
  13. limits:
  14. memory: 128Mi
  15. cpu: 500m

说明:
containers.resources.limits 容器使用的最大资源上限
containers.resources.requests 容器资源预留值、不是实际占用 -> 用于资源分配参考、判断节点可否容纳
当请求的值没有节点能够满足时,pod处于pendding
cpu的1核=1000m

kubectl describe pod 可以查看资源limit和request

watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3d5X2hoeHg_size_16_color_FFFFFF_t_70

看看pod在哪个节点,当前pod在k8s-node1

  1. [root@k8s-master ~]# kubectl get pod testpod -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. testpod 1/1 Running 0 19m 10.244.36.66 k8s-node1 <none> <none>
  4. [root@k8s-master ~]#

然后用kubectl describe node 看节点的资源

watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3d5X2hoeHg_size_16_color_FFFFFF_t_70 1

多创建几个pod,占满节点资源,当resources.requests值没有节点能够满足时,pod处于Pending

20210821234126441.png

kubectl describe pod 查看原因是资源不足 watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3d5X2hoeHg_size_16_color_FFFFFF_t_70 2

@nodeSelector
nodeSelector:用于将Pod调度到匹配标签的节点上,如果没有匹配的标签会调度失败。
应用场景举例,
专用节点:根据业务线将节点分组管理
配备特殊硬件:部分节点配有固态硬盘(比机械硬盘读写性能好)
节点添加、查看、删除标签命令,
给节点添加标签 kubectl label node = (可以没有value)
查看节点标签 kubectl get node —show-labels
去除节点标签 kubectl label node -

节点上默认会打一些标签

  1. [root@k8s-master ~]# kubectl get nodes --show-labels
  2. NAME STATUS ROLES AGE VERSION LABELS
  3. k8s-master Ready control-plane,master 23d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=
  4. k8s-node1 Ready <none> 23d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node1,kubernetes.io/os=linux
  5. k8s-node2 Ready <none> 23d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node2,kubernetes.io/os=linux
  6. [root@k8s-master ~]#

为k8s-node1添加一个标签ssd

  1. [root@k8s-master ~]# kubectl label node k8s-node1 disktype=ssd
  2. node/k8s-node1 labeled
  3. [root@k8s-master ~]#
  4. [root@k8s-master ~]# kubectl get nodes --show-labels
  5. NAME STATUS ROLES AGE VERSION LABELS
  6. k8s-master Ready control-plane,master 23d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=
  7. k8s-node1 Ready <none> 23d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node1,kubernetes.io/os=linux
  8. k8s-node2 Ready <none> 23d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node2,kubernetes.io/os=linux
  9. [root@k8s-master ~]#

生成一个pod的yaml文件,kubectl run testpod —image=nginx —dry-run=client -o yaml > testpod.yaml,编辑testpod.yaml的内容如下

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: testpod
  5. spec:
  6. nodeSelector:
  7. disktype: ssd
  8. containers:
  9. - image: nginx
  10. name: testpod

创建deployment,kubectl apply -f testpod.yaml,查看pod所在位置有无匹配到有ssd的k8s-node1

  1. [root@k8s-master ~]# kubectl get pod -o wide
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. testpod 1/1 Running 0 24s 10.244.36.112 k8s-node1 <none> <none>
  4. [root@k8s-master ~]#

去除标签 kubectl label node k8s-node1 disktype-

  1. [root@k8s-master ~]# kubectl get node --show-labels | grep ssd
  2. k8s-node1 Ready <none> 23d v1.20.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node1,kubernetes.io/os=linux
  3. [root@k8s-master ~]#
  4. [root@k8s-master ~]# kubectl label node k8s-node1 disktype-
  5. node/k8s-node1 labeled
  6. [root@k8s-master ~]# kubectl get node --show-labels | grep ssd
  7. [root@k8s-master ~]#

@nodeAffinity

节点亲和(nodeAffinity) 可以根据节点上的标签来约束Pod可以调度到哪些节点
相比nodeSelector,
1) 支持的操作符有:In,NotIn,Exists,DoesNotExist,Gt,Lt

可以使用 NotIn 和 DoesNotExist 来实现节点反亲和性行为,或者使用节点污点将 Pod 从特定节点中驱逐。

2) 调度分为软策略和硬策略,而不是硬性要求
- 硬策略(required)必须满足
- 软策略(preferred)尝试满足

【例】创建一个Pod,节点亲和性有一个硬策略和一个软策略
当前节点没有新加额外的标签,所以预期Pod创建不成功,node-affinity.yaml内容如下

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: node-affinity
  5. spec:
  6. affinity:
  7. nodeAffinity:
  8. requiredDuringSchedulingIgnoredDuringExecution:
  9. nodeSelectorTerms:
  10. - matchExpressions:
  11. - key: project
  12. operator: In
  13. values:
  14. - makePeopleHappy
  15. preferredDuringSchedulingIgnoredDuringExecution:
  16. - weight: 1
  17. preference:
  18. matchExpressions:
  19. - key: group
  20. operator: In
  21. values:
  22. - test1
  23. containers:
  24. - name: web
  25. image: nginx

说明:
硬策略 spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
软策略 spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution, weight 字段的取值范围是 1-100

注意:

- 同时指定nodeSelector和nodeAffinity,两者必须都满足,才能将Pod调度到候选节点上。
- 指定多个与 nodeAffinity 类型关联的 nodeSelectorTerms,其中一个nodeSelectorTerms满足,pod就可以调度到节点上。
- 指定多个与 nodeSelectorTerms 关联的 matchExpressions,当所有matchExpressions满足,pod才可以调度到节点上。
- 修改或删除 pod所调度到的节点的标签,pod不会被删除。即亲和性选择只在Pod调度期间有效。

硬策略说明:

创建Pod后一直处于Pending状态

  1. [root@k8s-master ~]# kubectl apply -f node-affinity.yaml
  2. pod/node-affinity created
  3. [root@k8s-master ~]# kubectl get pod
  4. NAME READY STATUS RESTARTS AGE
  5. node-affinity 0/1 Pending 0 6s
  6. [root@k8s-master ~]#

使用 kubectl describe pod 查看三个节点,一个有污点,另外两个不符合节点亲和

  1. [root@k8s-master ~]# kubectl describe pod node-affinity
  2. ...
  3. Events:
  4. Type Reason Age From Message
  5. ---- ------ ---- ---- -------
  6. Warning FailedScheduling 80s default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity.
  7. Warning FailedScheduling 80s default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity.
  8. [root@k8s-master ~]#

现在给k8s-node2添加一个符合刚才创建Pod应策略的标签(预期预期Pod会被分配到k8s-node2)

然后看看刚才Pod的状态由Pending变为了ContainerCreating

  1. [root@k8s-master ~]# kubectl label node k8s-node2 project=makePeopleHappy
  2. node/k8s-node2 labeled
  3. [root@k8s-master ~]# kubectl get pod
  4. NAME READY STATUS RESTARTS AGE
  5. node-affinity 0/1 ContainerCreating 0 5m3s
  6. [root@k8s-master ~]#
  7. [root@k8s-master ~]# kubectl get pod
  8. NAME READY STATUS RESTARTS AGE
  9. node-affinity 1/1 Running 0 6m
  10. [root@k8s-master ~]#

软策略的标签没有符合的节点,但不是必须满足的,而是尽量满足

软策略说明:

现在把刚才的pod删除,k8s-node1和k8s-node2都打上应策略匹配的标签,k8s-node1打上软策略需要的标签

watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3d5X2hoeHg_size_16_color_FFFFFF_t_70 3

然后创建pod(预期预期Pod会被分配到k8s-node1)

  1. [root@k8s-master ~]# kubectl apply -f node-affinity.yaml
  2. pod/node-affinity created
  3. [root@k8s-master ~]# kubectl get pod node-affinity -o wide
  4. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  5. node-affinity 0/1 ContainerCreating 0 16s <none> k8s-node1 <none> <none>
  6. [root@k8s-master ~]#

去除标签恢复环境:

  1. [root@k8s-master ~]# kubectl label node k8s-node1 project-
  2. node/k8s-node1 labeled
  3. [root@k8s-master ~]# kubectl label node k8s-node2 project-
  4. node/k8s-node2 labeled
  5. [root@k8s-master ~]# kubectl label node k8s-node1 group-
  6. node/k8s-node1 labeled
  7. [root@k8s-master ~]#
  8. [root@k8s-master ~]# kubectl get nodes --show-labels | egrep 'project|group'
  9. [root@k8s-master ~]#

@污点和污点容忍

污点 taints:避免Pod调度到特定Node上
污点容忍 tolerations:允许Pod调度到持有Taints的Node上
应用场景
- 专用节点
如果想将某些节点专门分配给特定的一组用户使用,可以给这些节点添加一个taint,然后给这组用户的Pod添加一个相对应的toleration
- 配备了特殊硬件的节点
在部分节点配备了特殊硬件(比如 GPU)的集群中,我们希望不需要这类硬件的Pod不要被分配到这些特殊节点,以便为后继需要这类硬件的Pod保留资源
- 基于污点的驱逐
这是在每个 Pod 中配置的在节点出现问题时的驱逐行为

1.污点

给节点添加污点 kubectl taint node =:
例如:kubectl taint node k8s-node1 gpu=yes:NoSchedule
查看节点的污点 kubectl describe node |grep Taint
effect取值,
NoSchedule :不会将 Pod 调度到该节点
PreferNoSchedule:尽量避免将Pod调度到该节点上,但不是强制的
NoExecute:不会调度,并且驱逐Node上已有的Pod

给节点去除污点 kubectl taint node :-

  1. [root@k8s-master ~]# kubectl taint node k8s-node1 gpu=yes:NoSchedule
  2. node/k8s-node1 tainted
  3. [root@k8s-master ~]# kubectl describe node k8s-node1 | grep Taint
  4. Taints: gpu=yes:NoSchedule
  5. [root@k8s-master ~]#

然后通过deployment创建pod试试, test-taint-deploy.yaml内容如下

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. labels:
  5. app: web
  6. name: web
  7. namespace: default
  8. spec:
  9. replicas: 4
  10. selector:
  11. matchLabels:
  12. app: web
  13. template:
  14. metadata:
  15. labels:
  16. app: web
  17. spec:
  18. containers:
  19. - image: nginx
  20. name: nginx

可以看到4个副本都避开了打了污点的k8s-node1

  1. [root@k8s-master ~]# kubectl apply -f test-taint-deploy.yaml
  2. deployment.apps/web created
  3. [root@k8s-master ~]# kubectl get pod -o wide
  4. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  5. web-96d5df5c8-gmspn 0/1 ContainerCreating 0 10s <none> k8s-node2 <none> <none>
  6. web-96d5df5c8-hh86q 0/1 ContainerCreating 0 10s <none> k8s-node2 <none> <none>
  7. web-96d5df5c8-n4587 0/1 ContainerCreating 0 10s <none> k8s-node2 <none> <none>
  8. web-96d5df5c8-wrmxv 0/1 ContainerCreating 0 10s <none> k8s-node2 <none> <none>
  9. [root@k8s-master ~]#

2.污点容忍

如果希望Pod可以被分配到带有污点的节点上,要在Pod配置中添加污点容忍(tolrations)字段

一个容忍度和一个污点相“匹配”是指它们有一样的键名和效果,并且
如果 operator是 Exists, 此时容忍度不能指定 value
如果 operator是 Equal, 则它们的value应该相等
如果 key为空, 且operator为 Exists, 表示这个容忍度能容忍任意 taint
如果 effect为空,则可以与所有指定键名的效果相匹配

为test-taint-deploy.yaml添加tolerations,更新后内容如下,

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. labels:
  5. app: web
  6. name: web
  7. namespace: default
  8. spec:
  9. replicas: 4
  10. selector:
  11. matchLabels:
  12. app: web
  13. template:
  14. metadata:
  15. labels:
  16. app: web
  17. spec:
  18. containers:
  19. - image: nginx
  20. name: nginx
  21. tolerations:
  22. - key: gpu
  23. operator: Equal
  24. value: "yes"
  25. effect: NoSchedule

说明: tolerations.value的值为yes,得加上引号,不然会报这么个错。之前的yaml没太注意,最好参考官方文档,该加引号的值都加上引号。

watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3d5X2hoeHg_size_16_color_FFFFFF_t_70 4

创建deployment, 这次不会避开打了污点的k8s-node1

  1. [root@k8s-master ~]# kubectl delete -f test-taint-deploy.yaml
  2. deployment.apps "web" deleted
  3. [root@k8s-master ~]#
  4. [root@k8s-master ~]# kubectl apply -f test-taint-deploy.yaml
  5. deployment.apps/web created
  6. [root@k8s-master ~]# kubectl get pod -o wide
  7. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  8. web-6854887cb-68sgr 1/1 Running 0 41s 10.244.36.74 k8s-node1 <none> <none>
  9. web-6854887cb-8627z 1/1 Running 0 41s 10.244.36.75 k8s-node1 <none> <none>
  10. web-6854887cb-c74f4 1/1 Running 0 41s 10.244.36.80 k8s-node1 <none> <none>
  11. web-6854887cb-hkbm2 1/1 Running 0 41s 10.244.169.145 k8s-node2 <none> <none>
  12. [root@k8s-master ~]#

@指定调度节点

不经过调度器,因此以上限制统统失效

test-fixednode-deploy.yaml内容如下, 指定调度到 nodeName: k8s-node1, 而k8s-node1有污点且该yaml文件没有污点容忍字段

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. labels:
  5. app: web
  6. name: web
  7. namespace: default
  8. spec:
  9. replicas: 4
  10. selector:
  11. matchLabels:
  12. app: web
  13. template:
  14. metadata:
  15. labels:
  16. app: web
  17. spec:
  18. nodeName: k8s-node1
  19. containers:
  20. - image: nginx
  21. name: nginx

可以看到所有pod副本都调度到了指定节点,即使该节点存在污点且pod未设置污点容忍

  1. [root@k8s-master ~]# kubectl apply -f test-fixednode-deploy.yaml
  2. deployment.apps/web created
  3. [root@k8s-master ~]#
  4. [root@k8s-master ~]# kubectl get pod -o wide
  5. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  6. web-57495dbb4b-fsn2j 1/1 Running 0 65s 10.244.36.86 k8s-node1 <none> <none>
  7. web-57495dbb4b-kjkx6 1/1 Running 0 65s 10.244.36.83 k8s-node1 <none> <none>
  8. web-57495dbb4b-lmntn 1/1 Running 0 65s 10.244.36.85 k8s-node1 <none> <none>
  9. web-57495dbb4b-tx96l 1/1 Running 0 65s 10.244.36.87 k8s-node1 <none> <none>
  10. [root@k8s-master ~]#

官方文档参考

将Pod分配给节点-> nodeSelector,pod亲和性 https://kubernetes.io/zh/docs/concepts/scheduling-eviction/assign-pod-node/
污点和容忍度 https://kubernetes.io/zh/docs/concepts/scheduling-eviction/taint-and-toleration/

发表评论

表情:
评论列表 (有 0 条评论,358人围观)

还没有评论,来说两句吧...

相关阅读