关于kubernetes:k8s默认调度器关于pod申请资源过滤的源码细节

如果问你是否理解 k8s 的调度原理，大家预计都会滔滔不绝说一通
然而是否真正的理解其中的细节预计就不好说了
上面是我浏览 k8s 调度器的源码剖析的全过程

k8s 零根底入门运维课程，计算存储网络和常见的集群相干操作

k8s 底层原理和源码解说之精髓篇
k8s 底层原理和源码解说之进阶篇
k8s 纯源码解读课程，助力你变成 k8s 专家

k8s 运维巨匠课程

k8s 治理运维平台实战前端 vue 后端 golang

k8s 二次开发之基于实在负载的调度器
k8s-operator 和 crd 实战开发助你成为 k8s 专家

tekton 全流水线实战和 pipeline 运行原理源码解读

01_prometheus 零根底入门，grafana 根底操作，支流 exporter 采集配置
02_prometheus 全组件配置应用、底层原理解析、高可用实战
03_prometheus-thanos 应用和源码解读
04_kube-prometheus 和 prometheus-operator 实战和原理介绍
05_prometheus 源码解说和二次开发
06_prometheus 监控 k8s 的实战配置和原理解说，写 go 我的项目裸露业务指标

golang 根底课程
golang 实战课，一天编写一个工作执行零碎，客户端和服务端架构
golang 运维开发我的项目之 k8s 网络探测实战
golang 运维平台实战，服务树, 日志监控，工作执行，分布式探测
golang 运维开发实战课程之 k8s 巡检平台

k8s-prometheus 课程答疑和运维开发职业倒退布局

https://kubernetes.io/zh-cn/d…

默认调度器的源码地位 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\scheduler.go
由调度一个 pod 的办法入口 , 其中 sched.Algorithm.Schedule 代表算法调度

func (sched *Scheduler) scheduleOne(ctx context.Context) {scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, sched.Extenders, fwk, state, pod)
}

默认调度 Schedule 办法的源码地位 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\generic_scheduler.go

从它的办法正文能够看到

// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.

翻译过去就是 Schedule 办法尝试从给出的节点列表中抉择一个调度这个 pod
如果胜利，会返回节点的名称
如果失败，会返回谬误

这个 ScheduleResult 构造体他的字段定义的很清晰一看就晓得干啥的

(result ScheduleResult, err error)
type ScheduleResult struct {
    // Name of the scheduler suggest host
    SuggestedHost string  后果节点
    // Number of nodes scheduler evaluated on one pod scheduled
    EvaluatedNodes int   参加计算的节点数
    // Number of feasible nodes on one pod scheduled
    FeasibleNodes int  适合的节点数
}

(ctx context.Context, extenders []framework.Extender, fwk framework.Framework, state framework.CycleState, pod v1.Pod)
ctx 上下文
extenders 应该是扩大的调度插件？
fwk 为内置的调度框架对象
state 应该是调度的后果缓存
pod 就是待调度的指标 pod

代码如 feasibleNodes, diagnosis, err := g.findNodesThatFitPod(ctx, extenders, fwk, state, pod)
findNodesThatFitPod 就是执行 filter 插件列表中的插件

step01 执行 prefilter 插件们

    // Run "prefilter" plugins.
    s := fwk.RunPreFilterPlugins(ctx, state, pod)
    allNodes, err := g.nodeInfoSnapshot.NodeInfos().List()
    if err != nil {return nil, diagnosis, err}

遍历执行的代码如下

func (f *frameworkImpl) RunPreFilterPlugins(ctx context.Context, state *framework.CycleState, pod *v1.Pod) (status *framework.Status) {startTime := time.Now()
  defer func() {metrics.FrameworkExtensionPointDuration.WithLabelValues(preFilter, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime))
  }()
  for _, pl := range f.preFilterPlugins {status = f.runPreFilterPlugin(ctx, pl, state, pod)
      if !status.IsSuccess() {status.SetFailedPlugin(pl.Name())
          if status.IsUnschedulable() {return status}
          return framework.AsStatus(fmt.Errorf("running PreFilter plugin %q: %w", pl.Name(), status.AsError())).WithFailedPlugin(pl.Name())
      }
  }

  return nil
}

外围就是执行各个 PreFilterPlugin 的 PreFilter 办法

type PreFilterPlugin interface {
    Plugin
    // PreFilter is called at the beginning of the scheduling cycle. All PreFilter
    // plugins must return success or the pod will be rejected.
    PreFilter(ctx context.Context, state *CycleState, p *v1.Pod) *Status
    // PreFilterExtensions returns a PreFilterExtensions interface if the plugin implements one,
    // or nil if it does not. A Pre-filter plugin can provide extensions to incrementally
    // modify its pre-processed info. The framework guarantees that the extensions
    // AddPod/RemovePod will only be called after PreFilter, possibly on a cloned
    // CycleState, and may call those functions more than once before calling
    // Filter again on a specific node.
    PreFilterExtensions() PreFilterExtensions}

咱们能够在官网文档中搜寻 prefilter
发现有 8 个比方 NodePorts、NodeResourcesFit、VolumeBinding 等
这跟咱们在 ide 中查看 PreFilter 的实现者根本能对上

地位 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\framework\plugins\noderesources\fit.go

func (f *Fit) PreFilter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod) *framework.Status {cycleState.Write(preFilterStateKey, computePodResourceRequest(pod, f.enablePodOverhead))
    return nil
}

从下面的办法来看只是计算了 pod 的资源状况，写入缓存为前面的过滤做筹备
其中的数据统计来自 computePodResourceRequest，咱们不必看具体代码，看正文就能分明这个办法的含意
从 pod 的 init 和 app 容器中汇总，求最大的资源应用状况
其中 init 和 app 容器的解决形式不统一
比方正文中给出的样例，init 容器按程序执行，那么找其中最大的资源就能够也就是 2c 3G
app 容器要求同时启动，所以需要求 sum 也就是 3c 3G
最初再求 2 者的 max 也就是 3c 3G

// computePodResourceRequest returns a framework.Resource that covers the largest
// width in each resource dimension. Because init-containers run sequentially, we collect
// the max in each dimension iteratively. In contrast, we sum the resource vectors for
// regular containers since they run simultaneously.
//
// If Pod Overhead is specified and the feature gate is set, the resources defined for Overhead
// are added to the calculated Resource request sum
//
// Example:
//
// Pod:
//   InitContainers
//     IC1:
//       CPU: 2
//       Memory: 1G
//     IC2:
//       CPU: 2
//       Memory: 3G
//   Containers
//     C1:
//       CPU: 2
//       Memory: 1G
//     C2:
//       CPU: 1
//       Memory: 1G
//
// Result: CPU: 3, Memory: 3G

其实相干的逻辑在 filter 插件中
因为在 findNodesThatFitPod 函数中执行完所有 prefilter 插件后该执行 filter 插件了
也就是 NodeResourcesFit 的 filter 函数
地位 D:\go_path\src\github.com\kubernetes\kubernetes\pkg\scheduler\framework\plugins\noderesources\fit.go

// Filter invoked at the filter extension point.
// Checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
// It returns a list of insufficient resources, if empty, then the node has all the resources requested by the pod.
func (f *Fit) Filter(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {s, err := getPreFilterState(cycleState)
    if err != nil {return framework.AsStatus(err)
    }

    insufficientResources := fitsRequest(s, nodeInfo, f.ignoredResources, f.ignoredResourceGroups)

    if len(insufficientResources) != 0 {
        // We will keep all failure reasons.
        failureReasons := make([]string, 0, len(insufficientResources))
        for _, r := range insufficientResources {failureReasons = append(failureReasons, r.Reason)
        }
        return framework.NewStatus(framework.Unschedulable, failureReasons...)
    }
    return nil
}

其中具体的资源计算逻辑在 fitsRequest 中

以计算 cpu 为例

  if podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU - nodeInfo.Requested.MilliCPU) {
      insufficientResources = append(insufficientResources, InsufficientResource{
          v1.ResourceCPU,
          "Insufficient cpu",
          podRequest.MilliCPU,
          nodeInfo.Requested.MilliCPU,
          nodeInfo.Allocatable.MilliCPU,
      })
  }

其实很简略就是：findNodesThatPassFilters 有多个 node 后果

而后交给前面的 score 办法打分计算筛选即可

  feasibleNodes, err := g.findNodesThatPassFilters(ctx, fwk, state, pod, diagnosis, allNodes)
  if err != nil {return nil, diagnosis, err}

NodeResourcesFit 的 PreFilterPlugin 负责计算 pod 的资源申请值，并且计算时解决 init 和 app 容器有所区别
k8s 的默认调度器是在哪个环节过滤满足这个 pod 资源的节点的：答案是 NodeResourcesFit 的 Filter 函数
filter 如果返回多个节点，那么交给 score 插件打分计算筛选即可

如果应用 k8s 的调度框架写个扩大调度器，只实现 Filter 办法依据节点的实在负载过滤那么会有什么问题
答案是：因为跳过了默认的 NodeResourcesFit 可能会导致被 kubelet 的 admit 拦挡呈现 OutOfMemory 等谬误
因为 kubelet 还是会校验新 pod 的 request 和本节点已调配的资源

k8s 二次开发之基于实在负载的调度器

关于kubernetes:k8s默认调度器关于pod申请资源过滤的源码细节

思考 Q1 k8s 的默认调度器是在哪个环节过滤满足这个 pod 资源的节点的？

我的 23 个课程举荐

k8s 零根底入门运维课程

k8s 纯源码解读教程 (3 个课程内容合成一个大课程)

k8s 运维进阶调优课程

k8s 治理运维平台实战

k8s 二次开发课程

cicd 课程

prometheus 全组件的教程

go 语言课程

直播答疑 sre 职业倒退布局

官网调度框架文档地址

01 默认调度器何时依据 pod 的容器资源 request 量筛选节点

剖析 Schedule 办法

来剖析一下这个办法的返回值

再剖析一下这个办法的参数

其中外围的内容就是 findNodesThatFitPod

step01 执行 prefilter 插件们

默认的 PreFilterPlugin 都有哪些呢

挑 1 个 NodeResourcesFit 的 PreFilterPlugin 来看下

看到这里就会纳闷了，fit 的 prefilter 中并没有过滤节点资源的代码

从下面的正文就能够看出，这个是查看一个节点是否具备满足指标 pod 申请资源的

思考如果下面有多个节点满足 pod 资源 request 怎么办

总结

脑洞

那么基于实在负载调度的调度器该怎么编写呢

思考 Q1 k8s 的默认调度器是在哪个环节过滤满足这个 pod 资源的节点的？

我的 23 个课程举荐

k8s 零根底入门运维课程

k8s 纯源码解读教程 (3 个课程内容合成一个大课程)

k8s 运维进阶调优课程

k8s 治理运维平台实战

k8s 二次开发课程

cicd 课程

prometheus 全组件的教程

go 语言课程

直播答疑 sre 职业倒退布局

官网调度框架文档地址

01 默认调度器何时 依据 pod 的容器 资源 request 量筛选节点

剖析 Schedule 办法

来剖析一下 这个办法的返回值

再剖析一下这个办法的 参数

其中外围的内容就是 findNodesThatFitPod

step01 执行 prefilter 插件们

默认的 PreFilterPlugin 都有哪些呢

挑 1 个 NodeResourcesFit 的 PreFilterPlugin 来看下

看到这里就会纳闷了，fit 的 prefilter 中并没有过滤节点资源的代码

从下面的正文就能够看出，这个是查看一个节点 是否具备满足 指标 pod 申请资源的

思考如果下面有多个节点满足 pod 资源 request 怎么办

总结

脑洞

那么基于实在负载调度的调度器该怎么编写呢

01 默认调度器何时依据 pod 的容器资源 request 量筛选节点

来剖析一下这个办法的返回值

再剖析一下这个办法的参数

从下面的正文就能够看出，这个是查看一个节点是否具备满足指标 pod 申请资源的