关于开源:揭秘OpenCloudOS内核调度器Features

调度器 (Scheduler) 须要应答各种极其场景以及各种业务模型，繁多的策略设计很难笼罩所有的场景。于是内核在调度器里增加了很多调度个性 feature，在不同的业务场景里，依据不同的业务模型抉择最优的调度策略，这样能够让调度器领有很好的适应性。

通过本文的剖析，能够让大家理解到内核相干 feature 的作用以及应用场景，这样能够依据依据这些参数为用户业务进行针对性的性能调优了。
调度个性剖析

通过 cat /sys/kernel/debug/sched_features 能够晓得以后内核反对哪些调度个性，以及这些的关上状况。
图片
能够看到，如果有 NO_前缀，就示意这个性能敞开。而没有这个前缀则示意性能关上。
内核代码位于：
kernel/sched/feature.h
在这个头文件里定义了所有内核反对的 feature
应用形式：
关上某个调度个性：
echo WAKEUP_PREEMPTION > /sys/kernel/debug/sched_features
敞开某个调度个性：
echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS
该性能用来限度睡眠线程的弥补工夫为 sysctl_sched_latency 的 50%，能够缩小其余工作的调度提早，该性能内核默认关上。
如果敞开该个性，则唤醒线程能够取得更多的执行工夫，但于此同时，调度队列上的其余工作则会由较大的调度提早。
/*

Only give sleepers 50% of their service deficit. This allows
them to run sooner, but does not allow tons of sleepers to
rip the spread apart.
*/

SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)

static void
place_entity(struct cfs_rq cfs_rq, struct sched_entity se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;

The ‘current’ period is already promised to the current tasks,
however the extra weight of the new task will slow them down a
little, place the new task so that it fits in the slot that
stays open at the end.
*
将新创建的工作 vruntime 加上一个额定的虚构工夫片，这样能够让新创建的工作
必须等到下个调度周期能力运行(预期)
*/

if (initial && sched_feat(START_DEBIT))

vruntime += sched_vslice(cfs_rq, se);

/ sleeps up to a single latency don’t count. /
if (!initial) {

/* 当 rq 上的过程数目较少，内核默认的调度提早(也能够称为调度周期) */
unsigned long thresh = sysctl_sched_latency;

/*
 * Halve their sleep time's effect, to allow
 * for a gentler effect of sleepers:
 *
 * 这里将睡眠线程的最大弥补工夫设置为内核调度提早的一半，这样做能够
 * 避免睡眠工夫较长的线程被唤醒后取得的工夫片过长，让调度队列上
 * 的其余工作呈现较大的调度提早毛刺
 */
if (sched_feat(GENTLE_FAIR_SLEEPERS))
  thresh >>= 1;

vruntime -= thresh;

}

/ ensure we never gain time by being placed backwards. /
se->vruntime = max_vruntime(se->vruntime, vruntime);
}

START_DEBIT
START_DEBIT 会将新创建工作的 vruntime 适当增大，让其在下个调度周期能力取得执行机会（与其余过程偏心调配工夫片）。这样的目标是为了避免有的过程通过一直 fork + exec 的形式取得更多的工夫片 (有点相似于攻打了)，
导致其余的过程呈现调度饥饿的状况，该性能内核默认关上。
/*

Place new tasks ahead so that they do not starve already running
tasks
*/

SCHED_FEAT(START_DEBIT, true)

static void
place_entity(struct cfs_rq cfs_rq, struct sched_entity se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;

The ‘current’ period is already promised to the current tasks,
however the extra weight of the new task will slow them down a
little, place the new task so that it fits in the slot that
stays open at the end.
*
将新创建的工作 vruntime 加上一个额定的虚构工夫片，这样能够让新创建的工作
必须等到下个调度周期能力运行(预期)
*/

if (initial && sched_feat(START_DEBIT))

vruntime += sched_vslice(cfs_rq, se);

/ sleeps up to a single latency don’t count. /
if (!initial) {

/* 当 rq 上的过程数目较少，内核默认的调度提早(也能够称为调度周期) */
unsigned long thresh = sysctl_sched_latency;

/*
 * Halve their sleep time's effect, to allow
 * for a gentler effect of sleepers:
 *
 * 这里将睡眠线程的最大弥补工夫设置为内核调度提早的个别，这样做能够
 * 避免睡眠工夫较长的线程被唤醒后取得的工夫片过长，从让调度队列上
 * 的其余工作呈现较大的调度提早毛刺
 */
if (sched_feat(GENTLE_FAIR_SLEEPERS))
  thresh >>= 1;

vruntime -= thresh;

}

/ ensure we never gain time by being placed backwards. /
se->vruntime = max_vruntime(se->vruntime, vruntime);
}
NEXT_BUDDY
next 与 last 是内核调度器留的两个“后门”，让某些过程能够失去优先调度的机会。
这里的 NEXT_BUDDY 就是在唤醒抢占查看的中央，是否无条件的设置被唤醒的认为为 NEXT BUDDY 优先调度对象，内核默认为敞开（即须要进行抢占粒度查看之后，合乎抢占条件才会设置）。
如果关上这个性能，会让 wakeup task 失去优先调度的查看机会（仅仅是机会，是否失去调度还是要看虚构工夫），但同时会减少 pick next task 的工夫开销。
/*

Prefer to schedule the task we woke last (assuming it failed
wakeup-preemption), since its likely going to consume data we
touched, increases cache locality.
*/

SCHED_FEAT(NEXT_BUDDY, false)

Preempt the current task with a newly woken task if needed:
*
next/last 则是没有太不偏心时，尽量选中它们运行。next 与 last 的优先级都是
一样的，内核会匹配 next、last 以及从红黑树里选出来的 first se，所以 next/
last 都有优先执行权(具体哪个先执行就要看各自的 se->vruntime 了)。
在同等条件下，next 其实比 last 要高一点优先级
*
last: 次要是在 check_preempt_wakeup()里，如果 curr 被 pse 抢占了，那么内核
就会设置 cfs_rq->last = curr，示意在下次调度时，内核会优先思考被抢占
过程。因为过程是被抢占的，所以设置其为 last，这样下次优先选择它能够
放弃更好的局部性
*
next: 次要是 wakeup task、se dequeue 以及 se 被动 yield 时，内核会设置过程
为 cfs_rq->next = se，这样在 pick_next_task_fair()的时候，next 会与 last
都会被思考优先调度。所以 next 示意有因为调度策略的起因，有过程心愿被
优先执行
*/

static void check_preempt_wakeup(struct rq rq, struct task_struct p, int wake_flags)
{
struct task_struct *curr = rq->curr;
struct sched_entity se = &curr->se, pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;
int next_buddy_marked = 0;

if (unlikely(se == pse))

return;

This is possible from callers such as attach_tasks(), in which we
unconditionally check_prempt_curr() after an enqueue (which may have
lead to a throttle). This both saves work and prevents false
next-buddy nomination below.
*/

if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))

return;

如果关上了 NEXT_BUDDY 个性，那么以后调度器上过程较多 (大于等于 8) 并且
这个 task 不是新创建的，那么它就会成为 next buddy 的优先调度对象(不论
前面的查看是否无效，都要设置 pse 为 next buddy)
*/

if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {

set_next_buddy(pse);
next_buddy_marked = 1;

}
…………
}
LAST_BUDDY
LAST_BUDDY 示意是否将被抢占的工作设置为 last 优先调度，即在下次 pick_next_task 的时候，内核会无效思考调度 last(但优先级低于 next).
内核默认为关上（这样能够让被抢占的工作在适合的时候尽快失去运行）
/*

Prefer to schedule the task that ran last (when we did
wake-preempt) as that likely will touch the same data, increases
cache locality.
*/

SCHED_FEAT(LAST_BUDDY, true)

static void check_preempt_wakeup(struct rq rq, struct task_struct p, int wake_flags)
{
………………
/*

走到这里示意以后 rq->curr 的工作行将被抢占，如果开启了 LAST_BUDDY
则会将 se 设置为 cfs_rq->last，示意在前面的调度中会优先思考它(但调度
优先级低于 cfs_rq->next)
*/

if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))

set_last_buddy(se);

………………
}

CACHE_HOT_BUDDY
CACHE_HOT_BUDDY 示意在做负载平衡的时候，须要思考到被迁徙过程的缓存亲和性，如果被迁徙前过程是 next/last 这样的优先调度过程，则它们可能具备比拟好的本地缓存热度，对于这样的工作会尽量让其不迁徙到其余 CPU 下来。
该性能内核默认是关上的。
/*

Consider buddies to be cache hot, decreases the likelyness of a
cache buddy being migrated away, increases cache locality.
*/

SCHED_FEAT(CACHE_HOT_BUDDY, true)

static int task_hot(struct task_struct p, struct lb_env env)
{
s64 delta;

lockdep_assert_held(&env->src_rq->lock);

if (p->sched_class != &fair_sched_class)

return 0;

if (unlikely(task_has_idle_policy(p)))

return 0;

Buddy candidates are cache hot:
*
如果 dst_rq(即须要将过程迁徙到这个 dst_rq 上)上有过程存在(不为空)
那么咱们这里就要思考本地缓存热度，如果 p 为 next/last 过程，则不容许
进行迁徙
*/

if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&

  (&p->se == cfs_rq_of(&p->se)->next ||
   &p->se == cfs_rq_of(&p->se)->last))
return 1;

if (sysctl_sched_migration_cost == -1)

return 1;

if (sysctl_sched_migration_cost == 0)

return 0;

delta = rq_clock_task(env->src_rq) – p->se.exec_start;

return delta < (s64)sysctl_sched_migration_cost;
}

WAKEUP_PREEMPTION
WAKEUP_PREEMPTION 示意当一个过程被唤醒进入调度队列的时候，须要与 cfs_rq->curr 进行抢占查看，如果符合条件则它就能够抢占调度队列上正在运行的工作，通过这个个性能够让被唤醒的工作取得调度优先性，从而缩小相应的调度提早。该个性内核默认为关上。
/*

Allow wakeup-time preemption of the current task:
*/

SCHED_FEAT(WAKEUP_PREEMPTION, true)

static void check_preempt_wakeup(struct rq rq, struct task_struct p, int wake_flags)
{
…………
/*

Batch and idle tasks do not preempt non-idle tasks (their preemption
is driven by the tick):
*
1 只有 SCHED_NORMAL 能力进行唤醒查看(SCHED_IDLE 核和 SCHED_BATCH 不能唤醒抢占)
2 只有开启 WAKEUP_PREEMPTION 个性能力容许唤醒抢占
*/

if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))

return;

………..
}

HRTICK
在 O(1)调度器里 (linux 内核在 2.6.26 之前的调度器) 有很多问题，其中一个就是调度精度问题。O(1)调度器利用零碎里的 tick 来作为调度抢占检查点（不思考唤醒抢占的场景），在每个 tick 中断处理函数里，内核会判断调度队列上正在运行的工作工夫片是否曾经用完。如果是的，则须要进行切换，调度下一个工作到 CPU 上运行。在这个角度上说，tick 的精度其实就决定了调度提早的状况。在很早的时候，内核 HZ=100，即每秒有 100 次 tick，这样每次实践调度提早是 10ms。在计算机性能比拟低的时代，10ms 是齐全能够承受的。然而随着计算机性能的进步，以及业务对于调度实时性的要求，HZ=100 曾经齐全不能满足需要，内核将 HZ 改成默认 250（有的架构甚至改成 1000），以满足更好的调度实时性。
但这样的改变不是没有代价，HZ 越大，则示意 tick 越频繁，这里会带来较大的零碎开销（tick 除了要进行调度查看，还要进行包含：墙上工夫更新、timerlist 查看、过程 cputime 更新等工作）。所以 CFS 在设计的时候就引入了 HRTICK 机制，内核为每个 CPU 筹备了一个 hrtimer 定时器。在 pick_next_task 的时候，内核会依据选中工作的工夫片完结工夫来设置 hrtimer。通过这样的形式，调度切换的精度就不再依赖于 TICK，从而取得了近乎纳秒的调度切换精度（具体是取决于硬件 timer 的精度）

但 HRTICK 机制会带来额定的中断开销（以及 enqueue/dequeue 时对 timer 的频繁操作），特地是在工作较多时，可能中断开销会比拟大。所以内核在默认状况下，是会敞开该性能（基于吞吐量的思考）。

如果是想要取得更好的调度实时性，那么能够思考关上这个开关，但可能会引来吞吐量的降落（实时性与吞吐量总是处于对立面）。
SCHED_FEAT(HRTICK, false)

Use hrtick when:
- enabled by features
- hrtimer is actually high res
  */

static inline int hrtick_enabled(struct rq *rq)
{
if (!sched_feat(HRTICK))

return 0;

if (!cpu_active(cpu_of(rq)))

return 0;

return hrtimer_is_hres_active(&rq->hrtick_timer);
}

static void hrtick_start_fair(struct rq rq, struct task_struct p)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);

SCHED_WARN_ON(task_rq(p) != rq);

如果以后 cfs 上多个工作，那么这里会依据选中工作所计算的工夫片 slice(用
过程权重计算取得)以及它曾经耗费的工夫片，计算过程工夫片完结工夫。
通过这样的形式就可能取得近乎纳秒的调度切换精度(取决于 timer 的硬件精度)
*/

if (rq->cfs.h_nr_running > 1) {

u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;

if (delta < 0) {if (rq->curr == p)
    resched_curr(rq);
  return;
}
hrtick_start(rq, delta);

}
}

DOUBLE_TICK
DOUBLE_TICK 是与下面的 HRTICK 联合起来用的，如果内核应用了 HRTICK，那么在 entity_tick 时就没有必要进行 check_preempt_tick 的查看。但内核提供了一个额定的 DOUBLE_TICK 开关，如果为 true 则表明既要在 HRTICK 里进行调度查看，也要在 TICK 里进行调度查看（这也是 DOUBLE 函数的由来）。如果为 false，则只会在 HRTICK 里进行调度查看（如果使能了 HRTICK）。在默认状况下，内核将 DOUBLE_TICK 设置为 false。
NONTASK_CAPACITY
这里的 NONTASK_CAPACITY 示意在计算 CPU capacity 的时候，须要将 IRQ 应用的 CPU 负载去掉。CPU capacity 示意 CPU 上去掉 DL/RT 以及 IRQ 后的 CPU 的可用能力，CFS 在做负载平衡的时候须要思考到优先级比它高的调度类 + 中断所耗费的 CPU(这里的 capacity 就去去掉这些之后剩下的，CFS 的可用资源)，这样能力实现更好的调度平衡策略。

这里的 NONTASK_CAPACITY 就示意 CFS 须要思考 IRQ 中断所应用的 PELT 利用率（须要使能 CONFIG_HAVE_SCHED_AVG_IRQ 后能力失效），内核默认该个性开启。
/*

Decrement CPU capacity based on time not spent running tasks
*/

SCHED_FEAT(NONTASK_CAPACITY, true)

static void update_rq_clock_task(struct rq *rq, s64 delta)
{
………..

if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))

update_irq_load_avg(rq, irq_delta + steal);

……….
}

static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
{
struct rq *rq = cpu_rq(cpu);
unsigned long max = arch_scale_cpu_capacity(cpu);
unsigned long used, free;
unsigned long irq;

irq = cpu_util_irq(rq);

if (unlikely(irq >= max))

return 1;

used = READ_ONCE(rq->avg_rt.util_avg);
used += READ_ONCE(rq->avg_dl.util_avg);

if (unlikely(used >= max))

return 1;

free = max – used;

/ 计算 CPU capacity 的时候，须要将 IRQ 的 PLET 使用率去掉 /
return scale_irq_capacity(free, irq, max);
}
TTWU_QUEUE
TTWU_QUEUE 示意内核会将 wakeup task 的过程 queue remote CPU，行将这个过程挂到 remote cpu wake_list 上，而后用 IPI 告诉其履行 wakeup 动作。这样做的目标其实就是为了缩小多核间的锁竞争导致的 cacheline pingpong 问题，会对性能带来肯定的益处。然而内核也发现，过多的 IPI 会导致系统性能降落，所以前面提交了一个 PATCH，用是否共享 LLC 来做 TTWU_QUEUE 的限度。在默认状况下，内核会开启该性能。
/*

Queue remote wakeups on the target CPU and process them
using the scheduler IPI. Reduces rq->lock contention/bounces.
*/

SCHED_FEAT(TTWU_QUEUE, true)

static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);
struct rq_flags rf;

因为过多的中断会导致系统性能降落，所以内核用
!cpus_share_cache(smp_processor_id(), cpu) 来限度 TTWU_QUEUE 中断触发的频率(并且
从另一个角度来看，在同 LLC 上的 task，它们之间因为锁竞争导致 cache bounces 的性能
损失会更小，所以这里只对跨 LLC 的 CPU 间做 TTWU_QUEUE 也是正当的)
*/

if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {

sched_clock_cpu(cpu); /* Sync clocks across CPUs */
ttwu_queue_remote(p, cpu, wake_flags);
return;

}

rq_lock(rq, &rf);
update_rq_clock(rq);
ttwu_do_activate(rq, p, wake_flags, &rf);
rq_unlock(rq, &rf);
}
SIS_AVG_CPU
SIS_AVG_CPU 的原意是想依据 avg_idle 来做查找开销的缩小，但该机制存在肯定的问题（肯定都不去找闲暇 CPU 会导致负载绝对集中），所以 5.12 内核将该性能移除（由 SIS_PROP 性能来做开销均衡）。
该性能在 5.4 内核里默认敞开，也不要开启该性能
/*

When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/

SCHED_FEAT(SIS_AVG_CPU, false)

static int select_idle_cpu(struct task_struct p, struct sched_domain sd, int target)
{
struct sched_domain *this_sd;
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
int this = smp_processor_id();
int cpu, nr = INT_MAX, si_cpu = -1;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)

return -1;

Due to large variance we need a large fuzz factor; hackbench in
particularly is sensitive here.
*/

avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;

SIS_AVG_CPU 性能原意是通过 cfs_rq 的均匀 idle 工夫 avg_idle 与前面的
for_each_cpu_wrap 耗费工夫做个比拟，如果开销太大(相比于闲暇工夫)，则 skip 前面的流程
以缩小查找开销。
*
该性能开启后会略显粗犷，导致 select_idle_cpu 一点都不会去查找，所以 5.12 内核里将
SIS_AVG_CPU 移除了
*/

if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)

return -1;

………………….
}
SIS_PROP
SIS_PROP 是内核用来限度 select_idle_cpu 的查找开销的（通过限度最大的查找次数来实现），内核默认为开启。
在零碎 CPU 利用率较低（不超过 50%）、而 CPU 又是调度提早敏感性，这个时候能够思考敞开 SIS_PROP，通过更多的查找让被唤醒的工作尽可能的找到闲暇的 CPU，从而缩小调度提早（但这可能会带来肯定水平的缓存损失，具体是否开启要看业务模型自身）。
SCHED_FEAT(SIS_PROP, true)

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)

return -1;

Due to large variance we need a large fuzz factor; hackbench in
particularly is sensitive here.
*/

avg_idle = this_rq()->avg_idle / 512;
avg_cost = this_sd->avg_scan_cost + 1;

SIS_AVG_CPU 性能原意是通过 cfs_rq 的均匀 idle 工夫 avg_idle 与前面的
for_each_cpu_wrap 耗费工夫做个比拟，如果开销太大(相比于闲暇工夫)，则 skip 前面的流程
以缩小查找开销。
*
该性能开启后会略显粗犷，导致 select_idle_cpu 一点都不会去查找，所以 5.12 内核里将
SIS_AVG_CPU 移除了
*/

if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)

return -1;

SIS_PROP 会依据以后 CPU 的负载与 avg_idle 的关系来决定 CPU 的最大查找个数
这里的最小值为 4
*/

if (sched_feat(SIS_PROP)) {

u64 span_avg = sd->span_weight * avg_idle;
if (span_avg > 4*avg_cost)
  nr = div_u64(span_avg, avg_cost);
else
  nr = 4;

}
…………….
}
WARN_DOUBLE_CLOCK
如果在在同一个中央屡次调到用 update_rq_clock 则会收回正告 (无用更新)，内核默认敞开。
RT_PUSH_IPI
对于 RT 的锁竞争优化，内核默认开启。
/*

In order to avoid a thundering herd attack of CPUs that are
lowering their priorities at the same time, and there being
a single CPU that has an RT task that can migrate and is waiting
to run, where the other CPUs will try to take that CPUs
rq lock and possibly create a large contention, sending an
IPI to that CPU and let that CPU push the RT task to where
it should go may be a better scenario.
*/

SCHED_FEAT(RT_PUSH_IPI, true)

static void pull_rt_task(struct rq *this_rq)
{
int this_cpu = this_rq->cpu, cpu;
bool resched = false;
struct task_struct *p;
struct rq *src_rq;
int rt_overload_count = rt_overloaded(this_rq);

if (likely(!rt_overload_count))

return;

Match the barrier from rt_set_overloaded; this guarantees that if we
see overloaded we must also see the rto_mask bit.
*/

smp_rmb();

/ If we are the only overloaded CPU do nothing /
if (rt_overload_count == 1 &&

  cpumask_test_cpu(this_rq->cpu, this_rq->rd->rto_mask))
return;

当 CPUs 上的 RT 过程都批量 (刹时) 的被改为 CFS，并且零碎里只有一个 CPU 上有
RT 工作能够被迁徙，那么以后的 pull_rt_task()就会被并发的执行(相似
惊群效应)，而后就会导致强烈的锁竞争。于是内核开发了 RT_PUSH_IPI 机制
向指标 CPU 发送一个 IPI 中断人，让指标 CPU 来执行 PUSH RT 到适合的 CPU 上
所而缩小多核间的并发竞争状况
*/

if (sched_feat(RT_PUSH_IPI)) {

tell_cpu_to_push(this_rq);
return;

}

………….
}

RT_RUNTIME_SHARE
在 rt sched_group 或者全局的 RT bandwidth 会对 RT 的使用率进行限度，避免 CPU 上的实时工作应用了太多 CPU。而这里的 RUNTIME SHARE 则是容许配额用完的 CPU 向其余 CPU 借一部分工夫，从而让着 CPU 上的 RT 过程能够运行的更久，这样可能会导致某个 CPU 上的 RT 工作使用率达到 100%。
内核默认会开启这个性能。
LB_MIN
在 load balance 的时候，会跳过 load < 16 的过程，即不对这些过程进行迁徙。
该性能内核默认敞开，即内核不须要对所有过程都进行负载平衡。如果零碎里的工作都是十分轻的负载，那么能够思考关上该负载，防止适度迁徙。
SCHED_FEAT(LB_MIN, false)

static int detach_tasks(struct lb_env *env)
{
………….
if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)

  goto next;

………….
}
ATTACH_AGE_LOAD
当过程产生 cpu migrate 或者 cgroup 迁徙的时候，内核的 PELT 计算会不精确（新的 CPU 上的 PELT 更新工夫戳与旧的 CPU 不太一样，但通常状况下两个的 clock_pelt 差距不会超过 1 个 tick）。所以开发了 ATTACH_AGE_LOAD feature，在进行 migrate 的时候，会利用 prev cfs_rq 进行 PELT 的衰减，从而让过程 PELT 更加精确。该性能内核默认是开启的，也不应该敞开。
WA_IDLE
WA_IDLE 示意在过程做 wake affine（唤醒亲核性抉择）查看时，如果唤醒它的 CPU 是闲暇的，则思考将过程迁徙到这个 CPU 上运行。内核默认为关上，如果不想被唤醒的工作被唤醒亲和频繁的迁徙，则能够思考敞开此性能（但个别须要关上，这个能够让零碎过程更好的应用 CPU 资源）。
SCHED_FEAT(WA_IDLE, true)

static int wake_affine(struct sched_domain sd, struct task_struct p,

       int this_cpu, int prev_cpu, int sync)

{
int target = nr_cpumask_bits;

if (sched_feat(WA_IDLE))

target = wake_affine_idle(this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)

target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
if (target == nr_cpumask_bits)

return prev_cpu;

schedstat_inc(sd->ttwu_move_affine);
schedstat_inc(p->se.statistics.nr_wakeups_affine);
return target;
}
WA_WEIGHT
WA_WEIGHT 示意在做 wake affine 时，是否用 waker cpu 与 prev cpu 的 CPU 负载来作为是否做唤醒亲核性抉择的规范。
内核默认为关上，如果不想做基于 CPU 负载的唤醒亲核选择，则能够敞开此性能（即只思考用 IDLE CPU 做 wake affine 抉择）。
SCHED_FEAT(WA_WEIGHT, true)

static int wake_affine(struct sched_domain sd, struct task_struct p,

       int this_cpu, int prev_cpu, int sync)

{
int target = nr_cpumask_bits;

if (sched_feat(WA_IDLE))

target = wake_affine_idle(this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)

target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
if (target == nr_cpumask_bits)

return prev_cpu;

schedstat_inc(sd->ttwu_move_affine);
schedstat_inc(p->se.statistics.nr_wakeups_affine);
return target;
}
WA_BIAS
WA_BIAS 是基于下面的 WA_WEIGHT 实现的，示意在 WEIGHT 权重计算时会给于 prev cpu 进行一些加权，让内核更偏向于抉择 waker cpu。
内核默认会关上该性能，如果不想内核偏向于优先选择 waker cpu，则能够敞开该性能。
UTIL_EST
内核以前的 PELT 机制随着衰减的进行，会呈现十分大的变动。例如当一个过程运行时，它的 pelt load 很大，但当它睡眠了一段时间，则他的 pelt load 会变得很小，这种变动会给负载平衡带来肯定的问题。例如，某个过程在 CPU 上运行了很长一段时间，它的 PELT LOAD 会很大，而后它睡眠了一段时间，PELT LOAD 就会被衰减的很小。而当它再次运行的时候，又须要一段时间的运行能力将 PELT LOAD 复原，而在这段时间里这个过程就会被认为是小工作。为了解决这个问题，内核就在 sched_avg 里引入了 util_est。util_est 是统计过程没有通过衰减的指数平滑负载，这样在周期性负载平衡里，能够抉择用 util_est 来计算 CPU 的残余算力，这样能够防止大工作因睡眠衰减的起因而被谬误的预估，从而导致 load balance 不精确。这里的 UTIL_EST 就示意在 CPU 算力评估时应用 EST 负载，而不是 PELT 的负载。
/*

UtilEstimation. Use estimated CPU utilization.
*/

SCHED_FEAT(UTIL_EST, true)

默认状况下，内核会开启这个个性。

关于开源:揭秘OpenCloudOS内核调度器Features

ifdef CONFIG_SCHED_HRTICK

ifdef CONFIG_HAVE_SCHED_AVG_IRQ

endif

if defined(CONFIG_SMP)

endif

ifdef HAVE_RT_PUSH_IPI

endif

Just My Socks（注册教程内含优惠码）

关于开源:揭秘OpenCloudOS内核调度器Features

ifdef CONFIG_SCHED_HRTICK

ifdef CONFIG_HAVE_SCHED_AVG_IRQ

endif

if defined(CONFIG_SMP)

endif

ifdef HAVE_RT_PUSH_IPI

endif

Just My Socks（注册教程 内含优惠码）

Just My Socks（注册教程内含优惠码）