为什么Linux的调度程序将两个线程放在具有超线程的处理器上的同一物理内核上？

I've read in multiple places that Linux's default scheduler is hyperthreading aware on multi-core machines, meaning that if you have a machine with 2 real cores (4 HT), it won't schedule two busy threads onto logical cores in a way that they both run on the same physical cores (which would lead to 2x performance cost in many cases).

我已经在多个地方读过Linux的默认调度程序在多核机器上的超线程感知，这意味着如果你有一台具有2个真实内核（4 HT）的机器，它将不会以某种方式将两个忙线程安排到逻辑内核上它们都运行在相同的物理内核上（在许多情况下会导致2倍的性能成本）。

But when I run stress -c 2 (spawns two threads to run on 100% CPU) on my Intel i5-2520M, it often schedules (and keeps) the two threads onto HT cores 1 and 2, which map to the same physical core. Even if the system is idle otherwise.

但是当我在我的Intel i5-2520M上运行压力-c 2（产生两个线程以在100％CPU上运行）时，它经常将两个线程调度（并保持）到HT核心1和2上，这些核心映射到相同的物理核心。即使系统处于空闲状态。

This also happens with real programs (I'm using stress here because it makes it easy to reproduce), and when that happens, my program understandably takes twice as long to run. Setting affinity manually with taskset fixes that for my program, but I'd expect the a HT aware scheduler to do that correctly by itself.

这也适用于真正的程序（我在这里使用压力因为它可以很容易地重现），当发生这种情况时，我的程序可以理解地需要两倍的时间来运行。手动设置与任务集的亲和力修复了我的程序，但我希望HT感知调度程序能够自己正确地执行此操作。

You can find the HT->physical core assgnment with egrep "processor|physical id|core id" /proc/cpuinfo | sed 's/^processor/\nprocessor/g'.

您可以使用egrep“processor | physical id | core id”/ proc / cpuinfo |找到HT->物理核心分配。 sed's / ^ processor / \ nprocessor / g'。

So my question is: Why does the scheduler put my threads onto the same physical core here?

所以我的问题是：为什么调度程序将我的线程放在同一个物理内核上？

Notes:

笔记：

This question is very similar to this other question, the answers to which say that Linux has quite a sophisticated thread scheduler which is HT aware. As described above, I cannot observe this fact (check for yourself with stress -c), and would like to know why.
这个问题与其他问题非常相似，答案就是说Linux有一个非常复杂的线程调度程序，它具有HT意识。如上所述，我无法观察到这一事实（用压力检查自己-c），并想知道原因。
I know that I can set processors affinity manually for my programs, e.g. with the taskset tool or with the sched_setaffinity function. This is not what I'm looking for, I would expect the scheduler to know by itself that mapping two busy threads to a physical core and leaving one physical core completely empty is not a good idea.
我知道我可以为我的程序手动设置处理器亲和性，例如使用任务集工具或sched_setaffinity函数。这不是我正在寻找的，我希望调度程序能够自己知道将两个繁忙线程映射到物理核心并将一个物理核心完全留空并不是一个好主意。
I'm aware that there are some situations in which you would prefer threads to be scheduled onto the same physical core and leave the other core free, but it seems nonsensical that the scheduler would do that roughly 1/4 of the cases. It seems to me that the HT cores that it picks are completely random, or maybe those HT cores that had least activity at the time of scheduling, but that wouldn't be very hyperthreading aware, given how clearly programs with the characteristics of stress benefit from running on separate physical cores.
我知道在某些情况下你更喜欢将线程安排到同一个物理内核上并让其他内核保持空闲状态，但调度程序大约1/4的情况下，这似乎是荒谬的。在我看来，它所选择的HT核心是完全随机的，或者可能是那些在调度时活动最少的HT核心，但鉴于具有压力优势特征的程序有多清楚，这不会非常超线程感知从在不同的物理核心上运行。

3 个解决方案

#1

I think it's time to summarize some knowledge from comments.

我认为是时候从评论中总结一些知识了。

Linux scheduler is aware of HyperThreading -- information about it should be read from ACPI SRAT/SLIT tables, which are provided by BIOS/UEFI -- than Linux builds scheduler domains from that.

Linux调度程序知道超线程 - 有关它的信息应该从BIOS / UEFI提供的ACPI SRAT / SLIT表中读取 - 而不是Linux从中构建调度程序域。

Domains have hierarchy -- i.e. on 2-CPU servers you will get three layers of domains: all-cpus, per-cpu-package, and per-cpu-core domain. You may check it from /proc/schedstat:

域具有层次结构 - 即在2 CPU服务器上，您将获得三层域：all-cpus，per-cpu-package和per-cpu-core域。你可以从/ proc / schedstat检查它：

$ awk '/^domain/ { print $1, $2; } /^cpu/ { print $1; }' /proc/schedstat
cpu0
domain0 0000,00001001     <-- all cpus from core 0
domain1 0000,00555555     <-- all cpus from package 0
domain2 0000,00ffffff     <-- all cpus in the system

Part of CFS scheduler is load balancer -- the beast that should steal tasks from your busy core to another core. Here are its description from the Kernel documentation:

CFS调度程序的一部分是负载均衡器 - 应该将任务从繁忙的核心窃取到另一个核心的野兽。以下是来自内核文档的描述：

While doing that, it checks to see if the current domain has exhausted its rebalance interval. If so, it runs load_balance() on that domain. It then checks the parent sched_domain (if it exists), and the parent of the parent and so forth.

在执行此操作时，它会检查当前域是否已用尽其重新平衡间隔。如果是，则在该域上运行load_balance（）。然后它检查父sched_domain（如果它存在），以及父级的父级，依此类推。

Initially, load_balance() finds the busiest group in the current sched domain. If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in that group. If it manages to find such a runqueue, it locks both our initial CPU's runqueue and the newly found busiest one and starts moving tasks from it to our runqueue. The exact number of tasks amounts to an imbalance previously computed while iterating over this sched domain's groups.

最初，load_balance（）查找当前sched域中最繁忙的组。如果成功，它会查找该组中所有CPU的运行队列中最繁忙的运行队列。如果它设法找到这样的runqueue，它会锁定我们的初始CPU的runqueue和新发现的最繁忙的runqueue，并开始将任务从它移动到我们的runqueue。在迭代这个sched域的组时，确切的任务数量相当于先前计算的不平衡。

From: https://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt

来自：https：//www.kernel.org/doc/Documentation/scheduler/sched-domains.txt

You can monitor for activities of load balancer by comparing numbers in /proc/schedstat. I wrote a script for doing that: schedstat.py

您可以通过比较/ proc / schedstat中的数字来监视负载均衡器的活动。我为此写了一个脚本：schedstat.py

Counter alb_pushed shows that load balancer was successfully moved out task:

Counter alb_pushed显示负载均衡器已成功移出任务：

Sun Apr 12 14:15:52 2015              cpu0    cpu1    ...    cpu6    cpu7    cpu8    cpu9    cpu10   ...
.domain1.alb_count                                    ...      1       1                       1  
.domain1.alb_pushed                                   ...      1       1                       1  
.domain2.alb_count                              1     ...                                         
.domain2.alb_pushed                             1     ...

However, logic of load balancer is complex, so it is hard to determine what reasons can stop it from doing its work well and how they are related with schedstat counters. Neither me nor @thatotherguy can reproduce your issue.

但是，负载均衡器的逻辑很复杂，因此很难确定哪些原因可以阻止它完成其工作以及它们与schedstat计数器的关系。我和@thatotherguy都不能重现你的问题。

I see two possibilities for that behavior:

我看到了这种行为的两种可能性：

You have some aggressive power saving policy that tries to save one core to reduce power consumption of CPU.
您有一些积极的节电策略，试图节省一个核心，以减少CPU的功耗。
You really encountered a bug with scheduling subsystem, than you should go to LKML and carefully share your findings (including mpstat and schedstat data)
你真的遇到了一个调度子系统的错误，你应该去LKML并仔细分享你的发现（包括mpstat和schedstat数据）

#2

I'm unable to reproduce this on 3.13.0-48 with my Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.

我无法在3.13.0-48上使用我的英特尔（R）Xeon（R）CPU E5-1650 0 @ 3.20GHz重现这一点。

I have 6 cores with hyperthreading, where logical core N maps to physical core N mod 6.

我有6个超线程核心，其中逻辑核心N映射到物理核心N mod 6。

Here's a typical output of top with stress -c 4 in two columns, so that each row is one physical core (I left out a few cores because my system is not idle):

这是顶部的典型输出，其中stress -c 4在两列中，因此每行是一个物理核心（由于我的系统没有空闲，我省略了几个核心）：

%Cpu0  :100.0 us,   %Cpu6  :  0.0 us, 
%Cpu1  :100.0 us,   %Cpu7  :  0.0 us, 
%Cpu2  :  5.9 us,   %Cpu8  :  2.0 us, 
%Cpu3  :100.0 us,   %Cpu9  :  5.7 us, 
%Cpu4  :  3.9 us,   %Cpu10 :  3.8 us, 
%Cpu5  :  0.0 us,   %Cpu11 :100.0 us,

Here it is after killing and restarting stress:

这是在杀死并重新开始压力之后：

%Cpu0  :100.0 us,   %Cpu6  :  2.6 us, 
%Cpu1  :100.0 us,   %Cpu7  :  0.0 us, 
%Cpu2  :  0.0 us,   %Cpu8  :  0.0 us, 
%Cpu3  :  2.6 us,   %Cpu9  :  0.0 us, 
%Cpu4  :  0.0 us,   %Cpu10 :100.0 us, 
%Cpu5  :  2.6 us,   %Cpu11 :100.0 us,

I did this several times, and did not see any instances where 4 threads across 12 logical cores would schedule on the same physical core.

我这样做了几次，并没有看到任何实例，其中12个逻辑核心上的4个线程将在同一物理核心上进行调度。

With -c 6 I tend to get results like this, where Linux appears to be helpfully scheduling other processes on their own physical cores. Even so, they're distributed way better than chance:

使用-c 6我倾向于得到这样的结果，其中Linux似乎有助于在自己的物理内核上调度其他进程。即便如此，它们的分布方式也好于机会：

%Cpu0  : 18.2 us,   %Cpu6  :  4.5 us, 
%Cpu1  :  0.0 us,   %Cpu7  :100.0 us, 
%Cpu2  :100.0 us,   %Cpu8  :100.0 us, 
%Cpu3  :100.0 us,   %Cpu9  :  0.0 us, 
%Cpu4  :100.0 us,   %Cpu10 :  0.0 us, 
%Cpu5  :100.0 us,   %Cpu11 :  0.0 us,

#3

-2

Quoting your experience with two additional processors that seemed to work correctly, the i7-2600 and Xeon E5-1620; This could be a long-shot but how about a CPU microcode update? It could include something to fix the problem if it's internal CPU behaviour.

使用两个似乎正常工作的额外处理器，i7-2600和Xeon E5-1620，引用您的体验;这可能是一个长镜头，但CPU微码更新怎么样？如果它是内部CPU行为，它可能包含一些解决问题的方法。

Intel CPU Microcode Downloads: http://intel.ly/1aku6ak

英特尔CPU微代码下载：http：//intel.ly/1aku6ak

Also see here: https://wiki.archlinux.org/index.php/Microcode

另见：https：//wiki.archlinux.org/index.php/Microcode

#1