如何设置Openstack 虚拟机的Numa架构和Cpu亲和性

CPU拓扑

OpenStack中的NUMA拓扑和CPU固定功能可提供对实例在虚拟机管理程序CPU上的运行方式以及实例可用的虚拟CPU拓扑的高级控制。这些功能有助于最大程度地减少延迟并提高性能。

SMP，NUMA和SMT

对称多处理器（SMP）

SMP是在许多现代多核系统中发现的设计。在SMP系统中，有两个或多个CPU，并且这些CPU通过某种互连连接。这为CPU提供了对系统资源（如内存和输入/输出端口）的同等访问权限。

非统一内存访问（NUMA）

NUMA是SMP设计的派生产品，可在许多多路插座系统中找到。在NUMA系统中，系统内存分为与特定CPU关联的单元或节点。通过互连总线可以请求其他节点上的内存。但是，此共享总线上的带宽是有限的。结果，对此资源的竞争可能会导致性能下降。

同时多线程（SMT）

SMT是SMP的补充设计。SMP系统中的CPU共享总线和一些内存，而SMT系统中的CPU共享更多组件。共享组件的CPU被称为线程同级。所有CPU在系统上均显示为可用CPU，并且可以并行执行工作负载。但是，与NUMA一样，线程争夺共享资源。

非统一I / O访问（NUMA I / O）

在NUMA系统中，映射到本地内存区域的设备的I / O比远程设备的I / O效率更高。连接到同一插槽（提供CPU和内存）的设备由于其物理上的接近性而为I / O操作提供了较低的延迟。通常，这会在连接到PCIe总线的设备（例如NIC或vGPU）中体现出来，但适用于任何支持内存映射的I / O的设备。

在OpenStack中，SMP CPU被称为核心，NUMA单元或节点被称为 套接字，SMT CPU被称为线程。例如，具有超线程功能的四路，八核系统将具有四个插槽，每个插槽八个核心，每个核心两个线程，总共64个CPU。

如何启用cpu pinning功能

需要配置vcpu_pin_set

Defines which physical CPUs (pCPUs) can be used by instance virtual CPUs (vCPUs).

Possible values:

A comma-separated list of physical CPU numbers that virtual CPUs can be allocated to by default. Each element should be either a single CPU number, a range of CPU numbers, or a caret followed by a CPU number to be excluded from a previous range. For example:

vcpu_pin_set = "4-12,^8,15"

划分出虚拟机专用的 pCPUs 来保证性能，另一方面，也是为了防止 Guest 过分争抢 Host 进程的 CPU 资源，为 Host 适当的留下一些 CPU 以保证正常运作。

2. 启用NUMATopologyFilter

nova-scheduler服务的里

[filter_scheduler]

enabled_filters里添加上NUMATopologyFilter

Numa拓扑策略

To restrict an instance's vCPUs to a single host NUMA node, run:

--property hw:numa_nodes=1

某些工作负载对内存访问延迟或带宽的要求非常苛刻，超过了单个NUMA节点可用的内存带宽。对于此类工作负载，将实例分布在多个主机NUMA节点上是有益的，即使实例的RAM / vCPU理论上可以在单个NUMA节点上安装也是如此。要强制实例的vCPU分布在两个主机NUMA节点上

To force an instance's vCPUs to spread across two host NUMA nodes, run:

--property hw:numa_nodes=2

如果想配置非对称的cpu/内存，可以显式地在flavor里指定

To spread the 6 vCPUs and 6 GB of memory of an instance across two NUMA nodes and create an asymmetric 1:2 vCPU and memory mapping between the two nodes, run:

--property hw:numa_nodes=2

--property hw:numa_cpus.0=0,1 --property hw:numa_mem.0=2048

--property hw:numa_cpus.1=2,3,4,5 --property hw:numa_mem.1=4096

CPU绑定策略

To configure a flavor to use pinned vCPUs, a use a dedicated CPU policy.

--property hw:cpu_policy=dedicated \

--property hw:cpu_thread_policy=require,isolate,prefer

By default, when instance NUMA placement is not specified, a topology of N sockets, each with one core and one thread, is used for an instance, where N corresponds to the number of instance vCPUs requested.

如果没有指定numa配置，默认为1个numa, N个socket， 1个核，1个线程， N个vCPU对应N个socket

When instance NUMA placement is specified, the number of sockets is fixed to the number of host NUMA nodes to use and the total number of instance CPUs is split over these sockets.

如果设置numa策略，sockets会和主机numa个数相关，但实测中看到，一般会设置为core=1，threads=2，sockets=Nvcpu/2。可以结合cpu_sockets,cpu_cores，cpu_threads参数进行设置

CPU-POLICY：

- shared (default)：不独占 pCPU 策略，允许 vCPUs 在不同的 pCPUs 间浮动，尽管 vCPUs 受到 NUMA node 的限制也是如此。
- dedicated：独占 pCPU 策略，Guest 的 vCPUs 将会严格的 pinned 到 pCPUs 的集合中。此时的 CPU overcommit ratio 为 1.0（不支持 CPU 超配），避免 vCPU 的数量大于 Core 的数量导致的线程上下文切换损耗。

CPU-THREAD-POLICY：

- prefer (default)：如果 Host 开启了超线程，则 vCPU 优先选择在 Siblings Thread 中运行，即所有的 vCPU 都只会考虑 siblings。例如：4 个逻辑核属于同一 NUMA，其中 CPU1 和 CPU2 属于相同物理核，CPU3 和 CPU4 属于不同的物理核，若此时创建一个 Flavor vCPU 为 4 的云主机会创建失败，因为 siblings 只有 [set([1, 2])]；否则，vCPU 优先选择在 Core 上运行。
- isolate（vCPU 性能最好）：vCPU 必须绑定到 Core 上。如果 Host 没有开启超线程，则理所当然会将 vCPU 绑定到 Core 上；相反，如果 Host 开启了超线程，则 vCPU 会被绑定到 Siblings Thread 的一个 Thread 中，并且其他的 vCPU 不会再被分配到该 Core 上，相当于 vCPU 独占一个 Core，避免 Siblings Thread 竞争。
- require（vCPU 数量最多）：vCPU 必须绑定到 Thread 上。Host 必须开启超线程，每个 vCPU 都会被绑定到 Thread 上，直到 Thread 用完为止。如果没有开启的话，那么该 Host 不会出现在 Nova Scheduler 的调度名单中。

NOTE 1：只有设定 hw:cpu_policy=dedicated 时，hw:cpu_thread_policy 才会生效。后者设定的是 vCPU pinning to pCPU 的策略。

NOTE 2：如果 pinned（isolate）和 unpinned 的虚拟机运行在同一个 compute node，则会发生 CPU 竞争，因为 unpinned 的虚拟机不会考虑到 pinned 虚拟机的资源需求，由于 Cache 的影响，这将会严重的影响进行了 CPU 绑定的虚拟机的性能，尤其当两者同处一个 NUMA 节点时。所以，应该使用 Host Aggregate 来区分开 pinned 和 unpinned 的虚拟机，退一步来说，最起码也应该让两者运行在不同的 NUMA 节点上。而且如果一个 compute node 上运行的全是 pinned 虚拟机，那么这个 compute node 不建议配置超配比。

NOTE 3：如果 cpu_thread_policy=prefer | require 时，Nova 的 Thread 分配策略是尽量先占满一个 Core，然后再使用下一个 Core 上的 Thread，尽量避免 Thread/Core 碎片影响到后面创建的虚拟机。

也可以在镜像属性上设置cpu pin，

镜像的hw_cpu_policy和flavor的cpu policy可以单独设置一方，或者两者相同，如果两者不匹配，会抛出异常

$ openstack image set [IMAGE_ID] \

--property hw_cpu_policy=dedicated \

--property hw_cpu_thread_policy=isolate

配置实例本身的CPU拓扑方式

To configure a flavor to use a maximum of two sockets, run:

Similarly, to configure a flavor to use one core and one thread, run:

--property hw:cpu_sockets=2

--property hw:cpu_cores=1 \

--property hw:cpu_threads=1

设置使用的套接字，核心和线程数的上限。与上面的硬值不同，此精确数字没有必要使用，因为它仅提供了一个限制。这可用于在调度中提供一定的灵活性，同时确保不超出某些限制。例如，要确保在实例拓扑中定义的套接字不超过两个

--property=hw:cpu_max_sockets=2

--property hw_cpu_max_cores=1

--property hw_cpu_max_threads=1

验证步骤

对套餐设置不同的属性，进行验证

openstack flavor set ecs.4large --property hw:numa_nodes=2 --property hw:cpu_sockets=2 --property hw:cpu_policy=dedicated

1. 4核的套餐，不设置cpu拓扑属性，创建的虚机拓扑为1numa 4socket 1core 1thread，cpu不固定

$ lscpu
Architecture:          x86_64
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             4
NUMA node(s):          1
NUMA node0 CPU(s):     0-3

2. 4核的套餐，设置hw:numa_nodes=2，hw:cpu_sockets=1，拓扑为2numa 1socket 4core 1thread，cpu在numa node里浮动，1个socket划分为2个numa

  <vcpu placement='static'>4</vcpu>
  <cputune>
    <shares>4096</shares>
    <vcpupin vcpu='0' cpuset='0-9,20-29'/>
    <vcpupin vcpu='1' cpuset='0-9,20-29'/>
    <vcpupin vcpu='2' cpuset='10-19,30-39'/>
    <vcpupin vcpu='3' cpuset='10-19,30-39'/>
    <emulatorpin cpuset='0-39'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memnode cellid='1' mode='strict' nodeset='1'/>
  </numatune>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='4' threads='1'/>
    <numa>
      <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='1048576' unit='KiB'/>
    </numa>
  </cpu>

查看虚机绑定在哪个物理cpu上，可以用pidstat命令查看
pidstat -p 2220118 -t 1

3. 16核的套餐，设置hw:cpu_policy='dedicated' ,虚机拓扑为 1numa，8socket，1core，2thread

4. 16核的套餐，设置hw:cpu_policy='dedicated', hw:numa_nodes='1'，虚机拓扑为 1numa，8socket，1core，2thread

  <vcpu placement='static'>16</vcpu>
  <cputune>
    <shares>16384</shares>
    <vcpupin vcpu='0' cpuset='35'/>
    <vcpupin vcpu='1' cpuset='15'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='30'/>
    <vcpupin vcpu='4' cpuset='16'/>
    <vcpupin vcpu='5' cpuset='36'/>
    <vcpupin vcpu='6' cpuset='11'/>
    <vcpupin vcpu='7' cpuset='31'/>
    <vcpupin vcpu='8' cpuset='32'/>
    <vcpupin vcpu='9' cpuset='12'/>
    <vcpupin vcpu='10' cpuset='17'/>
    <vcpupin vcpu='11' cpuset='37'/>
    <vcpupin vcpu='12' cpuset='18'/>
    <vcpupin vcpu='13' cpuset='38'/>
    <vcpupin vcpu='14' cpuset='19'/>
    <vcpupin vcpu='15' cpuset='39'/>
    <emulatorpin cpuset='10-12,15-19,30-32,35-39'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='1'/>
    <memnode cellid='0' mode='strict' nodeset='1'/>
  </numatune>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Broadwell-IBRS</model>
    <vendor>Intel</vendor>
    <topology sockets='8' cores='1' threads='2'/>
    <feature policy='require' name='vme'/>
    <numa>
      <cell id='0' cpus='0-15' memory='33554432' unit='KiB'/>
    </numa>
  </cpu>

5. 16核的套餐，hw:cpu_policy='dedicated' ,numa_nodes=2，虚机拓扑为 8socket，1core，2thread；2numa，4个socket属于numa0，另外4个属于numa1；

  <vcpu placement='static'>16</vcpu>
  <cputune>
    <shares>16384</shares>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='21'/>
    <vcpupin vcpu='2' cpuset='0'/>
    <vcpupin vcpu='3' cpuset='20'/>
    <vcpupin vcpu='4' cpuset='25'/>
    <vcpupin vcpu='5' cpuset='5'/>
    <vcpupin vcpu='6' cpuset='8'/>
    <vcpupin vcpu='7' cpuset='28'/>
    <vcpupin vcpu='8' cpuset='35'/>
    <vcpupin vcpu='9' cpuset='15'/>
    <vcpupin vcpu='10' cpuset='10'/>
    <vcpupin vcpu='11' cpuset='30'/>
    <vcpupin vcpu='12' cpuset='16'/>
    <vcpupin vcpu='13' cpuset='36'/>
    <vcpupin vcpu='14' cpuset='11'/>
    <vcpupin vcpu='15' cpuset='31'/>
    <emulatorpin cpuset='0-1,5,8,10-11,15-16,20-21,25,28,30-31,35-36'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0-1'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
    <memnode cellid='1' mode='strict' nodeset='1'/>
  </numatune>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Broadwell-IBRS</model>
    <vendor>Intel</vendor>
    <topology sockets='8' cores='1' threads='2'/>
    <feature policy='require' name='vme'/>
    <numa>
      <cell id='0' cpus='0-7' memory='16777216' unit='KiB'/>
      <cell id='1' cpus='8-15' memory='16777216' unit='KiB'/>
    </numa>
  </cpu>

6. 16核的套餐，设置hw:cpu_policy=dedicated，numa_nodes=1, cpu_sockets=1，拓扑为1numa 1socket 8core 2thread，绑定到物理numa node0，8个core*2thread

7. 再创建一个16核同套餐虚机，绑定到物理numa node1

8. 再创建一个16核同套餐虚机，失败，在numa node0 和 numa node1里都调度失败，每个numa只剩4个逻辑核，不能满足16个vcpu的需求

日志如下
Attempting to fit instance cell InstanceNUMACell(cpu_pinning_raw=None,cpu_policy='dedicated',cpu_thread_policy=None,cpu_topology=<?>,cpuset=set([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]),cpuset_reserved=None,id=0,memory=32768,pagesize=None) on host_cell NUMACell(cpu_usage=16,cpuset=set([0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29]),id=0,memory=130958,memory_usage=32768,mempages=[NUMAPagesTopology,NUMAPagesTopology,NUMAPagesTopology],network_metadata=NetworkMetadata,pinned_cpus=set([0,1,2,4,5,7,8,9,20,21,22,24,25,27,28,29]),siblings=[set([1,21]),set([0,20]),set([25,5]),set([8,28]),set([9,29]),set([24,4]),set([27,7]),set([2,22]),set([3,23]),set([26,6])]) _numa_fit_instance_cell nova/virt/:1040
No specific pagesize requested for instance, selected pagesize: 4 _numa_fit_instance_cell 
Pinning has been requested _numa_fit_instance_cell 
Not enough available CPUs to schedule instance. Oversubscription is not possible with pinned instances. Required: 16 (16 + 0), actual: 4 _numa_fit_instance_cell_with_pinning 


第二遍在cpuset=set([10,11,12,13,14,15,16,17,18,19,30,31,32,33,34,35,36,37,38,39])调度
最终调度失败
fails NUMA topology requirements. The instance does not fit on this host. host_passes 
Filter NUMATopologyFilter returned 0 hosts

9. 设置8u_16g的套餐，hw:cpu_policy=dedicated，hw:numa_nodes=1， hw:cpu_sockets=1，也调度失败

Not enough available CPUs to schedule instance. Oversubscription is not possible with pinned instances. Required: 8 (8 + 0), actual: 4

10. 只设置8u_16g套餐hw:cpu_policy=dedicated和hw:cpu_sockets=1依然调度失败，说明不配置numa_nodes，则默认就是往一个numa node里调度的，

11. 设置8u的套餐里hw:cpu_policy=dedicated和hw:cpu_sockets=2，numa_nodes=2，调度成功了，先在numa0调度4个cpu，再在numa1调度4个cpu，最终拓扑是2numa，2socket，2core，2thread

热迁移问题：

对绑定了cpu的虚机进行热迁移，在nova-conductor里报不能迁移

： Failed to compute_task_migrate_server: Migration pre-check error: Instance has an associated NUMA topology. Instance NUMA topologies, including related attributes such as CPU pinning, huge page and emulator thread pinning information, are not currently recalculated on live migration. See bug #1289064 for more information.: MigrationPreCheckError: Migration pre-check error: Instance has an associated NUMA topology. Instance NUMA topologies, including related attributes such as CPU pinning, huge page and emulator thread pinning information, are not currently recalculated on live migration. See bug #1289064 for more information.

需要启用enable_numa_live_migration

nova-conductor服务的

[workarounds]

enable_numa_live_migration=true

启用后的缺陷：

Enable live migration of instances with NUMA topologies.

Live migration of instances with NUMA topologies is disabled by default when using the libvirt driver. This includes live migration of instances with CPU pinning or hugepages. CPU pinning and huge page information for such instances is not currently re-calculated, as noted in bug #1289064. This means that if instances were already present on the destination host, the migrated instance could be placed on the same dedicated cores as these instances or use hugepages allocated for another instance. Alternately, if the host platforms were not homogeneous, the instance could be assigned to non-existent cores or be inadvertently split across host NUMA nodes.

Despite these known issues, there may be cases where live migration is necessary. By enabling this option, operators that are aware of the issues and are willing to manually work around them can enable live migration support for these instances.

Related options:

compute_driver: Only the libvirt driver is affected.

迁移时，在scheduler里会先检查目的主机numa和cpu topo是否满足，不满足则不进行迁移；

满足的情况下，进行迁移，但是依然使用迁移前的vcpupin配置，假如目的主机上已存在的实例已经绑定到了某些pcpu，正好和迁移来的虚机一样，那这俩虚机将绑定到相同的pcpu上。按社区的意思，这个情况需要管理员考虑。

参考链接

/nova/stein/admin/

/nova/stein/user/#extra-specs-numa-topology

/jmilkfan-fanguiju/p/

秒客网