使用AVX CPU指令:没有“/arch:AVX”的性能很差

时间:2021-08-22 06:48:34

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.

我的c++代码使用SSE,现在我想改进它,以便在AVX可用时支持它。我检测AVX何时可用并调用一个使用AVX命令的函数。我使用Win7 SP1 + VS2010 SP1和一个带有AVX的CPU。

To use AVX, it is necessary to include this:

要使用AVX,必须包括以下内容:

#include "immintrin.h"

and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning:

然后可以使用intrinsic AVX函数,比如_mm256_mul_ps, _mm256_add_ps等等。

warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX

警告C4752:发现Intel(R)先进的矢量扩展;考虑使用/拱:AVX

It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!

看起来VS2010实际上并不使用AVX指令,而是模仿它们。我在编译器选项中添加了/arch:AVX,得到了很好的结果。但是这个选项告诉编译器尽可能在任何地方使用AVX命令。因此,我的代码可能会在不支持AVX的CPU上崩溃!

So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.

因此,问题是如何让VS2010编译器生成AVX代码,但必须直接指定AVX intrinsic。对于SSE,我只使用SSE intrinsic函数,它生成SSE代码,没有任何编译器选项,比如/arch:SSE。但是对于AVX来说,由于某些原因,它并不起作用。

2 个解决方案

#1


75  

The behavior that you are seeing is the result of expensive state-switching.

您看到的行为是昂贵的状态切换的结果。

See page 102 of Agner Fog's manual:

参见阿格纳·福格手册102页:

http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/microarchitecture.pdf

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

每次在SSE和AVX指令之间进行不正确的切换时,您将付出极高的周期代价(~70)。

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)

当不使用/arch:AVX进行编译时,VS2010将生成SSE指令,但是在有AVX intrinsic的地方仍然会使用AVX。因此,您将得到同时具有SSE和AVX指令的代码——这将具有那些状态切换的惩罚。(VS2010知道这一点,所以它会发出你正在看到的警告。)

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.

因此,您应该使用所有的SSE,或者所有的AVX。指定/arch:AVX告诉编译器使用所有的AVX。

It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX. For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.

这听起来像是你在尝试创建多个代码路径:一个用于SSE,一个用于AVX。为此,我建议您将SSE和AVX代码分成两个不同的编译单元。(一个用/arch:AVX编译,另一个没有)然后将它们链接在一起,并根据运行的硬件设置一个分派器进行选择。

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

如果需要混合使用SSE和AVX,请确保适当地使用_mm256_zeroupper()或_mmm256_zeroall(),以避免状态切换的惩罚。

#2


14  

tl;dr

博士tl;

Use _mm256_zeroupper(); or _mm256_zeroall(); around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.

使用_mm256_zeroupper();或_mm256_zeroall();使用AVX围绕代码段(取决于函数参数)。只对带有AVX的源文件使用选项/arch:AVX,而不是对整个项目使用AVX,以避免破坏对只支持合法编码的sse代码路径的支持。

Cause

导致

I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties" (PDF). The abstract states:

我认为最好的解释是在英特尔的文章中,“避免AVX-SSE过渡惩罚”(PDF)。抽象状态:

Transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.

过渡到256位英特尔®AVX指令和遗留英特尔®SSE指令在一个程序可能会导致性能损失,因为硬件必须保存和恢复的上128位YMM寄存器。

Separating your AVX and SSE code into different compilation units may NOT help if you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):

将AVX和SSE代码分离到不同的编译单元中,如果切换到启用了SSE的和启用了AVX的对象文件之间的调用代码,可能不会有帮助,因为当AVX指令或程序集与任何(来自Intel的文件)混合时,可能会发生转换:

  • 128-bit intrinsic instructions
  • 128位内在指令
  • SSE inline assembly
  • 上交所内联汇编
  • C/C++ floating point code that is compiled to Intel® SSE
  • C / c++浮点代码编译为英特尔®SSE
  • Calls to functions or libraries that include any of the above
  • 调用包含上述任何一个函数或库的函数。

This means there may even be penalties when linking with external code using SSE.

这意味着在使用SSE链接外部代码时可能会受到惩罚。

Details

细节

There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMM registers are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel® AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:

AVX指令定义了3个处理器状态,其中一个状态是所有的YMM寄存器都是分开的,允许SSE指令使用下半部分。英特尔文档“英特尔®AVX状态转换:SSE代码迁移到AVX”提供了这些状态图:

使用AVX CPU指令:没有“/arch:AVX”的性能很差

When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.

在状态B (AVX-256模式)中,所有的YMM寄存器都在使用中。当一个SSE指令被调用时,必须进行到状态C的转换,这就是惩罚。在SSE启动之前,所有YMM寄存器的上半部分必须保存到一个内部缓冲区中,即使它们恰好是零。转换的成本是“在砂桥硬件上50-80个时钟周期”。C -> a也有一个惩罚,如图2所示。

You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEX and non-VEX modes" in Agner Fog's optimization guide (of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.

您还可以在Agner Fog的优化指南(版本更新为2014-08-07)中“VEX和非VEX模式之间的转换”第9.12节中找到导致这种减速的状态切换惩罚的详细信息。根据他的指导,任何到这个状态的转换都需要“在桑迪桥上大约70个时钟周期”。正如英特尔文件所说,这是一个可以避免的过渡惩罚。

Resolution

决议

To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper. The description for this intrinsic is very informative:

为了避免过渡惩罚,您可以删除所有遗留的SSE代码,指示编译器将所有SSE指令转换为128位指令(如果编译器有能力的话)的VEX编码形式,或者在AVX和SSE代码之间转换之前将YMM寄存器置于已知的零状态。本质上,要维护独立的SSE代码路径,您必须在使用AVX指令的任何代码之后,将所有16个YMM寄存器(发出VZEROUPPER指令)的128位的上位归零。将这些位归零将强制转换为状态a,并且避免了昂贵的代价,因为YMM值不需要由硬件存储在内部缓冲区中。执行此指令的本质是_m256_zeroupper。这种内在的描述是非常有用的:

This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions.

内在清晰有用的上部分YMM寄存器之间过渡时英特尔®高级向量扩展(Intel®AVX)遗留英特尔®补充SIMD指令和扩展(Intel®SSE)指令。没有过渡处罚如果一个应用程序清除上面的所有YMM寄存器通过VZEROUPPER(集“0”),这个内在的相应的指令,在英特尔®高级向量扩展之间的过渡(Intel®AVX)遗留英特尔®补充SIMD指令和扩展(Intel®SSE)指令。

In Visual Studio 2010+ (maybe even older), you get this intrinsic with immintrin.h.

在Visual Studio 2010+(可能更早)中,你会发现这是一种即时的特性。

Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER or VZEROALL instructions must be used.

注意,用其他方法调零位不会消除惩罚——必须使用VZEROUPPER或VZEROALL指令。

One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER at the beginning of each function containing Intel AVX code if none of the arguments are a YMM register or __m256/__m256d/__m256i datatype, and at the end of functions if the returned value is not a YMM register or __m256/__m256d/__m256i datatype.

一个自动的解决方案由英特尔实现编译器插入VZEROUPPER每个函数的开头包含英特尔AVX代码如果没有一个参数是一个YMM注册或__m256 / __m256d / __m256i数据类型,函数结束时如果返回值不是YMM注册或__m256 / __m256d __m256i数据类型。

In the wild

在野外

This VZEROUPPER solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:

这个VZEROUPPER解决方案被FFTW用于生成一个具有SSE和AVX支持的库。看到simd-avx.h:

/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
   See Intel Optimization Manual (April 2011, version 248966), Section
   11.3 */
#define VLEAVE _mm256_zeroupper

Then VLEAVE(); is called at the end of every function using intrinsics for AVX instructions.

然后VLEAVE();在每个函数的末尾调用AVX指令使用内部函数。

#1


75  

The behavior that you are seeing is the result of expensive state-switching.

您看到的行为是昂贵的状态切换的结果。

See page 102 of Agner Fog's manual:

参见阿格纳·福格手册102页:

http://www.agner.org/optimize/microarchitecture.pdf

http://www.agner.org/optimize/microarchitecture.pdf

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

每次在SSE和AVX指令之间进行不正确的切换时,您将付出极高的周期代价(~70)。

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)

当不使用/arch:AVX进行编译时,VS2010将生成SSE指令,但是在有AVX intrinsic的地方仍然会使用AVX。因此,您将得到同时具有SSE和AVX指令的代码——这将具有那些状态切换的惩罚。(VS2010知道这一点,所以它会发出你正在看到的警告。)

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.

因此,您应该使用所有的SSE,或者所有的AVX。指定/arch:AVX告诉编译器使用所有的AVX。

It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX. For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.

这听起来像是你在尝试创建多个代码路径:一个用于SSE,一个用于AVX。为此,我建议您将SSE和AVX代码分成两个不同的编译单元。(一个用/arch:AVX编译,另一个没有)然后将它们链接在一起,并根据运行的硬件设置一个分派器进行选择。

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

如果需要混合使用SSE和AVX,请确保适当地使用_mm256_zeroupper()或_mmm256_zeroall(),以避免状态切换的惩罚。

#2


14  

tl;dr

博士tl;

Use _mm256_zeroupper(); or _mm256_zeroall(); around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.

使用_mm256_zeroupper();或_mm256_zeroall();使用AVX围绕代码段(取决于函数参数)。只对带有AVX的源文件使用选项/arch:AVX,而不是对整个项目使用AVX,以避免破坏对只支持合法编码的sse代码路径的支持。

Cause

导致

I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties" (PDF). The abstract states:

我认为最好的解释是在英特尔的文章中,“避免AVX-SSE过渡惩罚”(PDF)。抽象状态:

Transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.

过渡到256位英特尔®AVX指令和遗留英特尔®SSE指令在一个程序可能会导致性能损失,因为硬件必须保存和恢复的上128位YMM寄存器。

Separating your AVX and SSE code into different compilation units may NOT help if you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):

将AVX和SSE代码分离到不同的编译单元中,如果切换到启用了SSE的和启用了AVX的对象文件之间的调用代码,可能不会有帮助,因为当AVX指令或程序集与任何(来自Intel的文件)混合时,可能会发生转换:

  • 128-bit intrinsic instructions
  • 128位内在指令
  • SSE inline assembly
  • 上交所内联汇编
  • C/C++ floating point code that is compiled to Intel® SSE
  • C / c++浮点代码编译为英特尔®SSE
  • Calls to functions or libraries that include any of the above
  • 调用包含上述任何一个函数或库的函数。

This means there may even be penalties when linking with external code using SSE.

这意味着在使用SSE链接外部代码时可能会受到惩罚。

Details

细节

There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMM registers are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel® AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:

AVX指令定义了3个处理器状态,其中一个状态是所有的YMM寄存器都是分开的,允许SSE指令使用下半部分。英特尔文档“英特尔®AVX状态转换:SSE代码迁移到AVX”提供了这些状态图:

使用AVX CPU指令:没有“/arch:AVX”的性能很差

When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.

在状态B (AVX-256模式)中,所有的YMM寄存器都在使用中。当一个SSE指令被调用时,必须进行到状态C的转换,这就是惩罚。在SSE启动之前,所有YMM寄存器的上半部分必须保存到一个内部缓冲区中,即使它们恰好是零。转换的成本是“在砂桥硬件上50-80个时钟周期”。C -> a也有一个惩罚,如图2所示。

You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEX and non-VEX modes" in Agner Fog's optimization guide (of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.

您还可以在Agner Fog的优化指南(版本更新为2014-08-07)中“VEX和非VEX模式之间的转换”第9.12节中找到导致这种减速的状态切换惩罚的详细信息。根据他的指导,任何到这个状态的转换都需要“在桑迪桥上大约70个时钟周期”。正如英特尔文件所说,这是一个可以避免的过渡惩罚。

Resolution

决议

To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper. The description for this intrinsic is very informative:

为了避免过渡惩罚,您可以删除所有遗留的SSE代码,指示编译器将所有SSE指令转换为128位指令(如果编译器有能力的话)的VEX编码形式,或者在AVX和SSE代码之间转换之前将YMM寄存器置于已知的零状态。本质上,要维护独立的SSE代码路径,您必须在使用AVX指令的任何代码之后,将所有16个YMM寄存器(发出VZEROUPPER指令)的128位的上位归零。将这些位归零将强制转换为状态a,并且避免了昂贵的代价,因为YMM值不需要由硬件存储在内部缓冲区中。执行此指令的本质是_m256_zeroupper。这种内在的描述是非常有用的:

This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions.

内在清晰有用的上部分YMM寄存器之间过渡时英特尔®高级向量扩展(Intel®AVX)遗留英特尔®补充SIMD指令和扩展(Intel®SSE)指令。没有过渡处罚如果一个应用程序清除上面的所有YMM寄存器通过VZEROUPPER(集“0”),这个内在的相应的指令,在英特尔®高级向量扩展之间的过渡(Intel®AVX)遗留英特尔®补充SIMD指令和扩展(Intel®SSE)指令。

In Visual Studio 2010+ (maybe even older), you get this intrinsic with immintrin.h.

在Visual Studio 2010+(可能更早)中,你会发现这是一种即时的特性。

Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER or VZEROALL instructions must be used.

注意,用其他方法调零位不会消除惩罚——必须使用VZEROUPPER或VZEROALL指令。

One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER at the beginning of each function containing Intel AVX code if none of the arguments are a YMM register or __m256/__m256d/__m256i datatype, and at the end of functions if the returned value is not a YMM register or __m256/__m256d/__m256i datatype.

一个自动的解决方案由英特尔实现编译器插入VZEROUPPER每个函数的开头包含英特尔AVX代码如果没有一个参数是一个YMM注册或__m256 / __m256d / __m256i数据类型,函数结束时如果返回值不是YMM注册或__m256 / __m256d __m256i数据类型。

In the wild

在野外

This VZEROUPPER solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:

这个VZEROUPPER解决方案被FFTW用于生成一个具有SSE和AVX支持的库。看到simd-avx.h:

/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
   See Intel Optimization Manual (April 2011, version 248966), Section
   11.3 */
#define VLEAVE _mm256_zeroupper

Then VLEAVE(); is called at the end of every function using intrinsics for AVX instructions.

然后VLEAVE();在每个函数的末尾调用AVX指令使用内部函数。