为什么编译器内联产生的代码比手工内联慢?

时间:2023-01-08 17:05:21

Background

The following critical loop of a piece of numerical software, written in C++, basically compares two objects by one of their members:

以下是用c++编写的一款数值软件的关键循环,它基本上是将两个对象之间的一个成员进行比较:

for(int j=n;--j>0;)
    asd[j%16]=a.e<b.e;

a and b are of class ASD:

a和b属于ASD类:

struct ASD  {
    float e;
    ...
};

I was investigating the effect of putting this comparison in a lightweight member function:

我正在研究在轻量级成员函数中进行这种比较的效果:

bool test(const ASD& y)const {
    return e<y.e;
}

and using it like this:

像这样使用它:

for(int j=n;--j>0;)
    asd[j%16]=a.test(b);

The compiler is inlining this function, but the problem is, that the assembly code will be different and cause >10% of runtime overhead. I have to question:

编译器正在内联这个函数,但是问题是,程序集代码会不同,导致>10%的运行时开销。我的问题:

Questions

  1. Why is the compiler prodrucing different assembly code?

    为什么编译器输出不同的汇编代码?

  2. Why is the produced assembly slower?

    为什么生产的装配速度慢?

EDIT: The second question has been answered by implementing @KamyarSouri's suggestion (j%16). The assembly code now looks almost identical (see http://pastebin.com/diff.php?i=yqXedtPm). The only differences are the lines 18, 33, 48:

编辑:第二个问题的答案是实现@KamyarSouri的建议(j%16)。组装代码现在看起来几乎相同(参见http://pastebin.com/diff.php?i=yqXedtPm)。唯一的区别是第18、33、48行:

000646F9  movzx       edx,dl 

Material

  • The test code: http://pastebin.com/03s3Kvry
  • 测试代码:http://pastebin.com/03s3Kvry
  • The assembly output on MSVC10 with /Ox /Ob2 /Ot /arch:SSE2:
  • MSVC10上的汇编输出是/Ox /Ob2 /Ot /arch:SSE2:编译器内联版本:http://pastebin.com/yqXedtPm手动内联版本:http://pastebin.com/pYSXL77f Difference http://pastebin.com/diff.php?

This chart shows the FLOP/s (up to a scaling factor) for 50 testruns of my code.

这张图表显示了我的50个测试代码的触发器/s(达到比例因子)。

为什么编译器内联产生的代码比手工内联慢?

The gnuplot script to generate the plot: http://pastebin.com/8amNqya7

生成该情节的gnuplot脚本:http://pastebin.com/8amNqya7。

Compiler Options:

编译器选项:

/Zi /W3 /WX- /MP /Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /MT /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Gd /analyze-

/Zi /W3 /WX- /MP /Ox /Ob2 /Oi /Ot /Oy /GL /D“WIN32”/D“NDEBUG”/D“_CONSOLE”/D“_UNICODE”/D“UNICODE”/D“UNICODE”/Gm /EHsc /MT /GS- /Gy /arch:SSE2 /fp:精确/Zc

Linker Options: /INCREMENTAL:NO "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE

链接器选项:kernel32 /增量:不”。*”“user32。*”“gdi32。*”“winspool。*”“comdlg32。*”“advapi32。*”“shell32。*”“ole32。*”“oleaut32。*”“uuid。*”“odbc32。*”“odbccp32。"level='asInvoker' uiAccess='false' > /子系统:控制台/OPT:REF /OPT:ICF /LTCG /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE

2 个解决方案

#1


31  

Short Answer:

Your asd array is declared as this:

您的asd数组声明如下:

int *asd=new int[16];

Therefore, use int as the return type rather than bool.
Alternatively, change the array type to bool.

因此,使用int作为返回类型,而不是bool。或者,将数组类型更改为bool。

In any case, make the return type of the test function match the type of the array.

无论如何,使测试函数的返回类型与数组的类型匹配。

Skip to bottom for more details.

跳至底部了解更多细节。

Long Answer:

In the manually inlined version, the "core" of one iteration looks like this:

在手动内联的版本中,一个迭代的“核心”是这样的:

xor         eax,eax  

mov         edx,ecx  
and         edx,0Fh  
mov         dword ptr [ebp+edx*4],eax  
mov         eax,dword ptr [esp+1Ch]  
movss       xmm0,dword ptr [eax]  
movss       xmm1,dword ptr [edi]  
cvtps2pd    xmm0,xmm0  
cvtps2pd    xmm1,xmm1  
comisd      xmm1,xmm0  

The compiler inlined version is completely identical except for the first instruction.

除了第一个指令之外,编译器内联版本完全相同。

Where instead of:

而不是:

xor         eax,eax

it has:

它有:

xor         eax,eax  
movzx       edx,al

Okay, so it's one extra instruction. They both do the same - zeroing a register. This is the only difference that I see...

这是一条额外的指令。它们都做同样的事情——调零寄存器。这是我看到的唯一的不同……

The movzx instruction has a single-cycle latency and 0.33 cycle reciprocal throughput on all the newer architectures. So I can't imagine how this could make a 10% difference.

movzx指令在所有更新的体系结构上具有单周期延迟和0.33周互操作吞吐量。所以我无法想象这会带来10%的差异。

In both cases, the result of the zeroing is used only 3 instructions later. So it's very possible that this could be on the critical path of execution.

在这两种情况下,零化的结果只在以后使用3个指令。这很有可能是在执行的关键路径上。


While I'm not an Intel engineer, here's my guess:

虽然我不是英特尔工程师,但我的猜测是:

Most modern processors deal with zeroing operations (such as xor eax,eax) via register renaming to a bank of zero registers. It completely bypasses the execution units. However, it's possible that this special handling could cause a pipeline bubble when the (partial) register is accessed via movzx edi,al.

大多数现代处理器通过将寄存器重命名为一个零寄存器组来处理归零操作(如xor eax、eax)。它完全绕过执行单元。然而,当(部分)寄存器通过movzx edi被访问时,这种特殊处理可能会导致管道泡沫。

Furthermore, there's also a false dependency on eax in the compiler inlined version:

此外,在编译器内联版本中也存在对eax的错误依赖:

movzx       edx,al  
mov         eax,ecx  //  False dependency on "eax".

Whether or not the out-of-order execution is able to resolve this is beyond me.

我不知道无序执行是否能够解决这个问题。


Okay, this is basically turning into a question of reverse-engineering the MSVC compiler...

Here I'll to explain why that extra movzx is generated as well as why it stays.

在这里,我将解释为什么会产生额外的movzx,以及它为什么会留下。

The key here is the bool return value. Apparently, bool datatypes are probably as stored 8-bit values inside the MSVC internal-representation. Therefore when you implicitly convert from bool to int here:

这里的关键是bool返回值。显然,bool数据类型可能是在MSVC内部表示中存储的8位值。因此,当您在这里隐式地将bool转换为int时:

asd[j%16] = a.test(b);
^^^^^^^^^   ^^^^^^^^^
 type int   type bool

there is an 8-bit -> 32-bit integer promotion. This is the reason why MSVC generates the movzx instruction.

有一个8位> 32位整数提升。这就是MSVC生成movzx指令的原因。

When the inlining is done manually, the compiler has enough information to optimize out this conversion and keeps everything as a 32-bit datatype IR.

当内联是手工完成时,编译器有足够的信息来优化这个转换,并将所有内容保存为32位数据类型IR。

However, when the code is put into it's own function with a bool return value, the compiler is not able to optimize out the 8-bit intermediate datatype. Therefore, the movzx stays.

但是,当代码被放到它自己的函数中并带有bool返回值时,编译器不能优化8位中间数据类型。因此,movzx停留。

When you make both datatypes the same (either int or bool), no conversion is needed. Hence the problem is avoided altogether.

当您使两个数据类型相同(int或bool)时,不需要转换。因此,问题就完全避免了。

#2


1  

lea esp,[esp] occupies 7 bytes of i-cache and it's inside the loop. A few other clues make it look like the compiler isn't sure if this is a release build or a debug build.

lea esp [esp]占用i-cache的7字节,它在循环中。其他一些线索使编译器看起来不确定这是一个发布版本还是一个调试版本。

Edit:

编辑:

The lea esp,[esp] isn't in the loop. The position among the surrounding instructions misled me. Now it looks like it intentionally wasted 7 bytes, followed by another wasted 2 bytes, in order to start the actual loop at a 16-byte boundary. Which means that this actually speeds things up, as observed by Johennes Gerer.

lea esp (esp)不在这个圈子里。周围指示的位置误导了我。现在看起来它故意浪费了7个字节,然后又浪费了2个字节,以便在16字节的边界上启动实际的循环。这意味着这实际上加速了事情的发展,正如约翰内斯·杰勒观察到的那样。

The compiler still seems to be uncertain whether this is a debug or release build though.

编译器似乎仍然不确定这是一个调试版还是发布版。

Another edit:

另一个编辑:

The pastebin diff is different from the pastebin diff that I saw earlier. This answer could be deleted now, but it already has comments so I'll leave it.

pastebin diff不同于我之前看到的pastebin diff。这个答案现在可以删除了,但是它已经有注释了,所以我就不写了。

#1


31  

Short Answer:

Your asd array is declared as this:

您的asd数组声明如下:

int *asd=new int[16];

Therefore, use int as the return type rather than bool.
Alternatively, change the array type to bool.

因此,使用int作为返回类型,而不是bool。或者,将数组类型更改为bool。

In any case, make the return type of the test function match the type of the array.

无论如何,使测试函数的返回类型与数组的类型匹配。

Skip to bottom for more details.

跳至底部了解更多细节。

Long Answer:

In the manually inlined version, the "core" of one iteration looks like this:

在手动内联的版本中,一个迭代的“核心”是这样的:

xor         eax,eax  

mov         edx,ecx  
and         edx,0Fh  
mov         dword ptr [ebp+edx*4],eax  
mov         eax,dword ptr [esp+1Ch]  
movss       xmm0,dword ptr [eax]  
movss       xmm1,dword ptr [edi]  
cvtps2pd    xmm0,xmm0  
cvtps2pd    xmm1,xmm1  
comisd      xmm1,xmm0  

The compiler inlined version is completely identical except for the first instruction.

除了第一个指令之外,编译器内联版本完全相同。

Where instead of:

而不是:

xor         eax,eax

it has:

它有:

xor         eax,eax  
movzx       edx,al

Okay, so it's one extra instruction. They both do the same - zeroing a register. This is the only difference that I see...

这是一条额外的指令。它们都做同样的事情——调零寄存器。这是我看到的唯一的不同……

The movzx instruction has a single-cycle latency and 0.33 cycle reciprocal throughput on all the newer architectures. So I can't imagine how this could make a 10% difference.

movzx指令在所有更新的体系结构上具有单周期延迟和0.33周互操作吞吐量。所以我无法想象这会带来10%的差异。

In both cases, the result of the zeroing is used only 3 instructions later. So it's very possible that this could be on the critical path of execution.

在这两种情况下,零化的结果只在以后使用3个指令。这很有可能是在执行的关键路径上。


While I'm not an Intel engineer, here's my guess:

虽然我不是英特尔工程师,但我的猜测是:

Most modern processors deal with zeroing operations (such as xor eax,eax) via register renaming to a bank of zero registers. It completely bypasses the execution units. However, it's possible that this special handling could cause a pipeline bubble when the (partial) register is accessed via movzx edi,al.

大多数现代处理器通过将寄存器重命名为一个零寄存器组来处理归零操作(如xor eax、eax)。它完全绕过执行单元。然而,当(部分)寄存器通过movzx edi被访问时,这种特殊处理可能会导致管道泡沫。

Furthermore, there's also a false dependency on eax in the compiler inlined version:

此外,在编译器内联版本中也存在对eax的错误依赖:

movzx       edx,al  
mov         eax,ecx  //  False dependency on "eax".

Whether or not the out-of-order execution is able to resolve this is beyond me.

我不知道无序执行是否能够解决这个问题。


Okay, this is basically turning into a question of reverse-engineering the MSVC compiler...

Here I'll to explain why that extra movzx is generated as well as why it stays.

在这里,我将解释为什么会产生额外的movzx,以及它为什么会留下。

The key here is the bool return value. Apparently, bool datatypes are probably as stored 8-bit values inside the MSVC internal-representation. Therefore when you implicitly convert from bool to int here:

这里的关键是bool返回值。显然,bool数据类型可能是在MSVC内部表示中存储的8位值。因此,当您在这里隐式地将bool转换为int时:

asd[j%16] = a.test(b);
^^^^^^^^^   ^^^^^^^^^
 type int   type bool

there is an 8-bit -> 32-bit integer promotion. This is the reason why MSVC generates the movzx instruction.

有一个8位> 32位整数提升。这就是MSVC生成movzx指令的原因。

When the inlining is done manually, the compiler has enough information to optimize out this conversion and keeps everything as a 32-bit datatype IR.

当内联是手工完成时,编译器有足够的信息来优化这个转换,并将所有内容保存为32位数据类型IR。

However, when the code is put into it's own function with a bool return value, the compiler is not able to optimize out the 8-bit intermediate datatype. Therefore, the movzx stays.

但是,当代码被放到它自己的函数中并带有bool返回值时,编译器不能优化8位中间数据类型。因此,movzx停留。

When you make both datatypes the same (either int or bool), no conversion is needed. Hence the problem is avoided altogether.

当您使两个数据类型相同(int或bool)时,不需要转换。因此,问题就完全避免了。

#2


1  

lea esp,[esp] occupies 7 bytes of i-cache and it's inside the loop. A few other clues make it look like the compiler isn't sure if this is a release build or a debug build.

lea esp [esp]占用i-cache的7字节,它在循环中。其他一些线索使编译器看起来不确定这是一个发布版本还是一个调试版本。

Edit:

编辑:

The lea esp,[esp] isn't in the loop. The position among the surrounding instructions misled me. Now it looks like it intentionally wasted 7 bytes, followed by another wasted 2 bytes, in order to start the actual loop at a 16-byte boundary. Which means that this actually speeds things up, as observed by Johennes Gerer.

lea esp (esp)不在这个圈子里。周围指示的位置误导了我。现在看起来它故意浪费了7个字节,然后又浪费了2个字节,以便在16字节的边界上启动实际的循环。这意味着这实际上加速了事情的发展,正如约翰内斯·杰勒观察到的那样。

The compiler still seems to be uncertain whether this is a debug or release build though.

编译器似乎仍然不确定这是一个调试版还是发布版。

Another edit:

另一个编辑:

The pastebin diff is different from the pastebin diff that I saw earlier. This answer could be deleted now, but it already has comments so I'll leave it.

pastebin diff不同于我之前看到的pastebin diff。这个答案现在可以删除了,但是它已经有注释了,所以我就不写了。