asm和__asm有什么区别吗?

时间:2022-03-30 14:46:09

As far as I can tell, the only difference between __asm { ... }; and __asm__("..."); is that the first uses mov eax, var and the second uses movl %0, %%eax with :"=r" (var) at the end. What other differences are there? And what about just asm?

就我所知,…};和__asm__(“…”);第一个使用mov eax, var,第二个使用movl %0, %eax,末尾有:"=r" (var)。还有其他的区别吗?那么asm怎么样呢?

4 个解决方案

#1


16  

Which one you use depends on your compiler. This isn't standard like the C language.

你使用哪个取决于你的编译器。这不是标准的C语言。

#2


17  

There's a massive difference between MSVC inline asm and GNU C inline asm. GCC syntax is designed for optimal output without wasted instructions, for wrapping a single instruction or something. MSVC syntax is designed to be fairly simple, but AFAICT it's impossible to use without the latency and extra instructions of a round trip through memory for your inputs and outputs.

MSVC内联asm和GNU C内联asm有很大的不同。GCC语法是为优化输出而设计的,没有浪费指令,用于包装单个指令或其他东西。MSVC的语法设计得相当简单,但是如果没有内存往返的延迟和额外指令,您的输入和输出是不可能使用的。

If you're using inline asm for performance reasons, this makes MSVC inline asm only viable if you write a whole loop entirely in asm, not for wrapping short sequences in an inline function. The example below (wrapping idiv with a function) is the kind of thing MSVC is bad at: ~8 extra store/load instructions.

如果出于性能原因而使用内联asm,那么只有在完全使用asm编写整个循环时,MSVC内联asm才是可行的,而不是在内联函数中封装短序列。下面的例子(用一个函数包装idiv)是MSVC不擅长的:~8额外的存储/装载指令。

MSVC inline asm (used by MSVC and probably icc, maybe also available in some commercial compilers):

MSVC内联asm (MSVC可能也被icc使用,也可能在一些商业编译器中可用):

  • looks at your asm to figure out which registers your code steps on.
  • 查看asm,以确定哪个注册了代码步骤。
  • can only transfer data via memory. Data that was live in registers is stored by the compiler to prepare for your mov ecx, shift_count, for example. So using a single asm instruction that the compiler won't generate for you involves a round-trip through memory on the way in and on the way out.
  • 只能通过内存传输数据。存储在寄存器中的数据由编译器存储,以便为您的mov ecx(例如shift_count)做准备。因此,使用编译器不会为您生成的单一asm指令,需要在输入和输出过程中对内存进行往返。
  • more beginner-friendly, but often impossible to avoid overhead getting data in/out. Even besides the syntax limitations, the optimizer in current versions of MSVC isn't good at optimizing around inline asm blocks, either.
  • 更多的初学者友好型,但通常不可能避免数据输入/输出。即使除了语法限制之外,当前版本的MSVC中的优化器也不擅长围绕内联asm块进行优化。

GNU C inline asm is not a good way to learn asm. You have to understand asm very well so you can tell the compiler about your code. And you have to understand what compilers need to know. That answer also has links to other inline-asm guides and Q&As. The tag wiki has lots of good stuff for asm in general, but just links to that for GNU inline asm. (The stuff in that answer is applicable to GNU inline asm on non-x86 platforms, too.)

GNU C内联asm并不是学习asm的好方法。你必须很好地理解asm,这样你才能告诉编译器你的代码。你必须知道编译器需要知道什么。这个答案也与其他在线指南和Q&As相关。x86标记wiki在一般情况下为asm提供了很多好东西,但只是为GNU内联asm提供了链接。(这个答案中的内容也适用于非x86平台上的GNU内联asm。)

GNU C inline asm syntax is used by gcc, clang, icc, and maybe some commercial compilers which implement GNU C:

gcc、clang、icc以及一些实现GNU C的商业编译器都使用GNU C内联asm语法:

  • You have to tell the compiler what you clobber. Failure to do this will lead to breakage of surrounding code in non-obvious hard-to-debug ways.
  • 你必须告诉编译器你在攻击什么。如果不能这样做,将导致周围代码以不明显的难以调试的方式破坏。
  • Powerful but hard to read, learn, and use syntax for telling the compiler how to supply inputs, and where to find outputs. e.g. "c" (shift_count) will get the compiler to put the shift_count variable into ecx before your inline asm runs.
  • 强大但难于阅读、学习和使用语法来告诉编译器如何提供输入,以及在何处查找输出。如。“c”(shift_count)将使编译器在您的内联asm运行之前将shift_count变量放入ecx。
  • extra clunky for large blocks of code, because the asm has to be inside a string constant. So you typically need

    对于大块的代码来说,额外的笨拙,因为asm必须在字符串常量中。所以你需要

    "insn   %[inputvar], %%reg\n\t"       // comment
    "insn2  %%reg, %[outputvar]\n\t"
    
  • very unforgiving / harder, but allows lower overhead esp. for wrapping single instructions. (wrapping single instructions was the original design intent, which is why you have to specially tell the compiler about early clobbers to stop it from using the same register for an input and output if that's a problem.)

    非常不宽容/困难,但是允许更低的开销,特别是用于包装单个指令的开销。(包装单一指令是最初的设计意图,这就是为什么你必须特别告诉编译器关于早期的clobbers,以阻止它使用相同的寄存器作为输入和输出,如果这是一个问题的话)。


Example: full-width integer division (div)

On a 32bit CPU, dividing a 64bit integer by a 32bit integer, or doing a full-multiply (32x32->64), can benefit from inline asm. gcc and clang don't take advantage of idiv for (int64_t)a / (int32_t)b, probably because the instruction faults if the result doesn't fit in a 32bit register. So unlike this Q&A about getting quotient and remainder from one div, this is a use-case for inline asm. (Unless there's a way to inform the compiler that the result will fit, so idiv won't fault.)

在32位CPU上,将64位整数除以32位整数,或者进行全乘(32x32->64),可以从内联asm中获益。gcc和clang没有利用idiv用于(int64_t)a / (int32_t)b,可能是因为如果结果不适合32位寄存器,那么指令会出错。不像这个Q&A,关于从一个div中得到商和余数,这是内联asm的用例。(除非有一种方法可以通知编译器结果是适合的,这样idiv就不会出错。)

We'll use calling conventions that put some args in registers (with hi even in the right register), to show a situation that's closer to what you'd see when inlining a tiny function like this.

我们将使用调用约定将一些args放入寄存器中(即使在正确的寄存器中也有hi),以显示类似内联这样的小函数时所看到的情况。


MSVC

Be careful with register-arg calling conventions when using inline-asm. Apparently the inline-asm support is so badly designed/implemented that the compiler might not save/restore arg registers around the inline asm, if those args aren't used in the inline asm. Thanks @RossRidge for pointing this out.

在使用内联-asm时,要小心使用寄存器-arg调用约定。显然,inline-asm支持的设计和实现非常糟糕,如果在内联asm中没有使用这些arg,编译器可能不会在内联asm周围保存/恢复arg寄存器。感谢@RossRidge指出这一点。

// MSVC.  Be careful with _vectorcall & inline-asm: see above
// we could return a struct, but that would complicate things
int _vectorcall div64(int hi, int lo, int divisor, int *premainder) {
    int quotient, tmp;
    __asm {
        mov   edx, hi;
        mov   eax, lo;
        idiv   divisor
        mov   quotient, eax
        mov   tmp, edx;
        // mov ecx, premainder   // Or this I guess?
        // mov   [ecx], edx
    }
    *premainder = tmp;
    return quotient;     // or omit the return with a value in eax
}

Update: apparently leaving a value in eax or edx:eax and then falling off the end of a non-void function (without a return) is supported, even when inlining. I assume this works only if there's no code after the asm statement. This avoids the store/reloads for the output (at least for quotient), but we can't do anything about the inputs. In a non-inline function with stack args, they will be in memory already, but in this use-case we're writing a tiny function that could usefully inline.

更新:显然在eax或edx中保留一个值:eax,然后在非空函数(没有返回)的末尾下降,即使是内联。我假设只有当asm语句之后没有代码时,它才会工作。这避免了输出的存储/重载(至少对于商),但是对于输入我们无能为力。在带有堆栈args的非内联函数中,它们已经在内存中了,但是在这个用例中,我们正在编写一个可以有效内联的小函数。


Compiled with MSVC 19.00.23026 /O2 on rextester (with a main() that finds the directory of the exe and dumps the compiler's asm output to stdout).

在rextester上使用MSVC 19.00.23026 /O2编译(使用main()查找exe的目录并将编译器的asm输出转储到stdout)。

## My added comments use. ##
; ... define some symbolic constants for stack offsets of parameters
; 48   : int ABI div64(int hi, int lo, int divisor, int *premainder) {
    sub esp, 16                 ; 00000010H
    mov DWORD PTR _lo$[esp+16], edx      ## these symbolic constants match up with the names of the stack args and locals
    mov DWORD PTR _hi$[esp+16], ecx

    ## start of __asm {
    mov edx, DWORD PTR _hi$[esp+16]
    mov eax, DWORD PTR _lo$[esp+16]
    idiv    DWORD PTR _divisor$[esp+12]
    mov DWORD PTR _quotient$[esp+16], eax  ## store to a local temporary, not *premainder
    mov DWORD PTR _tmp$[esp+16], edx
    ## end of __asm block

    mov ecx, DWORD PTR _premainder$[esp+12]
    mov eax, DWORD PTR _tmp$[esp+16]
    mov DWORD PTR [ecx], eax               ## I guess we should have done this inside the inline asm so this would suck slightly less
    mov eax, DWORD PTR _quotient$[esp+16]  ## but this one is unavoidable
    add esp, 16                 ; 00000010H
    ret 8

There's a ton of extra mov instructions, and the compiler doesn't even come close to optimizing any of it away. I thought maybe it would see and understand the mov tmp, edx inside the inline asm, and make that a store to premainder. But that would require loading premainder from the stack into a register before the inline asm block, I guess.

有大量额外的mov指令,编译器甚至都无法对它们进行优化。我想它可能会看到并理解mov tmp,内联asm中的edx,并使它成为一个预处理器。但是,我想这需要从堆栈中加载premainder到内联asm块之前的寄存器。

This function is actually worse with _vectorcall than with the normal everything-on-the-stack ABI. With two inputs in registers, it stores them to memory so the inline asm can load them from named variables. If this were inlined, even more of the parameters could potentially be in the regs, and it would have to store them all, so the asm would have memory operands! So unlike gcc, we don't gain much from inlining this.

这个函数在_vectorcall的情况下实际上比普通的任何东西都要糟糕。在寄存器中有两个输入,它将它们存储到内存中,以便内联asm能够从命名变量中加载它们。如果这是内联的,那么更多的参数可能会出现在regs中,并且必须将它们全部存储起来,这样asm就有了内存操作数!所以与gcc不同的是,我们并没有从内联中得到什么。

Doing *premainder = tmp inside the asm block means more code written in asm, but does avoid the totally braindead store/load/store path for the remainder. This reduces the instruction count by 2 total, down to 11 (not including the ret).

在asm块中执行*premainder = tmp意味着需要在asm中编写更多的代码,但是要避免对其余部分的完全脑死存储/加载/存储路径。这将指令计数减少2,减少到11(不包括ret)。

I'm trying to get the best possible code out of MSVC, not "use it wrong" and create a straw-man argument. But AFAICT it's horrible for wrapping very short sequences. Presumably there's an intrinsic function for 64/32 -> 32 division that allows the compiler to generate good code for this particular case, so the entire premise of using inline asm for this on MSVC could be a straw-man argument. But it does show you that intrinsics are much better than inline asm for MSVC.

我正试图从MSVC中获得尽可能好的代码,而不是“使用错误”,并创建一个稻草人参数。但AFAICT对短序列的包装很糟糕。假设有一个用于64/32 -> 32除法的内部函数,它允许编译器为这个特定情况生成良好的代码,因此在MSVC上使用内联asm的整个前提可能是一个稻草人参数。但是它确实向您表明,对于MSVC来说,intrinsic要比inline asm好得多。


GNU C (gcc/clang/icc)

Gcc does even better than the output shown here when inlining div64, because it can typically arrange for the preceding code to generate the 64bit integer in edx:eax in the first place.

Gcc的性能甚至比inlining div64时显示的输出更好,因为它通常可以安排前面的代码在edx:eax中首先生成64位整数。

I can't get gcc to compile for the 32bit vectorcall ABI. Clang can, but it sucks at inline asm with "rm" constraints (try it on the godbolt link: it bounces function arg through memory instead of using the register option in the constraint). The 64bit MS calling convention is close to the 32bit vectorcall, with the first two params in edx, ecx. The difference is that 2 more params go in regs before using the stack (and that the callee doesn't pop the args off the stack, which is what the ret 8 was about in the MSVC output.)

我无法让gcc为32位的vectorcall ABI编译。Clang可以,但是它会利用具有“rm”约束的内联asm(在godbolt链接上试试:它通过内存反射函数arg,而不是在约束中使用register选项)。64位MS调用约定接近32位的vectorcall,前两个参数在edx, ecx中。不同之处在于,在使用堆栈之前,还有2个params进入了regs(而且,callee并没有从堆栈中弹出args,这是在MSVC输出中关于ret 8的内容)。

// GNU C
// change everything to int64_t to do 128b/64b -> 64b division
// MSVC doesn't do x86-64 inline asm, so we'll use 32bit to be comparable
int div64(int lo, int hi, int *premainder, int divisor) {
    int quotient, rem;
    asm ("idivl  %[divsrc]"
          : "=a" (quotient), "=d" (rem)    // a means eax,  d means edx
          : "d" (hi), "a" (lo),
            [divsrc] "rm" (divisor)        // Could have just used %0 instead of naming divsrc
            // note the "rm" to allow the src to be in a register or not, whatever gcc chooses.
            // "rmi" would also allow an immediate, but unlike adc, idiv doesn't have an immediate form
          : // no clobbers
        );
    *premainder = rem;
    return quotient;
}

compiled with gcc -m64 -O3 -mabi=ms -fverbose-asm. With -m32 you just get 3 loads, idiv, and a store, as you can see from changing stuff in that godbolt link.

使用gcc -m64 -O3 -mabi=ms -fverbose-asm编译。在-m32中,你只需要得到3个负载,idiv和一个存储,就像你在godbolt链接中看到的变化一样。

mov     eax, ecx  # lo, lo
idivl  r9d      # divisor
mov     DWORD PTR [r8], edx       # *premainder_7(D), rem
ret

For 32bit vectorcall, gcc would do something like

对于32位的vectorcall, gcc会做类似的事情

## Not real compiler output, but probably similar to what you'd get
mov     eax, ecx               # lo, lo
mov     ecx, [esp+12]          # premainder
idivl   [esp+16]               # divisor
mov     DWORD PTR [ecx], edx   # *premainder_7(D), rem
ret   8

MSVC uses 13 instructions (not including the ret), compared to gcc's 4. With inlining, as I said, it potentially compiles to just one, while MSVC would still use probably 9. (It won't need to reserve stack space or load premainder; I'm assuming it still has to store about 2 of the 3 inputs. Then it reloads them inside the asm, runs idiv, stores two outputs, and reloads them outside the asm. So that's 4 loads/stores for input, and another 4 for output.)

MSVC使用了13条指令(不包括ret),而gcc使用了4条指令。就像我说的那样,使用内联,它可能只编译为1,而MSVC可能仍然使用9。(不需要预留堆栈空间或加载预存器;我假设它仍然需要存储3个输入中的2个。然后在asm中重新加载它们,运行idiv,存储两个输出,并在asm之外重新加载它们。这是4个负载/存储用于输入,另外4个用于输出)

#3


5  

With gcc compiler, it's not a big difference. asm or __asm or __asm__ are same, they just use to avoid conflict namespace purpose (there's user defined function that name asm, etc.)

对于gcc编译器,这并不是很大的区别。asm或__asm或__asm__是相同的,它们只是用来避免冲突名称空间目的(有名为asm的用户定义函数)。

#4


0  

asm vs __asm__ in GCC

在GCC中asm和__asm__

asm does not work with -std=c99, you have two alternatives:

asm不适用于-std=c99,有两种选择:

  • use __asm__
  • 使用__asm__
  • use -std=gnu99
  • 用化gnu99

More details: error: ‘asm’ undeclared (first use in this function)

更多细节:错误:“asm”未声明(首次在此函数中使用)

__asm vs __asm__ in GCC

在GCC中,asm与__asm__

I could not find where __asm is documented (notably not mentioned at https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords ), but from the GCC 8.1 source they are exactly the same:

我找不到__asm文档在哪里(值得注意的是,在https://gcc.gnu.org/onlinedocs/gcc-7.2.0/ gcc/exchangekeywords.html #交流关键字中没有提到),但是从GCC 8.1的源代码中,它们是完全相同的:

  { "__asm",        RID_ASM,    0 },
  { "__asm__",      RID_ASM,    0 },

so I would just use __asm__ which is documented.

所以我只需要使用__asm__。

#1


16  

Which one you use depends on your compiler. This isn't standard like the C language.

你使用哪个取决于你的编译器。这不是标准的C语言。

#2


17  

There's a massive difference between MSVC inline asm and GNU C inline asm. GCC syntax is designed for optimal output without wasted instructions, for wrapping a single instruction or something. MSVC syntax is designed to be fairly simple, but AFAICT it's impossible to use without the latency and extra instructions of a round trip through memory for your inputs and outputs.

MSVC内联asm和GNU C内联asm有很大的不同。GCC语法是为优化输出而设计的,没有浪费指令,用于包装单个指令或其他东西。MSVC的语法设计得相当简单,但是如果没有内存往返的延迟和额外指令,您的输入和输出是不可能使用的。

If you're using inline asm for performance reasons, this makes MSVC inline asm only viable if you write a whole loop entirely in asm, not for wrapping short sequences in an inline function. The example below (wrapping idiv with a function) is the kind of thing MSVC is bad at: ~8 extra store/load instructions.

如果出于性能原因而使用内联asm,那么只有在完全使用asm编写整个循环时,MSVC内联asm才是可行的,而不是在内联函数中封装短序列。下面的例子(用一个函数包装idiv)是MSVC不擅长的:~8额外的存储/装载指令。

MSVC inline asm (used by MSVC and probably icc, maybe also available in some commercial compilers):

MSVC内联asm (MSVC可能也被icc使用,也可能在一些商业编译器中可用):

  • looks at your asm to figure out which registers your code steps on.
  • 查看asm,以确定哪个注册了代码步骤。
  • can only transfer data via memory. Data that was live in registers is stored by the compiler to prepare for your mov ecx, shift_count, for example. So using a single asm instruction that the compiler won't generate for you involves a round-trip through memory on the way in and on the way out.
  • 只能通过内存传输数据。存储在寄存器中的数据由编译器存储,以便为您的mov ecx(例如shift_count)做准备。因此,使用编译器不会为您生成的单一asm指令,需要在输入和输出过程中对内存进行往返。
  • more beginner-friendly, but often impossible to avoid overhead getting data in/out. Even besides the syntax limitations, the optimizer in current versions of MSVC isn't good at optimizing around inline asm blocks, either.
  • 更多的初学者友好型,但通常不可能避免数据输入/输出。即使除了语法限制之外,当前版本的MSVC中的优化器也不擅长围绕内联asm块进行优化。

GNU C inline asm is not a good way to learn asm. You have to understand asm very well so you can tell the compiler about your code. And you have to understand what compilers need to know. That answer also has links to other inline-asm guides and Q&As. The tag wiki has lots of good stuff for asm in general, but just links to that for GNU inline asm. (The stuff in that answer is applicable to GNU inline asm on non-x86 platforms, too.)

GNU C内联asm并不是学习asm的好方法。你必须很好地理解asm,这样你才能告诉编译器你的代码。你必须知道编译器需要知道什么。这个答案也与其他在线指南和Q&As相关。x86标记wiki在一般情况下为asm提供了很多好东西,但只是为GNU内联asm提供了链接。(这个答案中的内容也适用于非x86平台上的GNU内联asm。)

GNU C inline asm syntax is used by gcc, clang, icc, and maybe some commercial compilers which implement GNU C:

gcc、clang、icc以及一些实现GNU C的商业编译器都使用GNU C内联asm语法:

  • You have to tell the compiler what you clobber. Failure to do this will lead to breakage of surrounding code in non-obvious hard-to-debug ways.
  • 你必须告诉编译器你在攻击什么。如果不能这样做,将导致周围代码以不明显的难以调试的方式破坏。
  • Powerful but hard to read, learn, and use syntax for telling the compiler how to supply inputs, and where to find outputs. e.g. "c" (shift_count) will get the compiler to put the shift_count variable into ecx before your inline asm runs.
  • 强大但难于阅读、学习和使用语法来告诉编译器如何提供输入,以及在何处查找输出。如。“c”(shift_count)将使编译器在您的内联asm运行之前将shift_count变量放入ecx。
  • extra clunky for large blocks of code, because the asm has to be inside a string constant. So you typically need

    对于大块的代码来说,额外的笨拙,因为asm必须在字符串常量中。所以你需要

    "insn   %[inputvar], %%reg\n\t"       // comment
    "insn2  %%reg, %[outputvar]\n\t"
    
  • very unforgiving / harder, but allows lower overhead esp. for wrapping single instructions. (wrapping single instructions was the original design intent, which is why you have to specially tell the compiler about early clobbers to stop it from using the same register for an input and output if that's a problem.)

    非常不宽容/困难,但是允许更低的开销,特别是用于包装单个指令的开销。(包装单一指令是最初的设计意图,这就是为什么你必须特别告诉编译器关于早期的clobbers,以阻止它使用相同的寄存器作为输入和输出,如果这是一个问题的话)。


Example: full-width integer division (div)

On a 32bit CPU, dividing a 64bit integer by a 32bit integer, or doing a full-multiply (32x32->64), can benefit from inline asm. gcc and clang don't take advantage of idiv for (int64_t)a / (int32_t)b, probably because the instruction faults if the result doesn't fit in a 32bit register. So unlike this Q&A about getting quotient and remainder from one div, this is a use-case for inline asm. (Unless there's a way to inform the compiler that the result will fit, so idiv won't fault.)

在32位CPU上,将64位整数除以32位整数,或者进行全乘(32x32->64),可以从内联asm中获益。gcc和clang没有利用idiv用于(int64_t)a / (int32_t)b,可能是因为如果结果不适合32位寄存器,那么指令会出错。不像这个Q&A,关于从一个div中得到商和余数,这是内联asm的用例。(除非有一种方法可以通知编译器结果是适合的,这样idiv就不会出错。)

We'll use calling conventions that put some args in registers (with hi even in the right register), to show a situation that's closer to what you'd see when inlining a tiny function like this.

我们将使用调用约定将一些args放入寄存器中(即使在正确的寄存器中也有hi),以显示类似内联这样的小函数时所看到的情况。


MSVC

Be careful with register-arg calling conventions when using inline-asm. Apparently the inline-asm support is so badly designed/implemented that the compiler might not save/restore arg registers around the inline asm, if those args aren't used in the inline asm. Thanks @RossRidge for pointing this out.

在使用内联-asm时,要小心使用寄存器-arg调用约定。显然,inline-asm支持的设计和实现非常糟糕,如果在内联asm中没有使用这些arg,编译器可能不会在内联asm周围保存/恢复arg寄存器。感谢@RossRidge指出这一点。

// MSVC.  Be careful with _vectorcall & inline-asm: see above
// we could return a struct, but that would complicate things
int _vectorcall div64(int hi, int lo, int divisor, int *premainder) {
    int quotient, tmp;
    __asm {
        mov   edx, hi;
        mov   eax, lo;
        idiv   divisor
        mov   quotient, eax
        mov   tmp, edx;
        // mov ecx, premainder   // Or this I guess?
        // mov   [ecx], edx
    }
    *premainder = tmp;
    return quotient;     // or omit the return with a value in eax
}

Update: apparently leaving a value in eax or edx:eax and then falling off the end of a non-void function (without a return) is supported, even when inlining. I assume this works only if there's no code after the asm statement. This avoids the store/reloads for the output (at least for quotient), but we can't do anything about the inputs. In a non-inline function with stack args, they will be in memory already, but in this use-case we're writing a tiny function that could usefully inline.

更新:显然在eax或edx中保留一个值:eax,然后在非空函数(没有返回)的末尾下降,即使是内联。我假设只有当asm语句之后没有代码时,它才会工作。这避免了输出的存储/重载(至少对于商),但是对于输入我们无能为力。在带有堆栈args的非内联函数中,它们已经在内存中了,但是在这个用例中,我们正在编写一个可以有效内联的小函数。


Compiled with MSVC 19.00.23026 /O2 on rextester (with a main() that finds the directory of the exe and dumps the compiler's asm output to stdout).

在rextester上使用MSVC 19.00.23026 /O2编译(使用main()查找exe的目录并将编译器的asm输出转储到stdout)。

## My added comments use. ##
; ... define some symbolic constants for stack offsets of parameters
; 48   : int ABI div64(int hi, int lo, int divisor, int *premainder) {
    sub esp, 16                 ; 00000010H
    mov DWORD PTR _lo$[esp+16], edx      ## these symbolic constants match up with the names of the stack args and locals
    mov DWORD PTR _hi$[esp+16], ecx

    ## start of __asm {
    mov edx, DWORD PTR _hi$[esp+16]
    mov eax, DWORD PTR _lo$[esp+16]
    idiv    DWORD PTR _divisor$[esp+12]
    mov DWORD PTR _quotient$[esp+16], eax  ## store to a local temporary, not *premainder
    mov DWORD PTR _tmp$[esp+16], edx
    ## end of __asm block

    mov ecx, DWORD PTR _premainder$[esp+12]
    mov eax, DWORD PTR _tmp$[esp+16]
    mov DWORD PTR [ecx], eax               ## I guess we should have done this inside the inline asm so this would suck slightly less
    mov eax, DWORD PTR _quotient$[esp+16]  ## but this one is unavoidable
    add esp, 16                 ; 00000010H
    ret 8

There's a ton of extra mov instructions, and the compiler doesn't even come close to optimizing any of it away. I thought maybe it would see and understand the mov tmp, edx inside the inline asm, and make that a store to premainder. But that would require loading premainder from the stack into a register before the inline asm block, I guess.

有大量额外的mov指令,编译器甚至都无法对它们进行优化。我想它可能会看到并理解mov tmp,内联asm中的edx,并使它成为一个预处理器。但是,我想这需要从堆栈中加载premainder到内联asm块之前的寄存器。

This function is actually worse with _vectorcall than with the normal everything-on-the-stack ABI. With two inputs in registers, it stores them to memory so the inline asm can load them from named variables. If this were inlined, even more of the parameters could potentially be in the regs, and it would have to store them all, so the asm would have memory operands! So unlike gcc, we don't gain much from inlining this.

这个函数在_vectorcall的情况下实际上比普通的任何东西都要糟糕。在寄存器中有两个输入,它将它们存储到内存中,以便内联asm能够从命名变量中加载它们。如果这是内联的,那么更多的参数可能会出现在regs中,并且必须将它们全部存储起来,这样asm就有了内存操作数!所以与gcc不同的是,我们并没有从内联中得到什么。

Doing *premainder = tmp inside the asm block means more code written in asm, but does avoid the totally braindead store/load/store path for the remainder. This reduces the instruction count by 2 total, down to 11 (not including the ret).

在asm块中执行*premainder = tmp意味着需要在asm中编写更多的代码,但是要避免对其余部分的完全脑死存储/加载/存储路径。这将指令计数减少2,减少到11(不包括ret)。

I'm trying to get the best possible code out of MSVC, not "use it wrong" and create a straw-man argument. But AFAICT it's horrible for wrapping very short sequences. Presumably there's an intrinsic function for 64/32 -> 32 division that allows the compiler to generate good code for this particular case, so the entire premise of using inline asm for this on MSVC could be a straw-man argument. But it does show you that intrinsics are much better than inline asm for MSVC.

我正试图从MSVC中获得尽可能好的代码,而不是“使用错误”,并创建一个稻草人参数。但AFAICT对短序列的包装很糟糕。假设有一个用于64/32 -> 32除法的内部函数,它允许编译器为这个特定情况生成良好的代码,因此在MSVC上使用内联asm的整个前提可能是一个稻草人参数。但是它确实向您表明,对于MSVC来说,intrinsic要比inline asm好得多。


GNU C (gcc/clang/icc)

Gcc does even better than the output shown here when inlining div64, because it can typically arrange for the preceding code to generate the 64bit integer in edx:eax in the first place.

Gcc的性能甚至比inlining div64时显示的输出更好,因为它通常可以安排前面的代码在edx:eax中首先生成64位整数。

I can't get gcc to compile for the 32bit vectorcall ABI. Clang can, but it sucks at inline asm with "rm" constraints (try it on the godbolt link: it bounces function arg through memory instead of using the register option in the constraint). The 64bit MS calling convention is close to the 32bit vectorcall, with the first two params in edx, ecx. The difference is that 2 more params go in regs before using the stack (and that the callee doesn't pop the args off the stack, which is what the ret 8 was about in the MSVC output.)

我无法让gcc为32位的vectorcall ABI编译。Clang可以,但是它会利用具有“rm”约束的内联asm(在godbolt链接上试试:它通过内存反射函数arg,而不是在约束中使用register选项)。64位MS调用约定接近32位的vectorcall,前两个参数在edx, ecx中。不同之处在于,在使用堆栈之前,还有2个params进入了regs(而且,callee并没有从堆栈中弹出args,这是在MSVC输出中关于ret 8的内容)。

// GNU C
// change everything to int64_t to do 128b/64b -> 64b division
// MSVC doesn't do x86-64 inline asm, so we'll use 32bit to be comparable
int div64(int lo, int hi, int *premainder, int divisor) {
    int quotient, rem;
    asm ("idivl  %[divsrc]"
          : "=a" (quotient), "=d" (rem)    // a means eax,  d means edx
          : "d" (hi), "a" (lo),
            [divsrc] "rm" (divisor)        // Could have just used %0 instead of naming divsrc
            // note the "rm" to allow the src to be in a register or not, whatever gcc chooses.
            // "rmi" would also allow an immediate, but unlike adc, idiv doesn't have an immediate form
          : // no clobbers
        );
    *premainder = rem;
    return quotient;
}

compiled with gcc -m64 -O3 -mabi=ms -fverbose-asm. With -m32 you just get 3 loads, idiv, and a store, as you can see from changing stuff in that godbolt link.

使用gcc -m64 -O3 -mabi=ms -fverbose-asm编译。在-m32中,你只需要得到3个负载,idiv和一个存储,就像你在godbolt链接中看到的变化一样。

mov     eax, ecx  # lo, lo
idivl  r9d      # divisor
mov     DWORD PTR [r8], edx       # *premainder_7(D), rem
ret

For 32bit vectorcall, gcc would do something like

对于32位的vectorcall, gcc会做类似的事情

## Not real compiler output, but probably similar to what you'd get
mov     eax, ecx               # lo, lo
mov     ecx, [esp+12]          # premainder
idivl   [esp+16]               # divisor
mov     DWORD PTR [ecx], edx   # *premainder_7(D), rem
ret   8

MSVC uses 13 instructions (not including the ret), compared to gcc's 4. With inlining, as I said, it potentially compiles to just one, while MSVC would still use probably 9. (It won't need to reserve stack space or load premainder; I'm assuming it still has to store about 2 of the 3 inputs. Then it reloads them inside the asm, runs idiv, stores two outputs, and reloads them outside the asm. So that's 4 loads/stores for input, and another 4 for output.)

MSVC使用了13条指令(不包括ret),而gcc使用了4条指令。就像我说的那样,使用内联,它可能只编译为1,而MSVC可能仍然使用9。(不需要预留堆栈空间或加载预存器;我假设它仍然需要存储3个输入中的2个。然后在asm中重新加载它们,运行idiv,存储两个输出,并在asm之外重新加载它们。这是4个负载/存储用于输入,另外4个用于输出)

#3


5  

With gcc compiler, it's not a big difference. asm or __asm or __asm__ are same, they just use to avoid conflict namespace purpose (there's user defined function that name asm, etc.)

对于gcc编译器,这并不是很大的区别。asm或__asm或__asm__是相同的,它们只是用来避免冲突名称空间目的(有名为asm的用户定义函数)。

#4


0  

asm vs __asm__ in GCC

在GCC中asm和__asm__

asm does not work with -std=c99, you have two alternatives:

asm不适用于-std=c99,有两种选择:

  • use __asm__
  • 使用__asm__
  • use -std=gnu99
  • 用化gnu99

More details: error: ‘asm’ undeclared (first use in this function)

更多细节:错误:“asm”未声明(首次在此函数中使用)

__asm vs __asm__ in GCC

在GCC中,asm与__asm__

I could not find where __asm is documented (notably not mentioned at https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords ), but from the GCC 8.1 source they are exactly the same:

我找不到__asm文档在哪里(值得注意的是,在https://gcc.gnu.org/onlinedocs/gcc-7.2.0/ gcc/exchangekeywords.html #交流关键字中没有提到),但是从GCC 8.1的源代码中,它们是完全相同的:

  { "__asm",        RID_ASM,    0 },
  { "__asm__",      RID_ASM,    0 },

so I would just use __asm__ which is documented.

所以我只需要使用__asm__。