为什么x86的INC指令不是原子的?(复制)

时间:2022-01-27 12:37:21

This question already has an answer here:

这个问题已经有了答案:

I've read that INC instruction of x86 is not atomic. My question is how come? Suppose we are incrementing a 64 bit integer on x86-64, we can do it with one instruction, since INC instruction works with both memory variables and register. So how come its not atomic?

我读过x86的INC指令不是原子的。我的问题是为什么?假设我们在x86-64上递增一个64位整数,我们可以用一条指令来完成,因为INC指令同时处理内存变量和寄存器。为什么它不是原子的呢?

3 个解决方案

#1


19  

Why would it be? The processor core still needs to read the value stored at the memory location, calculate the increment of it, and then store it back. There's a latency between reading and storing, and in the mean time another operation could have affected that memory location.

为什么它会是什么?处理器核心仍然需要读取存储在内存位置的值,计算它的增量,然后将其存储回来。在读取和存储之间有一个延迟,同时另一个操作可能会影响内存位置。

Even with out-of-order execution, processor cores are 'smart' enough not to trip over their own instructions and wouldn't be responsible for modifying this memory in the time gap. However, another core could have issued an instruction that modifies that location, a DMA transfer could have affected that location, or other hardware touched that memory location somehow.

即使有了无序的执行,处理器内核也足够“聪明”,不会超出自己的指令,也不会在时间间隙修改内存。然而,另一个内核可能已经发出了一条修改该位置的指令,DMA传输可能会影响该位置,或者其他硬件以某种方式触及该内存位置。

#2


19  

Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:

现代的x86处理器作为其执行管道的一部分,将x86指令“编译”为一组较低级别的操作;英特尔称这些uOps为AMD rOps,但归根结底,特定类型的x86指令是由CPU中的实际功能单元作为几个步骤来执行的。例如,这意味着:

INC EAX

gets executed as a single "mini-op" like uOp.inc eax (let me call it that - they're not exposed).
For other operands things will look differently, like:

像uOp一样被执行为一个单独的“迷你op”。inc eax(让我这么说吧——他们没有被曝光)。对于其他操作数,情况会有所不同,比如:

INC DWORD PTR [ EAX ]

the low-level decomposition though would look more like:

低层次分解看起来更像:

uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg

and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ], that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.

因此不以原子方式执行。如果在另一方面,前缀是LOCK INC [EAX],那么就会告诉“编译”阶段的管道以不同的方式分解,以确保满足原子性需求。

The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?

原因当然是别人提到的——速度;为什么要做一些原子化的东西,如果不总是需要的话,那么一定要慢一些呢?

#3


1  

You really don't want a guaranteed atomic operation unless you need it, from Agner Fog's Software optimization resources: instruction_tables.pdf (1996 – 2017):

除非您需要,否则您真的不想要一个有保证的原子操作,从Agner Fog的软件优化资源:instruction_tables。pdf(1996 - 2017):

Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

带有锁前缀的指令有很长的延迟,这取决于缓存组织和可能的RAM速度。如果有多个处理器或内核或直接内存访问(DMA)设备,那么所有被锁定的指令都将锁定一条缓存线进行独占访问,这可能涉及RAM访问。锁前缀通常需要超过100个时钟周期,即使在单处理器系统中也是如此。这也适用于具有内存操作数的XCHG指令。

#1


19  

Why would it be? The processor core still needs to read the value stored at the memory location, calculate the increment of it, and then store it back. There's a latency between reading and storing, and in the mean time another operation could have affected that memory location.

为什么它会是什么?处理器核心仍然需要读取存储在内存位置的值,计算它的增量,然后将其存储回来。在读取和存储之间有一个延迟,同时另一个操作可能会影响内存位置。

Even with out-of-order execution, processor cores are 'smart' enough not to trip over their own instructions and wouldn't be responsible for modifying this memory in the time gap. However, another core could have issued an instruction that modifies that location, a DMA transfer could have affected that location, or other hardware touched that memory location somehow.

即使有了无序的执行,处理器内核也足够“聪明”,不会超出自己的指令,也不会在时间间隙修改内存。然而,另一个内核可能已经发出了一条修改该位置的指令,DMA传输可能会影响该位置,或者其他硬件以某种方式触及该内存位置。

#2


19  

Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:

现代的x86处理器作为其执行管道的一部分,将x86指令“编译”为一组较低级别的操作;英特尔称这些uOps为AMD rOps,但归根结底,特定类型的x86指令是由CPU中的实际功能单元作为几个步骤来执行的。例如,这意味着:

INC EAX

gets executed as a single "mini-op" like uOp.inc eax (let me call it that - they're not exposed).
For other operands things will look differently, like:

像uOp一样被执行为一个单独的“迷你op”。inc eax(让我这么说吧——他们没有被曝光)。对于其他操作数,情况会有所不同,比如:

INC DWORD PTR [ EAX ]

the low-level decomposition though would look more like:

低层次分解看起来更像:

uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg

and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ], that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.

因此不以原子方式执行。如果在另一方面,前缀是LOCK INC [EAX],那么就会告诉“编译”阶段的管道以不同的方式分解,以确保满足原子性需求。

The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?

原因当然是别人提到的——速度;为什么要做一些原子化的东西,如果不总是需要的话,那么一定要慢一些呢?

#3


1  

You really don't want a guaranteed atomic operation unless you need it, from Agner Fog's Software optimization resources: instruction_tables.pdf (1996 – 2017):

除非您需要,否则您真的不想要一个有保证的原子操作,从Agner Fog的软件优化资源:instruction_tables。pdf(1996 - 2017):

Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

带有锁前缀的指令有很长的延迟,这取决于缓存组织和可能的RAM速度。如果有多个处理器或内核或直接内存访问(DMA)设备,那么所有被锁定的指令都将锁定一条缓存线进行独占访问,这可能涉及RAM访问。锁前缀通常需要超过100个时钟周期,即使在单处理器系统中也是如此。这也适用于具有内存操作数的XCHG指令。