为什么memmove比memcpy快?

I am investigating performance hotspots in an application which spends 50% of its time in memmove(3). The application inserts millions of 4-byte integers into sorted arrays, and uses memmove to shift the data "to the right" in order to make space for the inserted value.

我正在研究应用程序中的性能热点，该应用程序将50%的时间用于memmove(3)。应用程序将数百万个4字节的整数插入到排序数组中，并使用memmove将数据“向右”移动，以便为插入的值腾出空间。

My expectation was that copying memory is extremely fast, and I was surprised that so much time is spent in memmove. But then I had the idea that memmove is slow because it's moving overlapping regions, which must be implemented in a tight loop, instead of copying large pages of memory. I wrote a small microbenchmark to find out whether there was a performance difference between memcpy and memmove, expecting memcpy to win hands down.

我的期望是复制内存的速度非常快，我很惊讶在memmove上花了这么多时间。但是我有一个想法，memmove是慢的，因为它移动重叠区域，必须在一个紧密的循环中实现，而不是复制大的内存页。我编写了一个小的微基准来确定memcpy和memmove之间是否存在性能差异，希望memcpy能够轻松获胜。

I ran my benchmark on two machines (core i5, core i7) and saw that memmove is actually faster than memcpy, on the older core i7 even nearly twice as fast! Now I am looking for explanations.

我在两台机器(core i5, core i7)上运行了我的基准测试，发现memmove实际上比memcpy快，在旧的core i7上甚至快了两倍!现在我在寻找解释。

Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination are overlapping. Various "distances" for source and destination are tried. Each test is run 10 times, the average time is printed.

这是我的基准。它用memcpy复制100 mb，然后用memmove移动大约100 mb;源和目标是重叠的。尝试不同的“距离”源和目标。每次测试运行10次，平均时间被打印出来。

https://gist.github.com/cruppstahl/78a57cdf937bca3d062c

Here are the results on the Core i5 (Linux 3.5.0-54-generic #81~precise1-Ubuntu SMP x86_64 GNU/Linux, gcc is 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The number in brackets is the distance (gap size) between source and destination:

这里是Core i5的结果(Linux 3.5.0-54-通用#81 -Ubuntu SMP x86_64 GNU/Linux, gcc是4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)。括号中的数字为源与目标之间的距离(间隙大小):

memcpy        0.0140074
memmove (002) 0.0106168
memmove (004) 0.01065
memmove (008) 0.0107917
memmove (016) 0.0107319
memmove (032) 0.0106724
memmove (064) 0.0106821
memmove (128) 0.0110633

Memmove is implemented as a SSE optimized assembler code, copying from back to front. It uses hardware prefetch to load the data into the cache, and copies 128 bytes to XMM registers, then stores them at the destination.

Memmove是作为SSE优化的汇编代码实现的，从后到前进行复制。它使用硬件预取将数据加载到缓存中，并将128个字节复制到XMM寄存器，然后将它们存储在目的地。

(memcpy-ssse3-back.S, lines 1650 ff)

(memcpy-ssse3-back。年代,行1650 ff)

L(gobble_ll_loop):
    prefetchnta -0x1c0(%rsi)
    prefetchnta -0x280(%rsi)
    prefetchnta -0x1c0(%rdi)
    prefetchnta -0x280(%rdi)
    sub $0x80, %rdx
    movdqu  -0x10(%rsi), %xmm1
    movdqu  -0x20(%rsi), %xmm2
    movdqu  -0x30(%rsi), %xmm3
    movdqu  -0x40(%rsi), %xmm4
    movdqu  -0x50(%rsi), %xmm5
    movdqu  -0x60(%rsi), %xmm6
    movdqu  -0x70(%rsi), %xmm7
    movdqu  -0x80(%rsi), %xmm8
    movdqa  %xmm1, -0x10(%rdi)
    movdqa  %xmm2, -0x20(%rdi)
    movdqa  %xmm3, -0x30(%rdi)
    movdqa  %xmm4, -0x40(%rdi)
    movdqa  %xmm5, -0x50(%rdi)
    movdqa  %xmm6, -0x60(%rdi)
    movdqa  %xmm7, -0x70(%rdi)
    movdqa  %xmm8, -0x80(%rdi)
    lea -0x80(%rsi), %rsi
    lea -0x80(%rdi), %rdi
    jae L(gobble_ll_loop)

Why is memmove faster then memcpy? I would expect memcpy to copy memory pages, which should be much faster than looping. In worst case I would expect memcpy to be as fast as memmove.

为什么memmove比memcpy快?我希望memcpy能够复制内存页，这应该比循环快得多。在最坏的情况下，我希望memcpy和memmove一样快。

PS: I know that I cannot replace memmove with memcpy in my code. I know that the code sample mixes C and C++. This question is really just for academic purposes.

我知道我的代码不能用memcpy代替memmove。我知道代码示例混合了C和c++。这个问题只是学术问题。

UPDATE 1

I ran some variations of the tests, based on the various answers.

根据不同的答案，我运行了一些测试的变体。

When running memcpy twice, then the second run is faster than the first one.
当运行两次memcpy时，第二次运行比第一次运行快。
When "touching" the destination buffer of memcpy (memset(b2, 0, BUFFERSIZE...)) then the first run of memcpy is also faster.
当“触摸”memcpy的目标缓冲区(memset(b2, 0, BUFFERSIZE…)时，memcpy的第一次运行也会更快。
memcpy is still a little bit slower than memmove.
memcpy仍然比memmove慢一点。

Here are the results:

这里是结果:

memcpy        0.0118526
memcpy        0.0119105
memmove (002) 0.0108151
memmove (004) 0.0107122
memmove (008) 0.0107262
memmove (016) 0.0108555
memmove (032) 0.0107171
memmove (064) 0.0106437
memmove (128) 0.0106648

My conclusion: based on a comment from @Oliver Charlesworth, the operating system has to commit physical memory as soon as the memcpy destination buffer is accessed for the very first time (if someone knows how to "proof" this then please add an answer!). In addition, as @Mats Petersson said, memmove is cache friendlier than memcpy.

我的结论是:根据@Oliver Charlesworth的评论，一旦第一次访问memcpy目标缓冲区，操作系统就必须提交物理内存(如果有人知道如何“证明”这一点，请添加一个答案!)此外，正如@Mats Petersson所说，memmove比memcpy更友好。

Thanks for all the great answers and comments!

谢谢你的回答和评论!

4 个解决方案

#1

Your memmove calls are shuffling memory along by 2 to 128 bytes, while your memcpy source and destination are completely different. Somehow that's accounting for the performance difference: if you copy to the same place, you'll see memcpy ends up possibly a smidge faster, e.g. on ideone.com:

您的memmove调用正在以2到128字节的速度移动内存，而您的memcpy源和目的地则完全不同。某种程度上，这就是性能差异的原因:如果你复制到同一个地方，你会发现memcpy可能更快一点，比如在ideone.com上:

memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919

Hardly anything in it though - no evidence that writing back to an already faulted in memory page has much impact, and we're certainly not seeing a halving of time... but it does show that there's nothing wrong making memcpy unnecessarily slower when compared apples-for-apples.

但是几乎没有任何东西——没有证据表明回写已经出错的内存页有多大影响，而且我们肯定没有看到时间减半……但它确实表明，在比较苹果和苹果时，让memcpy变得不必要地慢是没有错的。

#2

When you are using memcpy, the writes need to go into the cache. When you use memmove where when you are copying a small step forward, the memory you are copying over will already be in the cache (because it was read 2, 4, 16 or 128 bytes "back"). Try doing a memmove where the destination is several megabytes (> 4 * cache size), and I suspect (but can't be bothered to test) that you'll get similar results.

当您使用memcpy时，写操作需要进入缓存。当您使用memmove时，当您向前复制一个小步骤时，您要复制的内存将已经在缓存中(因为它被读取了2,4,16或128个字节)。尝试执行一个memmove，其中的目标是几兆字节(> 4 *缓存大小)，我怀疑(但不用费心测试)您将得到类似的结果。

I guarantee that ALL is about cache maintenance when you do large memory operations.

我保证，当您进行大型内存操作时，所有这些都是关于缓存维护的。

#3

Historically, memmove and memcopy are the same function. They worked in the same way and had the same implementation. It was then realised that memcopy doesn't need to be (and frequently wasn't) defined to handle overlapping areas in any particular way.

历史上，memmove和memcopy是相同的功能。他们以相同的方式工作，并有相同的实现。然后人们意识到，memcopy不需要(通常也不需要)定义为以任何特定的方式处理重叠区域。

The end result is that memmove was defined to handle overlapping regions in a particular way even if this impacts performance. Memcopy is supposed to use the best algorithm available for non-overlapping regions. The implementations are normally almost identical.

最终的结果是，memmove被定义为以特定的方式处理重叠区域，即使这会影响性能。Memcopy应该使用最好的算法来处理不重叠的区域。这些实现通常几乎是相同的。

The problem you have run into is that there are so many variations of the x86 hardware that it is impossible to tell which method of shifting memory around will be the fastest. And even if you think you have a result in one circumstance something as simple as having a different 'stride' in the memory layout can cause vastly different cache performance.

您遇到的问题是，x86硬件有太多的变体，因此不可能知道哪种方法移动内存是最快的。即使你认为你在某一情况下得到了结果，一些简单的事情，比如在内存布局中使用不同的“stride”，也可能会导致非常不同的缓存性能。

You can either benchmark what you're actually doing or ignore the problem and rely on the benchmarks done for the C library.

您可以对实际正在做的事情进行基准测试，也可以忽略问题，并依赖于为C库所做的基准测试。

Edit: Oh, and one last thing; shifting lots of memory contents around is VERY slow. I would guess your application would run faster with something like a simple B-Tree implementation to handle your integers. (Oh you are, okay)

编辑:哦，还有最后一件事;移动大量内存内容是非常缓慢的。我猜您的应用程序可以使用简单的B-Tree实现来处理您的整数。(哦,好吧)

Edit2: To summarise my expansion in the comments: The microbenchmark is the issue here, it isn't measuring what you think it is. The tasks given to memcpy and memmove differ significantly from each other. If the task given to memcpy is repeated several times with memmove or memcpy the end results will not depend on which memory shifting function you use UNLESS the regions overlap.

Edit2:总结一下我在评论中的扩展:微基准测试是这里的问题，它不是在测量你的想法。给memcpy和memmove的任务之间有很大的不同。如果给memcpy的任务被memmove或memcpy重复了几次，那么最终结果将不取决于您使用的内存移动函数，除非区域重叠。

#4

"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions.

“memcpy比memmove更有效。”在您的例子中，在运行这两个函数时，您很可能没有做完全相同的事情。

In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.

一般来说，只有在必要时才使用memmove。当源和目标区域重叠的可能性非常大时，使用它。

Reference: https://www.youtube.com/watch?v=Yr1YnOVG-4g Dr. Jerry Cain, (Stanford Intro Systems Lecture - 7) Time: 36:00

参考:https://www.youtube.com/watch?Yr1YnOVG-4g Jerry Cain博士(Stanford Intro Systems Lecture - 7)时间:36:00

#1

memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919

#2