为什么memcpy()和memmove()比指针增量快?

I am copying N bytes from pSrc to pDest. This can be done in a single loop:

我正在从pSrc拷贝N个字节。这可以在单个循环中完成:

for (int i = 0; i < N; i++)
    *pDest++ = *pSrc++

Why is this slower than memcpy or memmove? What tricks do they use to speed it up?

为什么这个速度比memcpy或memmove慢?他们用什么技巧来加快速度呢?

9 个解决方案

#1

107

Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.

因为memcpy使用单词指针而不是字节指针，所以memcpy实现通常是用SIMD指令编写的，因此可以一次调整128位。

SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.

SIMD指令是汇编指令，它可以对一个长度为16字节的向量执行相同的操作。这包括装载和存储指令。

#2

Memory copy routines can be far more complicated and faster than a simple memory copy via pointers such as:

内存复制例程比简单的内存复制要复杂得多，而且速度要快得多，比如:

void simple_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;
  for (int i = 0; i < bytes; ++i)
    *b_dst++ = *b_src++;
}

Improvements

改进

The first improvement one can make is to align one of the pointers on a word boundary (by word I mean native integer size, usually 32 bits/4 bytes, but can be 64 bits/8 bytes on newer architectures) and use word sized move/copy instructions. This requires using a byte to byte copy until a pointer is aligned.

第一个改进是将一个指针对准一个单词边界(我指的是本机整数大小，通常是32位/4字节，但在较新的架构上可以是64位/8字节)，并使用word大小的移动/复制指令。这就要求在一个指针对齐之前，使用一个字节来复制。

void aligned_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;

  // Copy bytes to align source pointer
  while ((b_src & 0x3) != 0)
  {
    *b_dst++ = *b_src++;
    bytes--;
  }

  unsigned int* w_dst = (unsigned int*)b_dst;
  unsigned int* w_src = (unsigned int*)b_src;
  while (bytes >= 4)
  {
    *w_dst++ = *w_src++;
    bytes -= 4;
  }

  // Copy trailing bytes
  if (bytes > 0)
  {
    b_dst = (unsigned char*)w_dst;
    b_src = (unsigned char*)w_src;
    while (bytes > 0)
    {
      *b_dst++ = *b_src++;
      bytes--;
    }
  }
}

Different architectures will perform differently based on if the source or the destination pointer is appropriately aligned. For instance on an XScale processor I got better performance by aligning the destination pointer rather than the source pointer.

如果源或目标指针被适当地对齐，不同的体系结构将以不同的方式执行。例如，在XScale处理器上，我通过调整目标指针而不是源指针来获得更好的性能。

To further improve performance some loop unrolling can be done, so that more of the processor's registers are loaded with data and that means the load/store instructions can be interleaved and have their latency hidden by additional instructions (such as loop counting etc). The benefit this brings varies quite a bit by the processor, since load/store instruction latencies can be quite different.

为了进一步提高性能，可以进行一些循环展开，以便更多的处理器寄存器装载数据，这意味着加载/存储指令可以交叉，并通过附加指令(如循环计数等)隐藏它们的延迟。由于加载/存储指令的延迟可能会有很大的不同，所以这带来的好处与处理器有很大的不同。

At this stage the code ends up being written in Assembly rather than C (or C++) since you need to manually place the load and store instructions to get maximum benefit of latency hiding and throughput.

在这个阶段，代码以汇编而不是C(或c++)编写，因为您需要手动放置负载和存储指令，以获得延迟隐藏和吞吐量的最大好处。

Generally a whole cache line of data should be copied in one iteration of the unrolled loop.

一般来说，一个完整的缓存线应该在一个循环的迭代中被复制。

Which brings me to the next improvement, adding pre-fetching. These are special instructions that tell the processor's cache system to load specific parts of memory into its cache. Since there is a delay between issuing the instruction and having the cache line filled, the instructions need to be placed in such a way so that the data is available when just as it is to be copied, and no sooner/later.

这让我想到了下一个改进，添加预取。这些特殊指令告诉处理器的高速缓存系统将内存中的特定部分加载到缓存中。由于在发出指令和填充高速缓存线之间存在延迟，指令需要以这样一种方式放置，这样当数据被复制时，数据就可用了，而且不早/晚。

This means putting prefetch instructions at the start of the function as well as inside the main copy loop. With the prefetch instructions in the middle of the copy loop fetching data that will be copied in several iterations time.

这意味着在函数的开始处和主复制循环中放置预取指令。在复制循环获取数据中间的预取指令，将在多个迭代时间内复制。

I can't remember, but it may also be beneficial to prefetch the destination addresses as well as the source ones.

我不记得了，但它可能也有利于预取目标地址和源地址。

Factors

因素

The main factors that affect how fast memory can be copied are:

影响记忆快速复制的主要因素有:

The latency between the processor, its caches, and main memory.
处理器、缓存和主存之间的延迟。
The size and structure of the processor's cache lines.
处理器的缓存线路的大小和结构。
The processor's memory move/copy instructions (latency, throughput, register size, etc).
处理器的内存移动/复制指令(延迟、吞吐量、寄存器大小等)。

So if you want to write an efficient and fast memory cope routine you'll need to know quite a lot about the processor and architecture you are writing for. Suffice to say, unless you're writing on some embedded platform it would be much easier to just use the built in memory copy routines.

因此，如果您想要编写一个高效且快速的内存处理程序，您需要非常了解您所编写的处理器和体系结构。可以这么说，除非你在某个嵌入式平台上写作，否则就更容易使用内置的内存拷贝例程。

#3

memcpy can copy more than one byte at once depending on the computer's architecture. Most modern computers can work with 32 bits or more in a single processor instruction.

memcpy可以一次复制多个字节，这取决于计算机的架构。大多数现代计算机可以在一个处理器指令中使用32位或更多。

From one example implementation:

从一个示例实现:

    00026          * For speedy copying, optimize the common case where both pointers
    00027          * and the length are word-aligned, and copy word-at-a-time instead
    00028          * of byte-at-a-time. Otherwise, copy by bytes.

#4

You can implement memcpy() using any of the following techniques, some dependent on your architecture for performance gains, and they will all be much faster than your code:

您可以使用以下任何一种技术实现memcpy()，一些依赖于您的体系结构以获得性能提升，并且它们都比您的代码快得多:

Use larger units, such as 32-bit words instead of bytes. You can also (or may have to) deal with alignment here as well. You can't go reading/writing a 32-bit word to a odd memory location for example on some platforms, and on other platforms you pay a massive performance penalty. To fix this, the address has to be a unit divisible by 4. You can take this up to 64-bits for 64bit CPUs, or even higher using SIMD (Single instruction, multiple data) instructions (MMX, SSE, etc.)

使用较大的单元，例如32位的字，而不是字节。您还可以(或可能需要)处理这里的对齐方式。在某些平台上，你不能在一个奇怪的内存位置上读写一个32位的单词，在其他平台上，你要付出巨大的性能代价。要解决这个问题，地址必须是一个能被4整除的单位。您可以使用64位cpu的64位，或者使用SIMD(单指令、多数据)指令(MMX、SSE等)更高。
You can use special CPU instructions that your compiler may not be able to optimize from C. For example, on a 80386, you can use the "rep" prefix instruction + "movsb" instruction to move N bytes dictated by placing N in the count register. Good compilers will just do this for you, but you may be on a platform that lacks a good compiler. Note, that example tends to be a bad demonstration of speed, but combined with alignment + larger unit instructions, it can be faster than mostly everything else on certain CPUs.

您可以使用特殊的CPU指令，您的编译器可能无法对c进行优化。例如，在80386上，您可以使用“rep”前缀指令+“movsb”指令来移动N字节，在计数寄存器中设置N。好的编译器只会为你做这些，但是你可能在一个缺乏好的编译器的平台上。注意，这个例子往往是速度的错误演示，但是结合了对齐+更大的单元指令，它可能比其他cpu上的其他所有东西都快。
Loop unrolling -- branches can be quite expensive on some CPUs, so unrolling the loops can lower the number of branches. This is also a good technique for combining with SIMD instructions and very large sized units.

循环展开——在某些cpu上分支可能非常昂贵，因此展开循环可以降低分支的数量。这也是一种很好的结合SIMD指令和非常大的单位的技术。

For example, http://www.agner.org/optimize/#asmlib has a memcpy implementation that beats most out there (by a very tiny amount). If you read the source code, it will be full of tons of inlined assembly code that pulls off all of the above three techniques, choosing which of those techniques based on what CPU you are running on.

例如，http://www.agner.org/optize/ #asmlib有一个memcpy实现，它在那里的效果最好(非常小)。如果您阅读了源代码，那么将会有大量的内联汇编代码，这些代码会将上述三种技术全部拉出，根据您正在运行的CPU来选择这些技术。

Note, there are similar optimizations that can be made for finding bytes in a buffer too. strchr() and friends will often by faster than your hand rolled equivalent. This is especially true for .NET and Java. For example, in .NET, the built-in String.IndexOf() is much faster than even a Boyer–Moore string search, because it uses the above optimization techniques.

注意，也有类似的优化，可以在缓冲区中查找字节。strchr()和朋友通常会比你的手滚的速度更快。对于。net和Java来说尤其如此。例如，在。net中，内置的string . indexof()比一个Boyer-Moore字符串搜索要快得多，因为它使用了上述优化技术。

#5

Short answer:

简短的回答:

cache fill
缓存填满
wordsize transfers instead of byte ones where possible
在可能的情况下，使用wordsize传输代替字节。
SIMD magic
SIMD魔法

#6

Like others say memcpy copies larger than 1-byte chunks. Copying in word sized chunks is much faster. However, most implementations take it a step further and run several MOV (word) instructions before looping. The advantage to copying in say, 8 word blocks per loop is that the loop itself is costly. This technique reduces the number of conditional branches by a factor of 8, optimizing the copy for giant blocks.

和其他人一样，memcpy复制大于1字节块。在word大小的块中复制要快得多。然而，大多数实现在循环之前会更进一步，并运行多个MOV (word)指令。复制的优点是，每个循环有8个字块，这个循环本身代价很高。该技术将条件分支的数量减少了8倍，从而优化了巨型块的副本。

#7

I don't know whether it is actually used in any real-world implementations of memcpy, but I think Duff's Device deserves a mention here.

我不知道它是否真的在memcpy的实际实现中使用过，但是我认为Duff的设备应该在这里提一下。

From Wikipedia:

从*:

send(to, from, count)
register short *to, *from;
register count;
{
        register n = (count + 7) / 8;
        switch(count % 8) {
        case 0:      do {     *to = *from++;
        case 7:              *to = *from++;
        case 6:              *to = *from++;
        case 5:              *to = *from++;
        case 4:              *to = *from++;
        case 3:              *to = *from++;
        case 2:              *to = *from++;
        case 1:              *to = *from++;
                } while(--n > 0);
        }
}

Note that the above isn't a memcpy since it deliberately doesn't increment the to pointer. It implements a slightly different operation: the writing into a memory-mapped register. See the Wikipedia article for details.

注意，上面不是一个memcpy，因为它故意不增加指针。它实现了一个稍微不同的操作:写入内存映射寄存器。详见*文章。

#8

The answers are great, but if you still want implement a fast memcpy yourself, there is an interesting blog post about fast memcpy, Fast memcpy in C.

答案很好，但是如果你仍然想要实现一个快速的memcpy，有一个有趣的博客文章，关于快速的memcpy，快速的memcpy在C。

void *memcpy(void* dest, const void* src, size_t count)
{
    char* dst8 = (char*)dest;
    char* src8 = (char*)src;

    if (count & 1) {
        dst8[0] = src8[0];
        dst8 += 1;
        src8 += 1;
    }

    count /= 2;
    while (count--) {
        dst8[0] = src8[0];
        dst8[1] = src8[1];

        dst8 += 2;
        src8 += 2;
    }
    return dest;
}

Even, it can be better with optimizing memory accesses.

即使这样，优化内存访问也会更好。

#9

Because like many library routines it has been optimized for the architecture you are running on. Others have posted various techniques which can be used.

因为像许多库例程一样，它已经为您正在运行的体系结构进行了优化。其他人已经发布了各种可以使用的技术。

Given the choice, use library routines rather than roll your own. This is a variation on DRY that I call DRO (Don't Repeat Others). Also, library routines are less likely be wrong than your own implementation.

考虑到选择，使用库例程而不是自己滚。这是一种干的变化，我叫它DRO(不要重复别人)。另外，与您自己的实现相比，库例程不太可能出错。

I have seen memory access checkers complain about out of bounds reads on memory or string buffers which were not a multiple of the word size. This is a result of the optimization being used.

我看到过内存访问检查器会在内存或字符串缓冲区中出现错误，这不是单词大小的倍数。这是使用优化的结果。

#1

107