I am copying N bytes from pSrc to pDest. This can be done in a single loop:


for (int i = 0; i < N; i++)
    *pDest++ = *pSrc++

Why is this slower than memcpy or memmove? What tricks do they use to speed it up?


Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.


SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.




Memory copy routines can be far more complicated and faster than a simple memory copy via pointers such as:


void simple_memory_copy(void* dst, void* src, unsigned int bytes)
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;
  for (int i = 0; i < bytes; ++i)
    *b_dst++ = *b_src++;



The first improvement one can make is to align one of the pointers on a word boundary (by word I mean native integer size, usually 32 bits/4 bytes, but can be 64 bits/8 bytes on newer architectures) and use word sized move/copy instructions. This requires using a byte to byte copy until a pointer is aligned.


void aligned_memory_copy(void* dst, void* src, unsigned int bytes)
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;

  // Copy bytes to align source pointer
  while ((b_src & 0x3) != 0)
    *b_dst++ = *b_src++;

  unsigned int* w_dst = (unsigned int*)b_dst;
  unsigned int* w_src = (unsigned int*)b_src;
  while (bytes >= 4)
    *w_dst++ = *w_src++;
    bytes -= 4;

  // Copy trailing bytes
  if (bytes > 0)
    b_dst = (unsigned char*)w_dst;
    b_src = (unsigned char*)w_src;
    while (bytes > 0)
      *b_dst++ = *b_src++;

Different architectures will perform differently based on if the source or the destination pointer is appropriately aligned. For instance on an XScale processor I got better performance by aligning the destination pointer rather than the source pointer.


To further improve performance some loop unrolling can be done, so that more of the processor's registers are loaded with data and that means the load/store instructions can be interleaved and have their latency hidden by additional instructions (such as loop counting etc). The benefit this brings varies quite a bit by the processor, since load/store instruction latencies can be quite different.


At this stage the code ends up being written in Assembly rather than C (or C++) since you need to manually place the load and store instructions to get maximum benefit of latency hiding and throughput.


Generally a whole cache line of data should be copied in one iteration of the unrolled loop.


Which brings me to the next improvement, adding pre-fetching. These are special instructions that tell the processor's cache system to load specific parts of memory into its cache. Since there is a delay between issuing the instruction and having the cache line filled, the instructions need to be placed in such a way so that the data is available when just as it is to be copied, and no sooner/later.


This means putting prefetch instructions at the start of the function as well as inside the main copy loop. With the prefetch instructions in the middle of the copy loop fetching data that will be copied in several iterations time.


I can't remember, but it may also be beneficial to prefetch the destination addresses as well as the source ones.




The main factors that affect how fast memory can be copied are:


  • The latency between the processor, its caches, and main memory.
  • 处理器、缓存和主存之间的延迟。
  • The size and structure of the processor's cache lines.
  • 处理器的缓存线路的大小和结构。
  • The processor's memory move/copy instructions (latency, throughput, register size, etc).
  • 处理器的内存移动/复制指令(延迟、吞吐量、寄存器大小等)。

So if you want to write an efficient and fast memory cope routine you'll need to know quite a lot about the processor and architecture you are writing for. Suffice to say, unless you're writing on some embedded platform it would be much easier to just use the built in memory copy routines.




memcpy can copy more than one byte at once depending on the computer's architecture. Most modern computers can work with 32 bits or more in a single processor instruction.


From one example implementation:


    00026          * For speedy copying, optimize the common case where both pointers
    00027          * and the length are word-aligned, and copy word-at-a-time instead
    00028          * of byte-at-a-time. Otherwise, copy by bytes.



You can implement memcpy() using any of the following techniques, some dependent on your architecture for performance gains, and they will all be much faster than your code:


  1. Use larger units, such as 32-bit words instead of bytes. You can also (or may have to) deal with alignment here as well. You can't go reading/writing a 32-bit word to a odd memory location for example on some platforms, and on other platforms you pay a massive performance penalty. To fix this, the address has to be a unit divisible by 4. You can take this up to 64-bits for 64bit CPUs, or even higher using SIMD (Single instruction, multiple data) instructions (MMX, SSE, etc.)


  2. You can use special CPU instructions that your compiler may not be able to optimize from C. For example, on a 80386, you can use the "rep" prefix instruction + "movsb" instruction to move N bytes dictated by placing N in the count register. Good compilers will just do this for you, but you may be on a platform that lacks a good compiler. Note, that example tends to be a bad demonstration of speed, but combined with alignment + larger unit instructions, it can be faster than mostly everything else on certain CPUs.


  3. Loop unrolling -- branches can be quite expensive on some CPUs, so unrolling the loops can lower the number of branches. This is also a good technique for combining with SIMD instructions and very large sized units.


For example, http://www.agner.org/optimize/#asmlib has a memcpy implementation that beats most out there (by a very tiny amount). If you read the source code, it will be full of tons of inlined assembly code that pulls off all of the above three techniques, choosing which of those techniques based on what CPU you are running on.

例如,http://www.agner.org/optize/ #asmlib有一个memcpy实现,它在那里的效果最好(非常小)。如果您阅读了源代码,那么将会有大量的内联汇编代码,这些代码会将上述三种技术全部拉出,根据您正在运行的CPU来选择这些技术。

Note, there are similar optimizations that can be made for finding bytes in a buffer too. strchr() and friends will often by faster than your hand rolled equivalent. This is especially true for .NET and Java. For example, in .NET, the built-in String.IndexOf() is much faster than even a Boyer–Moore string search, because it uses the above optimization techniques.

注意,也有类似的优化,可以在缓冲区中查找字节。strchr()和朋友通常会比你的手滚的速度更快。对于。net和Java来说尤其如此。例如,在。net中,内置的string . indexof()比一个Boyer-Moore字符串搜索要快得多,因为它使用了上述优化技术。



Short answer:


  • cache fill
  • 缓存填满
  • wordsize transfers instead of byte ones where possible
  • 在可能的情况下,使用wordsize传输代替字节。
  • SIMD magic
  • SIMD魔法



Like others say memcpy copies larger than 1-byte chunks. Copying in word sized chunks is much faster. However, most implementations take it a step further and run several MOV (word) instructions before looping. The advantage to copying in say, 8 word blocks per loop is that the loop itself is costly. This technique reduces the number of conditional branches by a factor of 8, optimizing the copy for giant blocks.

和其他人一样,memcpy复制大于1字节块。在word大小的块中复制要快得多。然而,大多数实现在循环之前会更进一步,并运行多个MOV (word)指令。复制的优点是,每个循环有8个字块,这个循环本身代价很高。该技术将条件分支的数量减少了8倍,从而优化了巨型块的副本。



I don't know whether it is actually used in any real-world implementations of memcpy, but I think Duff's Device deserves a mention here.


From Wikipedia:


send(to, from, count)
register short *to, *from;
register count;
        register n = (count + 7) / 8;
        switch(count % 8) {
        case 0:      do {     *to = *from++;
        case 7:              *to = *from++;
        case 6:              *to = *from++;
        case 5:              *to = *from++;
        case 4:              *to = *from++;
        case 3:              *to = *from++;
        case 2:              *to = *from++;
        case 1:              *to = *from++;
                } while(--n > 0);

Note that the above isn't a memcpy since it deliberately doesn't increment the to pointer. It implements a slightly different operation: the writing into a memory-mapped register. See the Wikipedia article for details.




The answers are great, but if you still want implement a fast memcpy yourself, there is an interesting blog post about fast memcpy, Fast memcpy in C.


void *memcpy(void* dest, const void* src, size_t count)
    char* dst8 = (char*)dest;
    char* src8 = (char*)src;

    if (count & 1) {
        dst8[0] = src8[0];
        dst8 += 1;
        src8 += 1;

    count /= 2;
    while (count--) {
        dst8[0] = src8[0];
        dst8[1] = src8[1];

        dst8 += 2;
        src8 += 2;
    return dest;

Even, it can be better with optimizing memory accesses.




Because like many library routines it has been optimized for the architecture you are running on. Others have posted various techniques which can be used.


Given the choice, use library routines rather than roll your own. This is a variation on DRY that I call DRO (Don't Repeat Others). Also, library routines are less likely be wrong than your own implementation.


I have seen memory access checkers complain about out of bounds reads on memory or string buffers which were not a multiple of the word size. This is a result of the optimization being used.




