使用64位Visual C的巨大C阵列速度问题

时间:2022-09-01 10:18:08

I need to read a massive amount of data into a buffer (about 20gig). I have 192gb of very fast DDram available, so no issue with memory size. However, I am finding that the following code runs slower and slower the further it gets into the buffer. The Visual C profiler tells me that 68% of the 12 minute execution time is in the 2 statements inside the loop in myFunc(). I am running win7, 64bit on a very fast dell with 2 cpu's, 6 physical cores each (24 logical cores), and all 24 cores are completely maxed out while running this.

我需要将大量数据读入缓冲区(大约20gig)。我有192GB的非常快的DDram可用,所以没有内存大小的问题。但是,我发现以下代码进入缓冲区的速度越来越慢。 Visual C分析器告诉我,12分钟执行时间的68%在myFunc()循环内的2个语句中。我在一个非常快的dell上运行win7,64bit,每个有2个cpu,每个有6个物理内核(24个逻辑内核),并且运行它时所有24个内核都完全超出了。

#define TREAM_COUNT 9000
#define ARRAY_SIZE ONE_BILLION

#define offSet(a,b,c,d) ( ((size_t)  ARRAY_SIZE * (a)) + ((size_t) TREAM_COUNT * 800 * (b)) + ((size_t) 800 * (c)) + (d) )

void myFunc(int dogex, int ptxIndex, int xtreamIndex, int carIndex)
{
     short *ptx  =  (short *) calloc(ARRAY_SIZE * 20, sizeof(short));

    #pragma omp parallel for
    for (int bIndex = 0; bIndex < 800; ++bIndex)
          doWork(dogex, ptxIndex, carIndex);
}

 void doWork(int dogex, int ptxIndex, int carIndex)
{

    for (int treamIndex = 0; treamIndex < ONE_BILLION; ++treamIndex)
    {
         short ptxValue     =  ptx[ offSet(dogex, ptxIndex,   treamIndex, carIndex) ];
         short lastPtxValue =  ptx[ offSet(dogex, ptxIndex-1, treamIndex, carIndex) ];

         // ....
    }

}

4 个解决方案

#1


6  

The code allocated 20 blocks of one billion short ints. On a 64-bit Windows box, a short int is 2 bytes. So the allocation is ~40 gigabytes.

该代码分配了20个10亿个短整数的块。在64位Windows框中,短整数是2个字节。所以分配大约是40千兆字节。

You say there are 24 cores and they're all maxed out. The code as it is doesn't appear to show any parallelism. The way in which the code is parallelised could have a profound effect upon performance. You may need to provide more information.

你说有24个核心,它们都被淘汰了。代码本身似乎没有显示任何并行性。代码并行化的方式可能会对性能产生深远的影响。您可能需要提供更多信息。

--

Your basic problem, I suspect, revolves around cache behaviour and memory access limits.

我怀疑,您的基本问题围绕缓存行为和内存访问限制。

First, with two physical CPUs of six cores each, you will utterly saturate your memory bus. Probably you have a NUMA architecture anyway, but there's no control in the code about where your calloc() allocates (e.g. you could have a lot of code stored in memory which requires multiple hops to reach).

首先,有两个物理CPU,每个六个核心,你将完全饱和你的内存总线。也许你有一个NUMA架构,但是代码中没有关于你的calloc()分配位置的控制(例如,你可能有很多代码存储在内存中,需要多跳才能到达)。

Hyperthreading is turned on. This effectively halves cache sizes. Given the code is memory bus bound, rather than compute bound, hyperthreading is harmful. (Having said that, if computation is constantly outside of cache bounds anyway, this won't change much).

超线程已打开。这有效地减少了缓存大小。鉴于代码是内存总线绑定,而不是计算绑定,超线程是有害的。 (话虽如此,如果计算总是在缓存范围之外,那么这不会有太大变化)。

It's not clear (since some/much?) code is removed, how the array is being accessed and the access pattern and optimimzation of that pattern to honour cache optimization is the key to performance.

目前尚不清楚(因为有些/多少?)代码被删除,如何访问数组以及访问模式和优化该模式以实现缓存优化是性能的关键。

What I see in how offset() is caculated is that the code is constantly requiring the generation of new virtual to physical address lookups - each of which requires something like four or five memory accesses. This is kiling performance, by itself.

我在offset()的计算方法中看到的是,代码不断要求生成新的虚拟到物理地址查找 - 每个查询需要四次或五次内存访问。这本身就是表演。

My basic advice would be break the array up into level 2 cache-sized blocks, give one block to each CPU and let it process that block. You can do that in parallel. Actually, you might be able to use hyperthreading to pre-load the cache, but that's a more advanced technique.

我的基本建议是将阵列分解为2级缓存大小的块,为每个CPU提供一个块并让它处理该块。你可以并行完成。实际上,您可能可以使用超线程来预加载缓存,但这是一种更高级的技术。

#2


2  

You should try to access array in more linear fashion if possible. This probably causes excessive amount of cache misses.

如果可能,您应该尝试以更线性的方式访问数组。这可能会导致过多的缓存未命中。

#3


2  

This optimization will get rid of the slow multiplications:

这种优化将摆脱缓慢的乘法:

    ...
    int idx1 = offSet(dogex, ptxIndex,   0, carIndex);
    int idx2 = offSet(dogex, ptxIndex-1,   0, carIndex);

    for (int treamIndex = 0; treamIndex < ONE_BILLION; ++treamIndex)
    {             
         short ptxValue     =  ptx[ idx1 ];
         short lastPtxValue =  ptx[ idx2 ];
         idx1+=800;
         idx2+=800;             ...

#4


2  

I think, the problem of this code is its memory access pattern. The fact that each thread accesses memory in large (2*800 bytes) increments has 2 negative consequences:

我认为,这段代码的问题在于它的内存访问模式。每个线程以大(2 * 800字节)增量访问内存的事实有两个负面后果:

  1. At the start all threads access the same piece of memory, which is preloaded into L2/L3 cache and is efficiently used by every thread. Later on, threads proceed with slightly different speed and access different pieces of memory, which results in cache trashing (one thread loads data to cache and evicts data from there, which was not yet read by other threads, needing it). As a result, same piece of memory is read to the cache several times (in the worst case, 12 times, by the number of threads in one CPU). Since memory bus is relatively slow, this slows down the whole program.
  2. 在开始时,所有线程都访问同一块内存,该内存被预加载到L2 / L3缓存中,并被每个线程有效地使用。稍后,线程以稍微不同的速度继续访问不同的内存,这导致缓存垃圾(一个线程将数据加载到缓存并从那里驱逐数据,其他线程尚未读取,需要它)。结果,同一块内存被多次读取到缓存(在最坏的情况下,12次,由一个CPU中的线程数)。由于内存总线相对较慢,这会降低整个程序的速度。

  3. L1 cache is also used not very efficiently: only small part of the data in each cache line is used by CPU cores.
  4. L1缓存的使用效率也不是很高:CPU内核只使用每个缓存行中的一小部分数据。

The solution is to allow each thread to access memory sequentially (like exchanging c and d arguments of the offSet(a,b,c,d) macro).

解决方案是允许每个线程按顺序访问内存(例如交换offSet(a,b,c,d)宏的c和d参数)。

#1


6  

The code allocated 20 blocks of one billion short ints. On a 64-bit Windows box, a short int is 2 bytes. So the allocation is ~40 gigabytes.

该代码分配了20个10亿个短整数的块。在64位Windows框中,短整数是2个字节。所以分配大约是40千兆字节。

You say there are 24 cores and they're all maxed out. The code as it is doesn't appear to show any parallelism. The way in which the code is parallelised could have a profound effect upon performance. You may need to provide more information.

你说有24个核心,它们都被淘汰了。代码本身似乎没有显示任何并行性。代码并行化的方式可能会对性能产生深远的影响。您可能需要提供更多信息。

--

Your basic problem, I suspect, revolves around cache behaviour and memory access limits.

我怀疑,您的基本问题围绕缓存行为和内存访问限制。

First, with two physical CPUs of six cores each, you will utterly saturate your memory bus. Probably you have a NUMA architecture anyway, but there's no control in the code about where your calloc() allocates (e.g. you could have a lot of code stored in memory which requires multiple hops to reach).

首先,有两个物理CPU,每个六个核心,你将完全饱和你的内存总线。也许你有一个NUMA架构,但是代码中没有关于你的calloc()分配位置的控制(例如,你可能有很多代码存储在内存中,需要多跳才能到达)。

Hyperthreading is turned on. This effectively halves cache sizes. Given the code is memory bus bound, rather than compute bound, hyperthreading is harmful. (Having said that, if computation is constantly outside of cache bounds anyway, this won't change much).

超线程已打开。这有效地减少了缓存大小。鉴于代码是内存总线绑定,而不是计算绑定,超线程是有害的。 (话虽如此,如果计算总是在缓存范围之外,那么这不会有太大变化)。

It's not clear (since some/much?) code is removed, how the array is being accessed and the access pattern and optimimzation of that pattern to honour cache optimization is the key to performance.

目前尚不清楚(因为有些/多少?)代码被删除,如何访问数组以及访问模式和优化该模式以实现缓存优化是性能的关键。

What I see in how offset() is caculated is that the code is constantly requiring the generation of new virtual to physical address lookups - each of which requires something like four or five memory accesses. This is kiling performance, by itself.

我在offset()的计算方法中看到的是,代码不断要求生成新的虚拟到物理地址查找 - 每个查询需要四次或五次内存访问。这本身就是表演。

My basic advice would be break the array up into level 2 cache-sized blocks, give one block to each CPU and let it process that block. You can do that in parallel. Actually, you might be able to use hyperthreading to pre-load the cache, but that's a more advanced technique.

我的基本建议是将阵列分解为2级缓存大小的块,为每个CPU提供一个块并让它处理该块。你可以并行完成。实际上,您可能可以使用超线程来预加载缓存,但这是一种更高级的技术。

#2


2  

You should try to access array in more linear fashion if possible. This probably causes excessive amount of cache misses.

如果可能,您应该尝试以更线性的方式访问数组。这可能会导致过多的缓存未命中。

#3


2  

This optimization will get rid of the slow multiplications:

这种优化将摆脱缓慢的乘法:

    ...
    int idx1 = offSet(dogex, ptxIndex,   0, carIndex);
    int idx2 = offSet(dogex, ptxIndex-1,   0, carIndex);

    for (int treamIndex = 0; treamIndex < ONE_BILLION; ++treamIndex)
    {             
         short ptxValue     =  ptx[ idx1 ];
         short lastPtxValue =  ptx[ idx2 ];
         idx1+=800;
         idx2+=800;             ...

#4


2  

I think, the problem of this code is its memory access pattern. The fact that each thread accesses memory in large (2*800 bytes) increments has 2 negative consequences:

我认为,这段代码的问题在于它的内存访问模式。每个线程以大(2 * 800字节)增量访问内存的事实有两个负面后果:

  1. At the start all threads access the same piece of memory, which is preloaded into L2/L3 cache and is efficiently used by every thread. Later on, threads proceed with slightly different speed and access different pieces of memory, which results in cache trashing (one thread loads data to cache and evicts data from there, which was not yet read by other threads, needing it). As a result, same piece of memory is read to the cache several times (in the worst case, 12 times, by the number of threads in one CPU). Since memory bus is relatively slow, this slows down the whole program.
  2. 在开始时,所有线程都访问同一块内存,该内存被预加载到L2 / L3缓存中,并被每个线程有效地使用。稍后,线程以稍微不同的速度继续访问不同的内存,这导致缓存垃圾(一个线程将数据加载到缓存并从那里驱逐数据,其他线程尚未读取,需要它)。结果,同一块内存被多次读取到缓存(在最坏的情况下,12次,由一个CPU中的线程数)。由于内存总线相对较慢,这会降低整个程序的速度。

  3. L1 cache is also used not very efficiently: only small part of the data in each cache line is used by CPU cores.
  4. L1缓存的使用效率也不是很高:CPU内核只使用每个缓存行中的一小部分数据。

The solution is to allow each thread to access memory sequentially (like exchanging c and d arguments of the offSet(a,b,c,d) macro).

解决方案是允许每个线程按顺序访问内存(例如交换offSet(a,b,c,d)宏的c和d参数)。