寄存器溢出是否可能导致CUDA_EXCEPTION_5，超出范围的地址错误?

I'm getting a CUDA_EXCEPTION_5, Warp Out-of-range Address error and I'm trying to figure out the various scenarios that can cause that.

我得到了一个CUDA_EXCEPTION_5，偏离范围的地址错误，我正在尝试找出可能导致这个的各种情形。

I'm working on porting a C project (written by somebody else) to CUDA. The C code is very register-heavy, instantiating many arrays in the stack. I'm assuming register overflowing is very likely to be occuring and that may be triggering the warp out-of-range error.

我正在努力将一个C项目(由其他人写)移植到CUDA。C代码非常的register-heavy，在堆栈中实例化了许多数组。我假设寄存器溢出很可能是发生的，这可能触发了超出范围的错误。

Note that I want to get it working running first then I will begin optimizing the code.

注意，我想先让它运行，然后我将开始优化代码。

I'm using Compute Capable 3.0 hardware which according to Wikipedia has 512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?

我使用的是可计算的3.0硬件，根据*，它有512KB的“每个线程的本地内存”。我在其他地方读到它有512KB的寄存器空间。每个运行线程可能有512KB的寄存器空间吗?

I'm currently executing my kernel as follows (yes I know it's ultra-slow):

我现在正在执行我的内核如下(是的，我知道它非常慢):

dim3 grid(28800,1);
cuPlotLRMap<<<grid,1>>>(...)

Some details (I don't know how helpful this will be):

一些细节(我不知道这会有多大帮助):

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

我的硬件有7个SMs。有112个运行块，这是否意味着每个块的注册空间为512k的1/16 ?

I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?

我也理解如果一个线程超过了寄存器空间，它可以溢出到全局内存中。当发生这种情况时，并发线程是否可能溢出到相同的全局内存空间中?

1 个解决方案

#1

512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?

512KB“每个线程的本地内存”。我在其他地方读到它有512KB的寄存器空间。每个运行线程可能有512KB的寄存器空间吗?

See Compute Capabilities table in the CUDA C Programming Guide. Compute capbility 2.x and above devices support a maximum of 512KB of local memory per thread. The function cudaDeviceSetLimit( cudaLimitStackSize, bytesPerThread ) can be used to set the value. I believe the default is 2 KB per thread.

参见CUDA C编程指南中的计算能力表。计算能力2。x和以上设备支持每个线程最多512KB的本地内存。可以使用函数cudadesetlimit (cudalcopystacksize, bytesPerThread)来设置值。我认为默认值是每个线程2 KB。

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

我的硬件有7个SMs。有112个运行块，这是否意味着每个块的注册空间为512k的1/16 ?

Compute capbility 3.x devices can have at most 16 resides blocks per multiprocessor. This assumes that your registers/thread, threads/block, or shared memory/block does not limit the kernel to less than the device maximum. The Visual Profiler and Nsight VSE CUDA Profiler the configuration used by your kernel.

计算能力3。在每个多处理器中，x设备最多可以有16个驻留块。这假设您的寄存器/线程、线程/块或共享内存/块不限制内核小于设备最大值。视觉分析器和Nsight VSE CUDA剖析您的内核所使用的配置。

Currently, you are only launching 1 thread/block. You should be launching a multiple of WARP_SIZE per block (32).

目前，您只启动了一个线程/块。您应该在每个块上启动一个多个WARP_SIZE(32)。

I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?

我也理解如果一个线程超过了寄存器空间，它可以溢出到全局内存中。当发生这种情况时，并发线程是否可能溢出到相同的全局内存空间中?

At compile or JIT time the compiler will perform register allocation. If there are insufficient registers per thread then the compiler will spill to local memory. This operation is deterministic and not determined at runtime.

在编译或JIT时，编译器将执行寄存器分配。如果每个线程有足够的寄存器，编译器就会溢出到本地内存中。这个操作是确定的，在运行时不确定。

Compute capability 3.0 devices are limited to 63 registers/thread. Compute capability 3.5 devices are limited to 255 registers per thread.

计算能力3.0设备被限制为63个寄存器/线程。计算能力3.5设备被限制为每线程255个寄存器。

#1