在一个数组中找到最小值(但跳过一些元素),在CUDA中减少。

时间:2022-08-22 13:03:55

I have a large array of floating point numbers and I want to find out the minimum value of the array (ignoring -1s wherever present) as well as its index, using reduction in CUDA. I have written the following code to do this, which in my opinion should work:

我有大量的浮点数,我想找出数组的最小值(忽略-1)以及它的索引,减少CUDA。我已经写了下面的代码来做这个,我认为应该工作:

 __global__ void get_min_cost(float *d_Cost,int n,int *last_block_number,int *number_in_last_block,int *d_index){
     int tid = threadIdx.x;
     int myid = blockDim.x * blockIdx.x + threadIdx.x;
     int s;

     if(result == (*last_block_number)-1){
         s = (*number_in_last_block)/2;
     }else{
         s = 1024/2;
     }

     for(;s>0;s/=2){
         if(myid+s>=n)
             continue;
         if(tid<s){
             if(d_Cost[myid+s] == -1){
                 continue;
             }else if(d_Cost[myid] == -1 && d_Cost[myid+s] != -1){
                 d_Cost[myid] = d_Cost[myid+s];
                 d_index[myid] = d_index[myid+s];
             }else{
                 // both not -1
                 if(d_Cost[myid]<=d_Cost[myid+s])
                     continue;
                 else{
                     d_Cost[myid] = d_Cost[myid+s];
                     d_index[myid] = d_index[myid+s];
                 }
             }
         }
         else
             continue;
         __syncthreads();
     }
     if(tid==0){
         d_Cost[blockIdx.x] = d_Cost[myid];
         d_index[blockIdx.x] = d_index[myid];
     }
     return;
 }

The last_block_number argument is the id of the last block, and number_in_last_block is the number of elements in last block (which is a power of 2). Thus, all blocks will launch 1024 threads every time and the last block will only use number_in_last_block threads, while others will use 1024 threads.

last_block_number参数是最后一个块的id, number_in_last_block是最后一个块中元素的数量(这是2的一个幂),因此,所有块每次都将启动1024个线程,最后一个块只使用number_in_last_block线程,而其他块将使用1024个线程。

After this function runs, I expect the minimum values for each block to be in d_Cost[blockIdx.x] and their indices in d_index[blockIdx.x].

在此函数运行后,我期望每个块的最小值为d_Cost[blockIdx]。及其在d_index[blockIdx.x]中的索引。

I call this function multiple times, each time updating the number of threads and blocks. The second time I call this function, the number of threads now become equal to the number of blocks remaining etc.

我多次调用这个函数,每次更新线程和块的数量。第二次调用这个函数时,线程的数量现在就等于剩余块的数量。

However, the above function isn't giving me the desired output. In fact, it gives a different output every time I run the program, i.e, it returns an incorrect value as the minimum during some intermediate iteration (though that incorrect value is quite close to the minimum every time).

然而,上面的函数并没有给出期望的输出。实际上,每次运行程序时,它都会给出不同的输出。在某些中间迭代中,它返回一个不正确的值作为最小值(尽管不正确的值每次都非常接近最小值)。

What am I doing wrong here?

我在这里做错了什么?

1 个解决方案

#1


2  

As I mentioned in my comment above, I would recommend to avoid writing reductions of your own and use CUDA Thrust whenever possible. This holds true even in the case when you need to customize those operations, the customization being possible by properly overloading, e.g., relational operations.

正如我在上面的评论中提到的,我建议尽量避免减少自己的写作和使用CUDA推力。即使在需要定制这些操作的情况下,也可以这样做,通过适当地重载(例如,关系操作)来实现定制。

Below I'm providing a simple code to evaluate the minimum in an array along with its index. It is based on a classical example contained in the An Introduction to Thrust presentation. The only addition is skipping, as you requested, the -1's from the counting. This can be reasonably done by replacing all the -1's in the array by INT_MAX, i.e., the maximum representable integer according to IEEE floating point standards.

下面我提供了一个简单的代码来评估数组的最小值以及它的索引。它是基于一个经典的例子包含在介绍的推力演示。唯一的加法是跳跃,正如你所要求的,-1是从计数中得到的。这可以通过将数组中所有的-1替换为INT_MAX来合理地完成。,根据IEEE浮点标准,最大可表示整数。

#include <thrust\device_vector.h>
#include <thrust\replace.h>
#include <thrust\sequence.h>
#include <thrust\reduce.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\tuple.h>

// --- Struct returning the smallest of two tuples
struct smaller_tuple
{
    __host__ __device__ thrust::tuple<int,int> operator()(thrust::tuple<int,int> a, thrust::tuple<int,int> b)
    {
        if (a < b)
            return a;
        else
            return b;
    }
};


void main() {

    const int N = 20;
    const int large_value = INT_MAX;

    // --- Setting the data vector
    thrust::device_vector<int> d_vec(N,10);
    d_vec[3] = -1; d_vec[5] = -2;

    // --- Copying the data vector to a new vector where the -1's are changed to FLT_MAX
    thrust::device_vector<int> d_vec_temp(d_vec);
    thrust::replace(d_vec_temp.begin(), d_vec_temp.end(), -1, large_value);

    // --- Creating the index sequence [0, 1, 2, ... )
    thrust::device_vector<int> indices(d_vec_temp.size());
    thrust::sequence(indices.begin(), indices.end());

    // --- Setting the initial value of the search
    thrust::tuple<int,int> init(d_vec_temp[0],0);

    thrust::tuple<int,int> smallest;
    smallest = thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.begin(), indices.begin())),
                          thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.end(), indices.end())),
                          init, smaller_tuple());

    printf("Smallest %i %i\n",thrust::get<0>(smallest),thrust::get<1>(smallest));
    getchar();
}

#1


2  

As I mentioned in my comment above, I would recommend to avoid writing reductions of your own and use CUDA Thrust whenever possible. This holds true even in the case when you need to customize those operations, the customization being possible by properly overloading, e.g., relational operations.

正如我在上面的评论中提到的,我建议尽量避免减少自己的写作和使用CUDA推力。即使在需要定制这些操作的情况下,也可以这样做,通过适当地重载(例如,关系操作)来实现定制。

Below I'm providing a simple code to evaluate the minimum in an array along with its index. It is based on a classical example contained in the An Introduction to Thrust presentation. The only addition is skipping, as you requested, the -1's from the counting. This can be reasonably done by replacing all the -1's in the array by INT_MAX, i.e., the maximum representable integer according to IEEE floating point standards.

下面我提供了一个简单的代码来评估数组的最小值以及它的索引。它是基于一个经典的例子包含在介绍的推力演示。唯一的加法是跳跃,正如你所要求的,-1是从计数中得到的。这可以通过将数组中所有的-1替换为INT_MAX来合理地完成。,根据IEEE浮点标准,最大可表示整数。

#include <thrust\device_vector.h>
#include <thrust\replace.h>
#include <thrust\sequence.h>
#include <thrust\reduce.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\tuple.h>

// --- Struct returning the smallest of two tuples
struct smaller_tuple
{
    __host__ __device__ thrust::tuple<int,int> operator()(thrust::tuple<int,int> a, thrust::tuple<int,int> b)
    {
        if (a < b)
            return a;
        else
            return b;
    }
};


void main() {

    const int N = 20;
    const int large_value = INT_MAX;

    // --- Setting the data vector
    thrust::device_vector<int> d_vec(N,10);
    d_vec[3] = -1; d_vec[5] = -2;

    // --- Copying the data vector to a new vector where the -1's are changed to FLT_MAX
    thrust::device_vector<int> d_vec_temp(d_vec);
    thrust::replace(d_vec_temp.begin(), d_vec_temp.end(), -1, large_value);

    // --- Creating the index sequence [0, 1, 2, ... )
    thrust::device_vector<int> indices(d_vec_temp.size());
    thrust::sequence(indices.begin(), indices.end());

    // --- Setting the initial value of the search
    thrust::tuple<int,int> init(d_vec_temp[0],0);

    thrust::tuple<int,int> smallest;
    smallest = thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.begin(), indices.begin())),
                          thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.end(), indices.end())),
                          init, smaller_tuple());

    printf("Smallest %i %i\n",thrust::get<0>(smallest),thrust::get<1>(smallest));
    getchar();
}