是否有可能微观优化“x = max(a,b); y = min(a,b);“?

时间:2022-09-28 09:56:47

I had an algorithm that started out like

我有一个算法开始就像

int sumLargest2 ( int * arr, size_t n )
{
    int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1])); 
    // ... 

and I realized that the first is probably not optimal because calling max and then min is repetitious when you consider that the information required to know the minimum is already there once you've found the maximum. So I figured out that I could do

并且我意识到第一个可能不是最优的,因为当你认为知道最小值所需的信息在你找到最大值之后已经存在时,调用max然后min是重复的。所以我发现我能做到

   int largest = max(arr[0], arr[1]);
   int secondLargest = arr[0] == largest ? arr[1] : arr[0];

to shave off the useless invocation of min, but I'm not sure that actually saves any number of operations. Are there any fancy bit-shifting algorithms that can do the equivalent of

削减min的无用调用,但我不确定实际上是否可以节省任何数量的操作。是否有任何花哨的位移算法可以做到相当于

int largest(max(arr[0], arr[1])), secondLargest(min(arr[0],arr[1]));

?????

8 个解决方案

#1


10  

In C++, you can use std::minmax to produce a std::pair of the minimum and the maximum. This is particularly easy in combination with std::tie:

在C ++中,您可以使用std :: minmax生成最小值和最大值的std :: pair。与std :: tie结合使用时特别容易:

#include <algorithm>
#include <utility>

int largest, secondLargest;
std::tie(secondLargest, largest) = std::minmax(arr[0], arr[1]);

GCC, at least, is capable of optimizing the call to minmax into a single comparison, identical to the result of the C code below.

至少GCC能够将对minmax的调用优化为单个比较,与下面的C代码的结果相同。

In C, you could write the test out yourself:

在C中,您可以自己编写测试:

int largest, secondLargest;
if (arr[0] < arr[1]) {
  largest = arr[1];
  secondLargest = arr[0];
} else {
  largest = arr[0];
  secondLargest = arr[1];
}

#2


4  

How about:

int largestIndex = arr[1] > arr[0];
int largest = arr[largestIndex];
int secondLargest = arr[1 - largestIndex];

The first line relies on an implicit cast of a boolean result to 1 in the case of true and 0 in the case of false.

第一行依赖于布尔结果的隐式转换,在true的情况下为1,在false的情况下为0。

#3


4  

I'm going to assume that you'd rather solve the larger problem... That is, getting the sum of the largest two numbers in an array.

我将假设您宁愿解决更大的问题......也就是说,获取数组中最大的两个数字的总和。

What you are trying to do is a std::partial_sort(). Let's implement it.

你要做的是std :: partial_sort()。让我们实现它。

int sumLargest2(int * arr, size_t n) {
    int * first  = arr;
    int * middle = arr + 2;
    int * last   = arr + n;

    std::partial_sort(first, middle, last, std::greater<int>());

    return arr[0] + arr[1];
}

And if you're unable to modify arr, then I'd recommend looking into std::partial_sort_copy().

如果您无法修改arr,那么我建议您查看std :: partial_sort_copy()。

#4


4  

x = max(a, b);
y = a + b - x;

It won't necessarily be faster, but it will be different.

它不一定会更快,但会有所不同。

Also beware of overflows.

还要注意溢出。

#5


4  

If your intention is to reduce the function call to find min mad max you can try std::minmax_element. This is available since C++11.

如果你的目的是减少函数调用以找到min mad max,你可以试试std :: minmax_element。这是从C ++ 11开始提供的。

auto result = std::minmax_element(arr, arr+n);
std::cout<< "min:"<< *result.first<<"\n";
std::cout<< "max :" <<*result.second << "\n";

#6


3  

If you just want to find the bigger of two values go:

如果你只是想找到两个值中较大的值,那就去:

if(a > b)
{
    largest = a;
    second = b;
}
else
{
     largest = b;
     second = a;
}

No function calls, one comparison, two assignments.

没有函数调用,一个比较,两个赋值。

#7


2  

I'm assuming C++...

我假设C ++ ...

Short answer, use std::minmax and compile with the right optimizations and the right instruction set parameters.

简而言之,使用std :: minmax并使用正确的优化和正确的指令集参数进行编译。

Long ugly answer, The compiler cannot make all the assumptions necessary to make it really, really fast. You can. In this case, you can change the algorithm to process all data first and you can force alignment on the data. Doing all this, you can use intrinsics to make it faster.

长期难看的答案,编译器无法做出所有必要的假设,以使其真正,非常快。您可以。在这种情况下,您可以更改算法以首先处理所有数据,并且可以强制对齐数据。完成所有这些操作后,您可以使用内在函数来加快速度。

Although I haven't tested it in this particular case, I've seen enormous performance improvements using these guidelines.

虽然我没有在这种特殊情况下测试它,但我已经看到使用这些指南的巨大性能改进。

Since you're not passing 2 integers to the function, I'm assuming your using an array and want to iterate it somehow. You now have a choice to make: make 2 arrays and use min/max or use 1 array with both a and b. This decision alone can already influence the performance.

由于你没有将2个整数传递给函数,我假设你使用了一个数组并想以某种方式迭代它。您现在可以选择:制作2个数组并使用min / max或使用a和b两个数组。仅此决定已经可以影响性能。

If you have 2 arrays, these can be allocated on 32-byte boundaries with aligned malloc's and then processed using intrinsics. If you are going for real, raw performance - this is the way to go.

如果你有2个数组,可以使用对齐的malloc在32字节边界上分配这些数组,然后使用内在函数进行处理。如果你想要真正的原始性能 - 这是要走的路。

F.ex, let's assume you have AVX2. (NOTE: I'm not sure if you do and you SHOULD check this using CPU id's!). Go to the cheat sheet here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and pick your poison.

F.ex,我们假设你有AVX2。 (注意:我不确定你是否这样做,你应该使用CPU ID来检查它!)。转到这里的备忘单:https://software.intel.com/sites/landingpage/IntrinsicsGuide/并挑选你的毒药。

The intrinsics you're looking for are in this case probably:

你正在寻找的内在函数可能是:

  • _mm256_min_epi32
  • _mm256_max_epi32
  • _mm256_stream_load_si256

If you have to do this for the entire array, you probably want to keep all the stuff in a single __mm256 register before merging the individual items. E.g.: do a min/max per 256-bit vector, and when the loop is done, extract the 32-bit items and do a min/max on that.

如果必须对整个数组执行此操作,则可能需要在合并各个项之前将所有内容保留在单个__mm256寄存器中。例如:每256位向量执行最小/最大值,并且当循环完成时,提取32位项并对其执行最小值/最大值。

Long nicer answer: So ... as for the compiler. Compilers do attempt to optimize these kinds of things, but run into problems.

更好的答案:所以...至于编译器。编译器会尝试优化这些类型的东西,但会遇到问题。

If you have 2 different arrays that you process, the compiler has to know that they are different in order to be able to optimize it. This is the reason why stuff like restrict exists, which tells the compiler exactly this little thing you probably already knew while writing the code.

如果您处理了2个不同的数组,则编译器必须知道它们是不同的才能进行优化。这就是为什么像restrict这样的东西存在的原因,它告诉编译器你编写代码时可能已经知道的这个小东西。

Also, the compiler doesn't know your memory is aligned, so it has to check this and branch... for each call. We don't want this; which means we want it to inline its stuff. So, add inline, put it in a header file and that's that. You can also use aligned to give him a hint.

此外,编译器不知道您的内存是否已对齐,因此必须检查此内容并为每次调用分支....我们不想要这个;这意味着我们希望它能够内联它的内容。所以,添加内联,将它放在头文件中就是这样。你也可以使用align来给他一个提示。

Your compiler also didn't get the hint that the int* won't change over time. If it cannot change, it's a good idea to tell him that using the const keyword.

您的编译器也没有得到int *不会随时间变化的提示。如果它不能改变,告诉他使用const关键字是个好主意。

A compiler uses an instruction set to do the compilation. Normally, they already use SSE, but AVX2 can help a lot (as I've shown with the intrinsics above). If you can compile it with those flags, make sure to use them - they help a lot.

编译器使用指令集进行编译。通常情况下,他们已经使用了SSE,但AVX2可以提供很多帮助(正如我在上面的内在函数中所展示的那样)。如果你可以用这些标志编译它,请确保使用它们 - 它们有很多帮助。

Run in release mode, compile with optimizations on 'fast' and see what happens under the hood. If you do all this, you should see vpmax... instructions appearing in the inner loops, which means that the compiler uses the intrinsics just fine.

在发布模式下运行,在'fast'上进行优化编译,看看底层会发生什么。如果你做了这一切,你应该看到内部循环中出现的vpmax ...指令,这意味着编译器使用内在函数就好了。

I don't know what else you want to do in the loop... if you use all these instructions you should hit the memory speed on big arrays.

我不知道你想在循环中做什么...如果你使用所有这些指令,你应该在大数组上达到内存速度。

#8


2  

How about a time-space trade-off?

如何进行时空权衡?

#include <utility>

template<typename T>
    std::pair<T, T>
        minmax(T const& a, T const& b)
        { return b < a ? std::make_pair(b, a) : std::make_pair(a, b); }

//main
std::pair<int, int> values = minmax(a[0], a[1]);
int largest       = values.second;
int secondLargest = values.first;

#1


10  

In C++, you can use std::minmax to produce a std::pair of the minimum and the maximum. This is particularly easy in combination with std::tie:

在C ++中,您可以使用std :: minmax生成最小值和最大值的std :: pair。与std :: tie结合使用时特别容易:

#include <algorithm>
#include <utility>

int largest, secondLargest;
std::tie(secondLargest, largest) = std::minmax(arr[0], arr[1]);

GCC, at least, is capable of optimizing the call to minmax into a single comparison, identical to the result of the C code below.

至少GCC能够将对minmax的调用优化为单个比较,与下面的C代码的结果相同。

In C, you could write the test out yourself:

在C中,您可以自己编写测试:

int largest, secondLargest;
if (arr[0] < arr[1]) {
  largest = arr[1];
  secondLargest = arr[0];
} else {
  largest = arr[0];
  secondLargest = arr[1];
}

#2


4  

How about:

int largestIndex = arr[1] > arr[0];
int largest = arr[largestIndex];
int secondLargest = arr[1 - largestIndex];

The first line relies on an implicit cast of a boolean result to 1 in the case of true and 0 in the case of false.

第一行依赖于布尔结果的隐式转换,在true的情况下为1,在false的情况下为0。

#3


4  

I'm going to assume that you'd rather solve the larger problem... That is, getting the sum of the largest two numbers in an array.

我将假设您宁愿解决更大的问题......也就是说,获取数组中最大的两个数字的总和。

What you are trying to do is a std::partial_sort(). Let's implement it.

你要做的是std :: partial_sort()。让我们实现它。

int sumLargest2(int * arr, size_t n) {
    int * first  = arr;
    int * middle = arr + 2;
    int * last   = arr + n;

    std::partial_sort(first, middle, last, std::greater<int>());

    return arr[0] + arr[1];
}

And if you're unable to modify arr, then I'd recommend looking into std::partial_sort_copy().

如果您无法修改arr,那么我建议您查看std :: partial_sort_copy()。

#4


4  

x = max(a, b);
y = a + b - x;

It won't necessarily be faster, but it will be different.

它不一定会更快,但会有所不同。

Also beware of overflows.

还要注意溢出。

#5


4  

If your intention is to reduce the function call to find min mad max you can try std::minmax_element. This is available since C++11.

如果你的目的是减少函数调用以找到min mad max,你可以试试std :: minmax_element。这是从C ++ 11开始提供的。

auto result = std::minmax_element(arr, arr+n);
std::cout<< "min:"<< *result.first<<"\n";
std::cout<< "max :" <<*result.second << "\n";

#6


3  

If you just want to find the bigger of two values go:

如果你只是想找到两个值中较大的值,那就去:

if(a > b)
{
    largest = a;
    second = b;
}
else
{
     largest = b;
     second = a;
}

No function calls, one comparison, two assignments.

没有函数调用,一个比较,两个赋值。

#7


2  

I'm assuming C++...

我假设C ++ ...

Short answer, use std::minmax and compile with the right optimizations and the right instruction set parameters.

简而言之,使用std :: minmax并使用正确的优化和正确的指令集参数进行编译。

Long ugly answer, The compiler cannot make all the assumptions necessary to make it really, really fast. You can. In this case, you can change the algorithm to process all data first and you can force alignment on the data. Doing all this, you can use intrinsics to make it faster.

长期难看的答案,编译器无法做出所有必要的假设,以使其真正,非常快。您可以。在这种情况下,您可以更改算法以首先处理所有数据,并且可以强制对齐数据。完成所有这些操作后,您可以使用内在函数来加快速度。

Although I haven't tested it in this particular case, I've seen enormous performance improvements using these guidelines.

虽然我没有在这种特殊情况下测试它,但我已经看到使用这些指南的巨大性能改进。

Since you're not passing 2 integers to the function, I'm assuming your using an array and want to iterate it somehow. You now have a choice to make: make 2 arrays and use min/max or use 1 array with both a and b. This decision alone can already influence the performance.

由于你没有将2个整数传递给函数,我假设你使用了一个数组并想以某种方式迭代它。您现在可以选择:制作2个数组并使用min / max或使用a和b两个数组。仅此决定已经可以影响性能。

If you have 2 arrays, these can be allocated on 32-byte boundaries with aligned malloc's and then processed using intrinsics. If you are going for real, raw performance - this is the way to go.

如果你有2个数组,可以使用对齐的malloc在32字节边界上分配这些数组,然后使用内在函数进行处理。如果你想要真正的原始性能 - 这是要走的路。

F.ex, let's assume you have AVX2. (NOTE: I'm not sure if you do and you SHOULD check this using CPU id's!). Go to the cheat sheet here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and pick your poison.

F.ex,我们假设你有AVX2。 (注意:我不确定你是否这样做,你应该使用CPU ID来检查它!)。转到这里的备忘单:https://software.intel.com/sites/landingpage/IntrinsicsGuide/并挑选你的毒药。

The intrinsics you're looking for are in this case probably:

你正在寻找的内在函数可能是:

  • _mm256_min_epi32
  • _mm256_max_epi32
  • _mm256_stream_load_si256

If you have to do this for the entire array, you probably want to keep all the stuff in a single __mm256 register before merging the individual items. E.g.: do a min/max per 256-bit vector, and when the loop is done, extract the 32-bit items and do a min/max on that.

如果必须对整个数组执行此操作,则可能需要在合并各个项之前将所有内容保留在单个__mm256寄存器中。例如:每256位向量执行最小/最大值,并且当循环完成时,提取32位项并对其执行最小值/最大值。

Long nicer answer: So ... as for the compiler. Compilers do attempt to optimize these kinds of things, but run into problems.

更好的答案:所以...至于编译器。编译器会尝试优化这些类型的东西,但会遇到问题。

If you have 2 different arrays that you process, the compiler has to know that they are different in order to be able to optimize it. This is the reason why stuff like restrict exists, which tells the compiler exactly this little thing you probably already knew while writing the code.

如果您处理了2个不同的数组,则编译器必须知道它们是不同的才能进行优化。这就是为什么像restrict这样的东西存在的原因,它告诉编译器你编写代码时可能已经知道的这个小东西。

Also, the compiler doesn't know your memory is aligned, so it has to check this and branch... for each call. We don't want this; which means we want it to inline its stuff. So, add inline, put it in a header file and that's that. You can also use aligned to give him a hint.

此外,编译器不知道您的内存是否已对齐,因此必须检查此内容并为每次调用分支....我们不想要这个;这意味着我们希望它能够内联它的内容。所以,添加内联,将它放在头文件中就是这样。你也可以使用align来给他一个提示。

Your compiler also didn't get the hint that the int* won't change over time. If it cannot change, it's a good idea to tell him that using the const keyword.

您的编译器也没有得到int *不会随时间变化的提示。如果它不能改变,告诉他使用const关键字是个好主意。

A compiler uses an instruction set to do the compilation. Normally, they already use SSE, but AVX2 can help a lot (as I've shown with the intrinsics above). If you can compile it with those flags, make sure to use them - they help a lot.

编译器使用指令集进行编译。通常情况下,他们已经使用了SSE,但AVX2可以提供很多帮助(正如我在上面的内在函数中所展示的那样)。如果你可以用这些标志编译它,请确保使用它们 - 它们有很多帮助。

Run in release mode, compile with optimizations on 'fast' and see what happens under the hood. If you do all this, you should see vpmax... instructions appearing in the inner loops, which means that the compiler uses the intrinsics just fine.

在发布模式下运行,在'fast'上进行优化编译,看看底层会发生什么。如果你做了这一切,你应该看到内部循环中出现的vpmax ...指令,这意味着编译器使用内在函数就好了。

I don't know what else you want to do in the loop... if you use all these instructions you should hit the memory speed on big arrays.

我不知道你想在循环中做什么...如果你使用所有这些指令,你应该在大数组上达到内存速度。

#8


2  

How about a time-space trade-off?

如何进行时空权衡?

#include <utility>

template<typename T>
    std::pair<T, T>
        minmax(T const& a, T const& b)
        { return b < a ? std::make_pair(b, a) : std::make_pair(a, b); }

//main
std::pair<int, int> values = minmax(a[0], a[1]);
int largest       = values.second;
int secondLargest = values.first;