模板化无分支int max / min函数

时间:2022-09-28 17:18:41

I'm trying to write a branchless function to return the MAX or MIN of two integers without resorting to if (or ?:). Using the usual technique I can do this easily enough for a given word size:

我正在尝试编写一个无分支函数来返回两个整数的MAX或MIN而不求助于if(或?:)。使用通常的技术,我可以很容易地为给定的字大小做到这一点:

inline int32 imax( int32 a, int32 b )
{
    // signed for arithmetic shift
    int32 mask = a - b;
    // mask < 0 means MSB is 1.
    return a + ( ( b - a ) & ( mask >> 31 ) );
}

Now, assuming arguendo that I really am writing the kind of application on the kind of in-order processor where this is necessary, my question is whether there is a way to use C++ templates to generalize this to all sizes of int.

现在,假设arguendo我真的在那种必要的有序处理器上编写那种应用程序,我的问题是是否有办法使用C ++模板将其推广到所有大小的int。

The >>31 step only works for int32s, of course, and while I could copy out overloads on the function for int8, int16, and int64, it seems like I should use a template function instead. But how do I get the size of a template argument in bits?

>> 31步只适用于int32s,当然,虽然我可以复制int8,int16和int64函数的重载,但似乎我应该使用模板函数。但是如何以位为单位获取模板参数的大小?

Is there a better way to do it than this? Can I force the mask T to be signed? If T is unsigned the mask-shift step won't work (because it'll be a logical rather than arithmetic shift).

有没有比这更好的方法呢?我可以强制让面具T签名吗?如果T是无符号的,则掩码移位步骤将不起作用(因为它将是逻辑而不是算术移位)。

template< typename T > 
inline T imax( T a, T b )
{
    // how can I force this T to be signed?
    T mask = a - b;
    // I hope the compiler turns the math below into an immediate constant!
    mask = mask >> ( (sizeof(T) * 8) - 1 );
    return a + ( ( b - a ) & mask );
}

And, having done the above, can I prevent it from being used for anything but an integer type (eg, no floats or classes)?

并且,完成上述操作后,我可以阻止它被用于除整数类型之外的任何东西(例如,没有浮点数或类)吗?

4 个解决方案

#1


Generally, looks good, but for 100% portability, replace that 8 with CHAR_BIT (or numeric_limits::max()) since it isn't guaranteed that characters are 8-bit.

通常,看起来不错,但为了100%可移植性,用CHAR_BIT(或numeric_limits :: max())替换8,因为不能保证字符是8位。

Any good compiler will be smart enough to merge all of the math constants at compile time.

任何好的编译器都足够聪明,可以在编译时合并所有数学常量。

You can force it to be signed by using a type traits library. which would usually look something like (assuming your numeric_traits library is called numeric_traits):

您可以使用类型特征库强制对其进行签名。通常看起来像(假设您的numeric_traits库名为numeric_traits):

typename numeric_traits<T>::signed_type x;

An example of a manually rolled numeric_traits header could look like this: http://rafb.net/p/Re7kq478.html (there is plenty of room for additions, but you get the idea).

手动滚动的numeric_traits标头的示例可能如下所示:http://rafb.net/p/Re7kq478.html(有足够的空间可以添加,但您明白了)。

or better yet, use boost:

或者更好的是,使用boost:

typename boost::make_signed<T>::type x;

EDIT: IIRC, signed right shifts don't have to be arithmetic. It is common, and certainly the case with every compiler I've used. But I believe that the standard leaves it up the compiler whether right shifts are arithmetic or not on signed type. In my copy of the draft standard the following is written:

编辑:IIRC,签署右移不必算术。这是常见的,我使用的每个编译器都是如此。但是我相信标准会让编译器在签名类型上是右移也不算算。在我的标准草案副本中,写了以下内容:

The value of E1 >> E2 is E1 rightshifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation defined.

E1 >> E2的值是E1右移E2位位置。如果E1具有无符号类型或者E1具有有符号类型和非负值,则结果的值是E1的商除以提升到功率E2的数量2的积分部分。如果E1具有带符号类型和负值,则结果值是实现定义的。

But like I said, it will work on every compiler I've seen :-p.

但正如我所说,它将适用于我见过的每个编译器:-p。

#2


You may want to look at the Boost.TypeTraits library. For detecting whether a type is signed you can use the is_signed trait. You can also look into enable_if/disable_if for removing overloads for certain types.

您可能想要查看Boost.TypeTraits库。要检测类型是否已签名,可以使用is_signed特征。您还可以查看enable_if / disable_if以删除某些类型的重载。

#3


Here's another approach for branchless max and min. What's nice about it is that it doesn't use any bit tricks and you don't have to know anything about the type.

这是无分支最大和最小的另一种方法。它的好处是它不使用任何技巧,你不必知道任何类型的东西。

template <typename T> 
inline T imax (T a, T b)
{
    return (a > b) * a + (a <= b) * b;
}

template <typename T> 
inline T imin (T a, T b)
{
    return (a > b) * b + (a <= b) * a;
}

#4


tl;dr

To achieve your goals, you're best off just writing this:

为了实现你的目标,你最好写下这个:

template<typename T> T max(T a, T b) { return (a > b) ? a : b; }

Long version

I implemented both the "naive" implementation of max() as well as your branchless implementation. Both of them were not templated, and I instead used int32 just to keep things simple, and as far as I can tell, not only did Visual Studio 2017 make the naive implementation branchless, it also produced fewer instructions.

我实现了max()的“天真”实现以及无分支实现。它们都不是模板化的,我只是使用int32来保持简单,据我所知,不仅Visual Studio 2017使得天真的实现无分支,它也产生了更少的指令。

Here is the relevant Godbolt (and please, check the implementation to make sure I did it right). Note that I'm compiling with /O2 optimizations.

这是相关的Godbolt(并请,请检查实施,以确保我做对了)。请注意,我正在使用/ O2优化进行编译。

Admittedly, my assembly-fu isn't all that great, so while NaiveMax() had 5 fewer instructions and no apparent branching (and inlining I'm honestly not sure what's happening) I wanted to run a test case to definitively show whether the naive implementation was faster or not.

不可否认,我的程序集并不是那么好,所以虽然NaiveMax()有5个指令没有明显的分支(并且内联我真的不确定发生了什么)我想运行一个测试用例来明确地显示是否天真的实施更快或更快。

So I built a test. Here's the code I ran. Visual Studio 2017 (15.8.7) with "default" Release compiler options.

所以我建立了一个测试。这是我运行的代码。 Visual Studio 2017(15.8.7),带有“默认”版本编译器选项。

#include <iostream>
#include <chrono>

using int32 = long;
using uint32 = unsigned long;

constexpr int32 NaiveMax(int32 a, int32 b)
{
    return (a > b) ? a : b;
}

constexpr int32 FastMax(int32 a, int32 b)
{
    int32 mask = a - b;
    mask = mask >> ((sizeof(int32) * 8) - 1);
    return a + ((b - a) & mask);
}

int main()
{
    int32 resInts[1000] = {};

    int32 lotsOfInts[1'000];
    for (uint32 i = 0; i < 1000; i++)
    {
        lotsOfInts[i] = rand();
    }

    auto naiveTime = [&]() -> auto
    {
        auto start = std::chrono::high_resolution_clock::now();

        for (uint32 i = 1; i < 1'000'000; i++)
        {
            const auto index = i % 1000;
            const auto lastIndex = (i - 1) % 1000;
            resInts[lastIndex] = NaiveMax(lotsOfInts[lastIndex], lotsOfInts[index]);
        }

        auto finish = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    }();

    auto fastTime = [&]() -> auto
    {
        auto start = std::chrono::high_resolution_clock::now();

        for (uint32 i = 1; i < 1'000'000; i++)
        {
            const auto index = i % 1000;
            const auto lastIndex = (i - 1) % 1000;
            resInts[lastIndex] = FastMax(lotsOfInts[lastIndex], lotsOfInts[index]);
        }

        auto finish = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    }();

    std::cout << "Naive Time: " << naiveTime << std::endl;
    std::cout << "Fast Time:  " << fastTime << std::endl;

    getchar();

    return 0;
}

And here's the output I get on my machine:

这是我在我的机器上得到的输出:

Naive Time: 2330174
Fast Time:  2492246

I've run it several times getting similar results. Just to be safe, I also changed the order in which I conduct the tests, just in case it's the result of a core ramping up in speed, skewing the results. In all cases, I get similar results to the above.

我已经运行了几次得到类似的结果。为了安全起见,我还改变了我进行测试的顺序,以防万一这是因为核心速度加快,结果出现偏差。在所有情况下,我得到与上述类似的结果。

Of course, depending on your compiler or platform, these numbers may all be different. It's worth testing yourself.

当然,根据您的编译器或平台,这些数字可能都不同。值得自己测试一下。

The Answer

In brief, it would seem that the best way to write a branchless templated max() function is probably to keep it simple:

简而言之,编写无分支模板化max()函数的最佳方法似乎是保持简单:

template<typename T> T max(T a, T b) { return (a > b) ? a : b; }

There are additional upsides to the naive method:

天真的方法还有其他好处:

  1. It works for unsigned types.
  2. 它适用于无符号类型。

  3. It even works for floating types.
  4. 它甚至适用于浮动类型。

  5. It expresses exactly what you intend, rather than needing to comment up your code describing what the bit-twiddling is doing.
  6. 它完全表达了您的意图,而不是需要评论您的代码来描述比特正在做什么。

  7. It is a well known and recognizable pattern, so most compilers will know exactly how to optimize it, making it more portable. (This is a gut hunch of mine, only backed up by personal experience of compilers surprising me a lot. I'll be willing to admit I'm wrong here.)
  8. 这是一个众所周知且可识别的模式,因此大多数编译器将确切地知道如何优化它,使其更具可移植性。 (这是我的直觉预感,只有编译器的个人经验支持我很多。我会愿意承认我在这里错了。)

#1


Generally, looks good, but for 100% portability, replace that 8 with CHAR_BIT (or numeric_limits::max()) since it isn't guaranteed that characters are 8-bit.

通常,看起来不错,但为了100%可移植性,用CHAR_BIT(或numeric_limits :: max())替换8,因为不能保证字符是8位。

Any good compiler will be smart enough to merge all of the math constants at compile time.

任何好的编译器都足够聪明,可以在编译时合并所有数学常量。

You can force it to be signed by using a type traits library. which would usually look something like (assuming your numeric_traits library is called numeric_traits):

您可以使用类型特征库强制对其进行签名。通常看起来像(假设您的numeric_traits库名为numeric_traits):

typename numeric_traits<T>::signed_type x;

An example of a manually rolled numeric_traits header could look like this: http://rafb.net/p/Re7kq478.html (there is plenty of room for additions, but you get the idea).

手动滚动的numeric_traits标头的示例可能如下所示:http://rafb.net/p/Re7kq478.html(有足够的空间可以添加,但您明白了)。

or better yet, use boost:

或者更好的是,使用boost:

typename boost::make_signed<T>::type x;

EDIT: IIRC, signed right shifts don't have to be arithmetic. It is common, and certainly the case with every compiler I've used. But I believe that the standard leaves it up the compiler whether right shifts are arithmetic or not on signed type. In my copy of the draft standard the following is written:

编辑:IIRC,签署右移不必算术。这是常见的,我使用的每个编译器都是如此。但是我相信标准会让编译器在签名类型上是右移也不算算。在我的标准草案副本中,写了以下内容:

The value of E1 >> E2 is E1 rightshifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation defined.

E1 >> E2的值是E1右移E2位位置。如果E1具有无符号类型或者E1具有有符号类型和非负值,则结果的值是E1的商除以提升到功率E2的数量2的积分部分。如果E1具有带符号类型和负值,则结果值是实现定义的。

But like I said, it will work on every compiler I've seen :-p.

但正如我所说,它将适用于我见过的每个编译器:-p。

#2


You may want to look at the Boost.TypeTraits library. For detecting whether a type is signed you can use the is_signed trait. You can also look into enable_if/disable_if for removing overloads for certain types.

您可能想要查看Boost.TypeTraits库。要检测类型是否已签名,可以使用is_signed特征。您还可以查看enable_if / disable_if以删除某些类型的重载。

#3


Here's another approach for branchless max and min. What's nice about it is that it doesn't use any bit tricks and you don't have to know anything about the type.

这是无分支最大和最小的另一种方法。它的好处是它不使用任何技巧,你不必知道任何类型的东西。

template <typename T> 
inline T imax (T a, T b)
{
    return (a > b) * a + (a <= b) * b;
}

template <typename T> 
inline T imin (T a, T b)
{
    return (a > b) * b + (a <= b) * a;
}

#4


tl;dr

To achieve your goals, you're best off just writing this:

为了实现你的目标,你最好写下这个:

template<typename T> T max(T a, T b) { return (a > b) ? a : b; }

Long version

I implemented both the "naive" implementation of max() as well as your branchless implementation. Both of them were not templated, and I instead used int32 just to keep things simple, and as far as I can tell, not only did Visual Studio 2017 make the naive implementation branchless, it also produced fewer instructions.

我实现了max()的“天真”实现以及无分支实现。它们都不是模板化的,我只是使用int32来保持简单,据我所知,不仅Visual Studio 2017使得天真的实现无分支,它也产生了更少的指令。

Here is the relevant Godbolt (and please, check the implementation to make sure I did it right). Note that I'm compiling with /O2 optimizations.

这是相关的Godbolt(并请,请检查实施,以确保我做对了)。请注意,我正在使用/ O2优化进行编译。

Admittedly, my assembly-fu isn't all that great, so while NaiveMax() had 5 fewer instructions and no apparent branching (and inlining I'm honestly not sure what's happening) I wanted to run a test case to definitively show whether the naive implementation was faster or not.

不可否认,我的程序集并不是那么好,所以虽然NaiveMax()有5个指令没有明显的分支(并且内联我真的不确定发生了什么)我想运行一个测试用例来明确地显示是否天真的实施更快或更快。

So I built a test. Here's the code I ran. Visual Studio 2017 (15.8.7) with "default" Release compiler options.

所以我建立了一个测试。这是我运行的代码。 Visual Studio 2017(15.8.7),带有“默认”版本编译器选项。

#include <iostream>
#include <chrono>

using int32 = long;
using uint32 = unsigned long;

constexpr int32 NaiveMax(int32 a, int32 b)
{
    return (a > b) ? a : b;
}

constexpr int32 FastMax(int32 a, int32 b)
{
    int32 mask = a - b;
    mask = mask >> ((sizeof(int32) * 8) - 1);
    return a + ((b - a) & mask);
}

int main()
{
    int32 resInts[1000] = {};

    int32 lotsOfInts[1'000];
    for (uint32 i = 0; i < 1000; i++)
    {
        lotsOfInts[i] = rand();
    }

    auto naiveTime = [&]() -> auto
    {
        auto start = std::chrono::high_resolution_clock::now();

        for (uint32 i = 1; i < 1'000'000; i++)
        {
            const auto index = i % 1000;
            const auto lastIndex = (i - 1) % 1000;
            resInts[lastIndex] = NaiveMax(lotsOfInts[lastIndex], lotsOfInts[index]);
        }

        auto finish = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    }();

    auto fastTime = [&]() -> auto
    {
        auto start = std::chrono::high_resolution_clock::now();

        for (uint32 i = 1; i < 1'000'000; i++)
        {
            const auto index = i % 1000;
            const auto lastIndex = (i - 1) % 1000;
            resInts[lastIndex] = FastMax(lotsOfInts[lastIndex], lotsOfInts[index]);
        }

        auto finish = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::nanoseconds>(finish - start).count();
    }();

    std::cout << "Naive Time: " << naiveTime << std::endl;
    std::cout << "Fast Time:  " << fastTime << std::endl;

    getchar();

    return 0;
}

And here's the output I get on my machine:

这是我在我的机器上得到的输出:

Naive Time: 2330174
Fast Time:  2492246

I've run it several times getting similar results. Just to be safe, I also changed the order in which I conduct the tests, just in case it's the result of a core ramping up in speed, skewing the results. In all cases, I get similar results to the above.

我已经运行了几次得到类似的结果。为了安全起见,我还改变了我进行测试的顺序,以防万一这是因为核心速度加快,结果出现偏差。在所有情况下,我得到与上述类似的结果。

Of course, depending on your compiler or platform, these numbers may all be different. It's worth testing yourself.

当然,根据您的编译器或平台,这些数字可能都不同。值得自己测试一下。

The Answer

In brief, it would seem that the best way to write a branchless templated max() function is probably to keep it simple:

简而言之,编写无分支模板化max()函数的最佳方法似乎是保持简单:

template<typename T> T max(T a, T b) { return (a > b) ? a : b; }

There are additional upsides to the naive method:

天真的方法还有其他好处:

  1. It works for unsigned types.
  2. 它适用于无符号类型。

  3. It even works for floating types.
  4. 它甚至适用于浮动类型。

  5. It expresses exactly what you intend, rather than needing to comment up your code describing what the bit-twiddling is doing.
  6. 它完全表达了您的意图,而不是需要评论您的代码来描述比特正在做什么。

  7. It is a well known and recognizable pattern, so most compilers will know exactly how to optimize it, making it more portable. (This is a gut hunch of mine, only backed up by personal experience of compilers surprising me a lot. I'll be willing to admit I'm wrong here.)
  8. 这是一个众所周知且可识别的模式,因此大多数编译器将确切地知道如何优化它,使其更具可移植性。 (这是我的直觉预感,只有编译器的个人经验支持我很多。我会愿意承认我在这里错了。)