SSE乘4个32位整数

时间:2022-09-01 08:27:12

How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it.

如何将4个32位的整数与另外4个整数相乘?我找不到任何能做这件事的指示。

2 个解决方案

#1


23  

If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want:

如果你需要32 * 32位整数乘法,那么下面这个例子在software.intel.com看起来应该做你想做的:

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
    __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
    __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
    return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
}

You might want to have two builds - one for old CPUs and one for recent CPUs, in which case you could do the following:

您可能希望有两个构建——一个用于旧cpu,一个用于最近的cpu,在这种情况下,您可以执行以下操作:

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
#ifdef __SSE4_1__  // modern CPU - use SSE 4.1
    return _mm_mullo_epi32(a, b);
#else               // old CPU - use SSE 2
    __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
    __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
    return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
#endif
}

#2


7  

PMULLD, from SSE 4.1, does that.

来自SSE 4.1的PMULLD就是这样做的。

The description is slightly misleading, it talks about signed multiplication, but since it only stores the lower 32bits, it's really a sign-oblivious instruction that you can use for both, just like IMUL.

这个描述有点误导人,它讲的是有符号乘法,但是因为它只存储较低的32位,所以它实际上是一个无符号指令,你可以对两者都使用,就像IMUL一样。

#1


23  

If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want:

如果你需要32 * 32位整数乘法,那么下面这个例子在software.intel.com看起来应该做你想做的:

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
    __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
    __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
    return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
}

You might want to have two builds - one for old CPUs and one for recent CPUs, in which case you could do the following:

您可能希望有两个构建——一个用于旧cpu,一个用于最近的cpu,在这种情况下,您可以执行以下操作:

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
#ifdef __SSE4_1__  // modern CPU - use SSE 4.1
    return _mm_mullo_epi32(a, b);
#else               // old CPU - use SSE 2
    __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
    __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
    return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
#endif
}

#2


7  

PMULLD, from SSE 4.1, does that.

来自SSE 4.1的PMULLD就是这样做的。

The description is slightly misleading, it talks about signed multiplication, but since it only stores the lower 32bits, it's really a sign-oblivious instruction that you can use for both, just like IMUL.

这个描述有点误导人,它讲的是有符号乘法,但是因为它只存储较低的32位,所以它实际上是一个无符号指令,你可以对两者都使用,就像IMUL一样。