在c++中转置矩阵的最快方法是什么?

时间:2023-01-19 16:54:41

I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is

我有一个矩阵(相对较大)我需要转置。例如,假设我的矩阵是。

a b c d e f
g h i j k l
m n o p q r 

I want the result be as follows:

我想要的结果如下:

a g m
b h n
c I o
d j p
e k q
f l r

What is the fastest way to do this?

最快的方法是什么?

8 个解决方案

#1


105  

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.

这是个好问题。有很多原因,你想要把矩阵在内存中转置,而不只是交换坐标,比如矩阵乘法和高斯模糊。

First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)

首先让我列出我用于转置的一个函数(编辑:请参见我的答案的末尾,在那里我找到了一个更快的解决方案)

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

现在我们来看看为什么转置是有用的。考虑矩阵乘法C = A*B。我们可以这样做。

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

但是,这种方式会有很多缓存遗漏。一个快得多的解是先取B的转置。

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.

矩阵乘法是O(n ^ 3)和转置是O(n ^ 2),所以把转置应该对计算时间的影响可以忽略不计(大n)。在矩阵乘法循环瓷砖甚至比采取更有效的转置但这复杂得多。

I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.

我希望我知道一个更快的方法来做转置(编辑:我找到了一个更快的解决方案,看我的答案的结尾)。当Haswell/AVX2在几周内出现时,它将有一个收集功能。我不知道这对这个例子是否有用,但我可以像收集一个列并写出一行。也许它会使转置没有必要。

For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is

对于高斯涂抹,你要做的是水平涂抹,然后垂直涂抹。但是,垂直的涂抹有缓存问题,所以你要做的是。

Smear image horizontally
transpose output 
Smear output horizontally
transpose output

Here is a paper by Intel explaining that http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

这是英特尔的一篇文章,解释了http://software.intel.com/en-us/articles/iir-gaussi -模糊-过滤器-实现-基于- Intel -advanced-向量-扩展。

Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use

最后,我在矩阵乘法(和高斯模糊)中所做的,并不是取完全的转置,而是取某个向量的宽度的转置(例如,SSE/AVX的4或8)。这是我使用的函数。

void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
    #pragma omp parallel for
    for(int n=0; n<M*N; n++) {
        int k = vec_size*(n/N/vec_size);
        int i = (n/vec_size)%N;
        int j = n%vec_size;
        B[n] = A[M*i + k + j];
    }
}

EDIT:

编辑:

I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).

我尝试了几个函数来找到对大矩阵的最快转置。最后,最快的结果是使用block_size=16(编辑:我找到了一个更快的解决方案,使用SSE和循环阻塞-见下)。该代码适用于任何NxM矩阵(即矩阵不必为正方形)。

inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<block_size; i++) {
        for(int j=0; j<block_size; j++) {
            B[j*ldb + i] = A[i*lda +j];
        }
    }
}

inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
        }
    }
}

The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this

lda和ldb是矩阵的宽度。这些需要是块大小的倍数。为了找到值并分配内存,例如,一个3000x1001矩阵,我这样做。

#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);

float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

For 3000x1001 this returns ldb = 3008 and lda = 1008

对于3000x1001,返回ldb = 3008和lda = 1008。

Edit:

编辑:

I found an even faster solution using SSE intrinsics:

我发现了一个更快的解决方案,使用SSE的内在特性:

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }
}

#2


36  

This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.

这取决于你的应用程序但一般来说,转置矩阵最快的方法是当你向上看的时候,把你的坐标倒过来,这样你就不用移动任何数据了。

#3


4  

Some details about transposing 4x4 square float (I will discuss 32-bit integer later) matrices with x86 hardware. It's helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16.

关于transposing 4x4方形浮子(我将在后面讨论32位整数)的一些细节,使用x86硬件。从这里开始,为了转置更大的方阵,比如8x8或16x16,这很有帮助。

_MM_TRANSPOSE4_PS(r0, r1, r2, r3) is implemented differently by different compilers. GCC and ICC (I have not checked Clang) use unpcklps, unpckhps, unpcklpd, unpckhpd whereas MSVC uses only shufps. We can actually combine these two approaches together like this.

_MM_TRANSPOSE4_PS(r0, r1, r2, r3)由不同的编译器实现不同的实现。GCC和ICC(我没有检查Clang)使用unpcklps, unpckhps, unpcklpd, unpckhpd,而MSVC只使用shufps。我们可以把这两种方法结合起来。

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

r0 = _mm_shuffle_ps(t0,t2, 0x44);
r1 = _mm_shuffle_ps(t0,t2, 0xEE);
r2 = _mm_shuffle_ps(t1,t3, 0x44);
r3 = _mm_shuffle_ps(t1,t3, 0xEE);

One interesting observation is that two shuffles can be converted to one shuffle and two blends (SSE4.1) like this.

一个有趣的观察是,两个洗牌可以转换成一个洗牌和两个混合(SSE4.1)这样。

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

v  = _mm_shuffle_ps(t0,t2, 0x4E);
r0 = _mm_blend_ps(t0,v, 0xC);
r1 = _mm_blend_ps(t2,v, 0x3);
v  = _mm_shuffle_ps(t1,t3, 0x4E);
r2 = _mm_blend_ps(t1,v, 0xC);
r3 = _mm_blend_ps(t3,v, 0x3);

This effectively converted 4 shuffles into 2 shuffles and 4 blends. This uses 2 more instructions than the implementation of GCC, ICC, and MSVC. The advantage is that it reduces port pressure which may have a benefit in some circumstances. Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports.

这有效地将4个洗牌变成2个洗牌和4个混合。这比GCC、ICC和MSVC的实现需要更多的指令。它的优点是它可以减少在某些情况下可能有好处的端口压力。目前所有的洗牌和解包只能到一个特定的港口,而混合可以去两个不同的港口。

I tried using 8 shuffles like MSVC and converting that into 4 shuffles + 8 blends but it did not work. I still had to use 4 unpacks.

我试着用8个像MSVC一样的洗牌,把它转换成4个洗牌+ 8混合,但是没有用。我仍然需要使用4个解包。

I used this same technique for a 8x8 float transpose (see towards the end of that answer). https://*.com/a/25627536/2542702. In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends.

我在8x8浮点转置中使用了同样的技术(参见后面的答案)。https://*.com/a/25627536/2542702。在那个答案中,我仍然需要使用8个解包,但我必须把8个洗牌换成4个洗牌和8个混合。

For 32-bit integers there is nothing like shufps (except for 128-bit shuffles with AVX512) so it can only be implemented with unpacks which I don't think can be convert to blends (efficiently). With AVX512 vshufi32x4 acts effectively like shufps except for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4 in some cases. With Knights Landing shuffles are four times slower (throughput) than blends.

对于32位整数来说,没有什么像shufps(除了128位的shuffles和AVX512),所以只能用unpack来实现,我认为这不能转化为混合(有效地)。使用AVX512 vshufi32x4可以像shufps一样有效,除了4个整数的128位行,而不是32位的浮点数,因此在某些情况下,这一技术可能与vshufi32x4相同。随着骑士的登陆,洗牌的速度比混合的要慢4倍。

#4


1  

template <class T>
void transpose( std::vector< std::vector<T> > a,
std::vector< std::vector<T> > b,
int width, int height)
{
    for (int i = 0; i < width; i++)
    {
        for (int j = 0; j < height; j++)
        {
            b[j][i] = a[i][j];
        }
    }
} 

#5


1  

Consider each row as a column, and each column as a row .. use j,i instead of i,j

将每一行视为一列,每一列为一行。用j,而不是i j。

demo: http://ideone.com/lvsxKZ

演示:http://ideone.com/lvsxKZ

#include <iostream> 
using namespace std;

int main ()
{
    char A [3][3] =
    {
        { 'a', 'b', 'c' },
        { 'd', 'e', 'f' },
        { 'g', 'h', 'i' }
    };

    cout << "A = " << endl << endl;

    // print matrix A
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[i][j];
        cout << endl;
    }

    cout << endl << "A transpose = " << endl << endl;

    // print A transpose
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[j][i];
        cout << endl;
    }

    return 0;
}

#6


1  

transposing without any overhead (class not complete):

没有任何开销的转置(未完成的类):

class Matrix{
   double *data; //suppose this will point to data
   double _get1(int i, int j){return data[i*M+j];} //used to access normally
   double _get2(int i, int j){return data[j*N+i];} //used when transposed

   public:
   int M, N; //dimensions
   double (*get_p)(int, int); //functor to access elements  
   Matrix(int _M,int _N):M(_M), N(_N){
     //allocate data
     get_p=&Matrix::_get1; // initialised with normal access 
     }

   double get(int i, int j){
     //there should be a way to directly use get_p to call. but i think even this
     //doesnt incur overhead because it is inline and the compiler should be intelligent
     //enough to remove the extra call
     return (this->*get_p)(i,j);
    }
   void transpose(){ //twice transpose gives the original
     if(get_p==&Matrix::get1) get_p=&Matrix::_get2;
     else get_p==&Matrix::_get1; 
     swap(M,N);
     }
}

can be used like this:

可以这样使用:

Matrix M(100,200);
double x=M.get(17,45);
M.transpose();
x=M.get(17,45); // = original M(45,17)

of course I didn't bother with the memory management here, which is crucial but different topic.

当然,我并不关心这里的内存管理,这是一个非常重要但不同的主题。

#7


-1  

I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)

我认为最快速的方式不应该高于O(n ^ 2)也通过这种方式你可以使用O(1)空间:成对交换的方法,因为当你转置一个矩阵然后你要做的是:M[我][j]= M[j][我],所以在临时存储M[我][j],然后M[我][j]= M[j][我],和最后一步:M[j][我]= temp。这可能是由一个通过它应该O(n ^ 2)

#8


-5  

my answer is transposed of 3x3 matrix

我的答案是转置的3x3矩阵。

 #include<iostream.h>

#include<math.h>


main()
{
int a[3][3];
int b[3];
cout<<"You must give us an array 3x3 and then we will give you Transposed it "<<endl;
for(int i=0;i<3;i++)
{
    for(int j=0;j<3;j++)
{
cout<<"Enter a["<<i<<"]["<<j<<"]: ";

cin>>a[i][j];

}

}
cout<<"Matrix you entered is :"<<endl;

 for (int e = 0 ; e < 3 ; e++ )

{
    for ( int f = 0 ; f < 3 ; f++ )

        cout << a[e][f] << "\t";


    cout << endl;

    }

 cout<<"\nTransposed of matrix you entered is :"<<endl;
 for (int c = 0 ; c < 3 ; c++ )
{
    for ( int d = 0 ; d < 3 ; d++ )
        cout << a[d][c] << "\t";

    cout << endl;
    }

return 0;
}

#1


105  

This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.

这是个好问题。有很多原因,你想要把矩阵在内存中转置,而不只是交换坐标,比如矩阵乘法和高斯模糊。

First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)

首先让我列出我用于转置的一个函数(编辑:请参见我的答案的末尾,在那里我找到了一个更快的解决方案)

void transpose(float *src, float *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}

Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.

现在我们来看看为什么转置是有用的。考虑矩阵乘法C = A*B。我们可以这样做。

for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*l+j];
        }
        C[K*i + j] = tmp;
    }
}

That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first

但是,这种方式会有很多缓存遗漏。一个快得多的解是先取B的转置。

transpose(B);
for(int i=0; i<N; i++) {
    for(int j=0; j<K; j++) {
        float tmp = 0;
        for(int l=0; l<M; l++) {
            tmp += A[M*i+l]*B[K*j+l];
        }
        C[K*i + j] = tmp;
    }
}
transpose(B);

Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.

矩阵乘法是O(n ^ 3)和转置是O(n ^ 2),所以把转置应该对计算时间的影响可以忽略不计(大n)。在矩阵乘法循环瓷砖甚至比采取更有效的转置但这复杂得多。

I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.

我希望我知道一个更快的方法来做转置(编辑:我找到了一个更快的解决方案,看我的答案的结尾)。当Haswell/AVX2在几周内出现时,它将有一个收集功能。我不知道这对这个例子是否有用,但我可以像收集一个列并写出一行。也许它会使转置没有必要。

For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is

对于高斯涂抹,你要做的是水平涂抹,然后垂直涂抹。但是,垂直的涂抹有缓存问题,所以你要做的是。

Smear image horizontally
transpose output 
Smear output horizontally
transpose output

Here is a paper by Intel explaining that http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

这是英特尔的一篇文章,解释了http://software.intel.com/en-us/articles/iir-gaussi -模糊-过滤器-实现-基于- Intel -advanced-向量-扩展。

Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use

最后,我在矩阵乘法(和高斯模糊)中所做的,并不是取完全的转置,而是取某个向量的宽度的转置(例如,SSE/AVX的4或8)。这是我使用的函数。

void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
    #pragma omp parallel for
    for(int n=0; n<M*N; n++) {
        int k = vec_size*(n/N/vec_size);
        int i = (n/vec_size)%N;
        int j = n%vec_size;
        B[n] = A[M*i + k + j];
    }
}

EDIT:

编辑:

I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).

我尝试了几个函数来找到对大矩阵的最快转置。最后,最快的结果是使用block_size=16(编辑:我找到了一个更快的解决方案,使用SSE和循环阻塞-见下)。该代码适用于任何NxM矩阵(即矩阵不必为正方形)。

inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<block_size; i++) {
        for(int j=0; j<block_size; j++) {
            B[j*ldb + i] = A[i*lda +j];
        }
    }
}

inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
        }
    }
}

The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this

lda和ldb是矩阵的宽度。这些需要是块大小的倍数。为了找到值并分配内存,例如,一个3000x1001矩阵,我这样做。

#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);

float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

For 3000x1001 this returns ldb = 3008 and lda = 1008

对于3000x1001,返回ldb = 3008和lda = 1008。

Edit:

编辑:

I found an even faster solution using SSE intrinsics:

我发现了一个更快的解决方案,使用SSE的内在特性:

inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
    __m128 row1 = _mm_load_ps(&A[0*lda]);
    __m128 row2 = _mm_load_ps(&A[1*lda]);
    __m128 row3 = _mm_load_ps(&A[2*lda]);
    __m128 row4 = _mm_load_ps(&A[3*lda]);
     _MM_TRANSPOSE4_PS(row1, row2, row3, row4);
     _mm_store_ps(&B[0*ldb], row1);
     _mm_store_ps(&B[1*ldb], row2);
     _mm_store_ps(&B[2*ldb], row3);
     _mm_store_ps(&B[3*ldb], row4);
}

inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
    #pragma omp parallel for
    for(int i=0; i<n; i+=block_size) {
        for(int j=0; j<m; j+=block_size) {
            int max_i2 = i+block_size < n ? i + block_size : n;
            int max_j2 = j+block_size < m ? j + block_size : m;
            for(int i2=i; i2<max_i2; i2+=4) {
                for(int j2=j; j2<max_j2; j2+=4) {
                    transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
                }
            }
        }
    }
}

#2


36  

This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.

这取决于你的应用程序但一般来说,转置矩阵最快的方法是当你向上看的时候,把你的坐标倒过来,这样你就不用移动任何数据了。

#3


4  

Some details about transposing 4x4 square float (I will discuss 32-bit integer later) matrices with x86 hardware. It's helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16.

关于transposing 4x4方形浮子(我将在后面讨论32位整数)的一些细节,使用x86硬件。从这里开始,为了转置更大的方阵,比如8x8或16x16,这很有帮助。

_MM_TRANSPOSE4_PS(r0, r1, r2, r3) is implemented differently by different compilers. GCC and ICC (I have not checked Clang) use unpcklps, unpckhps, unpcklpd, unpckhpd whereas MSVC uses only shufps. We can actually combine these two approaches together like this.

_MM_TRANSPOSE4_PS(r0, r1, r2, r3)由不同的编译器实现不同的实现。GCC和ICC(我没有检查Clang)使用unpcklps, unpckhps, unpcklpd, unpckhpd,而MSVC只使用shufps。我们可以把这两种方法结合起来。

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

r0 = _mm_shuffle_ps(t0,t2, 0x44);
r1 = _mm_shuffle_ps(t0,t2, 0xEE);
r2 = _mm_shuffle_ps(t1,t3, 0x44);
r3 = _mm_shuffle_ps(t1,t3, 0xEE);

One interesting observation is that two shuffles can be converted to one shuffle and two blends (SSE4.1) like this.

一个有趣的观察是,两个洗牌可以转换成一个洗牌和两个混合(SSE4.1)这样。

t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);

v  = _mm_shuffle_ps(t0,t2, 0x4E);
r0 = _mm_blend_ps(t0,v, 0xC);
r1 = _mm_blend_ps(t2,v, 0x3);
v  = _mm_shuffle_ps(t1,t3, 0x4E);
r2 = _mm_blend_ps(t1,v, 0xC);
r3 = _mm_blend_ps(t3,v, 0x3);

This effectively converted 4 shuffles into 2 shuffles and 4 blends. This uses 2 more instructions than the implementation of GCC, ICC, and MSVC. The advantage is that it reduces port pressure which may have a benefit in some circumstances. Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports.

这有效地将4个洗牌变成2个洗牌和4个混合。这比GCC、ICC和MSVC的实现需要更多的指令。它的优点是它可以减少在某些情况下可能有好处的端口压力。目前所有的洗牌和解包只能到一个特定的港口,而混合可以去两个不同的港口。

I tried using 8 shuffles like MSVC and converting that into 4 shuffles + 8 blends but it did not work. I still had to use 4 unpacks.

我试着用8个像MSVC一样的洗牌,把它转换成4个洗牌+ 8混合,但是没有用。我仍然需要使用4个解包。

I used this same technique for a 8x8 float transpose (see towards the end of that answer). https://*.com/a/25627536/2542702. In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends.

我在8x8浮点转置中使用了同样的技术(参见后面的答案)。https://*.com/a/25627536/2542702。在那个答案中,我仍然需要使用8个解包,但我必须把8个洗牌换成4个洗牌和8个混合。

For 32-bit integers there is nothing like shufps (except for 128-bit shuffles with AVX512) so it can only be implemented with unpacks which I don't think can be convert to blends (efficiently). With AVX512 vshufi32x4 acts effectively like shufps except for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4 in some cases. With Knights Landing shuffles are four times slower (throughput) than blends.

对于32位整数来说,没有什么像shufps(除了128位的shuffles和AVX512),所以只能用unpack来实现,我认为这不能转化为混合(有效地)。使用AVX512 vshufi32x4可以像shufps一样有效,除了4个整数的128位行,而不是32位的浮点数,因此在某些情况下,这一技术可能与vshufi32x4相同。随着骑士的登陆,洗牌的速度比混合的要慢4倍。

#4


1  

template <class T>
void transpose( std::vector< std::vector<T> > a,
std::vector< std::vector<T> > b,
int width, int height)
{
    for (int i = 0; i < width; i++)
    {
        for (int j = 0; j < height; j++)
        {
            b[j][i] = a[i][j];
        }
    }
} 

#5


1  

Consider each row as a column, and each column as a row .. use j,i instead of i,j

将每一行视为一列,每一列为一行。用j,而不是i j。

demo: http://ideone.com/lvsxKZ

演示:http://ideone.com/lvsxKZ

#include <iostream> 
using namespace std;

int main ()
{
    char A [3][3] =
    {
        { 'a', 'b', 'c' },
        { 'd', 'e', 'f' },
        { 'g', 'h', 'i' }
    };

    cout << "A = " << endl << endl;

    // print matrix A
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[i][j];
        cout << endl;
    }

    cout << endl << "A transpose = " << endl << endl;

    // print A transpose
    for (int i=0; i<3; i++)
    {
        for (int j=0; j<3; j++) cout << A[j][i];
        cout << endl;
    }

    return 0;
}

#6


1  

transposing without any overhead (class not complete):

没有任何开销的转置(未完成的类):

class Matrix{
   double *data; //suppose this will point to data
   double _get1(int i, int j){return data[i*M+j];} //used to access normally
   double _get2(int i, int j){return data[j*N+i];} //used when transposed

   public:
   int M, N; //dimensions
   double (*get_p)(int, int); //functor to access elements  
   Matrix(int _M,int _N):M(_M), N(_N){
     //allocate data
     get_p=&Matrix::_get1; // initialised with normal access 
     }

   double get(int i, int j){
     //there should be a way to directly use get_p to call. but i think even this
     //doesnt incur overhead because it is inline and the compiler should be intelligent
     //enough to remove the extra call
     return (this->*get_p)(i,j);
    }
   void transpose(){ //twice transpose gives the original
     if(get_p==&Matrix::get1) get_p=&Matrix::_get2;
     else get_p==&Matrix::_get1; 
     swap(M,N);
     }
}

can be used like this:

可以这样使用:

Matrix M(100,200);
double x=M.get(17,45);
M.transpose();
x=M.get(17,45); // = original M(45,17)

of course I didn't bother with the memory management here, which is crucial but different topic.

当然,我并不关心这里的内存管理,这是一个非常重要但不同的主题。

#7


-1  

I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)

我认为最快速的方式不应该高于O(n ^ 2)也通过这种方式你可以使用O(1)空间:成对交换的方法,因为当你转置一个矩阵然后你要做的是:M[我][j]= M[j][我],所以在临时存储M[我][j],然后M[我][j]= M[j][我],和最后一步:M[j][我]= temp。这可能是由一个通过它应该O(n ^ 2)

#8


-5  

my answer is transposed of 3x3 matrix

我的答案是转置的3x3矩阵。

 #include<iostream.h>

#include<math.h>


main()
{
int a[3][3];
int b[3];
cout<<"You must give us an array 3x3 and then we will give you Transposed it "<<endl;
for(int i=0;i<3;i++)
{
    for(int j=0;j<3;j++)
{
cout<<"Enter a["<<i<<"]["<<j<<"]: ";

cin>>a[i][j];

}

}
cout<<"Matrix you entered is :"<<endl;

 for (int e = 0 ; e < 3 ; e++ )

{
    for ( int f = 0 ; f < 3 ; f++ )

        cout << a[e][f] << "\t";


    cout << endl;

    }

 cout<<"\nTransposed of matrix you entered is :"<<endl;
 for (int c = 0 ; c < 3 ; c++ )
{
    for ( int d = 0 ; d < 3 ; d++ )
        cout << a[d][c] << "\t";

    cout << endl;
    }

return 0;
}