熊猫在交叉值中找到重复

时间:2022-04-01 22:55:22

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:

我有一个dataframe,我想消除重复的行,它们具有相同的值,但是在不同的列中:

df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})dfOut[8]:    a  b  c  d1  x  y  e  f2  e  f  x  y3  w  v  s  t

Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate. I want to drop these lines and only keep one, to have the final output:

行[1],[2]的值为{x,y,e,f},但是它们是交叉排列的——也就是说,如果你把列c d和行[2]中的a,b交换,你会得到一个副本。我想去掉这几行,只保留一行,以得到最终的输出:

df_newOut[20]:    a  b  c  d1  x  y  e  f3  w  v  s  t

How can I efficiently achieve that?

我怎样才能有效地做到这一点呢?

3 个解决方案

#1


4  

I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:

我认为您需要使用由numpy创建的掩码进行布尔索引。排序与重复,用于倒置使用~:

df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]print (df)   a  b  c  d1  x  y  e  f3  w  v  s  t

Detail:

细节:

print (np.sort(df, axis=1))[['e' 'f' 'x' 'y'] ['e' 'f' 'x' 'y'] ['s' 't' 'v' 'w']]print (pd.DataFrame(np.sort(df, axis=1), index=df.index))   0  1  2  31  e  f  x  y2  e  f  x  y3  s  t  v  wprint (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1    False2     True3    Falsedtype: boolprint (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1     True2    False3     Truedtype: bool

#2


1  

Here's another solution, with a for loop:

下面是另一种解决方案,它有一个for循环:

data = df.as_matrix()new = []for row in data:    if not new:        new.append(row)    else:        if not any([c in nrow for nrow in new for c in row]):            new.append(row)new_df = pd.DataFrame(new, columns=df.columns)

#3


1  

Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.Later use that duplicates to drop(df.drop) the required index

使用排序(np.sort),然后从中获取重复(.duplicate())。稍后使用该副本来删除(df.drop)所需的索引

import pandas as pdimport numpy as npdf = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]df.drop(df.index[df_duplicated])

#1


4  

I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:

我认为您需要使用由numpy创建的掩码进行布尔索引。排序与重复,用于倒置使用~:

df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]print (df)   a  b  c  d1  x  y  e  f3  w  v  s  t

Detail:

细节:

print (np.sort(df, axis=1))[['e' 'f' 'x' 'y'] ['e' 'f' 'x' 'y'] ['s' 't' 'v' 'w']]print (pd.DataFrame(np.sort(df, axis=1), index=df.index))   0  1  2  31  e  f  x  y2  e  f  x  y3  s  t  v  wprint (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1    False2     True3    Falsedtype: boolprint (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())1     True2    False3     Truedtype: bool

#2


1  

Here's another solution, with a for loop:

下面是另一种解决方案,它有一个for循环:

data = df.as_matrix()new = []for row in data:    if not new:        new.append(row)    else:        if not any([c in nrow for nrow in new for c in row]):            new.append(row)new_df = pd.DataFrame(new, columns=df.columns)

#3


1  

Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.Later use that duplicates to drop(df.drop) the required index

使用排序(np.sort),然后从中获取重复(.duplicate())。稍后使用该副本来删除(df.drop)所需的索引

import pandas as pdimport numpy as npdf = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]df.drop(df.index[df_duplicated])