如何在pandas中的多个数据框列中“选择不同”?

时间:2022-11-19 01:39:01

I'm looking for a way to do the equivalent to the sql

我正在寻找一种方法来做与sql相同的方法

"SELECT DISTINCT col1, col2 FROM dataframe_table"

“SELECT DISTINCT col1,col2 FROM dataframe_table”

The pandas sql comparison doesn't have anything about "distinct"

pandas sql对比没有关于“distinct”的任何内容

.unique() only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.

.unique()仅适用于单个列,所以我想我可以将列连接起来,或者将它们放在列表/元组中并进行比较,但这似乎是大熊猫应该以更原生的方式进行的。

Am I missing something obvious, or is there no way to do this?

我错过了一些明显的东西,还是没有办法做到这一点?

3 个解决方案

#1


75  

You can use the drop_duplicates method to get the unique rows in a DataFrame:

您可以使用drop_duplicates方法获取DataFrame中的唯一行:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

如果您只想使用某些列来确定唯一性,也可以提供subset关键字参数。请参阅docstring。

#2


3  

There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

对于df没有唯一的方法,如果每列的唯一值的数量相同,则以下方法将起作用:df.apply(pd.Series.unique)但如果没有,则会出现错误。另一种方法是将值存储在以列名称为键的dict中:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

#3


0  

You can take the sets of the columns and just subtract the smaller set from the larger set:

您可以获取列的集合,只需从较大的集合中减去较小的集合:

distinct_values = set(df['a'])-set(df['b'])

#1


75  

You can use the drop_duplicates method to get the unique rows in a DataFrame:

您可以使用drop_duplicates方法获取DataFrame中的唯一行:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

如果您只想使用某些列来确定唯一性,也可以提供subset关键字参数。请参阅docstring。

#2


3  

There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

对于df没有唯一的方法,如果每列的唯一值的数量相同,则以下方法将起作用:df.apply(pd.Series.unique)但如果没有,则会出现错误。另一种方法是将值存储在以列名称为键的dict中:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

#3


0  

You can take the sets of the columns and just subtract the smaller set from the larger set:

您可以获取列的集合,只需从较大的集合中减去较小的集合:

distinct_values = set(df['a'])-set(df['b'])