pandas - 按行元素按另一个数据帧过滤数据帧

时间:2022-01-23 04:47:23

I have a dataframe df1 which looks like:

我有一个数据帧df1,看起来像:

   c  k  l
0  A  1  a
1  A  2  b
2  B  2  a
3  C  2  a
4  C  2  d

and another called df2 like:

和另一个名为df2的人:

   c  l
0  A  b
1  C  a

I would like to filter df1 keeping only the values that ARE NOT in df2. Values to filter are expected to be as (A,b) and (C,a) tuples. So far I tried to apply the isin method:

我想过滤df1只保留df2中不存在的值。要过滤的值应为(A,b)和(C,a)元组。到目前为止,我尝试应用isin方法:

d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]

Apart that seems to me too complicated, it returns:

除了在我看来太复杂,它返回:

   c  k  l
2  B  2  a
4  C  2  d

but I'm expecting:

但我期待:

   c  k  l
0  A  1  a
2  B  2  a
4  C  2  d

4 个解决方案

#1


25  

You can do this efficiently using isin on a multiindex constructed from the desired columns:

您可以在从所需列构造的多索引上使用isin有效地执行此操作:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
                    'k': [1, 2, 2, 2, 2],
                    'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
                    'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]

pandas  - 按行元素按另一个数据帧过滤数据帧

I think this improves on @IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).

我认为这改进了@ IanS的类似解决方案,因为它不假设任何列类型(即它将使用数字和字符串)。


(Above answer is an edit. Following was my initial answer)

(以上答案是编辑。以下是我最初的答案)

Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:

有趣!这是我之前没有遇到过的......我可能会通过合并两个数组来解决它,然后删除定义了df2的行。这是一个使用临时数组的示例:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
                    'k': [1, 2, 2, 2, 2],
                    'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
                    'l': ['b', 'a']})

# create a column marking df2 values
df2['marker'] = 1

# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined

pandas  - 按行元素按另一个数据帧过滤数据帧

# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]

pandas  - 按行元素按另一个数据帧过滤数据帧

There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.

可能有一种方法可以在不使用临时数组的情况下执行此操作,但我想不到一个。只要您的数据不是很大,上述方法应该是一个快速而充分的答案。

#2


8  

This is pretty succinct:

这非常简洁:

df1 = df1[~df1.index.isin(df2.index)]

df1 = df1 [~df1.index.isin(df2.index)]

#3


1  

How about:

df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)

#4


0  

Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.

避免创建额外列或进行合并的另一个选项是在df2上执行groupby以获取不同的(c,l)对,然后使用它来过滤df1。

gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]

For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.

对于这个小例子,它实际上似乎比基于熊猫的方法运行得快一点(在我的机器上666μs与1.76 ms),但我怀疑它在较大的例子上可能会慢一点,因为它已经落入纯Python。

#1


25  

You can do this efficiently using isin on a multiindex constructed from the desired columns:

您可以在从所需列构造的多索引上使用isin有效地执行此操作:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
                    'k': [1, 2, 2, 2, 2],
                    'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
                    'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]

pandas  - 按行元素按另一个数据帧过滤数据帧

I think this improves on @IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).

我认为这改进了@ IanS的类似解决方案,因为它不假设任何列类型(即它将使用数字和字符串)。


(Above answer is an edit. Following was my initial answer)

(以上答案是编辑。以下是我最初的答案)

Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2 is defined. Here is an example, which makes use of a temporary array:

有趣!这是我之前没有遇到过的......我可能会通过合并两个数组来解决它,然后删除定义了df2的行。这是一个使用临时数组的示例:

df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
                    'k': [1, 2, 2, 2, 2],
                    'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
                    'l': ['b', 'a']})

# create a column marking df2 values
df2['marker'] = 1

# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined

pandas  - 按行元素按另一个数据帧过滤数据帧

# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]

pandas  - 按行元素按另一个数据帧过滤数据帧

There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.

可能有一种方法可以在不使用临时数组的情况下执行此操作,但我想不到一个。只要您的数据不是很大,上述方法应该是一个快速而充分的答案。

#2


8  

This is pretty succinct:

这非常简洁:

df1 = df1[~df1.index.isin(df2.index)]

df1 = df1 [~df1.index.isin(df2.index)]

#3


1  

How about:

df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)

#4


0  

Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.

避免创建额外列或进行合并的另一个选项是在df2上执行groupby以获取不同的(c,l)对,然后使用它来过滤df1。

gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]

For this small example, it actually seems to run a bit faster than the pandas-based approach (666 µs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.

对于这个小例子,它实际上似乎比基于熊猫的方法运行得快一点(在我的机器上666μs与1.76 ms),但我怀疑它在较大的例子上可能会慢一点,因为它已经落入纯Python。