如何在两个不同的pandas数据帧中相互比较值

I have two different pandas data frames. One is called 'price' and it has the schema

我有两个不同的熊猫数据帧。一个被称为'价格',它有模式

SKU, price

The second data frame is called sales_tracking which contains information about the number of sales for a SKU at a given price. It's schema is

第二个数据框称为sales_tracking,其中包含有关给定价格的SKU销售数量的信息。它的架构是

SKU, price, total_orders, total_visits

But when we add a new price point for a SKU in the 'price' data frame, there won't be a matching record in the 'sales_tracking' data frame, at which point I have to add a new entry to the 'sales_tracking' data frame where the total_orders and total_visits is estimated from another data set (we're doing this to estimate conversion rates).

但是当我们在“价格”数据框中为SKU添加新的价格点时,“sales_tracking”数据框中将没有匹配的记录,此时我必须在“sales_tracking”中添加新条目数据框,其中total_orders和total_visits是从另一个数据集估算的(我们这样做是为了估算转换率)。

The problem I'm having is being able to check to see if the value for price in the pricing data frame also exists in the sales_tracking data frame. Before doing the comparison, I first create temporary data frames for both the pricing data and the sales data as follows:

我遇到的问题是能够检查定价数据框中的价格值是否也存在于sales_tracking数据框中。在进行比较之前,我首先为定价数据和销售数据创建临时数据框,如下所示:

sku_specific_sales_records = sales_tracking[sales_tracking['product']==product]

sku_specific_price = sku_specific_price [sku_specific_price ['product']==product]

To be clear, both sku_specific_sales_records and sku_specific_price may contain multiple record. I'm trying to identify the case when there is a row in sku_specific_price where sku_specific_price['price'] is not in sku_specific_sales_records['price'].

需要说明的是,sku_specific_sales_records和sku_specific_price都可能包含多条记录。我试图找出sku_specific_price中有一行的情况,其中sku_specific_price ['price']不在sku_specific_sales_records ['price']中。

I have tried various different things. Something as simple as

我尝试了各种不同的东西。简单的事情

if sku_specific_sales_records['price'] == sku_specific_price['price']:

doesn't work, I get a ValueError "can only compare identically-labeled Series objects'. So I also tried

不起作用,我得到一个ValueError“只能比较相同标记的系列对象”。所以我也尝试过

if sku_specific_price['price'].isin(sku_specific_sales_records['price']):
   doTheThingIfTheyMatch
else:
   doTheOtherThing

And that generates a different ValueError: "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). So I tried using a.bool()

这会产生一个不同的ValueError:“系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()。所以我尝试使用a.bool()

if sku_specific_price['price'].isin(sku_specific_sales_records['price']).bool():
   doTheThingIfTheyMatch
else:
   doTheOtherThing

but that brought me back full circle to the "ValueError: Can only compare identically-labeled Series objects".

但这让我回到了“ValueError:只能比较同名的Series对象”。

Here is a small example illustrating the problem.

这是一个说明问题的小例子。

import pandas as pd
sales = pd.DataFrame(columns={'product', 'price', 'sales', 'orders'})
pricing = pd.DataFrame(columns={'product', 'price'})
sales.loc[0] = [123, 10, 5, 5]
sales.loc[1] = [123, 15, 2, 10]
pricing.loc[0] = [123, 8]
if sales['price'].isin(pricing['price']):
    print "true"
else:
    print "false"

2 个解决方案

#1

In python, you need to use == instead of = when evaluating comparisons.

在python中,在评估比较时需要使用==而不是=。

This is because = is the assignment operator, so it cannot be used for comparisons.

这是因为=是赋值运算符,因此它不能用于比较。

Try this:

if sku_specific_sales_records['price'] == sku_specific_price['price']:

Note: It's also recommended to use short(er) variable names as there's less chance for typos and they're quicker to type.

注意:还建议使用短(呃)变量名称,因为拼写错误的可能性较小,而且输入速度更快。

#2

So the solution seems to be to replace .bool() with .any() as follows

所以解决方案似乎是用.any()替换.bool(),如下所示

import pandas as pd
sales = pd.DataFrame(columns={'product', 'price', 'sales', 'orders'})
pricing = pd.DataFrame(columns={'product', 'price'})
sales.loc[0] = [123, 10, 5, 5]
sales.loc[1] = [123, 15, 2, 10]
pricing.loc[0] = [123, 8]
if sales['price'].isin(sales['price']).any() 
    print "true"
else:
    print "false"

.bool() didn't work because .bool() only works on single elements, i.e. scalar values. I was trying to use it check if the values in one series were in another series, so even though my sales data frame only contained a single row, it was possible for the sales data frame to contain multiple rows. The .any() or .all() defines what conditions must be met for the comparison to be true.

.bool()不起作用,因为.bool()仅适用于单个元素,即标量值。我试图使用它检查一个系列中的值是否在另一个系列中,因此即使我的销售数据框只包含一行,销售数据框也可能包含多行。 .any()或.all()定义必须满足哪些条件才能使比较成立。

For example, I want to find out if we have any rows in pricing for a particular SKU that don't exist in sales (e.g. a new pricing point), in that case I would use import pandas as pd sales.loc[0] = [123, 10, 5, 5] sales.loc[1] = [123, 15, 2, 10] pricing.loc[0] = [123, 8] pricing.loc[1] = [123, 10] pricing.loc[2] = [123, 15] print sales print pricing print sales['price'] print pricing['price'] if pricing['price'].isin(sales['price']).all(): print "true" else: print "false"

例如,我想知道我们是否在销售中不存在特定SKU的定价中有任何行(例如新的定价点),在这种情况下我会使用import pandas作为pd sales.loc [0] = [123,10,5,5] sales.loc [1] = [123,15,2,10] pricing.loc [0] = [123,8] pricing.loc [1] = [123,10] pricing.loc [2] = [123,15]打印销售打印定价打印销售['价格']打印定价['价格']如果定价['价格'] .isin(销售['价格'])。全部( ):print“true”else:print“false”

because I need all of the values in pricing['price'] to be matched to sales['price']. If I only required a single matching value, then I'd use .any().

因为我需要将定价['价格']中的所有值与销售['价格']相匹配。如果我只需要一个匹配值,那么我会使用.any()。

#1