如何将函数应用于熊猫数据存储器的两列

时间:2022-04-26 15:47:44

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :

假设我有一个df,它有'ID', 'col_1', 'col_2'的列。我定义了一个函数

f = lambda x, y : my_function_expression.

f = x, y: my_function_expression。

Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :

现在我想把f应用到df的两列“col_1”和“col_2”来计算元素方面的新列“col_3”,有点像:

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

How to do ?

怎么办?

** Add detail sample as below ***

***增加细节样本如下***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

10 个解决方案

#1


189  

Here's an example using apply on the dataframe, which I am calling with axis = 1.

这里有一个应用在dataframe上的示例,我用axis = 1来调用它。

Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.

注意,不同之处在于,不是尝试将两个值传递给函数f,而是重写函数以接受一个熊猫系列对象,然后索引该系列以获得所需的值。

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.

根据您的用例,有时创建一个熊猫组对象,然后在组上使用apply是有帮助的。

#2


36  

A interesting question! my answer as below:

一个有趣的问题!我的回答如下:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

Output:

输出:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.

我将列名更改为ID、J1、J2、J3,以确保ID < J1 < J2 < J3,因此列按正确的顺序显示。

One more brief version:

一个简短的版本:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

#3


29  

A simple solution is:

一个简单的解决方案是:

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

#4


13  

The method you are looking for is Series.combine. However, it seems some care has to be taken around datatypes. In your example, you would (as I did when testing the answer) naively call

您正在寻找的方法是Series.combine。然而,似乎必须对数据类型加以注意。在您的示例中,您将(如我在测试答案时所做的那样)天真地调用

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

However, this throws the error:

然而,这就抛出了一个错误:

ValueError: setting an array element with a sequence.

My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:

我的最佳猜测是,它似乎期望结果与调用方法的系列相同。col_1这里)。然而,以下工作:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

#5


11  

The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.

写成f需要两个输入。如果你看一下错误信息它说你没有给f提供两个输入,只有一个。错误消息是正确的。不匹配是因为df[[['col1','col2']]]返回一个具有两个列而不是两个单独列的dataframe。

You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.

你需要改变你的f,这样它只接受一个输入,将上面的数据帧作为输入,然后把它分解成函数体内的x,y。然后执行所需的操作并返回一个值。

You need this function signature because the syntax is .apply(f) So f needs to take the single thing = dataframe and not two things which is what your current f expects.

你需要这个函数签名,因为语法是。apply(f)所以f需要取一个东西= dataframe而不是两个东西,这是当前f所期望的。

Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply

由于您还没有提供f的主体,所以我无法提供更多的细节——但这应该提供了解决方案,而无需从根本上修改代码或使用其他方法,而不是应用。

#6


8  

I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').

我要投票给np。vectorize。它允许你只拍摄x个列,而不处理函数中的dataframe,所以对于不受控制的函数来说,它是很好的,或者像向函数发送2列和常数(例如col_1、col_2、'foo')。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

#7


2  

I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -

我确信这并不像使用熊猫或Numpy操作的解决方案那么快,但是如果您不想重写您的函数,您可以使用map。使用原始示例数据-

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

We could pass as many arguments as we wanted into the function this way. The output is what we wanted

我们可以用这种方法向函数传递任意多的参数。输出是我们想要的。

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

#8


2  

Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:

从apply返回一个列表是一个危险的操作,因为结果对象不能保证是一个序列或一个DataFrame。在某些情况下可能会出现例外。让我们看一个简单的例子:

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

There are three possible outcomes with returning a list from apply

从apply返回列表有三种可能的结果

1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.

1)如果返回列表的长度不等于列数,则返回一系列列表。

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.

2)当返回列表的长度等于列数时,返回一个DataFrame,每个列都得到列表中的相应值。

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.

3)如果返回的列表的长度等于第一行的列数,但至少有一行,其中列表的元素个数不同于所增加的列数。

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

Answering the problem without apply

Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.

使用apply with axis=1非常缓慢。使用基本的迭代方法可以获得更好的性能(特别是在较大的数据集上)。

Create larger dataframe

创建更大的dataframe

df1 = df.sample(100000, replace=True).reset_index(drop=True)

Timings

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Thomas answer

@Thomas回答

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#9


1  

My example to your questions:

我给你们的问题举个例子:

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

#10


0  

I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.

我假设您不想更改get_sublist函数,只想使用DataFrame的apply方法来完成任务。为了得到您想要的结果,我编写了两个帮助函数:get_sublist_list和unlist。如函数名所示,首先获取子列表的列表,然后从该列表中提取子列表。最后,我们需要调用apply函数将这两个函数应用到df[['col_1','col_2']] DataFrame中。

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as @Ted Petrou had mentioned.

如果不使用[]来封装get_sublist函数,则get_sublist_list函数将返回一个普通的列表,它将引发ValueError:不能像@Ted Petrou那样将输入数组从shape(3)广播到shape(2)中。

#1


189  

Here's an example using apply on the dataframe, which I am calling with axis = 1.

这里有一个应用在dataframe上的示例,我用axis = 1来调用它。

Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.

注意,不同之处在于,不是尝试将两个值传递给函数f,而是重写函数以接受一个熊猫系列对象,然后索引该系列以获得所需的值。

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.

根据您的用例,有时创建一个熊猫组对象,然后在组上使用apply是有帮助的。

#2


36  

A interesting question! my answer as below:

一个有趣的问题!我的回答如下:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

Output:

输出:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.

我将列名更改为ID、J1、J2、J3,以确保ID < J1 < J2 < J3,因此列按正确的顺序显示。

One more brief version:

一个简短的版本:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

#3


29  

A simple solution is:

一个简单的解决方案是:

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

#4


13  

The method you are looking for is Series.combine. However, it seems some care has to be taken around datatypes. In your example, you would (as I did when testing the answer) naively call

您正在寻找的方法是Series.combine。然而,似乎必须对数据类型加以注意。在您的示例中,您将(如我在测试答案时所做的那样)天真地调用

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

However, this throws the error:

然而,这就抛出了一个错误:

ValueError: setting an array element with a sequence.

My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:

我的最佳猜测是,它似乎期望结果与调用方法的系列相同。col_1这里)。然而,以下工作:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

#5


11  

The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.

写成f需要两个输入。如果你看一下错误信息它说你没有给f提供两个输入,只有一个。错误消息是正确的。不匹配是因为df[[['col1','col2']]]返回一个具有两个列而不是两个单独列的dataframe。

You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.

你需要改变你的f,这样它只接受一个输入,将上面的数据帧作为输入,然后把它分解成函数体内的x,y。然后执行所需的操作并返回一个值。

You need this function signature because the syntax is .apply(f) So f needs to take the single thing = dataframe and not two things which is what your current f expects.

你需要这个函数签名,因为语法是。apply(f)所以f需要取一个东西= dataframe而不是两个东西,这是当前f所期望的。

Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply

由于您还没有提供f的主体,所以我无法提供更多的细节——但这应该提供了解决方案,而无需从根本上修改代码或使用其他方法,而不是应用。

#6


8  

I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').

我要投票给np。vectorize。它允许你只拍摄x个列,而不处理函数中的dataframe,所以对于不受控制的函数来说,它是很好的,或者像向函数发送2列和常数(例如col_1、col_2、'foo')。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

#7


2  

I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -

我确信这并不像使用熊猫或Numpy操作的解决方案那么快,但是如果您不想重写您的函数,您可以使用map。使用原始示例数据-

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

We could pass as many arguments as we wanted into the function this way. The output is what we wanted

我们可以用这种方法向函数传递任意多的参数。输出是我们想要的。

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

#8


2  

Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:

从apply返回一个列表是一个危险的操作,因为结果对象不能保证是一个序列或一个DataFrame。在某些情况下可能会出现例外。让我们看一个简单的例子:

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

There are three possible outcomes with returning a list from apply

从apply返回列表有三种可能的结果

1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.

1)如果返回列表的长度不等于列数,则返回一系列列表。

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.

2)当返回列表的长度等于列数时,返回一个DataFrame,每个列都得到列表中的相应值。

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.

3)如果返回的列表的长度等于第一行的列数,但至少有一行,其中列表的元素个数不同于所增加的列数。

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

Answering the problem without apply

Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.

使用apply with axis=1非常缓慢。使用基本的迭代方法可以获得更好的性能(特别是在较大的数据集上)。

Create larger dataframe

创建更大的dataframe

df1 = df.sample(100000, replace=True).reset_index(drop=True)

Timings

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Thomas answer

@Thomas回答

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#9


1  

My example to your questions:

我给你们的问题举个例子:

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

#10


0  

I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.

我假设您不想更改get_sublist函数,只想使用DataFrame的apply方法来完成任务。为了得到您想要的结果,我编写了两个帮助函数:get_sublist_list和unlist。如函数名所示,首先获取子列表的列表,然后从该列表中提取子列表。最后,我们需要调用apply函数将这两个函数应用到df[['col_1','col_2']] DataFrame中。

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as @Ted Petrou had mentioned.

如果不使用[]来封装get_sublist函数,则get_sublist_list函数将返回一个普通的列表,它将引发ValueError:不能像@Ted Petrou那样将输入数组从shape(3)广播到shape(2)中。