合并数据帧并保留一些列,同时重复一些列pandas

时间:2022-11-11 22:57:16

I have two pandas dataframes, df1 and df2. Both having atleast two columns with the same name (c1 and c2) and then there are other columns, such as:

我有两个pandas数据帧,df1和df2。两者都有至少两个具有相同名称的列(c1和c2),然后还有其他列,例如:

df1
-----------------
c1  c2  c4   c5
-----------------
10  1   hh  2231
11  1   fgf 2142
12  1   fg  1232

df2
-----------------
c1  c2   c6  c7
-----------------
10  2   110  231
10  3   111  332
11  2   112  123
11  3   113  432
12  2   114  432
12  3   115  432
13  2   116  432
13  3   117  432
14  2   118  432
14  3   119  432

I want to merge the two dataframes such that:

我想合并两个数据帧,以便:

  • There is a union of c1 and c2 from both the df
  • df中存在c1和c2的并集

  • c4 and c5 are repeated for all the instances of their corresponding c1 and c2 values from df1
  • 对于来自df1的相应c1和c2值的所有实例重复c4和c5

  • The remaining columns, i.e., c6 and c7 are matched against c1 and c2 in df2 and matched to the c1 value in df1 and then added to the resulting df.
  • 剩余的列,即c6和c7与df2中的c1和c2匹配,并与df1中的c1值匹配,然后加到得到的df中。

In a previous question, I got the answer to perform outer join and I performed outer join based on c1 and c2 and that returns the following for example:

在上一个问题中,我得到了执行外连接的答案,并且我基于c1和c2执行了外连接,并返回以下示例:

c1  c2  c4  c5      c6      c7
--------------------------------
10  1   hh  2231        
10  2               110     231
10  3               111     332

However, I would like to repeat the values of c4 and c5 as the resulting df as follows:

但是,我想重复c4和c5的值作为结果df,如下所示:

resulting dataframe:

c1  c2  c4  c5           c6     c7
-----------------------------------
10  1   hh  2231        
10  2   hh  2231        110     231
10  3   hh  2231        111     332
11  1   fgf 2142        
11  2   fgf 2142        112     123
11  3   fgf 2142        113     432
12  1   fg  1232        
12  2   fg  1232        114     432
12  3   fg  1232        115     432
13  1   dd  4532        
13  2   dd  4532        116     432
13  3   dd  4532        117     432
14  2                   118     432
14  3                   119     432

Can anyone help me in this matter? Thanks in advance!

有谁可以帮我解决这个问题?提前致谢!

2 个解决方案

#1


0  

You can use:

您可以使用:

  • first concat columns c1, c2, sort and if necessary remove duplicated
  • 第一个concat列c1,c2,排序,如有必要,删除重复

  • merge both DataFrames by left join
  • 通过左连接合并两个DataFrame

  • last repeat values by ffill and bfill
  • ffill和bfill的最后重复值


df = (pd.concat([df1[['c1','c2']], df2[['c1','c2']]])
        .sort_values(['c1','c2'])
        .drop_duplicates()
        .merge(df1, on=['c1','c2'], how='left')
        .merge(df2, on=['c1','c2'], how='left')
)

df[['c4','c5']] = df.groupby('c1')['c4','c5'].apply(lambda x: x.ffill().bfill())
print (df)
    c1  c2   c4      c5     c6     c7
0   10   1   hh  2231.0    NaN    NaN
1   10   2   hh  2231.0  110.0  231.0
2   10   3   hh  2231.0  111.0  332.0
3   11   1  fgf  2142.0    NaN    NaN
4   11   2  fgf  2142.0  112.0  123.0
5   11   3  fgf  2142.0  113.0  432.0
6   12   1   fg  1232.0    NaN    NaN
7   12   2   fg  1232.0  114.0  432.0
8   12   3   fg  1232.0  115.0  432.0
9   13   2  NaN     NaN  116.0  432.0
10  13   3  NaN     NaN  117.0  432.0
11  14   2  NaN     NaN  118.0  432.0
12  14   3  NaN     NaN  119.0  432.0

#2


0  

You just need to use pd.concat instead of merge or join. Here is an example

您只需要使用pd.concat而不是merge或join。这是一个例子

import pandas as pd
import numpy as np

a = np.arange(1,4)
b = np.arange(5,8)
c = np.random.randint(0,10,size=3)
d = np.random.randint(0,10,size=3)
df_1 = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})

out:

    a   b   c   d
0   1   5   5   1
1   2   6   7   5
2   3   7   6   9

a = np.arange(4,7)
b = np.arange(7,10)
e = np.random.randint(0,10,size=3)
f = np.random.randint(0,10,size=3)
df_2 = pd.DataFrame({'a':a,'b':b,'e':c,'f':d})
df_2

out:

    a   b   e   f
0   4   7   9   9
1   5   8   9   3
2   6   9   2   1

pd.concat([df_1,df_2])

out:

    a   b    c      d       e       f
0   1   5   5.0     1.0     NaN     NaN
1   2   6   7.0     5.0     NaN     NaN
2   3   7   6.0     9.0     NaN     NaN
0   4   7   NaN     NaN     9.0     9.0
1   5   8   NaN     NaN     9.0     3.0
2   6   9   NaN     NaN     2.0     1.0

#1


0  

You can use:

您可以使用:

  • first concat columns c1, c2, sort and if necessary remove duplicated
  • 第一个concat列c1,c2,排序,如有必要,删除重复

  • merge both DataFrames by left join
  • 通过左连接合并两个DataFrame

  • last repeat values by ffill and bfill
  • ffill和bfill的最后重复值


df = (pd.concat([df1[['c1','c2']], df2[['c1','c2']]])
        .sort_values(['c1','c2'])
        .drop_duplicates()
        .merge(df1, on=['c1','c2'], how='left')
        .merge(df2, on=['c1','c2'], how='left')
)

df[['c4','c5']] = df.groupby('c1')['c4','c5'].apply(lambda x: x.ffill().bfill())
print (df)
    c1  c2   c4      c5     c6     c7
0   10   1   hh  2231.0    NaN    NaN
1   10   2   hh  2231.0  110.0  231.0
2   10   3   hh  2231.0  111.0  332.0
3   11   1  fgf  2142.0    NaN    NaN
4   11   2  fgf  2142.0  112.0  123.0
5   11   3  fgf  2142.0  113.0  432.0
6   12   1   fg  1232.0    NaN    NaN
7   12   2   fg  1232.0  114.0  432.0
8   12   3   fg  1232.0  115.0  432.0
9   13   2  NaN     NaN  116.0  432.0
10  13   3  NaN     NaN  117.0  432.0
11  14   2  NaN     NaN  118.0  432.0
12  14   3  NaN     NaN  119.0  432.0

#2


0  

You just need to use pd.concat instead of merge or join. Here is an example

您只需要使用pd.concat而不是merge或join。这是一个例子

import pandas as pd
import numpy as np

a = np.arange(1,4)
b = np.arange(5,8)
c = np.random.randint(0,10,size=3)
d = np.random.randint(0,10,size=3)
df_1 = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})

out:

    a   b   c   d
0   1   5   5   1
1   2   6   7   5
2   3   7   6   9

a = np.arange(4,7)
b = np.arange(7,10)
e = np.random.randint(0,10,size=3)
f = np.random.randint(0,10,size=3)
df_2 = pd.DataFrame({'a':a,'b':b,'e':c,'f':d})
df_2

out:

    a   b   e   f
0   4   7   9   9
1   5   8   9   3
2   6   9   2   1

pd.concat([df_1,df_2])

out:

    a   b    c      d       e       f
0   1   5   5.0     1.0     NaN     NaN
1   2   6   7.0     5.0     NaN     NaN
2   3   7   6.0     9.0     NaN     NaN
0   4   7   NaN     NaN     9.0     9.0
1   5   8   NaN     NaN     9.0     3.0
2   6   9   NaN     NaN     2.0     1.0