如何合并具有不同列名的多个数据框

时间:2023-01-13 23:48:28

I have a two data frame, say 'df1' and 'df2'. df1 has the following column:

我有一个两个数据框,比如'df1'和'df2'。 df1有以下列:

Date

and df2 has the following columns:

和df2有以下列:

Date.1, USD.Price, Date.2, EUR.Price, Date.3, JPY.Price, Date.4, INR.Price

where Date, Date.1, Date.2, Date.3, Date.4 ... is in date format.

其中Date,Date.1,Date.2,Date.3,Date.4 ...采用日期格式。

Now I want to merge Date.1, USD.Price with df1 based on df1$Date and df2$Date.2 as:

现在我想将Date.1,USD.Price与基于df1 $ Date和df2 $ Date.2的df1合并为:

df3 = merge(df1, df2[,1:2],  by.x = "Date", by.y = "Date.1", all = TRUE)

Then,

df4 = merge(df3, df2[,3:4],  by.x = "Date", by.y = "Date.2", all = TRUE)

Then again,

df5 = merge(df4, df2[,5:6],  by.x = "Date", by.y = "Date.3", all = TRUE)

Furthermore,

df6 = merge(df5, df2[,7:8],  by.x = "Date", by.y = "Date.4", all = TRUE)

and so on for all 1000 such columns.

等等所有1000个这样的列。

For example, lets say, I have a following dataframe:

例如,假设我有以下数据帧:

df1:

Date
2009-10-13
2009-10-14
2009-10-16
2009-10-18
2009-10-19
2009-10-20
2009-10-21
2009-10-22

and df2:

 Date.1      USD.Price   Date.2       EUR.Price     Date.3       JPY.Price      Date.4           INR.Price     
 2009-10-13  21.6        NA           NA            NA            NA         NA                   NA 
 2009-10-14  21.9        2009-10-14   78.2          NA            NA         NA                   NA 
 2009-10-16  22.0        2009-10-16   78.5          NA             NA        2009-10-16           12.2
 NA          NA          2009-10-18   78.9          2009-10-18  32.1       2009-10-18             12.4
NA           NA           NA          NA            2009-10-19  32.6      2009-10-19             12.2  

Then the output needs to be:

然后输出需要是:

Date           USD.Price    EUR.Price    JPY.Price    INR.Price
2009-10-13     21.6         NA           NA           NA
2009-10-14     21.9         78.2         NA           NA
2009-10-16     22.0         78.5         NA           NA
2009-10-18     NA           78.9         32.1         12.4
2009-10-19     NA           NA           32.6         12.2 

I have got some reference: How can I merge multiple dataframes with the same column names?

我有一些参考:如何合并多个具有相同列名的数据帧?

But in my case column names are different as Date.1, Date.2, Date.3 etc...

但在我的情况下,列名称与Date.1,Date.2,Date.3等不同......

Can anyone please help me out how to do this for around 1000 columns aa doing as above is not scalable for many columns?

任何人都可以请帮助我如何为大约1000列做这个,如上所述,对于许多列不可扩展?

Thanks

5 个解决方案

#1


0  

Maybe this loop could help you out:

也许这个循环可以帮助你:

for(n in 1:999){
  assign(paste('df',n+2,sep = ''),
         merge(get(paste('df',n,sep = '')), get(paste('df',n+1,sep = ''))[,n:n+1],  
               by.x = 'Date', by.y = paste('Date',n,sep = '.'), all = TRUE),
         envir = .GlobalEnv)
}

#2


1  

You can try a recursive function (a function that calls itself).

您可以尝试递归函数(一个自我调用的函数)。

It takes two data.frames and a column index. It merges the data.frames based on the first column of df1 and the first column of df2 that is subsetted using the idx. Then it calls itself using the new data.frame dfx and df2 while idx is less then the number of columns in df2 - 1.

它需要两个data.frames和一个列索引。它合并了基于df1第一列的data.frames和使用idx进行子集化的第一列df2。然后它使用新的data.frame dfx和df2调用自身,而idx小于df2 - 1中的列数。

merge_df <- function(df1, df2, idx) {

  dfx <- merge(df1, df2[, idx:(idx + 1)], by.x = names(df1)[1], 
               by.y = names(df2)[idx])

  if (idx < ncol(df2) - 1) {
    return(merge_df(dfx, df2, idx + 2))
  } else {
    return(dfx)
  }
}

You can use it like this:

你可以像这样使用它:

df1 <- data.frame(id = 1:10)
df2 <- data.frame(id1 = 1:10,
                  test1 = letters[1:10],
                  id2 = 1:10,
                  test2 = LETTERS[1:10])


df <- merge_df(df1, df2, 1)

This would result in this:

这将导致:

head(df, 10)
   id test1 test2
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D
5   5     e     E
6   6     f     F
7   7     g     G
8   8     h     H
9   9     i     I
10 10     j     J

#3


1  

You could do this...

你可以这样做......

datecols <- grep("Date", names(df)) #get date columns

dfDates <- apply(df[,datecols], 1, function(x) x[!is.na(x)][1]) #vector of dates

df2 <- cbind(Date=dfDates, df[,-datecols]) #bind dates to non-date columns

df2
        Date USD.Price EUR.Price JPY.Price INR.Price
1 2009-10-13      21.6        NA        NA        NA
2 2009-10-14      21.9      78.2        NA        NA
3 2009-10-16      22.0      78.5        NA      12.2
4 2009-10-18        NA      78.9      32.1      12.4
5 2009-10-19        NA        NA      32.6      12.2

#4


0  

Here's a tidyverse way using your example df1 and df2 with the date columns processed with lubridate:

使用示例df1和df2以及使用lubridate处理的日期列时,这是一种整齐的方式:

library(tidyr)
library(dplyr)
library(lubridate)

# reformat df2
df2bis <- 
  df2 %>%
  gather(key = "tmp_key",
         value = "Date",
         starts_with("Date"),
         na.rm = TRUE) %>%
  select(-tmp_key) %>%
  distinct()

 # and merge with df1
 df <- inner_join(df1, df2bis)

#5


0  

An efficient way of doing this using sqldf I think.

我认为使用sqldf这样做的有效方法。

# Changing column names in df2 for convenience
names(df2) <- c("Date1", "USD_Price", "Date2", "EUR_Price", "Date3", "JPY_Price", "Date4", "INR_Price")

library(sqldf) 
sqldf({"
    SELECT D1.Date, D2.USD_Price, D2.EUR_Price, D2.JPY_Price, D2.INR_Price FROM df1 AS D1
    INNER JOIN df2 AS D2
    ON D1.Date IN (D2.Date1, D2.Date2, D2.Date3, D2.Date4)
"})

#        Date USD_Price EUR_Price JPY_Price INR_Price
#1 2009-10-13      21.6        NA        NA        NA
#2 2009-10-14      21.9      78.2        NA        NA
#3 2009-10-16      22.0      78.5        NA      12.2
#4 2009-10-18        NA      78.9      32.1      12.4
#5 2009-10-19        NA        NA      32.6      12.2

#1


0  

Maybe this loop could help you out:

也许这个循环可以帮助你:

for(n in 1:999){
  assign(paste('df',n+2,sep = ''),
         merge(get(paste('df',n,sep = '')), get(paste('df',n+1,sep = ''))[,n:n+1],  
               by.x = 'Date', by.y = paste('Date',n,sep = '.'), all = TRUE),
         envir = .GlobalEnv)
}

#2


1  

You can try a recursive function (a function that calls itself).

您可以尝试递归函数(一个自我调用的函数)。

It takes two data.frames and a column index. It merges the data.frames based on the first column of df1 and the first column of df2 that is subsetted using the idx. Then it calls itself using the new data.frame dfx and df2 while idx is less then the number of columns in df2 - 1.

它需要两个data.frames和一个列索引。它合并了基于df1第一列的data.frames和使用idx进行子集化的第一列df2。然后它使用新的data.frame dfx和df2调用自身,而idx小于df2 - 1中的列数。

merge_df <- function(df1, df2, idx) {

  dfx <- merge(df1, df2[, idx:(idx + 1)], by.x = names(df1)[1], 
               by.y = names(df2)[idx])

  if (idx < ncol(df2) - 1) {
    return(merge_df(dfx, df2, idx + 2))
  } else {
    return(dfx)
  }
}

You can use it like this:

你可以像这样使用它:

df1 <- data.frame(id = 1:10)
df2 <- data.frame(id1 = 1:10,
                  test1 = letters[1:10],
                  id2 = 1:10,
                  test2 = LETTERS[1:10])


df <- merge_df(df1, df2, 1)

This would result in this:

这将导致:

head(df, 10)
   id test1 test2
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D
5   5     e     E
6   6     f     F
7   7     g     G
8   8     h     H
9   9     i     I
10 10     j     J

#3


1  

You could do this...

你可以这样做......

datecols <- grep("Date", names(df)) #get date columns

dfDates <- apply(df[,datecols], 1, function(x) x[!is.na(x)][1]) #vector of dates

df2 <- cbind(Date=dfDates, df[,-datecols]) #bind dates to non-date columns

df2
        Date USD.Price EUR.Price JPY.Price INR.Price
1 2009-10-13      21.6        NA        NA        NA
2 2009-10-14      21.9      78.2        NA        NA
3 2009-10-16      22.0      78.5        NA      12.2
4 2009-10-18        NA      78.9      32.1      12.4
5 2009-10-19        NA        NA      32.6      12.2

#4


0  

Here's a tidyverse way using your example df1 and df2 with the date columns processed with lubridate:

使用示例df1和df2以及使用lubridate处理的日期列时,这是一种整齐的方式:

library(tidyr)
library(dplyr)
library(lubridate)

# reformat df2
df2bis <- 
  df2 %>%
  gather(key = "tmp_key",
         value = "Date",
         starts_with("Date"),
         na.rm = TRUE) %>%
  select(-tmp_key) %>%
  distinct()

 # and merge with df1
 df <- inner_join(df1, df2bis)

#5


0  

An efficient way of doing this using sqldf I think.

我认为使用sqldf这样做的有效方法。

# Changing column names in df2 for convenience
names(df2) <- c("Date1", "USD_Price", "Date2", "EUR_Price", "Date3", "JPY_Price", "Date4", "INR_Price")

library(sqldf) 
sqldf({"
    SELECT D1.Date, D2.USD_Price, D2.EUR_Price, D2.JPY_Price, D2.INR_Price FROM df1 AS D1
    INNER JOIN df2 AS D2
    ON D1.Date IN (D2.Date1, D2.Date2, D2.Date3, D2.Date4)
"})

#        Date USD_Price EUR_Price JPY_Price INR_Price
#1 2009-10-13      21.6        NA        NA        NA
#2 2009-10-14      21.9      78.2        NA        NA
#3 2009-10-16      22.0      78.5        NA      12.2
#4 2009-10-18        NA      78.9      32.1      12.4
#5 2009-10-19        NA        NA      32.6      12.2