合并具有相同变量和观察值的多个数据帧

时间:2022-09-05 15:33:49

I have several CSV files for each year. Each file contains the same variables and observations.

我每年都有几个CSV文件。每个文件包含相同的变量和观察。

df14 <- data.frame(name = c("one", "two", "three"), A = c(1,2,3), B = c(4, 2, 1), C = c(0, 1, 1))
df15 <- data.frame(name = c("one", "two", "three"), A = c(3,1,1), C = c(0, 0, 1), B = c(8, 5, 5))

Suppose df14 & df15 represent years 2014 & 2015 respectively.

假设df14和df15分别代表2014年和2015年。

Note: the variables are not recorded in the same order.

注意:变量不以相同的顺序记录。

What I'd like to do is see how each variable (A, B, C) are changing by year for each name.

我想做的是查看每个变量(A,B,C)每个名称的年份变化情况。

Is there a way to combine these in one data frame? Should I simply rbind them?

有没有办法在一个数据框架中组合这些?我应该简单地对他们进行调整吗?

Update:

One thing I could do is assign the years as a new variable and rbind but is it good practice?

我可以做的一件事是将年份作为一个新变量和rbind分配,但这是一个好习惯吗?

df14$year <- 2014; df15$year <- 2015
df <- rbind(df14, df15)

which gives:

   name A B C year
   one 1 4 0 2014
   two 2 2 1 2014
   three 3 1 1 2014
   one 3 8 0 2015
   two 1 5 0 2015
   three 1 5 1 2015

3 个解决方案

#1


TRY:

library(data.table)
library(magrittr)
years_2_digt <- 14:15

DT <- 
rbindlist(lapply(years_2_digt, function(y) {
  get(paste0("df", y)) %>% 
  setDT %>% 
  .[, year := y] %>%
  setkeyv("name")
}))


DT.molt <- reshape2::melt(DT, id.vars=c("name", "year"))

library(ggplot2)
ggplot(data=DT.molt, aes(x=year, color=variable, y=value)) + 
    geom_line() + geom_point() + 
    facet_grid(name ~ .) + 
    ggtitle("Change by year and name")

合并具有相同变量和观察值的多个数据帧

#2


You can programmatically add the year column to each data frame and then rbind them. Here's an example that relies on being able to get the year corresponding to each data frame from the file name. Here, I've stored you sample data frames in a list. In your real use case, you'd read the csv files into a list using something like df.list = sapply(vector_of_file_names, read.csv).

您可以以编程方式将年份列添加到每个数据框,然后再绑定它们。这是一个依赖于能够从文件名中获取与每个数据框相对应的年份的示例。在这里,我已经将样本数据帧存储在列表中。在您的实际用例中,您将使用df.list = sapply(vector_of_file_names,read.csv)之类的内容将csv文件读入列表。

df.list = list(df14=df14, df15=df15)

df.list = lapply(1:length(df.list), function(i) {
  df.list[[i]] = data.frame(df.list[[i]], 
                            year = 2000 + as.numeric(gsub(".*(\\d{2})\\.csv","\\1", names(df.list)[[i]])))
})

df = do.call(rbind, df.list)

#3


Here is a working example within one lapply:

这是一个lapply中的一个工作示例:

Make some dummy CSV files:

制作一些虚拟CSV文件:

df14 <- data.frame(name = c("one", "two", "three"), A = c(1,2,3), B = c(4, 2, 1), C = c(0, 1, 1))
df15 <- data.frame(name = c("one", "two", "three"), A = c(3,1,1), C = c(0, 0, 1), B = c(8, 5, 5))
df16 <- data.frame(name = c("one", "two", "three"), C = c(1,2,3), B = c(4, 2, 1), A = c(0, 1, 1))
df17 <- data.frame(name = c("one", "two", "three"), C = c(3,1,1), A = c(0, 0, 1), B = c(8, 5, 5))
#get dataframe names
myNames <- ls()[grepl("df",ls())]
lapply(myNames, function(i){write.csv(get(i),paste0(i,".csv"),row.names = FALSE)})

Solution: read CSV files, fix columns using sort, then rbind them into one dataframe:

解决方案:读取CSV文件,使用sort修复列,然后将它们绑定到一个数据帧中:

#Solution - read CSV, fix columns, rbind
do.call(rbind,
        lapply(list.files(".","^df\\d*.csv"),
               function(i){
                 d <- read.csv(i)
                 res <- d[,sort(colnames(d))]
                 cbind(res,FileName=i)
               }))
# output
#    A B C  name FileName
# 1  1 4 0   one df14.csv
# 2  2 2 1   two df14.csv
# 3  3 1 1 three df14.csv
# 4  3 8 0   one df15.csv
# 5  1 5 0   two df15.csv
# 6  1 5 1 three df15.csv
# 7  0 4 1   one df16.csv
# 8  1 2 2   two df16.csv
# 9  1 1 3 three df16.csv
# 10 0 8 3   one df17.csv
# 11 0 5 1   two df17.csv
# 12 1 5 1 three df17.csv

#1


TRY:

library(data.table)
library(magrittr)
years_2_digt <- 14:15

DT <- 
rbindlist(lapply(years_2_digt, function(y) {
  get(paste0("df", y)) %>% 
  setDT %>% 
  .[, year := y] %>%
  setkeyv("name")
}))


DT.molt <- reshape2::melt(DT, id.vars=c("name", "year"))

library(ggplot2)
ggplot(data=DT.molt, aes(x=year, color=variable, y=value)) + 
    geom_line() + geom_point() + 
    facet_grid(name ~ .) + 
    ggtitle("Change by year and name")

合并具有相同变量和观察值的多个数据帧

#2


You can programmatically add the year column to each data frame and then rbind them. Here's an example that relies on being able to get the year corresponding to each data frame from the file name. Here, I've stored you sample data frames in a list. In your real use case, you'd read the csv files into a list using something like df.list = sapply(vector_of_file_names, read.csv).

您可以以编程方式将年份列添加到每个数据框,然后再绑定它们。这是一个依赖于能够从文件名中获取与每个数据框相对应的年份的示例。在这里,我已经将样本数据帧存储在列表中。在您的实际用例中,您将使用df.list = sapply(vector_of_file_names,read.csv)之类的内容将csv文件读入列表。

df.list = list(df14=df14, df15=df15)

df.list = lapply(1:length(df.list), function(i) {
  df.list[[i]] = data.frame(df.list[[i]], 
                            year = 2000 + as.numeric(gsub(".*(\\d{2})\\.csv","\\1", names(df.list)[[i]])))
})

df = do.call(rbind, df.list)

#3


Here is a working example within one lapply:

这是一个lapply中的一个工作示例:

Make some dummy CSV files:

制作一些虚拟CSV文件:

df14 <- data.frame(name = c("one", "two", "three"), A = c(1,2,3), B = c(4, 2, 1), C = c(0, 1, 1))
df15 <- data.frame(name = c("one", "two", "three"), A = c(3,1,1), C = c(0, 0, 1), B = c(8, 5, 5))
df16 <- data.frame(name = c("one", "two", "three"), C = c(1,2,3), B = c(4, 2, 1), A = c(0, 1, 1))
df17 <- data.frame(name = c("one", "two", "three"), C = c(3,1,1), A = c(0, 0, 1), B = c(8, 5, 5))
#get dataframe names
myNames <- ls()[grepl("df",ls())]
lapply(myNames, function(i){write.csv(get(i),paste0(i,".csv"),row.names = FALSE)})

Solution: read CSV files, fix columns using sort, then rbind them into one dataframe:

解决方案:读取CSV文件,使用sort修复列,然后将它们绑定到一个数据帧中:

#Solution - read CSV, fix columns, rbind
do.call(rbind,
        lapply(list.files(".","^df\\d*.csv"),
               function(i){
                 d <- read.csv(i)
                 res <- d[,sort(colnames(d))]
                 cbind(res,FileName=i)
               }))
# output
#    A B C  name FileName
# 1  1 4 0   one df14.csv
# 2  2 2 1   two df14.csv
# 3  3 1 1 three df14.csv
# 4  3 8 0   one df15.csv
# 5  1 5 0   two df15.csv
# 6  1 5 1 three df15.csv
# 7  0 4 1   one df16.csv
# 8  1 2 2   two df16.csv
# 9  1 1 3 three df16.csv
# 10 0 8 3   one df17.csv
# 11 0 5 1   two df17.csv
# 12 1 5 1 three df17.csv