从长到宽格式转换/重塑数据帧而不使用“timevar”

时间:2022-09-16 11:04:50

I have a data frame that follows the below long Pattern:

我有一个数据框,遵循以下长模式:

   Name          MedName
  Name1    atenolol 25mg
  Name1     aspirin 81mg
  Name1 sildenafil 100mg
  Name2    atenolol 50mg
  Name2   enalapril 20mg

And would like to get below (I do not care if I can get the columns to be named this way, just want the data in this format):

并希望得到下面(我不在乎我是否可以这样命名列,只是想要这种格式的数据):

   Name   medication1    medication2      medication3
  Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
  Name2 atenolol 50mg enalapril 20mg             NA

Through this very site I have become familiarish with the reshape/reshape2 package, and have went through several attempts to try to get this to work but have thus far failed.

通过这个网站,我已经熟悉了reshape / reshape2包,并且经历了多次尝试以使其工作但迄今为止失败了。

When I try dcast(dataframe, Name ~ MedName, value.var='MedName') I just get a bunch of columns that are flags of the medication names (values that get transposed are 1 or 0) example:

当我尝试dcast(dataframe,Name~MedName,value.var ='MedName')时,我得到一堆列为药物名称的标志(转置的值为1或0)示例:

 Name  atenolol 25mg  aspirin 81mg
Name1              1             1
Name2              0             0 

I also tried a dcast(dataset, Name ~ variable) after I melted the dataset, however this just spits out the following (just counts how many meds each person has):

在我融合数据集之后,我也尝试了dcast(数据集,Name~variable),但是这只是吐出以下内容(只计算每个人有多少meds):

 Name  MedName
Name1        3
name2        2

Finally, I tried to melt the data and then reshape using idvar="Name" timevar="variable" (of which all just are Mednames), however this does not seem built for my issue since if there are multiple matches to the idvar, the reshape just takes the first MedName and ignores the rest.

最后,我尝试融化数据,然后使用idvar =“Name”timevar =“variable”重新整形(其中所有只是Mednames),但是这似乎不适用于我的问题,因为如果有多个匹配到idvar,重塑只需要第一个MedName并忽略其余的。

Does anyone know how to do this using reshape or another R function? I realize that there probably is a way to do this in a more messy manner with some for loops and conditionals to basically split and re-paste the data, but I was hoping there was a more simple solution. Thank you so much!

有没有人知道如何使用重塑或其他R功能这样做?我意识到可能有一种方法以更杂乱的方式执行此操作,其中一些for循环和条件基本上分割并重新粘贴数据,但我希望有一个更简单的解决方案。非常感谢!

6 个解决方案

#1


14  

Assuming your data is in the object dataset:

假设您的数据位于对象数据集中:

library(plyr)
## Add a medication index
data_with_index <- ddply(dataset, .(Name), mutate, 
                         index = paste0('medication', 1:length(Name)))    
dcast(data_with_index, Name ~ index, value.var = 'MedName')

##    Name   medication1    medication2      medication3
## 1 Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
## 2 Name2 atenolol 50mg enalapril 20mg             <NA>

#2


12  

You could always generate a unique timevar before using reshape. Here I use ave to apply the function seq_along 'along' each "Name".

在使用重塑之前,您始终可以生成唯一的时间变量。在这里,我使用ave将函数seq_along'沿'每个“Name”应用。

test <- data.frame(
Name=c(rep("name1",3),rep("name2",2)),
MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
          "atenolol 50mg","enalapril 20mg")
)

# generate the 'timevar'
test$uniqid <- with(test, ave(as.character(Name), Name, FUN = seq_along))

# reshape!
reshape(test, idvar = "Name", timevar = "uniqid", direction = "wide")

Result:

结果:

   Name     MedName.1      MedName.2        MedName.3
1 name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
4 name2 atenolol 50mg enalapril 20mg             <NA>

#3


11  

With the package, this could easily be solved with the new rowid function:

使用data.table包,可以使用新的rowid函数轻松解决这个问题:

library(data.table)
dcast(setDT(d1), 
      Name ~ rowid(Name, prefix = "medication"), 
      value.var = "MedName")

which gives:

这使:

   Name    medication1     medication2       medication3
1 Name1  atenolol 25mg    aspirin 81mg  sildenafil 100mg
2 Name2  atenolol 50mg  enalapril 20mg              <NA>

Another method (commonly used before version 1.9.7):

另一种方法(在1.9.7版之前常用):

dcast(setDT(d1)[, rn := 1:.N, by = Name], 
      Name ~ paste0("medication",rn), 
      value.var = "MedName")

giving the same result.

给出相同的结果。


A similar approach, but now using the and packages:

类似的方法,但现在使用dplyr和tidyr包:

library(dplyr)
library(tidyr)
d1 %>%
  group_by(Name) %>%
  mutate(rn = paste0("medication",row_number())) %>%
  spread(rn, MedName)

which gives:

这使:

Source: local data frame [2 x 4]
Groups: Name [2]

    Name   medication1    medication2      medication3
  (fctr)         (chr)          (chr)            (chr)
1  Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
2  Name2 atenolol 50mg enalapril 20mg               NA

#4


9  

This seems to actually be a fairly common problem, so I have included a function called getanID in my "splitstackshape" package.

这似乎是一个相当普遍的问题,所以我在我的“splitstackshape”包中包含了一个名为getanID的函数。

Here's what it does:

这是它的作用:

library(splitstackshape)
getanID(test, "Name")
#     Name          MedName .id
# 1: name1    atenolol 25mg   1
# 2: name1     aspirin 81mg   2
# 3: name1 sildenafil 100mg   3
# 4: name2    atenolol 50mg   1
# 5: name2   enalapril 20mg   2

Since "data.table" is loaded along with "splitstackshape", you have access to dcast.data.table, so you can proceed as with @mnel's example.

由于“data.table”与“splitstackshape”一起加载,因此您可以访问dcast.data.table,因此您可以像@mnel的示例一样继续操作。

dcast.data.table(getanID(test, "Name"), Name ~ .id, value.var = "MedName")
#     Name             1              2                3
# 1: name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
# 2: name2 atenolol 50mg enalapril 20mg               NA

The function essentially implements a sequence(.N) by the groups identified to create the "time" column.

该函数基本上由标识为创建“时间”列的组实现序列(.N)。

#5


3  

@thelatemail's solution is similar to this one. When I generate the time variable, I use rle in case I'm not working interactively and the Name variable needs to be dynamic.

@ thelatemail的解决方案与此类似。当我生成时间变量时,我使用rle以防我不能以交互方式工作,并且Name变量需要是动态的。

# start with your example data
x <- 
    data.frame(
        Name=c(rep("name1",3),rep("name2",2)),
        MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
            "atenolol 50mg","enalapril 20mg")
    )

# pick the id variable
id <- 'Name'

# sort the data.frame by that variable
x <- x[ order( x[ , id ] ) , ]

# construct a `time` variable on the fly
x$time <- unlist( lapply( rle( as.character( x[ , id ] ) )$lengths , seq_len ) )

# `reshape` uses that new `time` column by default
y <- reshape( x , idvar = id , direction = 'wide' )

# done
y

#6


0  

Here's a shorter way, taking advantage of the way unlist deals with names:

这是一个更短的方式,利用unlist处理名称的方式:

library(dplyr)
df1 %>% group_by(Name) %>% do(as_tibble(t(unlist(.[2]))))
# # A tibble: 2 x 4
# # Groups:   Name [2]
#      Name      MedName1       MedName2         MedName3
#     <chr>         <chr>          <chr>            <chr>
#   1 name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
#   2 name2 atenolol 50mg enalapril 20mg             <NA>

#1


14  

Assuming your data is in the object dataset:

假设您的数据位于对象数据集中:

library(plyr)
## Add a medication index
data_with_index <- ddply(dataset, .(Name), mutate, 
                         index = paste0('medication', 1:length(Name)))    
dcast(data_with_index, Name ~ index, value.var = 'MedName')

##    Name   medication1    medication2      medication3
## 1 Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
## 2 Name2 atenolol 50mg enalapril 20mg             <NA>

#2


12  

You could always generate a unique timevar before using reshape. Here I use ave to apply the function seq_along 'along' each "Name".

在使用重塑之前,您始终可以生成唯一的时间变量。在这里,我使用ave将函数seq_along'沿'每个“Name”应用。

test <- data.frame(
Name=c(rep("name1",3),rep("name2",2)),
MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
          "atenolol 50mg","enalapril 20mg")
)

# generate the 'timevar'
test$uniqid <- with(test, ave(as.character(Name), Name, FUN = seq_along))

# reshape!
reshape(test, idvar = "Name", timevar = "uniqid", direction = "wide")

Result:

结果:

   Name     MedName.1      MedName.2        MedName.3
1 name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
4 name2 atenolol 50mg enalapril 20mg             <NA>

#3


11  

With the package, this could easily be solved with the new rowid function:

使用data.table包,可以使用新的rowid函数轻松解决这个问题:

library(data.table)
dcast(setDT(d1), 
      Name ~ rowid(Name, prefix = "medication"), 
      value.var = "MedName")

which gives:

这使:

   Name    medication1     medication2       medication3
1 Name1  atenolol 25mg    aspirin 81mg  sildenafil 100mg
2 Name2  atenolol 50mg  enalapril 20mg              <NA>

Another method (commonly used before version 1.9.7):

另一种方法(在1.9.7版之前常用):

dcast(setDT(d1)[, rn := 1:.N, by = Name], 
      Name ~ paste0("medication",rn), 
      value.var = "MedName")

giving the same result.

给出相同的结果。


A similar approach, but now using the and packages:

类似的方法,但现在使用dplyr和tidyr包:

library(dplyr)
library(tidyr)
d1 %>%
  group_by(Name) %>%
  mutate(rn = paste0("medication",row_number())) %>%
  spread(rn, MedName)

which gives:

这使:

Source: local data frame [2 x 4]
Groups: Name [2]

    Name   medication1    medication2      medication3
  (fctr)         (chr)          (chr)            (chr)
1  Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
2  Name2 atenolol 50mg enalapril 20mg               NA

#4


9  

This seems to actually be a fairly common problem, so I have included a function called getanID in my "splitstackshape" package.

这似乎是一个相当普遍的问题,所以我在我的“splitstackshape”包中包含了一个名为getanID的函数。

Here's what it does:

这是它的作用:

library(splitstackshape)
getanID(test, "Name")
#     Name          MedName .id
# 1: name1    atenolol 25mg   1
# 2: name1     aspirin 81mg   2
# 3: name1 sildenafil 100mg   3
# 4: name2    atenolol 50mg   1
# 5: name2   enalapril 20mg   2

Since "data.table" is loaded along with "splitstackshape", you have access to dcast.data.table, so you can proceed as with @mnel's example.

由于“data.table”与“splitstackshape”一起加载,因此您可以访问dcast.data.table,因此您可以像@mnel的示例一样继续操作。

dcast.data.table(getanID(test, "Name"), Name ~ .id, value.var = "MedName")
#     Name             1              2                3
# 1: name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
# 2: name2 atenolol 50mg enalapril 20mg               NA

The function essentially implements a sequence(.N) by the groups identified to create the "time" column.

该函数基本上由标识为创建“时间”列的组实现序列(.N)。

#5


3  

@thelatemail's solution is similar to this one. When I generate the time variable, I use rle in case I'm not working interactively and the Name variable needs to be dynamic.

@ thelatemail的解决方案与此类似。当我生成时间变量时,我使用rle以防我不能以交互方式工作,并且Name变量需要是动态的。

# start with your example data
x <- 
    data.frame(
        Name=c(rep("name1",3),rep("name2",2)),
        MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
            "atenolol 50mg","enalapril 20mg")
    )

# pick the id variable
id <- 'Name'

# sort the data.frame by that variable
x <- x[ order( x[ , id ] ) , ]

# construct a `time` variable on the fly
x$time <- unlist( lapply( rle( as.character( x[ , id ] ) )$lengths , seq_len ) )

# `reshape` uses that new `time` column by default
y <- reshape( x , idvar = id , direction = 'wide' )

# done
y

#6


0  

Here's a shorter way, taking advantage of the way unlist deals with names:

这是一个更短的方式,利用unlist处理名称的方式:

library(dplyr)
df1 %>% group_by(Name) %>% do(as_tibble(t(unlist(.[2]))))
# # A tibble: 2 x 4
# # Groups:   Name [2]
#      Name      MedName1       MedName2         MedName3
#     <chr>         <chr>          <chr>            <chr>
#   1 name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
#   2 name2 atenolol 50mg enalapril 20mg             <NA>