重塑长到宽并保持重复的行

时间:2022-09-16 11:14:07

For the given data set, i want to convert my data set from long format to wide format. I have used reshape function to do so.

对于给定的数据集,我想将我的数据集从长格式转换为宽格式。我使用了reshape函数来做到这一点。

id  status      timestamp   
1   assigned   2017-01-02  
1   done       2017-01-03  
1   locked     2017-01-04   
2   assigned   2017-01-02   
2   done       2017-01-03  
2   assigned   2017-01-03  
2   done       2017-01-04 
2   locked     2017-01-05  
3   assigned   2017-01-02  
3   done       2017-01-03 
3   locked     2017-01-04 
...

# reshape function to convert long format to Wide.
temp <- reshape(temp, idvar = "id", timevar = "status", direction = "wide")

Results:

id timestamp.assigned timestamp.done timestamp.locked
1 2017-01-02 2017-01-03 2017-01-04
2 2017-01-02 2017-01-03 2017-01-05
3 2017-01-02 2017-01-03 2017-01-04

id timestamp.assigned timestamp.done timestamp.locked 1 2017-01-02 2017-01-03 2017-01-04 2 2017-01-02 2017-01-03 2017-01-05 3 2017-01-02 2017- 01-03 2017-01-04

when I do this it removes some of rows, for e.g: for id 2, there are multiple rows match for status=assigned, it takes the first row.

当我这样做时它删除了一些行,例如:对于id 2,有多个行匹配status = assign,它占用第一行。

How can I convert to wide without removing rows. Basically, I don't want to lose any data.

如何在不删除行的情况下转换为宽。基本上,我不想丢失任何数据。

Expected results:
id timestamp.assigned timestamp.done timestamp.locked
1 2017-01-02 2017-01-03 2017-01-04
2 2017-01-02 2017-01-03 2017-01-05
2 2017-01-03 2017-01-04 2017-01-05
3 2017-01-02 2017-01-03 2017-01-04

预期结果:id timestamp.assigned timestamp.done timestamp.locked 1 2017-01-02 2017-01-03 2017-01-04 2 2017-01-02 2017-01-03 2017-01-05 2 2017-01- 03 2017-01-04 2017-01-05 3 2017-01-02 2017-01-03 2017-01-04

or

id timestamp.assigned timestamp.done timestamp.locked
1 2017-01-02 2017-01-03 2017-01-04
2 2017-01-02 2017-01-03 NA
2 2017-01-03 2017-01-04 2017-01-05
3 2017-01-02 2017-01-03 2017-01-04

id timestamp.assigned timestamp.done timestamp.locked 1 2017-01-02 2017-01-03 2017-01-04 2 2017-01-02 2017-01-03 2 2 2017-01-03 2017-01-04 2017 -01-05 3 2017-01-02 2017-01-03 2017-01-04

2 个解决方案

#1


0  

One thing you could do would be to add a variable that gave a unique value for each new assignment. Then you could use that to shape your data

您可以做的一件事是添加一个变量,为每个新赋值赋予唯一值。然后你可以用它来塑造你的数据

i <- 0

temp$key <- sapply(temp$status, function(x) {
  if(x == "assigned") {i <<- i+1; i}
  else {i}
})

temp

   id   status  timestamp key
1   1 assigned 2017-01-02   1
2   1     done 2017-01-03   1
3   1   locked 2017-01-04   1
4   2 assigned 2017-01-02   2
5   2     done 2017-01-03   2
6   2 assigned 2017-01-03   3
7   2     done 2017-01-04   3
8   2   locked 2017-01-05   3
9   3 assigned 2017-01-02   4
10  3     done 2017-01-03   4
11  3   locked 2017-01-04   4

temp2 <- reshape(temp, idvar = c("key", "id"), timevar = "status", direction = "wide")

temp2

  id key timestamp.assigned timestamp.done timestamp.locked
1  1   1         2017-01-02     2017-01-03       2017-01-04
4  2   2         2017-01-02     2017-01-03             <NA>
6  2   3         2017-01-03     2017-01-04       2017-01-05
9  3   4         2017-01-02     2017-01-03       2017-01-04

#2


0  

1. cumsum()

Esther's approach to number each new assignment is the way to go.

以斯帖为每个新任务编号的方法是要走的路。

However, R already has the cumsum() function which can be used for this purpose:

但是,R已经具有可用于此目的的cumsum()函数:

temp$key <- cumsum(temp$status == "assigned")
reshape(temp, idvar = c("key", "id"), timevar = "status", direction = "wide")
   id key timestamp.assigned timestamp.done timestamp.locked
1:  1   1         2017-01-02     2017-01-03       2017-01-04
2:  2   2         2017-01-02     2017-01-03             <NA>
3:  2   3         2017-01-03     2017-01-04       2017-01-05
4:  3   4         2017-01-02     2017-01-03       2017-01-04

2. Grouped cumsum()

Although this solves the OP's original problem, key just numbers all assignments across all ids. In case the OP prefers to have the assignments numbered individually for each id we need to apply cumsum() grouped by id.

虽然这解决了OP的原始问题,但关键只是对所有ID中的所有分配进行编号。如果OP更喜欢为每个id分别编号,我们需要应用按id分组的cumsum()。

One way to accomplish this is using data.table syntax:

实现此目的的一种方法是使用data.table语法:

library(data.table)
setDT(temp)[, key := cumsum(status == "assigned"), by = id]
dcast(temp, id + key ~ status, value.var = "timestamp")
   id key   assigned       done     locked
1:  1   1 2017-01-02 2017-01-03 2017-01-04
2:  2   1 2017-01-02 2017-01-03       <NA>
3:  2   2 2017-01-03 2017-01-04 2017-01-05
4:  3   1 2017-01-02 2017-01-03 2017-01-04

dcast() is a replacement of base R's reshape(..., direction = "wide") function which is available from the reshape2 and data.table packages.

dcast()取代了基本R的重塑(...,direction =“wide”)函数,该函数可从reshape2和data.table包中获得。

3. Grouped cumsum() on-the-fly

The formula interface of data.table's dcast() accepts also expressions. With this, it is not necessary to modify temp by appending a key column before reshaping. Instead, this can be accomplished on-the-fly while reshaping:

data.table的dcast()的公式接口也接受表达式。有了这个,没有必要通过在重新整形之前附加一个键列来修改temp。相反,这可以在重塑时即时完成:

dcast(temp, id + ave(key <- status == "assigned", id, FUN = cumsum) ~ 
        paste0("timestamp.", status))
   id key timestamp.assigned timestamp.done timestamp.locked
1:  1   1         2017-01-02     2017-01-03       2017-01-04
2:  2   1         2017-01-02     2017-01-03             <NA>
3:  2   2         2017-01-03     2017-01-04       2017-01-05
4:  3   1         2017-01-02     2017-01-03       2017-01-04

Data

library(data.table)
temp <- fread(
  "id  status      timestamp   
1   assigned   2017-01-02  
1   done       2017-01-03  
1   locked     2017-01-04   
2   assigned   2017-01-02   
2   done       2017-01-03  
2   assigned   2017-01-03  
2   done       2017-01-04 
2   locked     2017-01-05  
3   assigned   2017-01-02  
3   done       2017-01-03 
3   locked     2017-01-04 ")

#1


0  

One thing you could do would be to add a variable that gave a unique value for each new assignment. Then you could use that to shape your data

您可以做的一件事是添加一个变量,为每个新赋值赋予唯一值。然后你可以用它来塑造你的数据

i <- 0

temp$key <- sapply(temp$status, function(x) {
  if(x == "assigned") {i <<- i+1; i}
  else {i}
})

temp

   id   status  timestamp key
1   1 assigned 2017-01-02   1
2   1     done 2017-01-03   1
3   1   locked 2017-01-04   1
4   2 assigned 2017-01-02   2
5   2     done 2017-01-03   2
6   2 assigned 2017-01-03   3
7   2     done 2017-01-04   3
8   2   locked 2017-01-05   3
9   3 assigned 2017-01-02   4
10  3     done 2017-01-03   4
11  3   locked 2017-01-04   4

temp2 <- reshape(temp, idvar = c("key", "id"), timevar = "status", direction = "wide")

temp2

  id key timestamp.assigned timestamp.done timestamp.locked
1  1   1         2017-01-02     2017-01-03       2017-01-04
4  2   2         2017-01-02     2017-01-03             <NA>
6  2   3         2017-01-03     2017-01-04       2017-01-05
9  3   4         2017-01-02     2017-01-03       2017-01-04

#2


0  

1. cumsum()

Esther's approach to number each new assignment is the way to go.

以斯帖为每个新任务编号的方法是要走的路。

However, R already has the cumsum() function which can be used for this purpose:

但是,R已经具有可用于此目的的cumsum()函数:

temp$key <- cumsum(temp$status == "assigned")
reshape(temp, idvar = c("key", "id"), timevar = "status", direction = "wide")
   id key timestamp.assigned timestamp.done timestamp.locked
1:  1   1         2017-01-02     2017-01-03       2017-01-04
2:  2   2         2017-01-02     2017-01-03             <NA>
3:  2   3         2017-01-03     2017-01-04       2017-01-05
4:  3   4         2017-01-02     2017-01-03       2017-01-04

2. Grouped cumsum()

Although this solves the OP's original problem, key just numbers all assignments across all ids. In case the OP prefers to have the assignments numbered individually for each id we need to apply cumsum() grouped by id.

虽然这解决了OP的原始问题,但关键只是对所有ID中的所有分配进行编号。如果OP更喜欢为每个id分别编号,我们需要应用按id分组的cumsum()。

One way to accomplish this is using data.table syntax:

实现此目的的一种方法是使用data.table语法:

library(data.table)
setDT(temp)[, key := cumsum(status == "assigned"), by = id]
dcast(temp, id + key ~ status, value.var = "timestamp")
   id key   assigned       done     locked
1:  1   1 2017-01-02 2017-01-03 2017-01-04
2:  2   1 2017-01-02 2017-01-03       <NA>
3:  2   2 2017-01-03 2017-01-04 2017-01-05
4:  3   1 2017-01-02 2017-01-03 2017-01-04

dcast() is a replacement of base R's reshape(..., direction = "wide") function which is available from the reshape2 and data.table packages.

dcast()取代了基本R的重塑(...,direction =“wide”)函数,该函数可从reshape2和data.table包中获得。

3. Grouped cumsum() on-the-fly

The formula interface of data.table's dcast() accepts also expressions. With this, it is not necessary to modify temp by appending a key column before reshaping. Instead, this can be accomplished on-the-fly while reshaping:

data.table的dcast()的公式接口也接受表达式。有了这个,没有必要通过在重新整形之前附加一个键列来修改temp。相反,这可以在重塑时即时完成:

dcast(temp, id + ave(key <- status == "assigned", id, FUN = cumsum) ~ 
        paste0("timestamp.", status))
   id key timestamp.assigned timestamp.done timestamp.locked
1:  1   1         2017-01-02     2017-01-03       2017-01-04
2:  2   1         2017-01-02     2017-01-03             <NA>
3:  2   2         2017-01-03     2017-01-04       2017-01-05
4:  3   1         2017-01-02     2017-01-03       2017-01-04

Data

library(data.table)
temp <- fread(
  "id  status      timestamp   
1   assigned   2017-01-02  
1   done       2017-01-03  
1   locked     2017-01-04   
2   assigned   2017-01-02   
2   done       2017-01-03  
2   assigned   2017-01-03  
2   done       2017-01-04 
2   locked     2017-01-05  
3   assigned   2017-01-02  
3   done       2017-01-03 
3   locked     2017-01-04 ")