为每个dataframe [duplicate]组中的行创建一个顺序编号(计数器)

时间:2021-07-23 09:12:11

This question already has an answer here:

这个问题已经有了答案:

How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":

如何在每个dataframe组中生成唯一的id号?下面是一些由“personid”分组的数据:

personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23

I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:

我希望在由“personid”定义的每个子集中为每一行添加一个id列,其值都是唯一的,总是从1开始。这是我想要的输出:

personid date measurement id
1         x     23         1
1         x     32         2
2         y     21         1
3         x     23         1
3         z     23         2
3         y     23         3

I appreciate any help.

我很感谢任何帮助。

6 个解决方案

#1


24  

The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.

带有参数FUN=seq_along的命名错误的ave()函数将很好地实现这一点——即使您的personid列没有严格排序。

df <- read.table(text = "personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23", header=TRUE)

## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3

## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2

#2


18  

Some dplyr alternatives, using convenience functions row_number and n.

一些dplyr替代方案,使用方便函数row_number和n。

library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))

You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.

您也可以使用getanID从包splitstackshape。注意,输入数据集作为数据表返回。

getanID(data = df, id.vars = "personid")
#    personid date measurement .id
# 1:        1    x          23   1
# 2:        1    x          32   2
# 3:        2    y          21   1
# 4:        3    x          23   1
# 5:        3    z          23   2
# 6:        3    y          23   3

#3


12  

Using data.table, and assuming you wish to order by date within the personid subset

使用数据。表,并假设您希望在personid子集内按日期排序

library(data.table)
DT <- data.table(Data)

DT[,id := order(date), by  = personid]

##    personid date measurement id
## 1:        1    x          23  1
## 2:        1    x          32  2
## 3:        2    y          21  1
## 4:        3    x          23  1
## 5:        3    z          23  3
## 6:        3    y          23  2

If you wish do not wish to order by date

如果您不希望按日期订货的话

DT[, id := 1:.N, by = personid]

##    personid date measurement id
## 1:        1    x          23  1
## 2:        1    x          32  2
## 3:        2    y          21  1
## 4:        3    x          23  1
## 5:        3    z          23  2
## 6:        3    y          23  3

Any of the following would also work

以下任何一种都可以

DT[, id := seq_along(measurement), by =  personid]
DT[, id := seq_along(date), by =  personid]

The equivalent commands using plyr

使用plyr的等效命令

library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))

#4


7  

I think there's a canned command for this, but I can't remember it. So here's one way:

我想这是一个固定的命令,但我记不得了。这是一个方法:

> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
 [1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
 [1] 1 1 2 2 3 4 5 6 7 8

This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.

这是因为重复返回一个逻辑向量。cumsum表示数值向量,因此逻辑被强制为数值。

You can store the result to your data.frame as a new column if you want:

您可以将结果存储到您的数据中。

dat$id <- cumsum(duplicated(test))+1

#5


5  

Assuming your data are in a data.frame named Data, this will do the trick:

假设您的数据位于一个名为data的数据框架中,这将实现以下功能:

# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))

#6


2  

You can use sqldf

您可以使用sqldf

df<-read.table(header=T,text="personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23")

library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
       FROM df a, df b 
       WHERE a.personid = b.personid AND b.ROWID <= a.ROWID 
       GROUP BY a.ROWID"
)

#  personid date measurement count
#1        1    x          23     1
#2        1    x          32     2
#3        2    y          21     1
#4        3    x          23     1
#5        3    z          23     2
#6        3    y          23     3

#1


24  

The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.

带有参数FUN=seq_along的命名错误的ave()函数将很好地实现这一点——即使您的personid列没有严格排序。

df <- read.table(text = "personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23", header=TRUE)

## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3

## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2

#2


18  

Some dplyr alternatives, using convenience functions row_number and n.

一些dplyr替代方案,使用方便函数row_number和n。

library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))

You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.

您也可以使用getanID从包splitstackshape。注意,输入数据集作为数据表返回。

getanID(data = df, id.vars = "personid")
#    personid date measurement .id
# 1:        1    x          23   1
# 2:        1    x          32   2
# 3:        2    y          21   1
# 4:        3    x          23   1
# 5:        3    z          23   2
# 6:        3    y          23   3

#3


12  

Using data.table, and assuming you wish to order by date within the personid subset

使用数据。表,并假设您希望在personid子集内按日期排序

library(data.table)
DT <- data.table(Data)

DT[,id := order(date), by  = personid]

##    personid date measurement id
## 1:        1    x          23  1
## 2:        1    x          32  2
## 3:        2    y          21  1
## 4:        3    x          23  1
## 5:        3    z          23  3
## 6:        3    y          23  2

If you wish do not wish to order by date

如果您不希望按日期订货的话

DT[, id := 1:.N, by = personid]

##    personid date measurement id
## 1:        1    x          23  1
## 2:        1    x          32  2
## 3:        2    y          21  1
## 4:        3    x          23  1
## 5:        3    z          23  2
## 6:        3    y          23  3

Any of the following would also work

以下任何一种都可以

DT[, id := seq_along(measurement), by =  personid]
DT[, id := seq_along(date), by =  personid]

The equivalent commands using plyr

使用plyr的等效命令

library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))

#4


7  

I think there's a canned command for this, but I can't remember it. So here's one way:

我想这是一个固定的命令,但我记不得了。这是一个方法:

> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
 [1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
 [1] 1 1 2 2 3 4 5 6 7 8

This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.

这是因为重复返回一个逻辑向量。cumsum表示数值向量,因此逻辑被强制为数值。

You can store the result to your data.frame as a new column if you want:

您可以将结果存储到您的数据中。

dat$id <- cumsum(duplicated(test))+1

#5


5  

Assuming your data are in a data.frame named Data, this will do the trick:

假设您的数据位于一个名为data的数据框架中,这将实现以下功能:

# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))

#6


2  

You can use sqldf

您可以使用sqldf

df<-read.table(header=T,text="personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23")

library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
       FROM df a, df b 
       WHERE a.personid = b.personid AND b.ROWID <= a.ROWID 
       GROUP BY a.ROWID"
)

#  personid date measurement count
#1        1    x          23     1
#2        1    x          32     2
#3        2    y          21     1
#4        3    x          23     1
#5        3    z          23     2
#6        3    y          23     3