仅为R中的列的每个惟一值选择第一行。

时间:2022-11-05 13:09:00

From a dataframe like this

像这样的dataframe。

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
    id string
 1   1      A
 2   1      F
 3   2      B
 4   2      G
 5   3      C
 6   3      H
 7   4      D
 8   4      I
 9   5      E
 10  5      J

I want to create a new one with the first appearance of each id / string pair. If sqldf accepted R code within it, the query could look like this:

我想创建一个新的,具有每个id /字符串对的第一个外观。如果sqldf在其中接受了R代码,那么查询可能是这样的:

res <- sqldf("select id, min(rownames(test)), string 
              from test 
              group by id, string")

> res
    id string
 1   1      A
 3   2      B
 5   3      C
 7   4      D
 9   5      E

Is there a solution short of creating a new column like

是否存在创建新列的解决方案?

test$row <- rownames(test)

and running the same sqldf query with min(row)?

并使用min(row)运行相同的sqldf查询?

7 个解决方案

#1


82  

You can use duplicated to do this very quickly.

你可以使用复制来快速完成这个任务。

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

速度怪胎的基准:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
  testd <- as.data.table(test)
  setkey(testd,id)
  # Initial solution (slow)
  # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
  # Faster options :
  testd[!duplicated(id)]               # (1)
  # testd[, .SD[1L], by=key(testd)]    # (2)
  # testd[J(unique(id)),mult="first"]  # (3)
  # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
    replications=5, order="relative")[,1:6]
#     test replications elapsed relative user.self sys.self
# 1   ju()            5    0.03    1.000      0.03     0.00
# 5  jdt()            5    0.03    1.000      0.03     0.00
# 3  gs2()            5    3.49  116.333      2.87     0.58
# 2  gs1()            5    3.58  119.333      3.00     0.58
# 4 jply()            5    3.69  123.000      3.11     0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

让我们再试一次,但是只有第一次热的竞争者,还有更多的数据和更多的复制。

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
#    test replications elapsed relative user.self sys.self
# 1  ju()          100    5.48    1.000      4.44     1.00
# 2 jdt()          100    6.92    1.263      5.70     1.15

#2


11  

What about

是什么

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables which will return the the first row by key

还有一种独特的数据处理方法。表将返回第一行的键。

jdtu <- function() unique(DT)

I think, if you are ordering test outside the benchmark, then you can removing the setkey and data.table conversion from the benchmark as well (as the setkey basically sorts by id, the same as order).

我认为,如果您要在基准测试之外进行测试,那么您可以删除setkey和数据。表从基准的转换(作为setkey基本上按id排序,和order一样)。

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult = 'first']


 library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
##    test replications elapsed relative user.self sys.self 
## 2 jdt()            5    0.01        1      0.02        0        
## 1  ju()            5    0.05        5      0.05        0         

and with more data

和更多的数据

** Edit with unique method**

**以独特的方法编辑**。

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
       test replications elapsed relative user.self sys.self 
2  jdt()            5    0.09     2.25      0.09     0.00    
3 jdtu()            5    0.04     1.00      0.05     0.00      
1   ju()            5    0.22     5.50      0.19     0.03        

The unique method is fastest here.

这里唯一的方法是最快的。

#3


9  

A simple ddply option:

一个简单的ddp选项:

ddply(test,.(id),function(x) head(x,1))

If speed is an issue, a similar approach could be taken with data.table:

如果速度是一个问题,可以采用类似的方法。

testd <- data.table(test)
setkey(testd,id)
testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]

#4


7  

(1) SQLite has a built in rowid pseudo-column so this works:

(1)SQLite在rowid的伪列中有一个构建,所以这是可行的:

sqldf("select min(rowid) rowid, id, string 
               from test 
               group by id")

giving:

给:

  rowid id string
1     1  1      A
2     3  2      B
3     5  3      C
4     7  4      D
5     9  5      E

(2) Also sqldf itself has a row.names= argument:

(2)sqldf本身也有一个行。name =参数:

sqldf("select min(cast(row_names as real)) row_names, id, string 
              from test 
              group by id", row.names = TRUE)

giving:

给:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

(3) A third alternative which mixes the elements of the above two might be even better:

(3)将上述两种因素混合在一起的第三种选择可能更好:

sqldf("select min(rowid) row_names, id, string 
               from test 
               group by id", row.names = TRUE)

giving:

给:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

Note that all three of these rely on a SQLite extension to SQL where the use of min or max is guaranteed to result in the other columns being chosen from the same row. (In other SQL-based databases that may not be guaranteed.)

请注意,这三种方法都依赖于SQL的SQLite扩展,其中,使用min或max可以保证从同一行中选择其他列。(在其他基于sql的数据库中,这可能得不到保证。)

#5


5  

now, for dplyr, adding a distinct counter.

现在,对于dplyr,添加一个不同的计数器。

dfx <- df %>% group_by(aa, bb) %>%
        summarise(first=head(value,1), count=n_distinct(value))

you create groups, them summarise within groups. if data is numeric, you can use:
first(value) [there is also last(value)] in place of head(value, 1)

你创建小组,他们在小组内总结。如果数据是数值型的,您可以使用:first(value)[还有最后一个(值)]代替head(值,1)

see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

参见:http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Full:

全部:

> df
Source: local data frame [16 x 3]

   aa bb value
1   1  1   GUT
2   1  1   PER
3   1  2   SUT
4   1  2   GUT
5   1  3   SUT
6   1  3   GUT
7   1  3   PER
8   2  1   221
9   2  1   224
10  2  1   239
11  2  2   217
12  2  2   221
13  2  2   224
14  3  1   GUT
15  3  1   HUL
16  3  1   GUT

library(dplyr)
dfx <- df %>% group_by(aa, bb) %>%
    summarise(first=head(value,1), count=n_distinct(value))

> dfx
Source: local data frame [6 x 4]
Groups: aa

  aa bb first count
1  1  1   GUT     2
2  1  2   SUT     2
3  1  3   SUT     3
4  2  1   221     3
5  2  2   217     3
6  3  1   GUT     2

#6


4  

A base R option is the split()-lapply()-do.call() idiom:

一个基本的R选项是split()-lapply()-do.call()

> do.call(rbind, lapply(split(test, test$id), head, 1))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

A more direct option is to lapply() the [ function:

更直接的选择是lapply()函数:

> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

The comma-space 1, ) at the end of the lapply() call is essential as this is equivalent of calling [1, ] to select first row and all columns.

在lapply()调用末尾的逗号空间(comma-space 1)是非常重要的,因为这相当于调用[1,]来选择第一行和所有列。

#7


-1  

test_subset <- test[unique(test$id),]

Just this line will generate the subset you want.

这一行将生成您想要的子集。

#1


82  

You can use duplicated to do this very quickly.

你可以使用复制来快速完成这个任务。

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

速度怪胎的基准:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
  testd <- as.data.table(test)
  setkey(testd,id)
  # Initial solution (slow)
  # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
  # Faster options :
  testd[!duplicated(id)]               # (1)
  # testd[, .SD[1L], by=key(testd)]    # (2)
  # testd[J(unique(id)),mult="first"]  # (3)
  # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
    replications=5, order="relative")[,1:6]
#     test replications elapsed relative user.self sys.self
# 1   ju()            5    0.03    1.000      0.03     0.00
# 5  jdt()            5    0.03    1.000      0.03     0.00
# 3  gs2()            5    3.49  116.333      2.87     0.58
# 2  gs1()            5    3.58  119.333      3.00     0.58
# 4 jply()            5    3.69  123.000      3.11     0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

让我们再试一次,但是只有第一次热的竞争者,还有更多的数据和更多的复制。

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
#    test replications elapsed relative user.self sys.self
# 1  ju()          100    5.48    1.000      4.44     1.00
# 2 jdt()          100    6.92    1.263      5.70     1.15

#2


11  

What about

是什么

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables which will return the the first row by key

还有一种独特的数据处理方法。表将返回第一行的键。

jdtu <- function() unique(DT)

I think, if you are ordering test outside the benchmark, then you can removing the setkey and data.table conversion from the benchmark as well (as the setkey basically sorts by id, the same as order).

我认为,如果您要在基准测试之外进行测试,那么您可以删除setkey和数据。表从基准的转换(作为setkey基本上按id排序,和order一样)。

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult = 'first']


 library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
##    test replications elapsed relative user.self sys.self 
## 2 jdt()            5    0.01        1      0.02        0        
## 1  ju()            5    0.05        5      0.05        0         

and with more data

和更多的数据

** Edit with unique method**

**以独特的方法编辑**。

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
       test replications elapsed relative user.self sys.self 
2  jdt()            5    0.09     2.25      0.09     0.00    
3 jdtu()            5    0.04     1.00      0.05     0.00      
1   ju()            5    0.22     5.50      0.19     0.03        

The unique method is fastest here.

这里唯一的方法是最快的。

#3


9  

A simple ddply option:

一个简单的ddp选项:

ddply(test,.(id),function(x) head(x,1))

If speed is an issue, a similar approach could be taken with data.table:

如果速度是一个问题,可以采用类似的方法。

testd <- data.table(test)
setkey(testd,id)
testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]

#4


7  

(1) SQLite has a built in rowid pseudo-column so this works:

(1)SQLite在rowid的伪列中有一个构建,所以这是可行的:

sqldf("select min(rowid) rowid, id, string 
               from test 
               group by id")

giving:

给:

  rowid id string
1     1  1      A
2     3  2      B
3     5  3      C
4     7  4      D
5     9  5      E

(2) Also sqldf itself has a row.names= argument:

(2)sqldf本身也有一个行。name =参数:

sqldf("select min(cast(row_names as real)) row_names, id, string 
              from test 
              group by id", row.names = TRUE)

giving:

给:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

(3) A third alternative which mixes the elements of the above two might be even better:

(3)将上述两种因素混合在一起的第三种选择可能更好:

sqldf("select min(rowid) row_names, id, string 
               from test 
               group by id", row.names = TRUE)

giving:

给:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

Note that all three of these rely on a SQLite extension to SQL where the use of min or max is guaranteed to result in the other columns being chosen from the same row. (In other SQL-based databases that may not be guaranteed.)

请注意,这三种方法都依赖于SQL的SQLite扩展,其中,使用min或max可以保证从同一行中选择其他列。(在其他基于sql的数据库中,这可能得不到保证。)

#5


5  

now, for dplyr, adding a distinct counter.

现在,对于dplyr,添加一个不同的计数器。

dfx <- df %>% group_by(aa, bb) %>%
        summarise(first=head(value,1), count=n_distinct(value))

you create groups, them summarise within groups. if data is numeric, you can use:
first(value) [there is also last(value)] in place of head(value, 1)

你创建小组,他们在小组内总结。如果数据是数值型的,您可以使用:first(value)[还有最后一个(值)]代替head(值,1)

see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

参见:http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Full:

全部:

> df
Source: local data frame [16 x 3]

   aa bb value
1   1  1   GUT
2   1  1   PER
3   1  2   SUT
4   1  2   GUT
5   1  3   SUT
6   1  3   GUT
7   1  3   PER
8   2  1   221
9   2  1   224
10  2  1   239
11  2  2   217
12  2  2   221
13  2  2   224
14  3  1   GUT
15  3  1   HUL
16  3  1   GUT

library(dplyr)
dfx <- df %>% group_by(aa, bb) %>%
    summarise(first=head(value,1), count=n_distinct(value))

> dfx
Source: local data frame [6 x 4]
Groups: aa

  aa bb first count
1  1  1   GUT     2
2  1  2   SUT     2
3  1  3   SUT     3
4  2  1   221     3
5  2  2   217     3
6  3  1   GUT     2

#6


4  

A base R option is the split()-lapply()-do.call() idiom:

一个基本的R选项是split()-lapply()-do.call()

> do.call(rbind, lapply(split(test, test$id), head, 1))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

A more direct option is to lapply() the [ function:

更直接的选择是lapply()函数:

> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

The comma-space 1, ) at the end of the lapply() call is essential as this is equivalent of calling [1, ] to select first row and all columns.

在lapply()调用末尾的逗号空间(comma-space 1)是非常重要的,因为这相当于调用[1,]来选择第一行和所有列。

#7


-1  

test_subset <- test[unique(test$id),]

Just this line will generate the subset you want.

这一行将生成您想要的子集。