将列中的逗号分隔的字符串分割为单独的行

时间:2022-01-12 00:15:37

I have a data frame, like so:

我有一个数据框架,像这样:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

As you can see, some entries in the director column are multiple names separated by commas. I would like to split these entries up into separate rows while maintaining the values of the other column. As an example, the first row in the data frame above should be split into two rows, with a single name each in the director column and 'A' in the AB column.

如您所见,director列中的一些条目是由逗号分隔的多个名称。我想将这些条目分割成不同的行,同时保持另一列的值。例如,上面数据框中的第一行应该分为两行,在director列中各有一个名称,在AB列中各有一个名称。

5 个解决方案

#1


42  

This old question frequently is being used as dupe target (tagged with r-faq). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmark as guidance which of the approaches is the fastest1.

这个老问题经常被用作dupe目标(用r-faq标记)。到今天为止,已经有三次有人回答了这个问题,他们提供了6种不同的方法,但没有一个基准来作为指导,即哪一种方法是最快的。

The benchmarked solutions include

基准测试的解决方案包括

Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmark package (see code below).

总共有8种不同的方法通过使用microbenchmark包对6种不同大小的数据帧进行了基准测试(参见下面的代码)。

The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.

OP给出的示例数据仅包含20行。要创建更大的数据帧,这20行只需重复1、10、100、1000、10000和100000次,从而产生高达200万行的问题大小。

Benchmark results

将列中的逗号分隔的字符串分割为单独的行

The benchmark results show that for sufficiently large data frames all data.table methods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.table method 2 and the variant DT3 are the fastest, magnitudes faster than the slowest methods.

基准测试结果表明,对于足够大的数据帧,所有数据都是如此。表方法比任何其他方法都快。对于超过5000行的数据帧,Jaap的数据。表方法2和变型DT3是最快的,比最慢的方法快得多。

Remarkably, the timings of the two tidyverse methods and the splistackshape solution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.

值得注意的是,这两种贴诗法和splistackshape解法的时间安排是如此相似,以至于很难提取图表中的曲线。它们是所有数据帧大小的基准测试方法中最慢的。

For smaller data frames, Matt's base R solution and data.table method 4 seem to have less overhead than the other methods.

对于较小的数据帧,Matt的基本R解决方案和数据。表方法4的开销似乎比其他方法少。

Code

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

Define function for benchmark runs of problem size n

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

Run benchmark for different problem sizes

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

Prepare data for plotting

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

Create chart

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

Session info & package versions (excerpt)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)

1My curiosity was piqued by this exuberant comment Brilliant! Orders of magnitude faster! to a tidyverse answer of a question which was closed as a duplicate of this question.

我的好奇心被这辉煌的评论激怒了!数量级的速度!对于一个问题的提反式答案,这个问题是这个问题的复本。

#2


53  

Several alternatives:

几个选择:

1) two ways with data.table:

1)数据的两种方式。

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2) a dplyr/tidyr combination: Alternatively, you can also use the dplyr / tidyr combination:

2)dplyr/tidyr组合:或者,也可以使用dplyr/tidyr组合:

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3) with tidyr only: With tidyr 0.5.0 (and later), you can also just use separate_rows:

3)仅使用tidyr:使用tidyr 0.5.0(及更高版本),也可以使用separation ate_rows:

separate_rows(v, director, sep = ",")

You can use the convert = TRUE parameter to automatically convert numbers into numeric columns.

您可以使用convert = TRUE参数自动将数字转换为数字列。

4) with base R:

4)基地R:

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))

#3


41  

Naming your original data.frame v, we have this:

给你的原始数据命名。

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González Iñárritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

Note the use of rep to build the new AB column. Here, sapply returns the number of names in each of the original rows.

注意使用rep构建新的AB列。在这里,sapply返回每个原始行的名称数量。

#4


20  

Late to the party, but another generalized alternative is to use cSplit from my "splitstackshape" package that has a direction argument. Set this to "long" to get the result you specify:

晚到党,但另一个普遍的选择是使用cSplit from我的“splitstackshape”包,它有一个方向参数。将其设置为“long”以获得您指定的结果:

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B

#5


-1  

Try this approach,

试试这个方法,

library(dplyr)
df$director <- as.character(df$director) 
df %>% mutate(firstName=unlist(strsplit(director, split=" "))[[1]],
              lastName=unlist(strsplit(director, split=" "))[[2]])

Mutate will create a new column based on some specified transformation of another. In this case, it is a broken version of a single column, done twice (I.e., split=" ").

Mutate将基于另一个特定的转换创建一个新的列。在这种情况下,它是单个列的一个破碎版本,执行两次(即。,分= " ")。

#1


42  

This old question frequently is being used as dupe target (tagged with r-faq). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmark as guidance which of the approaches is the fastest1.

这个老问题经常被用作dupe目标(用r-faq标记)。到今天为止,已经有三次有人回答了这个问题,他们提供了6种不同的方法,但没有一个基准来作为指导,即哪一种方法是最快的。

The benchmarked solutions include

基准测试的解决方案包括

Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmark package (see code below).

总共有8种不同的方法通过使用microbenchmark包对6种不同大小的数据帧进行了基准测试(参见下面的代码)。

The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.

OP给出的示例数据仅包含20行。要创建更大的数据帧,这20行只需重复1、10、100、1000、10000和100000次,从而产生高达200万行的问题大小。

Benchmark results

将列中的逗号分隔的字符串分割为单独的行

The benchmark results show that for sufficiently large data frames all data.table methods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.table method 2 and the variant DT3 are the fastest, magnitudes faster than the slowest methods.

基准测试结果表明,对于足够大的数据帧,所有数据都是如此。表方法比任何其他方法都快。对于超过5000行的数据帧,Jaap的数据。表方法2和变型DT3是最快的,比最慢的方法快得多。

Remarkably, the timings of the two tidyverse methods and the splistackshape solution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.

值得注意的是,这两种贴诗法和splistackshape解法的时间安排是如此相似,以至于很难提取图表中的曲线。它们是所有数据帧大小的基准测试方法中最慢的。

For smaller data frames, Matt's base R solution and data.table method 4 seem to have less overhead than the other methods.

对于较小的数据帧,Matt的基本R解决方案和数据。表方法4的开销似乎比其他方法少。

Code

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

Define function for benchmark runs of problem size n

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

Run benchmark for different problem sizes

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

Prepare data for plotting

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

Create chart

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

Session info & package versions (excerpt)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)

1My curiosity was piqued by this exuberant comment Brilliant! Orders of magnitude faster! to a tidyverse answer of a question which was closed as a duplicate of this question.

我的好奇心被这辉煌的评论激怒了!数量级的速度!对于一个问题的提反式答案,这个问题是这个问题的复本。

#2


53  

Several alternatives:

几个选择:

1) two ways with data.table:

1)数据的两种方式。

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2) a dplyr/tidyr combination: Alternatively, you can also use the dplyr / tidyr combination:

2)dplyr/tidyr组合:或者,也可以使用dplyr/tidyr组合:

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3) with tidyr only: With tidyr 0.5.0 (and later), you can also just use separate_rows:

3)仅使用tidyr:使用tidyr 0.5.0(及更高版本),也可以使用separation ate_rows:

separate_rows(v, director, sep = ",")

You can use the convert = TRUE parameter to automatically convert numbers into numeric columns.

您可以使用convert = TRUE参数自动将数字转换为数字列。

4) with base R:

4)基地R:

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))

#3


41  

Naming your original data.frame v, we have this:

给你的原始数据命名。

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González Iñárritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

Note the use of rep to build the new AB column. Here, sapply returns the number of names in each of the original rows.

注意使用rep构建新的AB列。在这里,sapply返回每个原始行的名称数量。

#4


20  

Late to the party, but another generalized alternative is to use cSplit from my "splitstackshape" package that has a direction argument. Set this to "long" to get the result you specify:

晚到党,但另一个普遍的选择是使用cSplit from我的“splitstackshape”包,它有一个方向参数。将其设置为“long”以获得您指定的结果:

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B

#5


-1  

Try this approach,

试试这个方法,

library(dplyr)
df$director <- as.character(df$director) 
df %>% mutate(firstName=unlist(strsplit(director, split=" "))[[1]],
              lastName=unlist(strsplit(director, split=" "))[[2]])

Mutate will create a new column based on some specified transformation of another. In this case, it is a broken version of a single column, done twice (I.e., split=" ").

Mutate将基于另一个特定的转换创建一个新的列。在这种情况下,它是单个列的一个破碎版本,执行两次(即。,分= " ")。