从R中的字符串中提取子字符串和数字

时间:2022-09-13 16:35:37

I have several strings, following are some examples.

我有几个字符串,下面是一些例子。

rfoutputtablep7q10000t20000c100
rfoutputtablep7q1000t20000c100
svmLinear2outputtablep7q20000t20000c100
svmLinear2outputtablep7q5000t20000c100

I want to make a data frame with columns: algorithm, p, q, t, and c and extract the values from these strings. So the stuff before "outputtable" is the algorithm, the number after "p" is the value of p, number after "q" is the value of q, etc.

我想创建一个包含列的数据框架:算法、p、q、t和c,并从这些字符串中提取值。所以"outputtable"之前的是算法,"p"后面的数字是p的值,"q"后面的数字是q的值,等等。

How can this data frame be created?

如何创建这个数据框架?

4 个解决方案

#1


6  

Using base R only.

只使用基本R。

res <- do.call(rbind, strsplit(y, 'outputtable|p|q|t|c'))
res <- as.data.frame(res[, -2])
res[-1] <- lapply(res[-1], function(x) as.numeric(as.character(x)))
names(res) <- c("algorithm", "p", "q", "t", "c")
res
#   algorithm p     q     t   c
#1         rf 7 10000 20000 100
#2         rf 7  1000 20000 100
#3 svmLinear2 7 20000 20000 100
#4 svmLinear2 7  5000 20000 100

DATA.

数据。

y <- scan(text = '"rfoutputtablep7q10000t20000c100"
"rfoutputtablep7q1000t20000c100"
"svmLinear2outputtablep7q20000t20000c100"
"svmLinear2outputtablep7q5000t20000c100"',
what = character())

#2


4  

Use a positive look-ahead to get the algorithm:

使用一个积极的展望得到算法:

gsub("^(\\w+)(?=outputtable).*", "\\1", string, perl=TRUE)

Live example: https://regex101.com/r/7vDK1x/2

生活例子:https://regex101.com/r/7vDK1x/2

A positive look-behind for p, q, t, and c (replace p with the other letters in (?<=p).

p、q、t和c的正查找(用(?<=p)中的其他字母替换p。

gsub(".*?(?<=q)(\\d+).*", "\\1", a, perl=TRUE)

#3


4  

library(stringr)
myd = c("p", "q", "t", "c")
data.frame(sapply(myd, function(a) str_extract(str_extract(x, paste0(a, "\\d+")), "\\d+")))
#  p     q     t   c
#1 7 10000 20000 100
#2 7  1000 20000 100
#3 7 20000 20000 100
#4 7  5000 20000 100

#For first column
substr(x, 1, unlist(gregexpr("outputtable", x)) - 1)
#[1] "rf"         "rf"         "svmLinear2" "svmLinear2"

DATA

数据

x = c("rfoutputtablep7q10000t20000c100", "rfoutputtablep7q1000t20000c100", 
"svmLinear2outputtablep7q20000t20000c100", "svmLinear2outputtablep7q5000t20000c100")

#4


2  

Here another solution using stringi package. Check the benchmarks comparing all solutions proposed so far. stringi is slightly faster than base R, but is, of course, a bit more complicated if you seek a simple solution. Hence, depending on your preference for speed or simplicity either is good. However, stringi offers more flexibility for more complex cases. (Note, the benchmarks are not perfectly comparable since we have all used slighlty different approaches for setting up the data.frame and converting types.)

这里是另一个使用stringi包的解决方案。检查基准,比较目前提出的所有解决方案。stringi比base R稍微快一些,但是,当然,如果你寻求一个简单的解决方案,它就会稍微复杂一些。因此,取决于你对速度或简单性的偏好,两者都是好的。然而,stringi为更复杂的情况提供了更多的灵活性。(请注意,基准并不是完全可比较的,因为我们都使用了稍微不同的方法来设置数据。框架和转换类型。)

UPDATE: In response to the comment of Rui Barradas I have updated the code to my answer. (i) I have proposed a function using the stringi approach including conversion of columns to numeric, hence, for the full task as I would do it. (ii) Furthermore, I have added benchmarks so that all approaches proposed so far (also in comments) are included. In order to achieve a halfway fair comparison I have modified the proposed approaches, so that the ouput is the same. I have skipped conversion of columns to numeric for the comparison, in particular, and made the commands similarly concise by avoiding interim assignments, etc.

更新:针对Rui Barradas的评论,我更新了我的答案的代码。(i)我已经提出了一个函数,使用stringi方法,包括将列转换为数字,因此,就像我所做的那样,完成完整的任务。此外,我还增加了基准,以便包括迄今提出的所有办法(也在评论中)。为了实现一个半程的公平比较,我修改了提议的方法,使ouput是相同的。我跳过了将列转换为数字的方法,特别是,并通过避免临时任务等方式使命令类似地简洁。

It seems that stringi is still the fastest.

看起来stringi仍然是最快的。

Please correct me, if I have overseen anything concerning a fair comparison (especially the stringr solution might be improved codewise, I guess, but I am not so familiar with the package, therefore, I kept the proposed solution).

请纠正我,如果我看过关于公平比较的任何东西(特别是stringr解决方案可能会改进代码级别,我想,但是我对这个包不是很熟悉,所以我保留了提议的解决方案)。

library(stringi)
library(stringr)
library(microbenchmark)

strings <- c("rfoutputtablep7q10000t20000c100",
              "rfoutputtablep7q1000t20000c100",
             "svmLinear2outputtablep7q20000t20000c100",
             "svmLinear2outputtablep7q5000t20000c100")


split_to_df <- function(string, splititems, colidschar, firstcolname, replsplit_tonames) {

   data <- as.data.frame(do.call(rbind
                                ,stri_split_regex(strings, paste(splititems, collapse = "|")))
                        ,stringsAsFactors = FALSE)
   names(data) <- c(firstcolname, stri_replace_all_regex(splititems, replsplit_tonames, ""))
   numericcols <- setdiff(1:ncol(data), colidschar)
   data[,numericcols] <- lapply(data[,numericcols], as.numeric)
   return(data)

}

stringi_approach_complete <- function() {

  df <- split_to_df(string = strings
                    ,splititems = c("outputtablep(?=\\d)", "q(?=\\d)", "t(?=\\d)", "c(?=\\d)")
                    ,colidschar = 1
                    ,firstcolname = "A"
                    ,replsplit_tonames = "\\(.*\\)|outputtable")
  # class(df$p)
  # [1] "numeric"
  # A p     q     t   c
  # 1         rf 7 10000 20000 100
  # 2         rf 7  1000 20000 100
  # 3 svmLinear2 7 20000 20000 100
  # 4 svmLinear2 7  5000 20000 100

}


stringi_approach_compare <- function() {

  data <- as.data.frame(do.call(rbind, stri_split_regex(strings, c("outputtable|p(?=\\d)|q(?=\\d)|t(?=\\d)|c(?=\\d)"))))
  names(data) <- c("A", "p", "q", "t", "c")
  #class(data$p)
  #[1] "factor"
  #data
  # A p     q     t   c
  # 1         rf 7 10000 20000 100
  # 2         rf 7  1000 20000 100
  # 3 svmLinear2 7 20000 20000 100
  # 4 svmLinear2 7  5000 20000 100

}


stringr_approach <- function() {

  res <- data.frame(p = str_extract(str_extract(strings, "p\\d+"), "\\d+"),
                    q = str_extract(str_extract(strings, "q\\d+"), "\\d+"),
                    t = str_extract(str_extract(strings, "t\\d+"), "\\d+"),
                    c = str_extract(str_extract(strings, "c\\d+"), "\\d+"))
  #class(res$p)
  #[1] "factor"
  #res
  # p     q     t   c
  # 1 7 10000 20000 100
  # 2 7  1000 20000 100
  # 3 7 20000 20000 100
  # 4 7  5000 20000 100

}

base_approach1 <- function() {

  res <- do.call(rbind, strsplit(strings, 'outputtable|p|q|t|c'))
  res <- as.data.frame(res[, -2])
  names(res) <- c("A", "p", "q", "t", "c")
  #class(res$p)
  #[1] "factor"
  #res[-1] <- lapply(res[-1], function(x) as.numeric(as.character(x)))
  #res
  #           A p     q     t   c
  #1         rf 7 10000 20000 100
  #2         rf 7  1000 20000 100
  #3 svmLinear2 7 20000 20000 100
  #4 svmLinear2 7  5000 20000 100


}

base_approach2 <- function() {

  df <- setNames(data.frame(do.call(rbind, strsplit(strings, 'outputtable\\D|p|q|t|c'))), c("A", "p", "q", "t", "c"))
  #class(df$p)
  #[1] "factor"
  #df
  # A p     q     t   c
  # 1         rf 7 10000 20000 100
  # 2         rf 7  1000 20000 100
  # 3 svmLinear2 7 20000 20000 100
  # 4 svmLinear2 7  5000 20000 100

}



microbenchmark(
  base_approach1(),
  base_approach2(),
  stringi_approach_compare(),
  stringr_approach(),
  stringi_approach_complete()

)

# Unit: microseconds
#         expr                 min       lq     mean   median       uq       max neval
# base_approach1()            260.139 273.3635 337.1985 285.6005 298.2330  5280.152   100
# base_approach2()            352.906 362.1820 461.8205 374.8140 391.9850  4645.791   100
# stringi_approach_compare()  280.667 297.8380 312.8426 307.3125 319.1545   654.098   100
# stringr_approach()          849.499 867.6570 956.7596 886.2100 923.7115  5651.609   100
# stringi_approach_complete() 319.747 333.9580 461.5521 346.7870 369.0900 10985.052   100

#1


6  

Using base R only.

只使用基本R。

res <- do.call(rbind, strsplit(y, 'outputtable|p|q|t|c'))
res <- as.data.frame(res[, -2])
res[-1] <- lapply(res[-1], function(x) as.numeric(as.character(x)))
names(res) <- c("algorithm", "p", "q", "t", "c")
res
#   algorithm p     q     t   c
#1         rf 7 10000 20000 100
#2         rf 7  1000 20000 100
#3 svmLinear2 7 20000 20000 100
#4 svmLinear2 7  5000 20000 100

DATA.

数据。

y <- scan(text = '"rfoutputtablep7q10000t20000c100"
"rfoutputtablep7q1000t20000c100"
"svmLinear2outputtablep7q20000t20000c100"
"svmLinear2outputtablep7q5000t20000c100"',
what = character())

#2


4  

Use a positive look-ahead to get the algorithm:

使用一个积极的展望得到算法:

gsub("^(\\w+)(?=outputtable).*", "\\1", string, perl=TRUE)

Live example: https://regex101.com/r/7vDK1x/2

生活例子:https://regex101.com/r/7vDK1x/2

A positive look-behind for p, q, t, and c (replace p with the other letters in (?<=p).

p、q、t和c的正查找(用(?<=p)中的其他字母替换p。

gsub(".*?(?<=q)(\\d+).*", "\\1", a, perl=TRUE)

#3


4  

library(stringr)
myd = c("p", "q", "t", "c")
data.frame(sapply(myd, function(a) str_extract(str_extract(x, paste0(a, "\\d+")), "\\d+")))
#  p     q     t   c
#1 7 10000 20000 100
#2 7  1000 20000 100
#3 7 20000 20000 100
#4 7  5000 20000 100

#For first column
substr(x, 1, unlist(gregexpr("outputtable", x)) - 1)
#[1] "rf"         "rf"         "svmLinear2" "svmLinear2"

DATA

数据

x = c("rfoutputtablep7q10000t20000c100", "rfoutputtablep7q1000t20000c100", 
"svmLinear2outputtablep7q20000t20000c100", "svmLinear2outputtablep7q5000t20000c100")

#4


2  

Here another solution using stringi package. Check the benchmarks comparing all solutions proposed so far. stringi is slightly faster than base R, but is, of course, a bit more complicated if you seek a simple solution. Hence, depending on your preference for speed or simplicity either is good. However, stringi offers more flexibility for more complex cases. (Note, the benchmarks are not perfectly comparable since we have all used slighlty different approaches for setting up the data.frame and converting types.)

这里是另一个使用stringi包的解决方案。检查基准,比较目前提出的所有解决方案。stringi比base R稍微快一些,但是,当然,如果你寻求一个简单的解决方案,它就会稍微复杂一些。因此,取决于你对速度或简单性的偏好,两者都是好的。然而,stringi为更复杂的情况提供了更多的灵活性。(请注意,基准并不是完全可比较的,因为我们都使用了稍微不同的方法来设置数据。框架和转换类型。)

UPDATE: In response to the comment of Rui Barradas I have updated the code to my answer. (i) I have proposed a function using the stringi approach including conversion of columns to numeric, hence, for the full task as I would do it. (ii) Furthermore, I have added benchmarks so that all approaches proposed so far (also in comments) are included. In order to achieve a halfway fair comparison I have modified the proposed approaches, so that the ouput is the same. I have skipped conversion of columns to numeric for the comparison, in particular, and made the commands similarly concise by avoiding interim assignments, etc.

更新:针对Rui Barradas的评论,我更新了我的答案的代码。(i)我已经提出了一个函数,使用stringi方法,包括将列转换为数字,因此,就像我所做的那样,完成完整的任务。此外,我还增加了基准,以便包括迄今提出的所有办法(也在评论中)。为了实现一个半程的公平比较,我修改了提议的方法,使ouput是相同的。我跳过了将列转换为数字的方法,特别是,并通过避免临时任务等方式使命令类似地简洁。

It seems that stringi is still the fastest.

看起来stringi仍然是最快的。

Please correct me, if I have overseen anything concerning a fair comparison (especially the stringr solution might be improved codewise, I guess, but I am not so familiar with the package, therefore, I kept the proposed solution).

请纠正我,如果我看过关于公平比较的任何东西(特别是stringr解决方案可能会改进代码级别,我想,但是我对这个包不是很熟悉,所以我保留了提议的解决方案)。

library(stringi)
library(stringr)
library(microbenchmark)

strings <- c("rfoutputtablep7q10000t20000c100",
              "rfoutputtablep7q1000t20000c100",
             "svmLinear2outputtablep7q20000t20000c100",
             "svmLinear2outputtablep7q5000t20000c100")


split_to_df <- function(string, splititems, colidschar, firstcolname, replsplit_tonames) {

   data <- as.data.frame(do.call(rbind
                                ,stri_split_regex(strings, paste(splititems, collapse = "|")))
                        ,stringsAsFactors = FALSE)
   names(data) <- c(firstcolname, stri_replace_all_regex(splititems, replsplit_tonames, ""))
   numericcols <- setdiff(1:ncol(data), colidschar)
   data[,numericcols] <- lapply(data[,numericcols], as.numeric)
   return(data)

}

stringi_approach_complete <- function() {

  df <- split_to_df(string = strings
                    ,splititems = c("outputtablep(?=\\d)", "q(?=\\d)", "t(?=\\d)", "c(?=\\d)")
                    ,colidschar = 1
                    ,firstcolname = "A"
                    ,replsplit_tonames = "\\(.*\\)|outputtable")
  # class(df$p)
  # [1] "numeric"
  # A p     q     t   c
  # 1         rf 7 10000 20000 100
  # 2         rf 7  1000 20000 100
  # 3 svmLinear2 7 20000 20000 100
  # 4 svmLinear2 7  5000 20000 100

}


stringi_approach_compare <- function() {

  data <- as.data.frame(do.call(rbind, stri_split_regex(strings, c("outputtable|p(?=\\d)|q(?=\\d)|t(?=\\d)|c(?=\\d)"))))
  names(data) <- c("A", "p", "q", "t", "c")
  #class(data$p)
  #[1] "factor"
  #data
  # A p     q     t   c
  # 1         rf 7 10000 20000 100
  # 2         rf 7  1000 20000 100
  # 3 svmLinear2 7 20000 20000 100
  # 4 svmLinear2 7  5000 20000 100

}


stringr_approach <- function() {

  res <- data.frame(p = str_extract(str_extract(strings, "p\\d+"), "\\d+"),
                    q = str_extract(str_extract(strings, "q\\d+"), "\\d+"),
                    t = str_extract(str_extract(strings, "t\\d+"), "\\d+"),
                    c = str_extract(str_extract(strings, "c\\d+"), "\\d+"))
  #class(res$p)
  #[1] "factor"
  #res
  # p     q     t   c
  # 1 7 10000 20000 100
  # 2 7  1000 20000 100
  # 3 7 20000 20000 100
  # 4 7  5000 20000 100

}

base_approach1 <- function() {

  res <- do.call(rbind, strsplit(strings, 'outputtable|p|q|t|c'))
  res <- as.data.frame(res[, -2])
  names(res) <- c("A", "p", "q", "t", "c")
  #class(res$p)
  #[1] "factor"
  #res[-1] <- lapply(res[-1], function(x) as.numeric(as.character(x)))
  #res
  #           A p     q     t   c
  #1         rf 7 10000 20000 100
  #2         rf 7  1000 20000 100
  #3 svmLinear2 7 20000 20000 100
  #4 svmLinear2 7  5000 20000 100


}

base_approach2 <- function() {

  df <- setNames(data.frame(do.call(rbind, strsplit(strings, 'outputtable\\D|p|q|t|c'))), c("A", "p", "q", "t", "c"))
  #class(df$p)
  #[1] "factor"
  #df
  # A p     q     t   c
  # 1         rf 7 10000 20000 100
  # 2         rf 7  1000 20000 100
  # 3 svmLinear2 7 20000 20000 100
  # 4 svmLinear2 7  5000 20000 100

}



microbenchmark(
  base_approach1(),
  base_approach2(),
  stringi_approach_compare(),
  stringr_approach(),
  stringi_approach_complete()

)

# Unit: microseconds
#         expr                 min       lq     mean   median       uq       max neval
# base_approach1()            260.139 273.3635 337.1985 285.6005 298.2330  5280.152   100
# base_approach2()            352.906 362.1820 461.8205 374.8140 391.9850  4645.791   100
# stringi_approach_compare()  280.667 297.8380 312.8426 307.3125 319.1545   654.098   100
# stringr_approach()          849.499 867.6570 956.7596 886.2100 923.7115  5651.609   100
# stringi_approach_complete() 319.747 333.9580 461.5521 346.7870 369.0900 10985.052   100