R - 从dataframe连接字符串并删除html标记

时间:2022-06-01 20:31:35

I'm looking to use R to clean up some text strings from a database. The database stores the text complete with HTML tags. Unfortunately, due to database limitations, each string is broken into multiple fragments in the database. I think I could figure out how to remove the html tags with regular expressions and the help of other posts, but I don't expect those solutions will work unless I concatenate the fragments back together (opening/closing html tags can be spread across records in the dataframe). Here is some sample data:

我想用R来清理数据库中的一些文本字符串。数据库存储带有HTML标记的文本。不幸的是,由于数据库的限制,每个字符串都被分成数据库中的多个片段。我想我可以弄清楚如何使用正则表达式和其他帖子的帮助删除html标签,但我不希望这些解决方案能够工作,除非我将片段重新连接在一起(打开/关闭html标签可以分布在记录中在数据框中)。以下是一些示例数据:

Existing dataframe

现有数据帧

Record_nbr  fragment    Comments
1   1   "The quick brown"
1   2   "fox jumped over"
1   3   "the lazy dog."
2   1   "New Record."

Desired output dataframe

期望的输出数据帧

Record_nbr  fragment    Comments
1   3   "The quick brown fox jumped over the lazy dog."
2   2   "New Record."

Data:

数据:

dat <- read.table(text='Record_nbr  fragment    Comments
1   1   "The quick brown"
1   2   "fox jumped over"
1   3   "the lazy dog."
2   1   "New Record."', header=TRUE)

5 个解决方案

#1


0  

It seems like the fragment column becomes unusable after the split? Maybe

拆分后,片段列似乎变得无法使用?也许

> aggregate(dat[3], dat[1], paste)
#   Record_nbr                                             x
# 1          1 The quick brown fox jumped over the lazy dog.
# 2          2                                   New Record.

equivalent to

相当于

aggregate(Comments~Record_nbr, data = dat, paste)

#2


1  

I am assuming that you didn't actually want to keep the fragment column. In this case you can use this quick one-liner:

我假设你实际上并不想保留片段列。在这种情况下,您可以使用这个快速单行:

aggregate(comment ~ Record_nbr, data=dat, function(x) paste(x, collapse=" "))

#3


0  

Here's one of many approaches:

这是许多方法之一:

## ensure order
dat <- with(dat, dat[order(Record_nbr, fragment), ])

do.call(rbind, lapply(split(dat, dat$Record_nbr), function(x) {
    data.frame(
        x[1, 1, drop=FALSE], 
        fragment = max(x[, 2]), 
        Comments = paste(x$Comments, collapse=" ")
    )
}))

##   Record_nbr fragment                                      Comments
## 1          1        3 The quick brown fox jumped over the lazy dog.
## 2          2        1                                   New Record.

#4


0  

Using dplyr:

使用dplyr:

library(dplyr)
dat %>% 
group_by(Record_nbr) %>% 
summarize(fragment= n(), Comments=paste(Comments, collapse= " "))

#  Record_nbr fragment                                      Comments
#1          1        3 The quick brown fox jumped over the lazy dog.
#2          2        1                                   New Record.

#5


0  

Also consider using the quicker 'aggregate' function:

还要考虑使用更快的“聚合”功能:

aggregate(dat,  by=list(dat$Record_nbr), paste, collapse=" ")

##   Group.1 Record_nbr fragment                                      Comments
## 1       1      1 1 1    1 2 3 The quick brown fox jumped over the lazy dog.
## 2       2          2        1                                   New Record.

Edit: You might have to play with the function inputs to get the exact outcome you want.

编辑:您可能必须使用功能输入来获得所需的确切结果。

#1


0  

It seems like the fragment column becomes unusable after the split? Maybe

拆分后,片段列似乎变得无法使用?也许

> aggregate(dat[3], dat[1], paste)
#   Record_nbr                                             x
# 1          1 The quick brown fox jumped over the lazy dog.
# 2          2                                   New Record.

equivalent to

相当于

aggregate(Comments~Record_nbr, data = dat, paste)

#2


1  

I am assuming that you didn't actually want to keep the fragment column. In this case you can use this quick one-liner:

我假设你实际上并不想保留片段列。在这种情况下,您可以使用这个快速单行:

aggregate(comment ~ Record_nbr, data=dat, function(x) paste(x, collapse=" "))

#3


0  

Here's one of many approaches:

这是许多方法之一:

## ensure order
dat <- with(dat, dat[order(Record_nbr, fragment), ])

do.call(rbind, lapply(split(dat, dat$Record_nbr), function(x) {
    data.frame(
        x[1, 1, drop=FALSE], 
        fragment = max(x[, 2]), 
        Comments = paste(x$Comments, collapse=" ")
    )
}))

##   Record_nbr fragment                                      Comments
## 1          1        3 The quick brown fox jumped over the lazy dog.
## 2          2        1                                   New Record.

#4


0  

Using dplyr:

使用dplyr:

library(dplyr)
dat %>% 
group_by(Record_nbr) %>% 
summarize(fragment= n(), Comments=paste(Comments, collapse= " "))

#  Record_nbr fragment                                      Comments
#1          1        3 The quick brown fox jumped over the lazy dog.
#2          2        1                                   New Record.

#5


0  

Also consider using the quicker 'aggregate' function:

还要考虑使用更快的“聚合”功能:

aggregate(dat,  by=list(dat$Record_nbr), paste, collapse=" ")

##   Group.1 Record_nbr fragment                                      Comments
## 1       1      1 1 1    1 2 3 The quick brown fox jumped over the lazy dog.
## 2       2          2        1                                   New Record.

Edit: You might have to play with the function inputs to get the exact outcome you want.

编辑:您可能必须使用功能输入来获得所需的确切结果。