如何从R中的csv数据中删除[1] s,[[1]]和双引号?

时间:2022-06-01 12:42:55

I've a CSV file. It contains the output of some previous R operations, so it is filled with the index numbers (such as [1], [[1]]). When it is read into R, it looks like this, for example:

我有一个CSV文件。它包含一些先前R操作的输出,因此它用索引号填充(例如[1],[[1]])。当它被读入R时,它看起来像这样,例如:

        V1
1                                                                                                           [1] 789
2                                                                                                             [[1]]
3                                                           [1] "PNG"        "D115"    "DX06"    "Slz"
4                                                                                                           [1] 787
5                                                                                                             [[1]]
6                                                                       [1] "D010"           "HC"
7                                                                                                           [1] 949
8                                                                                                             [[1]]
9                                                                       [1] "HC" "DX06"          

(I don't know why all that wasted space between line number and the output data)

(我不知道为什么在行号和输出数据之间浪费了所有空间)

I need the above data to appear as follows (without [1] or [[1]] or " " and with the data placed beside its corresponding number, like):

我需要上面的数据显示如下(没有[1]或[[1]]或“”,并且数据放在相应的数字旁边,如:)

789 PNG,D115,DX06,Slz
787 D010,HC
949 HC,DX06

(possibly the 789 and its corresponding data PNG,D115,DX06,Slz should be separated by a tab.. and like that for each row)

(可能是789及其相应的数据PNG,D115,DX06,Slz应该用一个标签分隔..并且像每行一样)

How to achieve this in R?

如何在R中实现这一目标?

2 个解决方案

#1


We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.

我们可以创建一个分组变量('indx'),在删除开头的括号部分之后使用分组索引拆分'V1'列以及字符串中的引号“。假设我们需要第一列作为数字我们可以使用正则表达式替换空格,(如预期结果中所示,然后rbind列表元素),并且第二列作为非数字部分。

indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
 do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
         function(x) data.frame(ind=x[1],
    val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))

 #   ind               val
 #1  789 PNG,D115,DX06,Slz
 #2  787           D010,HC
 #3  949           HC,DX06

data

 df1 <- structure(list(V1 = c("[1] 789", "[[1]]", 
 "[1] \"PNG\"        \"D115\"    \"DX06\"    \"Slz\"", 
 "[1] 787", "[[1]]", "[1] \"D010\"           \"HC\"", "[1] 949", 
 "[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1", 
 class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", 
 "7", "8", "9"))

#2


Honestly, a command-line fix using either sed/perl/egrep -o is less pain:

老实说,使用sed / perl / egrep -o的命令行修复不那么痛苦:

sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv 

#1


We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.

我们可以创建一个分组变量('indx'),在删除开头的括号部分之后使用分组索引拆分'V1'列以及字符串中的引号“。假设我们需要第一列作为数字我们可以使用正则表达式替换空格,(如预期结果中所示,然后rbind列表元素),并且第二列作为非数字部分。

indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
 do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
         function(x) data.frame(ind=x[1],
    val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))

 #   ind               val
 #1  789 PNG,D115,DX06,Slz
 #2  787           D010,HC
 #3  949           HC,DX06

data

 df1 <- structure(list(V1 = c("[1] 789", "[[1]]", 
 "[1] \"PNG\"        \"D115\"    \"DX06\"    \"Slz\"", 
 "[1] 787", "[[1]]", "[1] \"D010\"           \"HC\"", "[1] 949", 
 "[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1", 
 class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", 
 "7", "8", "9"))

#2


Honestly, a command-line fix using either sed/perl/egrep -o is less pain:

老实说,使用sed / perl / egrep -o的命令行修复不那么痛苦:

sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv