如何从R中的csv数据中删除[1] s,[[1]]和双引号?

时间:2022-06-01 12:42:55

I've a CSV file. It contains the output of some previous R operations, so it is filled with the index numbers (such as [1], [[1]]). When it is read into R, it looks like this, for example:


1                                                                                                           [1] 789
2                                                                                                             [[1]]
3                                                           [1] "PNG"        "D115"    "DX06"    "Slz"
4                                                                                                           [1] 787
5                                                                                                             [[1]]
6                                                                       [1] "D010"           "HC"
7                                                                                                           [1] 949
8                                                                                                             [[1]]
9                                                                       [1] "HC" "DX06"          

(I don't know why all that wasted space between line number and the output data)


I need the above data to appear as follows (without [1] or [[1]] or " " and with the data placed beside its corresponding number, like):


789 PNG,D115,DX06,Slz
787 D010,HC
949 HC,DX06

(possibly the 789 and its corresponding data PNG,D115,DX06,Slz should be separated by a tab.. and like that for each row)


How to achieve this in R?


2 个解决方案


We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.


indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
 do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
         function(x) data.frame(ind=x[1],
    val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))

 #   ind               val
 #1  789 PNG,D115,DX06,Slz
 #2  787           D010,HC
 #3  949           HC,DX06


 df1 <- structure(list(V1 = c("[1] 789", "[[1]]", 
 "[1] \"PNG\"        \"D115\"    \"DX06\"    \"Slz\"", 
 "[1] 787", "[[1]]", "[1] \"D010\"           \"HC\"", "[1] 949", 
 "[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1", 
 class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", 
 "7", "8", "9"))


Honestly, a command-line fix using either sed/perl/egrep -o is less pain:

老实说,使用sed / perl / egrep -o的命令行修复不那么痛苦:

sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv 


We could create a grouping variable ('indx'), split the 'V1' column using the grouping index after removing the parentheses part in the beginning as well as the quotes within the string ". Assuming that we need the first column as the numeric element, and the second column as the non-numeric part, we can use regex to replace the space with , (as showed in the expected result, and then rbind the list elements.


indx <- cumsum(c(grepl('\\[\\[', df1$V1)[-1], FALSE))
 do.call(rbind,lapply(split(gsub('"|^.*\\]', '', df1$V1), indx),
         function(x) data.frame(ind=x[1],
    val=gsub('\\s+', ',', gsub('^\\s+|\\s+$', '',x[-1][x[-1]!=''])))))

 #   ind               val
 #1  789 PNG,D115,DX06,Slz
 #2  787           D010,HC
 #3  949           HC,DX06


 df1 <- structure(list(V1 = c("[1] 789", "[[1]]", 
 "[1] \"PNG\"        \"D115\"    \"DX06\"    \"Slz\"", 
 "[1] 787", "[[1]]", "[1] \"D010\"           \"HC\"", "[1] 949", 
 "[[1]]", "[1] \"HC\" \"DX06\"")), .Names = "V1", 
 class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", 
 "7", "8", "9"))


Honestly, a command-line fix using either sed/perl/egrep -o is less pain:

老实说,使用sed / perl / egrep -o的命令行修复不那么痛苦:

sed -e 's/.*\][ \t]*//' dirty.csv > clean.csv