从R中的文件读取数据帧时如何跳过无效行?

时间:2023-01-23 09:14:05

I have a large file which contains lots of data, and I'd like to read it into dataframe, but found some invalid rows. These invalid rows cause the read.table to break. I try the following method to skip invalid lines, but it seems the performance is very bad.

我有一个包含大量数据的大文件,我想将其读入数据帧,但发现一些无效的行。这些无效行导致read.table中断。我尝试以下方法来跳过无效行,但似乎性能非常糟糕。

counts<-count.fields(textConnection(lines),sep="\001")
raw_data<-read.table(textConnection(lines[counts == 34]), sep="\001")

Is there any better way to achieve this? Thanks

有没有更好的方法来实现这一目标?谢谢

2 个解决方案

#1


18  

using @PaulHiemstra's sample data:

使用@PaulHiemstra的样本数据:

read.table("test.csv", sep = ";", fill=TRUE)

then you take care of the NAs as you wish.

那么你可以按照自己的意愿照顾新生儿。

#2


5  

What you could do is iterate over the lines in the file, and only add the lines that have the correct length.

你可以做的是迭代文件中的行,并只添加具有正确长度的行。

I defined the following test csv file:

我定义了以下测试csv文件:

1;2;3;4
1;2;3;4
1;2;3
1;2;3;4

Using read.table fails:

使用read.table失败:

> read.table("test.csv", sep = ";")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :                                                                       
  line 3 did not have 4 elements   

Now an iterative approach:

现在是一种迭代方法:

require(plyr)
no_lines = 4
correct_length = 4
file_con = file("test.csv", "r")
result = ldply(1:no_lines, function(line) {
   dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
   if(length(dum) == correct_length) {
     return(dum)
   } else {
     cat(sprintf("Skipped line %s\n", line))
     return(NULL)
   }
  })
close(file_con)

> result
  V1 V2 V3 V4
1  1  2  3  4
2  1  2  3  4
3  1  2  3  4

Ofcourse this is a trivial example as the file is really small. Let us create a more challenging example to act as a benchmark.

当然,这是一个简单的例子,因为文件非常小。让我们创建一个更具挑战性的例子来充当基准。

# First file with invalid rows
norow = 10e5    # number of rows
no_lines = round(runif(norow, min = 3, max = 4))
no_lines[1] = correct_length
file_content = ldply(no_lines, function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_test.csv")

# Same length with valid rows
file_content = ldply(rep(4, norow), function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_normal.csv")

Now for the benchmark

现在为基准

# Iterative approach
system.time({file_con <- file("big_test.csv", "r")
    result_test <- ldply(1:norow, function(line) {
       dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
       if(length(dum) == correct_length) {
         return(dum)
       } else {
         # Commenting this speeds up by 30%
         #cat(sprintf("Skipped line %s\n", line))
         return(NULL)
       }
      })
    close(file_con)})
   user  system elapsed 
 20.559   0.047  20.775

# Normal read.table
system.time(result_normal <- read.table("big_normal.csv", sep = ";"))
   user  system elapsed 
  1.060   0.015   1.079 

# read.table with fill = TRUE
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE)
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  1.161   0.033   1.203 

# Specifying which type the columns are (e.g. character or numeric)
# using the colClasses argument.
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE, 
                                       colClasses = rep("numeric", 4))
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  0.933   0.064   1.001

So the iterative approach is quite a bit slower, but 20 seconds for 1 million rows might be acceptable (although this depends on your definition of acceptable). Especially when you only have to this once, and than save it using save for later retrieval. The solution suggested by @Paolo is almost as fast as the normal call to read.table. Rows that contain the wrong amount of columns (thus NA's) are eliminated using complete.cases. Specifying which classes the columns are further improves the performance, and I think this effect is going to be bigger when the amount of columns and rows gets bigger.

因此,迭代方法相当慢,但100万行的20秒可能是可接受的(尽管这取决于您对可接受的定义)。特别是当你只需要这一次,而不是保存它以便以后检索。 @Paolo建议的解决方案几乎与read.table的正常调用一样快。使用complete.cases消除包含错误列数(因此为NA)的行。指定列的哪些类可以进一步提高性能,我认为当列数和行数变大时,这种效果会更大。

So in conclusion, the best option is to use read.table with fill = TRUE, while specifying the classes of the columns. The iterative approach using ldply is only a good option if you want more flexibility in the choice of how to read the lines, e.g. only read the line if a certain value is above a threshold. But probably this could be done quicker by reading all the data into R, and than creating a subset. Only when the data is bigger than your RAM, I could imagine the iterative approach having its merits.

总而言之,最好的选择是使用read.table和fill = TRUE,同时指定列的类。使用ldply的迭代方法只是一个很好的选择,如果你想要更灵活地选择如何读取行,例如仅当某个值高于阈值时才读取该行。但是,通过将所有数据读入R,而不是创建子集,可能可以更快地完成此操作。只有当数据大于你的RAM时,我才能想象迭代方法有其优点。

#1


18  

using @PaulHiemstra's sample data:

使用@PaulHiemstra的样本数据:

read.table("test.csv", sep = ";", fill=TRUE)

then you take care of the NAs as you wish.

那么你可以按照自己的意愿照顾新生儿。

#2


5  

What you could do is iterate over the lines in the file, and only add the lines that have the correct length.

你可以做的是迭代文件中的行,并只添加具有正确长度的行。

I defined the following test csv file:

我定义了以下测试csv文件:

1;2;3;4
1;2;3;4
1;2;3
1;2;3;4

Using read.table fails:

使用read.table失败:

> read.table("test.csv", sep = ";")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :                                                                       
  line 3 did not have 4 elements   

Now an iterative approach:

现在是一种迭代方法:

require(plyr)
no_lines = 4
correct_length = 4
file_con = file("test.csv", "r")
result = ldply(1:no_lines, function(line) {
   dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
   if(length(dum) == correct_length) {
     return(dum)
   } else {
     cat(sprintf("Skipped line %s\n", line))
     return(NULL)
   }
  })
close(file_con)

> result
  V1 V2 V3 V4
1  1  2  3  4
2  1  2  3  4
3  1  2  3  4

Ofcourse this is a trivial example as the file is really small. Let us create a more challenging example to act as a benchmark.

当然,这是一个简单的例子,因为文件非常小。让我们创建一个更具挑战性的例子来充当基准。

# First file with invalid rows
norow = 10e5    # number of rows
no_lines = round(runif(norow, min = 3, max = 4))
no_lines[1] = correct_length
file_content = ldply(no_lines, function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_test.csv")

# Same length with valid rows
file_content = ldply(rep(4, norow), function(line) paste(1:line, collapse = ";"))
writeLines(paste(file_content[[1]], sep = "\n"), "big_normal.csv")

Now for the benchmark

现在为基准

# Iterative approach
system.time({file_con <- file("big_test.csv", "r")
    result_test <- ldply(1:norow, function(line) {
       dum = strsplit(readLines(file_con, n = 1), split = ";")[[1]]
       if(length(dum) == correct_length) {
         return(dum)
       } else {
         # Commenting this speeds up by 30%
         #cat(sprintf("Skipped line %s\n", line))
         return(NULL)
       }
      })
    close(file_con)})
   user  system elapsed 
 20.559   0.047  20.775

# Normal read.table
system.time(result_normal <- read.table("big_normal.csv", sep = ";"))
   user  system elapsed 
  1.060   0.015   1.079 

# read.table with fill = TRUE
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE)
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  1.161   0.033   1.203 

# Specifying which type the columns are (e.g. character or numeric)
# using the colClasses argument.
system.time({result_fill <- read.table("big_test.csv", sep = ";", fill=TRUE, 
                                       colClasses = rep("numeric", 4))
             na_rows <- complete.cases(result_fill)
             result_fill <- result_fill[-na_rows,]})
   user  system elapsed 
  0.933   0.064   1.001

So the iterative approach is quite a bit slower, but 20 seconds for 1 million rows might be acceptable (although this depends on your definition of acceptable). Especially when you only have to this once, and than save it using save for later retrieval. The solution suggested by @Paolo is almost as fast as the normal call to read.table. Rows that contain the wrong amount of columns (thus NA's) are eliminated using complete.cases. Specifying which classes the columns are further improves the performance, and I think this effect is going to be bigger when the amount of columns and rows gets bigger.

因此,迭代方法相当慢,但100万行的20秒可能是可接受的(尽管这取决于您对可接受的定义)。特别是当你只需要这一次,而不是保存它以便以后检索。 @Paolo建议的解决方案几乎与read.table的正常调用一样快。使用complete.cases消除包含错误列数(因此为NA)的行。指定列的哪些类可以进一步提高性能,我认为当列数和行数变大时,这种效果会更大。

So in conclusion, the best option is to use read.table with fill = TRUE, while specifying the classes of the columns. The iterative approach using ldply is only a good option if you want more flexibility in the choice of how to read the lines, e.g. only read the line if a certain value is above a threshold. But probably this could be done quicker by reading all the data into R, and than creating a subset. Only when the data is bigger than your RAM, I could imagine the iterative approach having its merits.

总而言之,最好的选择是使用read.table和fill = TRUE,同时指定列的类。使用ldply的迭代方法只是一个很好的选择,如果你想要更灵活地选择如何读取行,例如仅当某个值高于阈值时才读取该行。但是,通过将所有数据读入R,而不是创建子集,可能可以更快地完成此操作。只有当数据大于你的RAM时,我才能想象迭代方法有其优点。