在R中快速读取非常大的数据表。

时间:2023-01-18 18:07:35

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

我有非常大的表(3000万行),我想在R. read.table()中作为一个dataframes加载。表()有很多方便的特性,但是在实现中似乎有很多逻辑可以使事情慢下来。在我的例子中,我假设我提前知道了列的类型,表中不包含任何列标题或行名称,并且没有我需要担心的任何病态字符。

I know that reading in a table as a list using scan() can be quite fast, e.g.:

我知道在表格中使用scan()可以非常快,例如:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

但是,我试图将其转换为dataframe的一些尝试似乎降低了上述的性能:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? Or quite possibly completely different approach to the problem?

有没有更好的办法?或者完全不同的方法来解决这个问题?

8 个解决方案

#1


340  

An update, several years later

几年之后的一个更新。

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

这个答案已经过时了,而R已经继续前进了。调整阅读。表运行得快一点,没有什么好处。你的选择是:

  1. Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.

    使用从文件中读数据。从csv/tab-delimited文件直接导入数据的表见mnel的答案。

  2. Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

    在readr中使用read_table(2015年4月,CRAN)。这很像上面的fread。该链接中的readme解释了这两个函数之间的区别(readr目前声称比data.table::fread要慢1.5-2x)。

  3. read.csv.raw from iotools provides a third option for quickly reading CSV files.

    read.csv。来自iotools的raw格式提供了快速读取CSV文件的第三个选项。

  4. Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

    尝试在数据库中存储尽可能多的数据,而不是平面文件。(以及作为一个更好的永久存储介质,数据以二进制格式从R传递到R,这是更快的)。sqldf包中的sql,如JD Long的回答中所描述的,将数据导入到一个临时的SQLite数据库中,然后将其读入R. See: the RODBC包,以及反向依赖于DBI包页面的部分。MonetDB。R给你一个数据类型,它假装是一个数据帧,但实际上是一个MonetDB,增加了性能。使用monetdb.read导入数据。csv函数。dplyr允许您直接处理存储在几种类型数据库中的数据。

  5. Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.

    以二进制格式存储数据也有助于提高性能。使用saveRDS/readRDS(见下面),h5或rhdf5包用于HDF5格式,或从fst包中写入write_fst/read_fst。


The original answer

原来的答案

There are a couple of simple things to try, whether you use read.table or scan.

有一些简单的事情可以尝试,不管你是否使用read。表或扫描。

  1. Set nrows=the number of records in your data (nmax in scan).

    设置nrows=你数据中记录的数量(nmax在扫描中)。

  2. Make sure that comment.char="" to turn off interpretation of comments.

    确保发表评论。字符=""关闭注释的解释。

  3. Explicitly define the classes of each column using colClasses in read.table.

    在read.table中使用colClasses显式定义每个列的类。

  4. Setting multi.line=FALSE may also improve performance in scan.

    设置多。行=FALSE也可以提高扫描的性能。

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

如果这些东西都不工作,那么就使用一个分析包来确定哪些行是慢下来的。也许你可以写一个精简版的阅读。表基于结果。

The other alternative is filtering your data before you read it into R.

另一种方法是在将数据读到R之前过滤数据。

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

或者,如果问题是您必须经常阅读它,然后使用这些方法来一次读取数据,然后将数据帧保存为保存saveRDS的二进制blob,那么下次您可以使用load readRDS更快地检索它。

#2


247  

Here is an example that utilizes fread from data.table 1.8.7

这是一个利用数据的fread的例子。表1.8.7

The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.

这些例子从帮助页面到fread,在我的windows XP Core 2 duo E8400上计时。

library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
                 b=sample(1:1000,n,replace=TRUE),
                 c=rnorm(n),
                 d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                 e=rnorm(n),
                 f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]

standard read.table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
## File size (MB): 51 

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   24.71    0.15   25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   17.85    0.07   17.98

optimized read.table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                          stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                          colClasses=c("integer","integer","numeric",                        
                                       "character","numeric","integer")))


##    user  system elapsed 
##   10.20    0.03   10.32

fread

require(data.table)
system.time(DT <- fread("test.csv"))                                  
 ##    user  system elapsed 
##    3.12    0.01    3.22

sqldf

require(sqldf)

system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             

##    user  system elapsed 
##   12.49    0.09   12.69

# sqldf as on SO

f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

##    user  system elapsed 
##   10.21    0.47   10.73

ff / ffdf

 require(ff)

 system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
 ##    user  system elapsed 
 ##   10.85    0.10   10.99

In summary:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

#3


240  

I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.

我最初没有看到这个问题,几天后又问了一个类似的问题。我将把之前的问题写下来,但是我想在这里添加一个答案来解释我是如何使用sqldf()来完成这个任务的。

There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

对于将2GB或更多的文本数据导入到R数据框架的最佳方式,我们进行了很少的讨论。昨天我写了一篇关于使用sqldf()将数据导入到SQLite作为一个staging区域,然后从SQLite将数据导入到r的博客文章,这对我来说非常有用。我能够在< 5分钟内拉出2GB(3列,40mm行)的数据。相比之下,阅读。csv命令运行了一整夜,始终没有完成。

Here's my test code:

这是我的测试代码:

Set up the test data:

设置测试数据:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)

I restarted R before running the following import routine:

在运行以下导入例程之前,我重新启动R:

library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

I let the following line run all night but it never completed:

我让下面的线整夜运行,但它始终没有完成:

system.time(big.df <- read.csv('bigdf.csv'))

#4


67  

Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:

奇怪的是,多年来没有人回答这个问题的底部,尽管这是一个很重要的问题——数据框只是具有正确属性的列表,所以如果你有大数据,你就不想使用as.data.frame或类似的列表。简单地将一个列表“转换”到一个数据框架中,要快得多:

attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"

This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.

这使得数据没有副本,所以它是即时的(不像所有其他方法)。它假设您已经在列表上设置了名称()。

[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]

对于将大数据加载到R中,我个人将它们按列转储到二进制文件中,并使用readBin()——这是目前为止最快的方法(除了mmapping),而且只受到磁盘速度的限制。与二进制数据相比,解析ASCII文件本质上是慢的(即使是在C中)。

#5


30  

This was previously asked on R-Help, so that's worth reviewing.

这是以前在R-Help上被问到的,所以这是值得回顾的。

One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

其中一个建议是使用readChar(),然后用strsplit()和substr()对结果进行字符串操作。您可以看到,在readChar中所涉及的逻辑比read.table要小得多。

I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):

我不知道这里是否有内存问题,但您可能也想看看hadoop流包。它使用Hadoop,这是一个MapReduce框架,用于处理大型数据集。为此,您将使用hsTableReader函数。这是一个示例(但它有学习Hadoop的学习曲线):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)

The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

这里的基本思想是将数据导入成块。你甚至可以走这么远来使用一个并行的框架(如雪)和运行分段并行导入的数据文件,但是最有可能以来大型数据集,不会帮助你会遇到内存约束,这就是为什么使用映射-规约模式是一个更好的方法。

#6


5  

A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):

一个次要的附加点值得一提。如果您有一个非常大的文件,您可以在飞行中计算使用的行数(如果没有标题)(在您的工作目录中,bedGraph是文件的名称):

>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))

You can then use that either in read.csv , read.table ...

然后你可以在阅读中使用它。csv、阅读。表……

>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
   user  system elapsed 
 25.877   0.887  26.752 
>object.size(BG)
203949432 bytes

#7


4  

Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:

通常,我认为将较大的数据库保存在数据库中是很好的做法(例如Postgres)。我不使用任何大于(nrow * ncol) ncell = 10M的东西,它非常小;但我经常发现,只有当我从多个数据库查询时,我才需要R来创建和保存内存密集型图。在未来的32 GB笔记本电脑中,这些类型的内存问题将会消失。但是,使用数据库来保存数据,然后使用R的内存来获得最终的查询结果和图表的吸引力可能是有用的。一些优势是:

(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.

(1)数据保存在数据库中。当你打开笔记本电脑时,你只需将pgadmin重新连接到你想要的数据库。

(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.

(2)真正的R可以比SQL做更多的漂亮的统计和图形化操作。但是我认为SQL比R更适合查询大量数据。

# Looking at Voter/Registrant Age by Decade

library(RPostgreSQL);library(lattice)

con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                 port="2345", host="localhost", dbname="WC2014_08_01_2014")

Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)

with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)

#8


0  

Instead of the conventional read.table I feel fread is a faster function. Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.

而不是传统的阅读。我觉得fread是一个更快的功能。指定额外的属性,比如只选择需要的列,指定colclasses和string,因为这些因素会减少导入文件的时间。

data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))

#1


340  

An update, several years later

几年之后的一个更新。

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

这个答案已经过时了,而R已经继续前进了。调整阅读。表运行得快一点,没有什么好处。你的选择是:

  1. Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.

    使用从文件中读数据。从csv/tab-delimited文件直接导入数据的表见mnel的答案。

  2. Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

    在readr中使用read_table(2015年4月,CRAN)。这很像上面的fread。该链接中的readme解释了这两个函数之间的区别(readr目前声称比data.table::fread要慢1.5-2x)。

  3. read.csv.raw from iotools provides a third option for quickly reading CSV files.

    read.csv。来自iotools的raw格式提供了快速读取CSV文件的第三个选项。

  4. Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

    尝试在数据库中存储尽可能多的数据,而不是平面文件。(以及作为一个更好的永久存储介质,数据以二进制格式从R传递到R,这是更快的)。sqldf包中的sql,如JD Long的回答中所描述的,将数据导入到一个临时的SQLite数据库中,然后将其读入R. See: the RODBC包,以及反向依赖于DBI包页面的部分。MonetDB。R给你一个数据类型,它假装是一个数据帧,但实际上是一个MonetDB,增加了性能。使用monetdb.read导入数据。csv函数。dplyr允许您直接处理存储在几种类型数据库中的数据。

  5. Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.

    以二进制格式存储数据也有助于提高性能。使用saveRDS/readRDS(见下面),h5或rhdf5包用于HDF5格式,或从fst包中写入write_fst/read_fst。


The original answer

原来的答案

There are a couple of simple things to try, whether you use read.table or scan.

有一些简单的事情可以尝试,不管你是否使用read。表或扫描。

  1. Set nrows=the number of records in your data (nmax in scan).

    设置nrows=你数据中记录的数量(nmax在扫描中)。

  2. Make sure that comment.char="" to turn off interpretation of comments.

    确保发表评论。字符=""关闭注释的解释。

  3. Explicitly define the classes of each column using colClasses in read.table.

    在read.table中使用colClasses显式定义每个列的类。

  4. Setting multi.line=FALSE may also improve performance in scan.

    设置多。行=FALSE也可以提高扫描的性能。

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

如果这些东西都不工作,那么就使用一个分析包来确定哪些行是慢下来的。也许你可以写一个精简版的阅读。表基于结果。

The other alternative is filtering your data before you read it into R.

另一种方法是在将数据读到R之前过滤数据。

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

或者,如果问题是您必须经常阅读它,然后使用这些方法来一次读取数据,然后将数据帧保存为保存saveRDS的二进制blob,那么下次您可以使用load readRDS更快地检索它。

#2


247  

Here is an example that utilizes fread from data.table 1.8.7

这是一个利用数据的fread的例子。表1.8.7

The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.

这些例子从帮助页面到fread,在我的windows XP Core 2 duo E8400上计时。

library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
                 b=sample(1:1000,n,replace=TRUE),
                 c=rnorm(n),
                 d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                 e=rnorm(n),
                 f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]

standard read.table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
## File size (MB): 51 

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   24.71    0.15   25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   17.85    0.07   17.98

optimized read.table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                          stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                          colClasses=c("integer","integer","numeric",                        
                                       "character","numeric","integer")))


##    user  system elapsed 
##   10.20    0.03   10.32

fread

require(data.table)
system.time(DT <- fread("test.csv"))                                  
 ##    user  system elapsed 
##    3.12    0.01    3.22

sqldf

require(sqldf)

system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             

##    user  system elapsed 
##   12.49    0.09   12.69

# sqldf as on SO

f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

##    user  system elapsed 
##   10.21    0.47   10.73

ff / ffdf

 require(ff)

 system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
 ##    user  system elapsed 
 ##   10.85    0.10   10.99

In summary:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

#3


240  

I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.

我最初没有看到这个问题,几天后又问了一个类似的问题。我将把之前的问题写下来,但是我想在这里添加一个答案来解释我是如何使用sqldf()来完成这个任务的。

There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

对于将2GB或更多的文本数据导入到R数据框架的最佳方式,我们进行了很少的讨论。昨天我写了一篇关于使用sqldf()将数据导入到SQLite作为一个staging区域,然后从SQLite将数据导入到r的博客文章,这对我来说非常有用。我能够在< 5分钟内拉出2GB(3列,40mm行)的数据。相比之下,阅读。csv命令运行了一整夜,始终没有完成。

Here's my test code:

这是我的测试代码:

Set up the test data:

设置测试数据:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)

I restarted R before running the following import routine:

在运行以下导入例程之前,我重新启动R:

library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

I let the following line run all night but it never completed:

我让下面的线整夜运行,但它始终没有完成:

system.time(big.df <- read.csv('bigdf.csv'))

#4


67  

Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:

奇怪的是,多年来没有人回答这个问题的底部,尽管这是一个很重要的问题——数据框只是具有正确属性的列表,所以如果你有大数据,你就不想使用as.data.frame或类似的列表。简单地将一个列表“转换”到一个数据框架中,要快得多:

attr(df, "row.names") <- .set_row_names(length(df[[1]]))
class(df) <- "data.frame"

This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.

这使得数据没有副本,所以它是即时的(不像所有其他方法)。它假设您已经在列表上设置了名称()。

[As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]

对于将大数据加载到R中,我个人将它们按列转储到二进制文件中,并使用readBin()——这是目前为止最快的方法(除了mmapping),而且只受到磁盘速度的限制。与二进制数据相比,解析ASCII文件本质上是慢的(即使是在C中)。

#5


30  

This was previously asked on R-Help, so that's worth reviewing.

这是以前在R-Help上被问到的,所以这是值得回顾的。

One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

其中一个建议是使用readChar(),然后用strsplit()和substr()对结果进行字符串操作。您可以看到,在readChar中所涉及的逻辑比read.table要小得多。

I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):

我不知道这里是否有内存问题,但您可能也想看看hadoop流包。它使用Hadoop,这是一个MapReduce框架,用于处理大型数据集。为此,您将使用hsTableReader函数。这是一个示例(但它有学习Hadoop的学习曲线):

str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)

The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

这里的基本思想是将数据导入成块。你甚至可以走这么远来使用一个并行的框架(如雪)和运行分段并行导入的数据文件,但是最有可能以来大型数据集,不会帮助你会遇到内存约束,这就是为什么使用映射-规约模式是一个更好的方法。

#6


5  

A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):

一个次要的附加点值得一提。如果您有一个非常大的文件,您可以在飞行中计算使用的行数(如果没有标题)(在您的工作目录中,bedGraph是文件的名称):

>numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))

You can then use that either in read.csv , read.table ...

然后你可以在阅读中使用它。csv、阅读。表……

>system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
   user  system elapsed 
 25.877   0.887  26.752 
>object.size(BG)
203949432 bytes

#7


4  

Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:

通常,我认为将较大的数据库保存在数据库中是很好的做法(例如Postgres)。我不使用任何大于(nrow * ncol) ncell = 10M的东西,它非常小;但我经常发现,只有当我从多个数据库查询时,我才需要R来创建和保存内存密集型图。在未来的32 GB笔记本电脑中,这些类型的内存问题将会消失。但是,使用数据库来保存数据,然后使用R的内存来获得最终的查询结果和图表的吸引力可能是有用的。一些优势是:

(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.

(1)数据保存在数据库中。当你打开笔记本电脑时,你只需将pgadmin重新连接到你想要的数据库。

(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.

(2)真正的R可以比SQL做更多的漂亮的统计和图形化操作。但是我认为SQL比R更适合查询大量数据。

# Looking at Voter/Registrant Age by Decade

library(RPostgreSQL);library(lattice)

con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                 port="2345", host="localhost", dbname="WC2014_08_01_2014")

Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)

with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)

#8


0  

Instead of the conventional read.table I feel fread is a faster function. Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.

而不是传统的阅读。我觉得fread是一个更快的功能。指定额外的属性,比如只选择需要的列,指定colclasses和string,因为这些因素会减少导入文件的时间。

data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))