R不能在ff过程中分配内存。如何来吗?

I'm working on a 64-bit Windows Server 2008 machine with Intel Xeon processor and 24 GB of RAM. I'm having trouble trying to read a particular TSV (tab-delimited) file of 11 GB (>24 million rows, 20 columns). My usual companion, read.table, has failed me. I'm currently trying the package ff, through this procedure:

我正在使用Intel Xeon处理器和24gb的RAM开发64位Windows Server 2008计算机。我在尝试读取一个特定的TSV(表分隔的)文件时遇到了麻烦，该文件有1gb(> 2400万行，20列)。我的同伴,阅读。表,我已经失败了。我目前正在尝试包装ff，通过这个程序:

> df <- read.delim.ffdf(file       = "data.tsv",
+                       header     = TRUE,
+                       VERBOSE    = TRUE,
+                       first.rows = 1e3,
+                       next.rows  = 1e6,
+                       na.strings = c("", NA),
+                       colClasses = c("NUMERO_PROCESSO" = "factor"))

Which works fine for about 6 million records, but then I get an error, as you can see:

这对于600万的记录来说是没问题的，但是我得到了一个错误，正如你所看到的:

read.table.ffdf 1..1000 (1000) csv-read=0.14sec ffdf-write=0.2sec
read.table.ffdf 1001..1001000 (1000000) csv-read=240.92sec ffdf-write=67.32sec
read.table.ffdf 1001001..2001000 (1000000) csv-read=179.15sec ffdf-write=94.13sec
read.table.ffdf 2001001..3001000 (1000000) csv-read=792.36sec ffdf-write=68.89sec
read.table.ffdf 3001001..4001000 (1000000) csv-read=192.57sec ffdf-write=83.26sec
read.table.ffdf 4001001..5001000 (1000000) csv-read=187.23sec ffdf-write=78.45sec
read.table.ffdf 5001001..6001000 (1000000) csv-read=193.91sec ffdf-write=94.01sec
read.table.ffdf 6001001..
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

If I'm not mistaken, R is complaining of lack of memory to read the data, but wasn't the read...ffdf procedure supposed to circumvent heavy memory usage when reading data? What could I be doing wrong here?

如果我没弄错的话，R抱怨说读数据时内存不足，但读的不是……ffdf程序在读取数据时应该避免大量内存使用?我在这里做错了什么?

1 个解决方案

#1

(I realize this is an old question, but I had the same problem and spent two days looking for the solution. This seems as good a place as any to document what I eventually figured out for posterity.)

(我知道这是一个老问题，但我有同样的问题，花了两天时间寻找解决方案。这似乎是一个很好的地方，用来记录我最终为子孙后代找到的东西。

The problem isn't that you are running out of available memory. The problem is that you've hit the memory limit for a single string. From help('Memory-limits'):

问题不在于您正在耗尽可用内存。问题是您已经达到了单个字符串的内存限制。从帮助(内存限制):

There are also limits on individual objects. The storage space cannot exceed the address limit, and if you try to exceed that limit, the error message begins cannot allocate vector of length. The number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9, which is also the limit on each dimension of an array.

对单个对象也有限制。存储空间不能超过地址限制，如果您试图超过这个限制，错误消息开始不能分配长度向量。在一个字符串的字节数是有限的2 ^ 31 - 1 ~ 2 * 10 ^ 9,这也是限制每个维度的数组。

In my case (and it appears yours as well) I didn't bother to set the quote character since I was dealing with tab separated data and I assumed it didn't matter. However, somewhere in the middle of the data set, I had a string with an unmatched quote, and then read.table happily ran right past the end of line and on to the next, and the next, and the next... until it hit the limit for the size of a string and blew up.

在我的例子中(它也出现在你的例子中)，我没有费心设置引用字符，因为我处理的是tab分隔的数据，我假设它不重要。然而，在数据集中的某个地方，我有一个字符串，有一个未匹配的引号，然后读取。桌子愉快地跑过了终点线，一直跑到下一个，下一个，下一个……直到它达到了字符串大小的极限并爆炸。

The solution was to explicitly set quote = "" in the argument list.

解决方案是在参数列表中显式地设置quote = "" "。

#1