使用R和Quanteda在大型语料库上计算n-g。

I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package. I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation.

我尝试使用伟大的Quanteda包在文本中构建n-g(对象大小约为1Gb)。我没有可用的云资源，所以我使用自己的笔记本(Windows和/或Mac, 12Gb RAM)进行计算。

If I sample down the data into pieces, the code works and I get a (partial) dfm of n-grams of various sizes, but when I try to run the code on whole corpus, unfortunately I hit memory limits with this corpus size, and get the following error (example code for unigrams, single words):

如果我样本数据成碎片,代码工作,我得到一个(部分)dfm不同大小的字格,但是当我试着运行代码整个语料库,不幸的是我点击内存限制这个语料库的大小,并得到以下错误(unigrams示例代码,单个词):

> dfm(corpus, verbose = TRUE, stem = TRUE,
      ignoredFeatures = stopwords("english"),
      removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features: 
Error: cannot allocate vector of size 1024.0 Mb

In addition: Warning messages:
1: In unique.default(allFeatures) :
  Reached total allocation of 11984Mb: see help(memory.size)

Even worse if I try to build n-grams with n > 1:

更糟糕的是，如果我尝试用n > 1构建n-g:

> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
     ignoredFeatures = stopwords("english"),
     removePunct = TRUE, removeNumbers = TRUE)

Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage  19925140 is too close to the limit

I found this related post, but it looks it was an issue with dense matrix coercion, later solved, and it doesn't help in my case.

我找到了这个相关的帖子，但它看起来是一个密集的矩阵强制的问题，后来解决了，在我的案例中没有帮助。

Are there better ways to handle this with limited amount of memory, without having to break the corpus data into pieces?

是否有更好的方法来处理有限的内存，而不需要将主体数据分割成块?

[EDIT] As requested, sessionInfo() data:

[编辑]根据要求，sessionInfo()数据:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3      quanteda_0.9.4  

loaded via a namespace (and not attached):
 [1] magrittr_1.5    R6_2.1.2        assertthat_0.1  Matrix_1.2-3    rsconnect_0.4.2 DBI_0.3.1      
 [7] parallel_3.2.3  tools_3.2.3     Rcpp_0.12.3     stringi_1.0-1   grid_3.2.3      chron_2.3-47   
[13] lattice_0.20-33 ca_0.64

2 个解决方案

#1

Yes there is, exactly by breaking it into pieces, but hear me out. Instead of importing the whole corpus, import a piece of it (is it in files: then import file by file; is it in one giant txt file - fine, use readLines). Compute your n-grams, store them in another file, read next file/line, store n-grams again. This is more flexible and will not run into RAM issues (it will take quite a bit more space than the original corpus of course, depending on the value of n). Later, you can access the ngrams from the files as per usual.

是的，确实有，把它拆成碎片，但听我说出来。不要导入整个语料库，而是导入一个文件(它在文件中:然后通过文件导入文件;是在一个巨大的txt文件-罚款，使用readLines)。计算你的n-g，将它们存储在另一个文件中，读取下一个文件/行，再存储n-g。这更灵活，不会遇到RAM问题(当然，这需要比原始的语料库多得多的空间，这取决于n的值)，之后，您可以按照常规访问文件中的ng。

Update as per comment.

按评论更新。

As for loading, sparse matrices/arrays sounds like a good idea, come to think of it, it might be a good idea for storage too (particularly if you happen to be dealing with bigrams only). If your data is that big, you'll probably have to look into indexing anyway (that should help with storage: instead of storing words in bigrams, index all words and store the index tuples). But it also depends what your "full n-gram model" is supposed to be for. If it's to look up the conditional probability of (a relatively small number of) words in a text, then you could just do a search (grep) over the stored ngram files. I'm not sure the indexing overhead would be justified in such a simple task. If you actually need all the 12GB worth of ngrams in a model, and the model has to calculate something that cannot be done piece-by-piece, then you still need a cluster/cloud.

对于加载，稀疏矩阵/数组听起来是个好主意，考虑一下，它可能也是一个存储的好主意(特别是如果你碰巧只处理big)。如果您的数据是那么大，那么您可能需要查找索引(这应该有助于存储:而不是将单词存储在big中，索引所有单词并存储索引元组)。但这也取决于你的“完整的n-gram模型”应该是什么。如果要查找文本中(相对较少的)单词的条件概率，那么您可以在存储的ngram文件上进行搜索(grep)。我不确定在这样一个简单的任务中索引开销是否合理。如果您在模型中实际需要所有12GB的ng，并且模型必须计算一些不能逐个完成的事情，那么您仍然需要一个集群/云。

But one more general advice, one that I frequently give to students as well: start small. Instead of 12BG, train and test on small subsets of the data. Saves you a ton of time while you are figuring out the exact implementation and iron out bugs - and particularly if you happen to be unsure about how these things work.

但还有一个更普遍的建议，那就是我经常给学生们的建议:从小处开始。不是12BG，而是对数据的小子集进行训练和测试。在你计算出精确的实现和排除错误的时候，节省了大量的时间——特别是当你碰巧不确定这些事情是如何工作的时候。

#2

Probably too late now, but I had a very similar problem recently (n-grams, R, Quanteda and large text source). I searched for two days and could not find a satisfactory solution, posted on this forum and others and didn't get an answer. I knew I had to chunk the data and combine results at the end, but couldn't work out how to do the chunking. In the end I found a somewhat un-elegant solution that worked and answered my own question in the following post here

现在可能已经太晚了，但我最近遇到了一个非常类似的问题(n克，R, Quanteda和大型文本源)。我搜索了两天，找不到一个令人满意的解决方案，贴在这个论坛和其他论坛上，没有得到一个答案。我知道我必须把数据块拼起来，并在最后合并结果，但无法计算出如何做分块。最后，我找到了一个不那么优雅的解决方案，并在下面的帖子中回答了我自己的问题。

I sliced up the corpus using the 'tm' package VCorpus then fed the chunks to quanteda using the corpus() function.

我使用“tm”包VCorpus分割了语料库，然后使用corpus()函数将块添加到quanteda。

I thought I would post it as I provide the code solution. Hopefully, it will prevent others from spending two days searching.

我想我会在提供代码解决方案的时候发布它。希望它能阻止其他人花两天时间搜索。

#1