
时间:2023-01-31 16:57:26

I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package. I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation.

我尝试使用伟大的Quanteda包在文本中构建n-g(对象大小约为1Gb)。我没有可用的云资源,所以我使用自己的笔记本(Windows和/或Mac, 12Gb RAM)进行计算。

If I sample down the data into pieces, the code works and I get a (partial) dfm of n-grams of various sizes, but when I try to run the code on whole corpus, unfortunately I hit memory limits with this corpus size, and get the following error (example code for unigrams, single words):


> dfm(corpus, verbose = TRUE, stem = TRUE,
      ignoredFeatures = stopwords("english"),
      removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features: 
Error: cannot allocate vector of size 1024.0 Mb

In addition: Warning messages:
1: In unique.default(allFeatures) :
  Reached total allocation of 11984Mb: see help(memory.size)

Even worse if I try to build n-grams with n > 1:

更糟糕的是,如果我尝试用n > 1构建n-g:

> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
     ignoredFeatures = stopwords("english"),
     removePunct = TRUE, removeNumbers = TRUE)

Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage  19925140 is too close to the limit

I found this related post, but it looks it was an issue with dense matrix coercion, later solved, and it doesn't help in my case.


Are there better ways to handle this with limited amount of memory, without having to break the corpus data into pieces?


[EDIT] As requested, sessionInfo() data:


> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3      quanteda_0.9.4  

loaded via a namespace (and not attached):
 [1] magrittr_1.5    R6_2.1.2        assertthat_0.1  Matrix_1.2-3    rsconnect_0.4.2 DBI_0.3.1      
 [7] parallel_3.2.3  tools_3.2.3     Rcpp_0.12.3     stringi_1.0-1   grid_3.2.3      chron_2.3-47   
[13] lattice_0.20-33 ca_0.64

2 个解决方案



Yes there is, exactly by breaking it into pieces, but hear me out. Instead of importing the whole corpus, import a piece of it (is it in files: then import file by file; is it in one giant txt file - fine, use readLines). Compute your n-grams, store them in another file, read next file/line, store n-grams again. This is more flexible and will not run into RAM issues (it will take quite a bit more space than the original corpus of course, depending on the value of n). Later, you can access the ngrams from the files as per usual.


Update as per comment.


As for loading, sparse matrices/arrays sounds like a good idea, come to think of it, it might be a good idea for storage too (particularly if you happen to be dealing with bigrams only). If your data is that big, you'll probably have to look into indexing anyway (that should help with storage: instead of storing words in bigrams, index all words and store the index tuples). But it also depends what your "full n-gram model" is supposed to be for. If it's to look up the conditional probability of (a relatively small number of) words in a text, then you could just do a search (grep) over the stored ngram files. I'm not sure the indexing overhead would be justified in such a simple task. If you actually need all the 12GB worth of ngrams in a model, and the model has to calculate something that cannot be done piece-by-piece, then you still need a cluster/cloud.


But one more general advice, one that I frequently give to students as well: start small. Instead of 12BG, train and test on small subsets of the data. Saves you a ton of time while you are figuring out the exact implementation and iron out bugs - and particularly if you happen to be unsure about how these things work.




Probably too late now, but I had a very similar problem recently (n-grams, R, Quanteda and large text source). I searched for two days and could not find a satisfactory solution, posted on this forum and others and didn't get an answer. I knew I had to chunk the data and combine results at the end, but couldn't work out how to do the chunking. In the end I found a somewhat un-elegant solution that worked and answered my own question in the following post here

现在可能已经太晚了,但我最近遇到了一个非常类似的问题(n克,R, Quanteda和大型文本源)。我搜索了两天,找不到一个令人满意的解决方案,贴在这个论坛和其他论坛上,没有得到一个答案。我知道我必须把数据块拼起来,并在最后合并结果,但无法计算出如何做分块。最后,我找到了一个不那么优雅的解决方案,并在下面的帖子中回答了我自己的问题。

I sliced up the corpus using the 'tm' package VCorpus then fed the chunks to quanteda using the corpus() function.


I thought I would post it as I provide the code solution. Hopefully, it will prevent others from spending two days searching.




Yes there is, exactly by breaking it into pieces, but hear me out. Instead of importing the whole corpus, import a piece of it (is it in files: then import file by file; is it in one giant txt file - fine, use readLines). Compute your n-grams, store them in another file, read next file/line, store n-grams again. This is more flexible and will not run into RAM issues (it will take quite a bit more space than the original corpus of course, depending on the value of n). Later, you can access the ngrams from the files as per usual.


Update as per comment.


As for loading, sparse matrices/arrays sounds like a good idea, come to think of it, it might be a good idea for storage too (particularly if you happen to be dealing with bigrams only). If your data is that big, you'll probably have to look into indexing anyway (that should help with storage: instead of storing words in bigrams, index all words and store the index tuples). But it also depends what your "full n-gram model" is supposed to be for. If it's to look up the conditional probability of (a relatively small number of) words in a text, then you could just do a search (grep) over the stored ngram files. I'm not sure the indexing overhead would be justified in such a simple task. If you actually need all the 12GB worth of ngrams in a model, and the model has to calculate something that cannot be done piece-by-piece, then you still need a cluster/cloud.


But one more general advice, one that I frequently give to students as well: start small. Instead of 12BG, train and test on small subsets of the data. Saves you a ton of time while you are figuring out the exact implementation and iron out bugs - and particularly if you happen to be unsure about how these things work.




Probably too late now, but I had a very similar problem recently (n-grams, R, Quanteda and large text source). I searched for two days and could not find a satisfactory solution, posted on this forum and others and didn't get an answer. I knew I had to chunk the data and combine results at the end, but couldn't work out how to do the chunking. In the end I found a somewhat un-elegant solution that worked and answered my own question in the following post here

现在可能已经太晚了,但我最近遇到了一个非常类似的问题(n克,R, Quanteda和大型文本源)。我搜索了两天,找不到一个令人满意的解决方案,贴在这个论坛和其他论坛上,没有得到一个答案。我知道我必须把数据块拼起来,并在最后合并结果,但无法计算出如何做分块。最后,我找到了一个不那么优雅的解决方案,并在下面的帖子中回答了我自己的问题。

I sliced up the corpus using the 'tm' package VCorpus then fed the chunks to quanteda using the corpus() function.


I thought I would post it as I provide the code solution. Hopefully, it will prevent others from spending two days searching.
