使用scala处理某些xml时出现内存不足错误

时间:2021-08-16 22:00:57

I have broken wiki xml dump into many small parts of 1M and tried to clean it (after cleaning it with another program by somebody else)

我已经将wiki xml转储分解为1M的许多小部分并尝试清理它(在用别人的其他程序清理之后)

I get an out of memory error which I don't know how to solve. Can anyone enlighten me?

我得到一个内存不足的错误,我不知道如何解决。任何人都可以开导我吗?

I get the following error message:

我收到以下错误消息:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.<init>(FreqProxTermsWriterPerField.java:212)
    at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:235)
    at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:48)
    at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:252)
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:292)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:645)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1541)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1237)
    at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:234)
    at qa.main.ja.Indexing$$anonfun$5$$anonfun$apply$4.apply(SearchDocument.scala:224)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.Iterator$class.foreach(Iterator.scala:750)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:224)
    at qa.main.ja.Indexing$$anonfun$5.apply(SearchDocument.scala:220)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) 

Where line 234 is as follows:

第234行如下:

writer.addDocument(document)

It is adding some documents to Lucene

它正在向Lucene添加一些文档

and where line 224 is as follows:

并且第224行如下:

for (doc <- target_xml \\ "doc") yield {

It is the first line of a for loop for adding various elements as fields in the index.

它是for循环的第一行,用于在索引中添加各种元素作为字段。

Is it a code problem, setting problem or hardware problem?

是代码问题,设置问题还是硬件问题?

EDIT

编辑

Hi, this is my for loop:

嗨,这是我的for循环:

for (knowledgeFile <- knowledgeFiles) yield {
System.err.println(s"processing file: ${knowledgeFile}")
val target_xml=XML.loadString("    <file>"+cleanFile(knowledgeFile).mkString+"</file>")
for (doc <- target_xml \\ "doc") yield {
val id = (doc \ "@id").text
val title = (doc \ "@title").text
val text = doc.text
val document = new Document()
document.add(new StringField("id", id, Store.YES))
document.add(new TextField("text", new StringReader(title + text)))
writer.addDocument(document)
val xml_doc = <page><title>{ title }</title><text>{ text }</text></page>
id -> xml_doc
}
}).flatten.toArray` 

The inner loop just loops thru every doc element. The outer loop loops thru every file. Is the nested for the source of the problem?

内部循环只是遍历每个doc元素。外部循环遍历每个文件。是否嵌套了问题的根源?

Below is the cleanFile function for reference:

以下是cleanFile函数供参考:

def cleanFile(fileName:String):Array[String] = {
val tagRe = """<\/?doc.*?>""".r
val lines = Source.fromFile(fileName).getLines.toArray
val outLines = new Array[String](lines.length)
for ((line,lineNo) <- lines.zipWithIndex) yield {
if (tagRe.findFirstIn(line)!=None)
{
outLines(lineNo) = line
}
else
{
outLines(lineNo) = StringEscapeUtils.escapeXml11(line)
}
}
outLines
}

Thanks again

再次感谢

1 个解决方案

#1


0  

Looks like you would like to try increasing the heap size by having -xmx jvm argument?

看起来你想通过使用-xmx jvm参数来尝试增加堆大小?

#1


0  

Looks like you would like to try increasing the heap size by having -xmx jvm argument?

看起来你想通过使用-xmx jvm参数来尝试增加堆大小?