Lucene上打开文件太多错误

时间:2021-07-27 03:39:13

The project I'm working on is indexing a certain number of data (with long texts) and comparing them with list of words per interval (about 15 to 30 minutes).

我正在研究的项目是索引一定数量的数据(带有长文本),并将它们与每个区间的单词列表(大约15到30分钟)进行比较。

After some time, say 35th round, while starting to index new set of data on 36th round this error occurred:

经过一段时间后,比如第35轮,在第36轮开始索引新的数据集时,发生了以下错误:

    [ERROR] (2011-06-01 10:08:59,169) org.demo.service.LuceneService.countDocsInIndex(?:?) : Exception on countDocsInIndex: 
    java.io.FileNotFoundException: /usr/share/demo/index/tag/data/_z.tvd (Too many open files)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
        at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.<init>(SimpleFSDirectory.java:69)
        at org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.<init>(SimpleFSDirectory.java:90)
        at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.<init>(NIOFSDirectory.java:91)
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)
        at org.apache.lucene.index.TermVectorsReader.<init>(TermVectorsReader.java:81)
        at org.apache.lucene.index.SegmentReader$CoreReaders.openDocStores(SegmentReader.java:299)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:580)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:556)
        at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:113)
        at org.apache.lucene.index.ReadOnlyDirectoryReader.<init>(ReadOnlyDirectoryReader.java:29)
        at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:736)
        at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:428)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:274)
        at org.demo.service.LuceneService.countDocsInIndex(Unknown Source)
        at org.demo.processing.worker.DataFilterWorker.indexTweets(Unknown Source)
        at org.demo.processing.worker.DataFilterWorker.processTweets(Unknown Source)
        at org.demo.processing.worker.DataFilterWorker.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:636)

I've already tried setting maximum number of open files by:

我已经尝试通过以下方式设置最大打开文件数:

        ulimit -n <number>

But after some time, when the interval has about 1050 rows of long texts, the same error occurs. But it only occurred once.

但是过了一段时间,当间隔有大约1050行长文本时,会发生同样的错误。但它只发生过一次。

Should I follow the advice of modifying Lucene IndexWriter's mergeFactor from (Too many open files) - SOLR or is this an issue on the amount of data being indexed?

我是否应该遵循修改Lucene IndexWriter的mergeFactor(太多打开的文件) - SOLR的建议,或者这是一个关于被索引的数据量的问题?

I've also read that it's a choice between batch indexing or interactive indexing. How would one determine if indexing is interactive, just by frequent updates? Should I categorize this project under interactive indexing then?

我还读到它是批量索引或交互式索引之间的选择。如何通过频繁更新来确定索引是否是交互式的?我应该在交互式索引下对这个项目进行分类吗?

UPDATE: I'm adding snippet of my IndexWriter:

更新:我正在添加我的IndexWriter片段:

        writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);

Seems like maxMerge (? or field length...) is already set to unlimited.

似乎maxMerge(?或字段长度...)已设置为无限制。

3 个解决方案

#1


2  

I already used the ulimit but error still shows. Then I inspected the customized core adapters for lucene functions. Turns out there's too many IndexWriter.open directory that is LEFT OPEN.

我已经使用了ulimit,但错误仍然显示。然后我检查了用于lucene功能的定制核心适配器。事实证明,有太多的IndexWriter.open目录是LEFT OPEN。

Should note that after processing, will always call on closing the directory opened.

应注意处理后,将始终调用关闭打开的目录。

#2


1  

You need to double check if ulimit value has actually been persisted and set to a proper value (whatever maximum is).

您需要仔细检查ulimit值是否已实际持久并设置为正确的值(无论最大值是多少)。

It is very likely that your app is not closing index readers/writers properly. I've seen many stories like this in the Lucene mailing list and it was almost always the user app which was to blame, not the Lucene itself.

您的应用很可能没有正确关闭索引读取器/写入器。我在Lucene邮件列表中看到过很多这样的故事,它几乎总是用户应用程序,而不是Lucene本身。

#3


0  

Use compound index to reduce file count. When this flag is set, lucene will write a segment as single .cfs file instead of multiple files. This will reduce the number of files significantly.

使用复合索引减少文件计数。设置此标志后,lucene会将段写为单个.cfs文件而不是多个文件。这将显着减少文件数量。

IndexWriter.setUseCompoundFile(true) 

#1


2  

I already used the ulimit but error still shows. Then I inspected the customized core adapters for lucene functions. Turns out there's too many IndexWriter.open directory that is LEFT OPEN.

我已经使用了ulimit,但错误仍然显示。然后我检查了用于lucene功能的定制核心适配器。事实证明,有太多的IndexWriter.open目录是LEFT OPEN。

Should note that after processing, will always call on closing the directory opened.

应注意处理后,将始终调用关闭打开的目录。

#2


1  

You need to double check if ulimit value has actually been persisted and set to a proper value (whatever maximum is).

您需要仔细检查ulimit值是否已实际持久并设置为正确的值(无论最大值是多少)。

It is very likely that your app is not closing index readers/writers properly. I've seen many stories like this in the Lucene mailing list and it was almost always the user app which was to blame, not the Lucene itself.

您的应用很可能没有正确关闭索引读取器/写入器。我在Lucene邮件列表中看到过很多这样的故事,它几乎总是用户应用程序,而不是Lucene本身。

#3


0  

Use compound index to reduce file count. When this flag is set, lucene will write a segment as single .cfs file instead of multiple files. This will reduce the number of files significantly.

使用复合索引减少文件计数。设置此标志后,lucene会将段写为单个.cfs文件而不是多个文件。这将显着减少文件数量。

IndexWriter.setUseCompoundFile(true)