Lucene没有索引一些单词？

I use leucene.net for my site and it Index some of the words fine and correct but it doesn't index some words like "الله"!

我使用leucene.net为我的网站，它索引一些正确和正确的词，但它没有索引像“الله”这样的词！

I have see the indexed file with Luke and it shows that "الله"is not indexed.

我已经看到带有Luke的索引文件，它显示“الله”没有编入索引。

I have used ArabicAnalyzer for indexing.

我使用ArabicAnalyzer进行索引。

you can see my site at www.qoranic.com , if you search "مریم" it will be ok but if you search "الله" it shows nothing.

你可以在www.qoranic.com上看到我的网站，如果你搜索“مریم”就可以了，但如果搜索“الله”则没有显示任何内容。

any idea is appreciated in forward.

任何想法都在前进中受到赞赏。

1 个解决方案

#1

The ArabicAnalyzer does some transformation to that input; it will transform the input الله to له. This is due to the usage of the ArabicStemFilter (and ArabicStemmer) which is documented with ...

ArabicAnalyzer对该输入进行了一些转换;它会将输入转换为له。这是因为使用了ArabicStemFilter（和ArabicStemmer），它记录在......

Stemming is defined as:

词干定义为：

Removal of attached definite article, conjunction, and prepositions.

删除附加的定冠词，连词和介词。

Stemming of common suffixes.

干扰普通后缀。

This shouldn't be an issue since you should be parsing the user provided query through the same analyzer when searching, producing the same tokens.

这应该不是问题，因为您应该在搜索时通过同一分析器解析用户提供的查询，生成相同的标记。

Here's the sample code I used to see what terms an analyzer produced from a given input.

这是我用来查看分析器从给定输入生成的术语的示例代码。

using System;
using Lucene.Net.Analysis.AR;
using Lucene.Net.Analysis.Tokenattributes;
using System.IO;

namespace ConsoleApplication {
    public static class Program {
        public static void Main() {
            var luceneVersion = Lucene.Net.Util.Version.LUCENE_30;

            var input = "الله";
            var analyzer = new ArabicAnalyzer(luceneVersion);

            var inputReader = new StringReader(input);
            var stream = analyzer.TokenStream("fieldName", inputReader);

            var termAttribute = stream.GetAttribute<ITermAttribute>();
            while(stream.IncrementToken()) {
                Console.WriteLine("Term: {0}", termAttribute.Term);
            }

            Console.WriteLine("Done.");
            Console.ReadLine();
        }
    }
}

You can overcome this behavior (remove the stemming) by writing a custom Analyzer which uses the ArabicNormalizationFilter, just as ArabicAnalyzer does, but without the call to ArabicStemFilter.

您可以通过编写使用ArabicNormalizationFilter的自定义分析器来克服此行为（删除词干），就像ArabicAnalyzer一样，但没有调用ArabicStemFilter。

public class CustomAnalyzer : Analyzer {
    public override TokenStream TokenStream(String fieldName, TextReader reader) {
        TokenStream result = new ArabicLetterTokenizer(reader);
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter(result);
        return result;
    }
}

#1