Lucene没有索引一些单词?

时间:2020-12-22 19:15:44

I use leucene.net for my site and it Index some of the words fine and correct but it doesn't index some words like "الله"!

我使用leucene.net为我的网站,它索引一些正确和正确的词,但它没有索引像“الله”这样的词!

I have see the indexed file with Luke and it shows that "الله"is not indexed.

我已经看到带有Luke的索引文件,它显示“الله”没有编入索引。

I have used ArabicAnalyzer for indexing.

我使用ArabicAnalyzer进行索引。

you can see my site at www.qoranic.com , if you search "مریم" it will be ok but if you search "الله" it shows nothing.

你可以在www.qoranic.com上看到我的网站,如果你搜索“مریم”就可以了,但如果搜索“الله”则没有显示任何内容。

any idea is appreciated in forward.

任何想法都在前进中受到赞赏。

1 个解决方案

#1


1  

The ArabicAnalyzer does some transformation to that input; it will transform the input الله to له. This is due to the usage of the ArabicStemFilter (and ArabicStemmer) which is documented with ...

ArabicAnalyzer对该输入进行了一些转换;它会将输入转换为له。这是因为使用了ArabicStemFilter(和ArabicStemmer),它记录在......

Stemming is defined as:

词干定义为:

  • Removal of attached definite article, conjunction, and prepositions.
  • 删除附加的定冠词,连词和介词。
  • Stemming of common suffixes.
  • 干扰普通后缀。

This shouldn't be an issue since you should be parsing the user provided query through the same analyzer when searching, producing the same tokens.

这应该不是问题,因为您应该在搜索时通过同一分析器解析用户提供的查询,生成相同的标记。

Here's the sample code I used to see what terms an analyzer produced from a given input.

这是我用来查看分析器从给定输入生成的术语的示例代码。

using System;
using Lucene.Net.Analysis.AR;
using Lucene.Net.Analysis.Tokenattributes;
using System.IO;

namespace ConsoleApplication {
    public static class Program {
        public static void Main() {
            var luceneVersion = Lucene.Net.Util.Version.LUCENE_30;

            var input = "الله";
            var analyzer = new ArabicAnalyzer(luceneVersion);

            var inputReader = new StringReader(input);
            var stream = analyzer.TokenStream("fieldName", inputReader);

            var termAttribute = stream.GetAttribute<ITermAttribute>();
            while(stream.IncrementToken()) {
                Console.WriteLine("Term: {0}", termAttribute.Term);
            }

            Console.WriteLine("Done.");
            Console.ReadLine();
        }
    }
}

You can overcome this behavior (remove the stemming) by writing a custom Analyzer which uses the ArabicNormalizationFilter, just as ArabicAnalyzer does, but without the call to ArabicStemFilter.

您可以通过编写使用ArabicNormalizationFilter的自定义分析器来克服此行为(删除词干),就像ArabicAnalyzer一样,但没有调用ArabicStemFilter。

public class CustomAnalyzer : Analyzer {
    public override TokenStream TokenStream(String fieldName, TextReader reader) {
        TokenStream result = new ArabicLetterTokenizer(reader);
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter(result);
        return result;
    }
}

#1


1  

The ArabicAnalyzer does some transformation to that input; it will transform the input الله to له. This is due to the usage of the ArabicStemFilter (and ArabicStemmer) which is documented with ...

ArabicAnalyzer对该输入进行了一些转换;它会将输入转换为له。这是因为使用了ArabicStemFilter(和ArabicStemmer),它记录在......

Stemming is defined as:

词干定义为:

  • Removal of attached definite article, conjunction, and prepositions.
  • 删除附加的定冠词,连词和介词。
  • Stemming of common suffixes.
  • 干扰普通后缀。

This shouldn't be an issue since you should be parsing the user provided query through the same analyzer when searching, producing the same tokens.

这应该不是问题,因为您应该在搜索时通过同一分析器解析用户提供的查询,生成相同的标记。

Here's the sample code I used to see what terms an analyzer produced from a given input.

这是我用来查看分析器从给定输入生成的术语的示例代码。

using System;
using Lucene.Net.Analysis.AR;
using Lucene.Net.Analysis.Tokenattributes;
using System.IO;

namespace ConsoleApplication {
    public static class Program {
        public static void Main() {
            var luceneVersion = Lucene.Net.Util.Version.LUCENE_30;

            var input = "الله";
            var analyzer = new ArabicAnalyzer(luceneVersion);

            var inputReader = new StringReader(input);
            var stream = analyzer.TokenStream("fieldName", inputReader);

            var termAttribute = stream.GetAttribute<ITermAttribute>();
            while(stream.IncrementToken()) {
                Console.WriteLine("Term: {0}", termAttribute.Term);
            }

            Console.WriteLine("Done.");
            Console.ReadLine();
        }
    }
}

You can overcome this behavior (remove the stemming) by writing a custom Analyzer which uses the ArabicNormalizationFilter, just as ArabicAnalyzer does, but without the call to ArabicStemFilter.

您可以通过编写使用ArabicNormalizationFilter的自定义分析器来克服此行为(删除词干),就像ArabicAnalyzer一样,但没有调用ArabicStemFilter。

public class CustomAnalyzer : Analyzer {
    public override TokenStream TokenStream(String fieldName, TextReader reader) {
        TokenStream result = new ArabicLetterTokenizer(reader);
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter(result);
        return result;
    }
}

相关文章