ES ik分词器使用技巧

时间:2023-12-06 15:29:02

match查询会将查询词分词,然后对分词的结果进行term查询。

ES ik分词器使用技巧

然后默认是将每个分词term查询之后的结果求交集,所以只要分词的结果能够命中,某条数据就可以被查询出来,而分词是在新建索引时指定的,只有text类型的数据才能设置分词策略。

新建索引,并指定分词策略:

PUT mail_test3
{
"settings": {
"index": {
"refresh_interval": "30s",
"number_of_shards": "1",
"number_of_replicas": "0"
}
},
"mappings": {
"default": {
"_all": {
"enabled": false
},
"_source": {
"enabled": true
},
"properties": {
"addressTude": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"copy_to": [
"commonText"
],
"fielddata": true
},
"captureTime": {
"type": "long"
},
"commonText": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"fielddata": true
},
"commonNum":{
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"fielddata": true
},
"imsi": {
"type": "keyword",
"copy_to": ["commonNum"]
},
"uuid": {
"type": "keyword"
}
}
}
}
}

analyzer 指的是在建索引时的分词策略,search_analyzer 指的是在查询时的分词策略。ik分词器还有一种ik_smart 的分词策略,可以比较两种分词策略的差别:

ik_smart分词策略:

GET mail_test3/_analyze
{
"analyzer": "ik_smart",
"text": "湖南省湘潭市*路96号-11-8"
}

结果:

{
"tokens": [
{
"token": "湖南省",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "湘潭市",
"start_offset": 3,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
},
{
"token": "江",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 2
},
{
"token": "山路",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 3
},
{
"token": "96号",
"start_offset": 9,
"end_offset": 12,
"type": "TYPE_CQUAN",
"position": 4
},
{
"token": "11-8",
"start_offset": 13,
"end_offset": 17,
"type": "LETTER",
"position": 5
}
]
}

ik_max_word分词策略:

GET mail_test1/_analyze
{
"analyzer": "ik_max_word",
"text": "湖南省湘潭市*路96号-11-8"
}

分词结果:

 {
"tokens": [
{
"token": "湖南省",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "湖南",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "省",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 2
},
{
"token": "湘潭市",
"start_offset": 3,
"end_offset": 6,
"type": "CN_WORD",
"position": 3
},
{
"token": "湘潭",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
},
{
"token": "市",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 5
},
{
"token": "*",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 6
},
{
"token": "山路",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 7
},
{
"token": "96",
"start_offset": 9,
"end_offset": 11,
"type": "ARABIC",
"position": 8
},
{
"token": "号",
"start_offset": 11,
"end_offset": 12,
"type": "COUNT",
"position": 9
},
{
"token": "11-8",
"start_offset": 13,
"end_offset": 17,
"type": "LETTER",
"position": 10
},
{
"token": "11",
"start_offset": 13,
"end_offset": 15,
"type": "ARABIC",
"position": 11
},
{
"token": "8",
"start_offset": 16,
"end_offset": 17,
"type": "ARABIC",
"position": 12
}
]
}

ik_max_word分词器的分词结果更多,分词的粒度更细,而ik_smart的分词结果粒度更粗,但较为智能。一般的策略是建立索引使用ik_max_word,查询时使用ik_smart,这样就能尽可能多的查到结果,而且上文提到,match查询最终是转化为term查询,因此只要某个分词结果命中,结果中就会有该条数据。

如果对搜索结果的精度较高,可以在查询中加入operator参数,然后让分词结果的每个term查询结果之间求交集,这样能尽可能地提高精度。

这里的operator设置为or和and的差别较大,可以测试进行比较:

GET mail_test3/_search
{
"query": {
"match": {
"commonText": {
"query": "湖北省宜昌市天台东二街",
"operator": "and"
}
}
}
}