一、概要:
1.es默认的分词器对中文支持不好,会分割成一个个的汉字。ik分词器对中文的支持要好一些,主要由两种模式:ik_smart和ik_max_word
2.环境
操作系统:centos
es版本:6.0.0
二、安装插件
1.插件地址:https://github.com/medcl/elasticsearch-analysis-ik
2.运行命令行:
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.0.0/elasticsearch-analysis-ik-6.0.0.zip
运行完成后会发现多了以下文件:esroot 下的plugins和config文件夹多了analysis-ik目录。
三、重启es
1.查找es进程
ps -ef | grep elastic
2.终止进程
从上面的结果可以看到es进程号是12776.
执行命令:
kill 12776
3.启动es后台运行
./bin/sh elastic search –d
提醒:重启es会重新分片,线上环境要注意了。
四、测试
1.使用ik_max_word分词
GET _analyze
{
"analyzer":"ik_max_word",
"text":"*国歌"
}
分词结果:
{
"tokens": [
{
"token": "*",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民*",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "*",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 9
}
]
}
2.使用ik_smart分词
GET _analyze
{
"analyzer":"ik_smart",
"text":"*国歌"
}
分词结果:
{
"tokens": [
{
"token": "*",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 1
}
]
}
五、java api分词测试
1.调用ik_max_word分词
@Test
public void analyzer_ik_max_word() throws Exception {
java.lang.String text = "提前祝大家春节快乐!"; TransportClient client = EsClient.get();
AnalyzeRequest request = (new AnalyzeRequest()).analyzer("ik_max_word").text(text);
List<AnalyzeResponse.AnalyzeToken> tokens = client.admin().indices().analyze(request).actionGet().getTokens();
System.out.println(tokens.size());//
for (AnalyzeResponse.AnalyzeToken token : tokens) {
System.out.println(token.getTerm() + " ");
}
}
结果:
6
提前
祝
大家
春节快乐
春节
快乐
2.调用ik_smart分词
@Test
public void analyzer_ik_smart() throws Exception {
java.lang.String text = "提前祝大家春节快乐!"; TransportClient client = EsClient.get();
AnalyzeRequest request = (new AnalyzeRequest()).analyzer("ik_smart").text(text);
List<AnalyzeResponse.AnalyzeToken> tokens = client.admin().indices().analyze(request).actionGet().getTokens();
System.out.println(tokens.size());
for (AnalyzeResponse.AnalyzeToken token : tokens) {
System.out.println(token.getTerm() + " ");
}
}
结果:
4
提前
祝
大家
春节快乐