使用elasticsearch的关键技术点

前言

最近有一个项目用到了搜索引擎，这里记录下使用过程中遇到的一些问题和解决方案。

0.准备工作

1)安装elasticsearch

2)安装Marvel

3)安装head

tips：在es的配置文件（/config/elasticsearch.yml）中可以看到es设置的对外服务的http端口，默认为9200

http.port: 9200

但是我们的服务器没有开放9200端口，因此需要改配置文件，这里改为：

http.port:8080

这时head和sense就都可以访问了，如下地址：

http://IP_ADDRESS:8080/_plugin/head/

http://IP_ADDRESS:8080/_plugin/marvel/sense/

1. 一些API，如存储，删除，查看设置等

存储

PUT /index_name/type_name/id

{

  "field1":"value1",

  "field2":"value2"

}

删除index:

DELETE /index_name

查看index的setting:

GET /user/_settings

返回了如下的一个结果：

{

   "user": {

      "settings": {

         "index": {

            "creation_date": "1437553188027",

            "uuid": "Ui-wLKGSS2y_bJwb71gLtA",

            "number_of_replicas": "1",

            "number_of_shards": "5",

            "version": {

               "created": "1060099"

            }

         }

      }

   }

}

更改settings

如果想要定义新的analyzers或者filters,那么首先_close the index, 在更新完成后再_open

POST /user/_close

PUT /user/_settings

{

  "analysis":{

    "analyzer":{

      "my_analyzer":{

        "type":"whitespace",

        "filter":["standard", "lowercase"],

        "tokenizer":"standard"

      }

    }

  }

}

POST /user/_open

查找API

查找一项

GET /user/youku/_search?explain

{

  "query": {

    "term": {

      "req_app_content_keywords": {

        "value": "睫毛膏 裸妆"

      }

    }

  }

}

查找多项

GET /user/_search?explain

{

  "query": {

    "terms": {

      "req_app_content_keywords": [

        "睫毛膏",

        "裸妆"

      ]

    }

  }

}

2. 解决一个问题：要对关键字中的中文词语进行查找

问题描述如下：

需要查找的关键字：英雄联盟

req_app_content_keywords:英雄联盟起小点 TOP10 S3集锦

elasticserach里默认的分词器是standard，对英文的分词支持的比较好，但是对于中文，比如调用下面的语句：

GET /user/_analyze?analyzer=standard

{

  "text":"明天会更好"

}

返回的结果就是：

{

   "tokens": [

      {

         "token": "text",

         "start_offset": 5,

         "end_offset": 9,

         "type": "<ALPHANUM>",

         "position": 1

      },

      {

         "token": "明",

         "start_offset": 12,

         "end_offset": 13,

         "type": "<IDEOGRAPHIC>",

         "position": 2

      },

      {

         "token": "天",

         "start_offset": 13,

         "end_offset": 14,

         "type": "<IDEOGRAPHIC>",

         "position": 3

      },

      {

         "token": "会",

         "start_offset": 14,

         "end_offset": 15,

         "type": "<IDEOGRAPHIC>",

         "position": 4

      },

      {

         "token": "更",

         "start_offset": 15,

         "end_offset": 16,

         "type": "<IDEOGRAPHIC>",

         "position": 5

      },

      {

         "token": "好",

         "start_offset": 16,

         "end_offset": 17,

         "type": "<IDEOGRAPHIC>",

         "position": 6

      }

   ]

}

那么假设想要查找中文词语，是查找不到的，只能用单个的单词，才会查找到结果：

GET /local/hello/_search

{

  "query": {

    "term": {

      "tags": {

        "value": "角"

      }

    }

  }

}

{

   "took": 2,

   "timed_out": false,

   "_shards": {

      "total": 5,

      "successful": 5,

      "failed": 0

   },

   "hits": {

      "total": 1,

      "max_score": 0.19178301,

      "hits": [

         {

            "_index": "local",

            "_type": "hello",

            "_id": "1",

            "_score": 0.19178301,

            "_source": {

               "tags": "角色",

               "name": "1234"

            }

         }

      ]

   }

}

解决方案：

设置Mapping,那么不同的field可以有不同的index：

analyzed

First analyze the string and then index it.

not_analyzed

Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it.

no

Don't index this field at all. This field will not be searchable.

因为这里只要进行最简单的分词，就是使用空格来分割，那么可以使用elasticsearch自带的analyzer:whitespace,

whitespace以空格为分隔符拆分

对各个mapping设定不同的分词器的API如下：

分词器的设定要在创建的时候就设定，后面新数据的加入会转化成最开始设定的格式，如果转化不成功，就会报错。

PUT /user

{

  "mappings": {

    "youku": {

      "properties": {

        "tags":{

          "type": "string",

          "analyzer": "whitespace"

        },

        "req_app_content_keywords":{

          "type": "string",

          "analyzer": "whitespace"

        },

        "req_app_content_title":{

          "type": "string",

          "analyzer": "whitespace"

        }

      }

    }

  }

}

3. 探索es的相似性的计算方式

检索词频率：（TF）

How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.

tf(t in d) = √frequency

The term frequency (tf) for term t in document d is the square root of the number of times the term appears in the document.

按照这个定义，我们的field中的term由于都是唯一的，所以其TF都等于1.

IDF:

每个检索词在索引中出现的频率？频率越高，相关性越低。检索词出现在多数文档中会比出现在少数文档中的权重更低，即检验一个检索词在文档中的普遍重要性。

How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms.

idf(t) = 1 + ln ( numDocs / (docFreq + 1))

Field-lenth norm

norm(d) = 1 / √numTerms

我们这里，Norms are not useful. Disabling norms can save a significant amount of memory.

So，我认为我们的算法里的norm是不需要的。比如说视频关键字，其关键字的多少对相似度是没有影响的。

PUT /my_index

{

  "mappings": {

    "doc": {

      "properties": {

        "text": {

          "type": "string",

          "norms": { "enabled": false }

        }

      }

    }

  }

}

秒客网