elasticsearch 口水篇(6) Mapping 定义索引

时间:2024-01-03 20:59:56

前面我们感觉ES就想是一个nosql数据库,支持Free Schema。

接触过Lucene、solr的同学这时可能会思考一个问题——怎么定义document中的field?store、index、analyzer等属性如何配置?

这时可以了解下ES中的Mapping。

[reference]

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html#mapping


Mapping is the process of defining how a document should be mapped to the Search Engine, including its searchable characteristics such as which fields are searchable and if/how they are tokenized. In ElasticSearch, an index may store documents of different "mapping types". ElasticSearch allows one to associate multiple mapping definitions for each mapping type.

Explicit mapping is defined on an index/type level. By default, there isn’t a need to define an explicit mapping, since one is automatically created and registered when a new type or new field is introduced (with no performance overhead) and have sensible defaults. Only when the defaults need to be overridden must a mapping definition be provided.

mapping types

Mapping types are a way to divide the documents in an index into logical groups. Think of it as tables in a database. Though there is separation between types, it’s not a full separation (all end up as a document within the same Lucene index).

Field names with the same name across types are highly recommended to have the same type and same mapping characteristics (analysis settings for example). There is an effort to allow to explicitly "choose" which field to use by using type prefix (my_type.my_field), but it’s not complete, and there are places where it will never work (like faceting on the field).

In practice though, this restriction is almost never an issue. The field name usually ends up being a good indication to its "typeness" (e.g. "first_name" will always be a string). Note also, that this does not apply to the cross index case.

global settings

The index.mapping.ignore_malformed global setting can be set on the index level to allow to ignore malformed content globally across all mapping types (malformed content example is trying to index a text string value as a numeric type).

The index.mapping.coerce global setting can be set on the index level to coerce numeric content globally across all mapping types (The default setting is true and coercions attempted are to convert strings with numbers into numeric types and also numeric values with fractions to any integer/short/long values minus the fraction part). When the permitted conversions fail in their attempts, the value is considered malformed and the ignore_malformed setting dictates what will happen next.


Fields

1)_uid

Each document indexed is associated with an id and a type, the internal _uid field is the unique identifier of a document within an index and is composed of the type and the id (meaning that different types can have the same id and still maintain uniqueness).

The _uid field is automatically used when _type is not indexed to perform type based filtering, and does not require the _id to be indexed.

【_udi=type+id,即不同的type可以存在相同id。】

2)_id

Each document indexed is associated with an id and a type. The _id field can be used to index just the id, and possible also store it. By default it is not indexed and not stored (thus, not created).

Note, even though the _id is not indexed, all the APIs still work (since they work with the _uidfield), as well as fetching by ids using termterms or prefix queries/filters (including the specificids query/filter).

【_id默认是不索引、不存储,那么对其进行的各项查询操作将由_uid负责。】

The _id field can be enabled to be indexed, and possibly stored, using:

{
    "tweet":{
        "_id":{"index":"not_analyzed","store":false}
    }
}

The _id mapping can also be associated with a path that will be used to extract the id from a different location in the source document. For example, having the following mapping:

{
    "tweet":{
        "_id":{
            "path":"post_id"
        }
    }
}

Will cause 1 to be used as the id for:

{
    "message":"You know, for Search",
    "post_id":"1"
}

This does require an additional lightweight parsing step while indexing, in order to extract the id to decide which shard the index operation will be executed on.

3)_type

Each document indexed is associated with an id and a type. The type, when indexing, is automatically indexed into a _type field. By default, the _type field is indexed (but not analyzed) and not stored. This means that the _type field can be queried.

【有个_type字段用来索引type,那么每次type检索是否要加上_type字段检索条件?】

The _type field can be stored as well, for example:

{
    "tweet":{
        "_type":{"store":true}
    }
}

The _type field can also not be indexed, and all the APIs will still work except for specific queries (term queries / filters) or faceting done on the _type field.

{
    "tweet":{
        "_type":{"index":"no"}
    }
}

4)_source

The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. It is not indexed (searchable), just stored. When executing "fetch" requests, like get or search, the _source field is returned by default.

【_source理解为正向文本,该字段生效时将占用不小的额外空间。】

Though very handy to have around, the source field does incur storage overhead within the index. For this reason, it can be disabled. For example:

{
    "tweet":{
        "_source":{"enabled":false}
    }
}

includes / excludes

Allow to specify paths in the source that would be included / excluded when it’s stored, supporting * as wildcard annotation. For example:

{
    "my_type":{
        "_source":{
            "includes":["path1.*","path2.*"],
            "excludes":["pat3.*"]
        }
    }
}

5)_all

The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size.

The _all fields can be completely disabled. Explicit field mappings and object mappings can be excluded / included in the _all field. By default, it is enabled and all fields are included in it for ease of use.

When disabling the _all field, it is a good practice to set index.query.default_field to a different value (for example, if you have a main "message" field in your data, set it to message).

【当_all字段不可用是,最佳实践是指定默认检索字段index.query.default_field】

One of the nice features of the _all field is that it takes into account specific fields boost levels. Meaning that if a title field is boosted more than content, the title (part) in the _all field will mean more than the content (part) in the _all field.

Here is a sample mapping:

{
    "person":{
        "_all":{"enabled":true},
        "properties":{
            "name":{
                "type":"object",
                "dynamic":false,
                "properties":{
                    "first":{"type":"string","store":true,"include_in_all":false},
                    "last":{"type":"string","index":"not_analyzed"}
                }
            },
            "address":{
                "type":"object",
                "include_in_all":false,
                "properties":{
                    "first":{
                        "properties":{
                            "location":{"type":"string","store":true,"index_name":"firstLocation"}
                        }
                    },
                    "last":{
                        "properties":{
                            "location":{"type":"string"}
                        }
                    }
                }
            },
            "simple1":{"type":"long","include_in_all":true},
            "simple2":{"type":"long","include_in_all":false}
        }
    }
}

The _all fields allows for storeterm_vector and analyzer (with specific index_analyzer and search_analyzer) to be set.

highlighting

For any field to allow highlighting it has to be either stored or part of the _source field. By default the _all field does not qualify for either, so highlighting for it does not yield any data.

Although it is possible to store the _all field, it is basically an aggregation of all fields, which means more data will be stored, and highlighting it might produce strange results.

6)_analyzer

The _analyzer mapping allows to use a document field property as the name of the analyzer that will be used to index the document. The analyzer will be used for any field that does not explicitly defines an analyzer or index_analyzer when indexing.

Here is a simple mapping:

{
    "type1":{
        "_analyzer":{
            "path":"my_field"
        }
    }
}

The above will use the value of the my_field to lookup an analyzer registered under it. For example, indexing the following doc:

{
    "my_field":"whitespace"
}

Will cause the whitespace analyzer to be used as the index analyzer for all fields without explicit analyzer setting.

The default path value is _analyzer, so the analyzer can be driven for a specific document by setting the _analyzer field in it. If a custom json field name is needed, an explicit mapping with a different path should be set.

By default, the _analyzer field is indexed, it can be disabled by settings index to no in the mapping.

7)_boost

Boosting is the process of enhancing the relevancy of a document or field. Field level mapping allows to define an explicit boost level on a specific field. The boost field mapping (applied on theroot object) allows to define a boost field mapping where its content will control the boost level of the document. For example, consider the following mapping:

{
    "tweet":{
        "_boost":{"name":"my_boost","null_value":1.0}
    }
}

The above mapping defines a mapping for a field named my_boost. If the my_boost field exists within the JSON document indexed, its value will control the boost level of the document indexed. For example, the following JSON document will be indexed with a boost value of 2.2:

{
    "my_boost":2.2,
    "message":"This is a tweet!"
}

function score instead of boost

Support for document boosting via the _boost field has been removed from Lucene and is deprecated in Elasticsearch as of v1.0.0.RC1. The implementation in Lucene resulted in unpredictable result when used with multiple fields or multi-value fields.

Instead, the Function Score Query can be used to achieve the desired functionality by boosting each document by the value in any field the document:

{
    "query":{
        "function_score":{
            "query":{  
                "match":{
                    "title":"your main query"
                }
            },
            "functions":[{
                "script_score":{
                    "script":"doc['my_boost_field'].value"
                }
            }],
            "score_mode":"multiply"
        }
    }
}

8)_parent

The parent field mapping is defined on a child mapping, and points to the parent type this child relates to. For example, in case of a blog type and a blog_tag type child document, the mapping for blog_tag should be:

{
    "blog_tag":{
        "_parent":{
            "type":"blog"
        }
    }
}

The mapping is automatically stored and indexed (meaning it can be searched on using the _parent field notation).

9)_routing

The routing field allows to control the _routing aspect when indexing data and explicit routing control is required.

store / index

The first thing the _routing mapping does is to store the routing value provided (store set to false) and index it (index set to not_analyzed). The reason why the routing is stored by default is so reindexing data will be possible if the routing value is completely external and not part of the docs.

required

Another aspect of the _routing mapping is the ability to define it as required by setting requiredto true. This is very important to set when using routing features, as it allows different APIs to make use of it. For example, an index operation will be rejected if no routing value has been provided (or derived from the doc). A delete operation will be broadcasted to all shards if no routing value is provided and _routing is required.

path

The routing value can be provided as an external value when indexing (and still stored as part of the document, in much the same way _source is stored). But, it can also be automatically extracted from the index doc based on a path. For example, having the following mapping:

{
    "comment":{
        "_routing":{
            "required":true,
            "path":"blog.post_id"
        }
    }
}

Will cause the following doc to be routed based on the 111222 value:

{
    "text":"the comment text"
    "blog":{
        "post_id":"111222"
    }
}

Note, using path without explicit routing value provided required an additional (though quite fast) parsing phase.

id uniqueness

When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed throughout all the shards that the index is composed of. In fact, documents with the same _id might end up in different shards if indexed with different _routing values.

10)_index

The ability to store in a document the index it belongs to. By default it is disabled, in order to enable it, the following mapping should be defined:

{
    "tweet":{
        "_index":{"enabled":true}
    }
}

11)_size

The _size field allows to automatically index the size of the original _source indexed. By default, it’s disabled. In order to enable it, set the mapping to:

【限定_source字段的大小】

{
    "tweet":{
        "_size":{"enabled":true}
    }
}

In order to also store it, use:

{
    "tweet":{
        "_size":{"enabled":true,"store":true}
    }
}

12)timestamp

The _timestamp field allows to automatically index the timestamp of a document. It can be provided externally via the index request or in the _source. If it is not provided externally it will be automatically set to the date the document was processed by the indexing chain.

【时间戳 如果没有提供时间戳,将自动生成。】

enabled

By default it is disabled. In order to enable it, the following mapping should be defined:

{
    "tweet":{
        "_timestamp":{"enabled":true}
    }
}

store / index

By default the _timestamp field has store set to false and index set to not_analyzed. It can be queried as a standard date field.

path

The _timestamp value can be provided as an external value when indexing. But, it can also be automatically extracted from the document to index based on a path. For example, having the following mapping:

{
    "tweet":{
        "_timestamp":{
            "enabled":true,
            "path":"post_date"
        }
    }
}

Will cause 2009-11-15T14:12:12 to be used as the timestamp value for:

{
    "message":"You know, for Search",
    "post_date":"2009-11-15T14:12:12"
}

Note, using path without explicit timestamp value provided require an additional (though quite fast) parsing phase.

format

You can define the date format used to parse the provided timestamp value. For example:

{
    "tweet":{
        "_timestamp":{
            "enabled":true,
            "path":"post_date",
            "format":"YYYY-MM-dd"
        }
    }
}

Note, the default format is dateOptionalTime. The timestamp value will first be parsed as a number and if it fails the format will be tried.

13)_ttl

A lot of documents naturally come with an expiration date. Documents can therefore have a _ttl(time to live), which will cause the expired documents to be deleted automatically.

【ttl - time to live! 可以用来设置文档的过期时间。 】

enabled

By default it is disabled, in order to enable it, the following mapping should be defined:

{
    "tweet":{
        "_ttl":{"enabled":true}
    }
}

store / index

By default the _ttl field has store set to true and index set to not_analyzed. Note that indexproperty has to be set to not_analyzed in order for the purge process to work.

default

You can provide a per index/type default _ttl value as follows:

{
    "tweet":{
        "_ttl":{"enabled":true,"default":"1d"}
    }
}

In this case, if you don’t provide a _ttl value in your query or in the _source all tweets will have a_ttl of one day.

In case you do not specify a time unit like d (days), m (minutes), h (hours), ms (milliseconds) or w(weeks), milliseconds is used as default unit.

If no default is set and no _ttl value is given then the document has an infinite _ttl and will not expire.

You can dynamically update the default value using the put mapping API. It won’t change the _ttl of already indexed documents but will be used for future documents.

note on documents expiration

Expired documents will be automatically deleted regularly. You can dynamically set the indices.ttl.interval to fit your needs. The default value is 60s.

The deletion orders are processed by bulk. You can set indices.ttl.bulk_size to fit your needs. The default value is 10000.

Note that the expiration procedure handle versioning properly so if a document is updated between the collection of documents to expire and the delete order, the document won’t be deleted.


Types

1)core types

Each JSON field can be mapped to a specific core type. JSON itself already provides us with some typing, with its support for stringinteger/longfloat/doubleboolean, and null.

The following sample tweet JSON document will be used to explain the core types:

{
    "tweet"{
        "user":"kimchy"
        "message":"This is a tweet!",
        "postDate":"2009-11-15T14:12:12",
        "priority":4,
        "rank":12.3
    }
}

Explicit mapping for the above JSON tweet can be:

{
    "tweet":{
        "properties":{
            "user":{"type":"string","index":"not_analyzed"},
            "message":{"type":"string","null_value":"na"},
            "postDate":{"type":"date"},
            "priority":{"type":"integer"},
            "rank":{"type":"float"}
        }
    }
}

string

The text based string type is the most basic type, and contains one or more characters. An example mapping can be:

{
    "tweet":{
        "properties":{
            "message":{
                "type":"string",
                "store":true,
                "index":"analyzed",
                "null_value":"na"
            },
            "user":{
                "type":"string",
                "index":"not_analyzed",
                "norms":{
                    "enabled":false
                }
            }
        }
    }
}

The above mapping defines a string message property/field within the tweet type. The field is stored in the index (so it can later be retrieved using selective loading when searching), and it gets analyzed (broken down into searchable terms). If the message has a null value, then the value that will be stored is na. There is also a string user which is indexed as-is (not broken down into tokens) and has norms disabled (so that matching this field is a binary decision, no match is better than another one).

The following table lists all the attributes that can be used with the string type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to actually store the field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to analyzed for the field to be indexed and searchable after being broken down into token using an analyzer. not_analyzed means that its still searchable, but does not go through any analysis process or broken down into tokens. no means that it won’t be searchable at all (as an individual field; it may still be included in _all). Setting to no disables include_in_all. Defaults to analyzed.

doc_values

Set to true to store field values in a column-stride fashion. Automatically set to true when the fielddata format is doc_values.

term_vector

Possible values are noyeswith_offsetswith_positionswith_positions_offsets. Defaults to no.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

norms: {enabled: <value>}

Boolean value if norms should be enabled or not. Defaults to true for analyzed fields, and to false for not_analyzed fields. See the section about norms.

norms: {loading: <value>}

Describes how norms should be loaded, possible values are eager and lazy (default). It is possible to change the default value to eager for all fields by configuring the index setting index.norms.loading to eager.

index_options

Allows to set the indexing options, possible values are docs (only doc numbers are indexed), freqs (doc numbers and term frequencies), and positions (doc numbers, term frequencies and positions). Defaults to positions for analyzed fields, and to docs for not_analyzed fields. It is also possible to set it to offsets (doc numbers, term frequencies, positions and offsets).

analyzer

The analyzer used to analyze the text contents when analyzed during indexing and when searching using a query string. Defaults to the globally configured analyzer.

index_analyzer

The analyzer used to analyze the text contents when analyzed during indexing.

search_analyzer

The analyzer used to analyze the field when part of a query string. Can be updated on an existing field.

include_in_all

Should the field be included in the _all field (if enabled). If index is set to no this defaults to false, otherwise, defaults to true or to the parent object type setting.

ignore_above

The analyzer will ignore strings larger than this size. Useful for generic not_analyzed fields that should ignore long text.

position_offset_gap

Position increment gap between field instances with the same field name. Defaults to 0.

The string type also support custom indexing parameters associated with the indexed value. For example:

{
    "message":{
        "_value":  "boosted value",
        "_boost":  2.0
    }
}

The mapping is required to disambiguate the meaning of the document. Otherwise, the structure would interpret "message" as a value of type "object". The key _value (or value) in the inner document specifies the real string content that should eventually be indexed. The _boost (or boost) key specifies the per field document boost (here 2.0).

norms

Norms store various normalization factors that are later used (at query time) in order to compute the score of a document relatively to a query.

Although useful for scoring, norms also require quite a lot of memory (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, it is highly recommended to disable norms on it. In particular, this is the case for fields that are used solely for filtering or aggregations.

Coming in 1.2.0.

In case you would like to disable norms after the fact, it is possible to do so by using the PUT mapping API. Please however note that norms won’t be removed instantly, but as your index will receive new insertions or updates, and segments get merged. Any score computation on a field that got norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms.

number

A number based type supporting floatdoublebyteshortinteger, and long. It uses specific constructs within Lucene in order to support numeric values. The number types have the same ranges as corresponding Java types. An example mapping can be:

{
    "tweet":{
        "properties":{
            "rank":{
                "type":"float",
                "null_value":1.0
            }
        }
    }
}

The following table lists all the attributes that can be used with a numbered type:

Attribute Description

type

The type of the number. Can be floatdoubleintegerlongshortbyte. Required.

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. Setting to no disables include_in_all. If set to no the field should be either stored in _source, have include_in_all enabled, or store be set to true for this to be useful.

doc_values

Set to true to store field values in a column-stride fashion. Automatically set to true when the fielddata format is doc_values.

precision_step

The precision step (number of terms generated for each number value). Defaults to 4.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

include_in_all

Should the field be included in the _all field (if enabled). If index is set to no this defaults to false, otherwise, defaults to true or to the parent object type setting.

ignore_malformed

Ignored a malformed number. Defaults to false.

coerce

Try convert strings to numbers and truncate fractions for integers. Defaults to true.

token count

The token_count type maps to the JSON string type but indexes and stores the number of tokens in the string rather than the string itself. For example:

{
    "tweet":{
        "properties":{
            "name":{
                "type":"string",
                "fields":{
                    "word_count":{
                        "type":"token_count",
                        "store":"yes",
                        "analyzer":"standard"
                    }
                }
            }
        }
    }
}

All the configuration that can be specified for a number can be specified for a token_count. The only extra configuration is the required analyzer field which specifies which analyzer to use to break the string into tokens. For best performance, use an analyzer with no token filters.

Technically the token_count type sums position increments rather than counting tokens. This means that even if the analyzer filters out stop words they are included in the count.

date

The date type is a special type which maps to JSON string type. It follows a specific format that can be explicitly set. All dates are UTC. Internally, a date maps to a number type long, with the added parsing stage from string to long and from long to string. An example mapping:

{
    "tweet":{
        "properties":{
            "postDate":{
                "type":"date",
                "format":"YYYY-MM-dd"
            }
        }
    }
}

The date type will also accept a long number representing UTC milliseconds since the epoch, regardless of the format it can handle.

The following table lists all the attributes that can be used with a date type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

format

The date format. Defaults to dateOptionalTime.

store

Set to true to store actual field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. Setting to no disables include_in_all. If set to no the field should be either stored in _source, have include_in_all enabled, or store be set to true for this to be useful.

doc_values

Set to true to store field values in a column-stride fashion. Automatically set to true when the fielddata format is doc_values.

precision_step

The precision step (number of terms generated for each number value). Defaults to 4.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

include_in_all

Should the field be included in the _all field (if enabled). If index is set to no this defaults to false, otherwise, defaults to true or to the parent object type setting.

ignore_malformed

Ignored a malformed number. Defaults to false.

boolean

The boolean type Maps to the JSON boolean type. It ends up storing within the index either T or F, with automatic translation to true and false respectively.

{
    "tweet":{
        "properties":{
            "hes_my_special_tweet":{
                "type":"boolean",
            }
        }
    }
}

The boolean type also supports passing the value as a number or a string (in this case 0, an empty string, Ffalseoff and no are false, all other values are true).

The following table lists all the attributes that can be used with the boolean type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false(note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. Setting to no disables include_in_all. If set to no the field should be either stored in _source, have include_in_allenabled, or store be set to true for this to be useful.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

binary

The binary type is a base64 representation of binary data that can be stored in the index. The field is not stored by default and not indexed at all.

{
    "tweet":{
        "properties":{
            "image":{
                "type":"binary",
            }
        }
    }
}

The following table lists all the attributes that can be used with the binary type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false(note, the JSON document itself is stored, and it can be retrieved from it).

fielddata filters

It is possible to control which field values are loaded into memory, which is particularly useful for faceting on string fields, using fielddata filters, which are explained in detail in the Fielddatasection.

Fielddata filters can exclude terms which do not match a regex, or which don’t fall between a minand max frequency range:

{
    tweet:{
        type:      "string",
        analyzer:  "whitespace"
        fielddata:{
            filter:{
                regex:{
                    "pattern":        "^#.*"
                },
                frequency:{
                    min:              0.001,
                    max:              0.1,
                    min_segment_size:500
                }
            }
        }
    }
}

These filters can be updated on an existing field mapping and will take effect the next time the fielddata for a segment is loaded. Use the Clear Cache API to reload the fielddata using the new filters.

postings format

Posting formats define how fields are written into the index and how fields are represented into memory. Posting formats can be defined per field via the postings_format option. Postings format are configurable. Elasticsearch has several builtin formats:

direct
A postings format that uses disk-based storage but loads its terms and postings directly into memory. Note this postings format is very memory intensive and has certain limitation that don’t allow segments to grow beyond 2.1GB see {@link DirectPostingsFormat} for details.
memory
A postings format that stores its entire terms, postings, positions and payloads in a finite state transducer. This format should only be used for primary keys or with fields where each term is contained in a very low number of documents.
pulsing
A postings format that in-lines the posting lists for very low frequent terms in the term dictionary. This is useful to improve lookup performance for low-frequent terms.
bloom_default
A postings format that uses a bloom filter to improve term lookup performance. This is useful for primary keys or fields that are used as a delete key.
bloom_pulsing
A postings format that combines the advantages of bloom and pulsing to further improve lookup performance.
default
The default Elasticsearch postings format offering best general purpose performance. This format is used if no postings format is specified in the field mapping.
postings format example

On all field types it possible to configure a postings_format attribute:

{
  "person":{
     "properties":{
         "second_person_id":{"type":"string","postings_format":"pulsing"}
     }
  }
}

On top of using the built-in posting formats it is possible define custom postings format. See codec module for more information.

doc values format

Doc values formats define how fields are written into column-stride storage in the index for the purpose of sorting or faceting. Fields that have doc values enabled will have special field data instances, which will not be uninverted from the inverted index, but directly read from disk. This makes _refresh faster and ultimately allows for having field data stored on disk depending on the configured doc values format.

Doc values formats are configurable. Elasticsearch has several builtin formats:

memory
A doc values format which stores data in memory. Compared to the default field data implementations, using doc values with this format will have similar performance but will be faster to load, making _refresh less time-consuming.
disk
A doc values format which stores all data on disk, requiring almost no memory from the JVM at the cost of a slight performance degradation.
default
The default Elasticsearch doc values format, offering good performance with low memory usage. This format is used if no format is specified in the field mapping.
doc values format example

On all field types, it is possible to configure a doc_values_format attribute:

{
  "product":{
     "properties":{
         "price":{"type":"integer","doc_values_format":"memory"}
     }
  }
}

On top of using the built-in doc values formats it is possible to define custom doc values formats. See codec module for more information.

similarity

Elasticsearch allows you to configure a similarity (scoring algorithm) per field. The similaritysetting provides a simple way of choosing a similarity algorithm other than the default TF/IDF, such as BM25.

You can configure similarities via the similarity module

configuring similarity per field

Defining the Similarity for a field is done via the similarity mapping property, as this example shows:

{
  "book":{
    "properties":{
      "title":{"type":"string","similarity":"BM25"}
    }
}

The following Similarities are configured out-of-box:

default
The Default TF/IDF algorithm used by Elasticsearch and Lucene in previous versions.
BM25
The BM25 algorithm. See Okapi_BM25 for more details.
copy to field

Added in 1.0.0.RC2.

Adding copy_to parameter to any field mapping will cause all values of this field to be copied to fields specified in the parameter. In the following example all values from fields title and abstract will be copied to the field meta_data.

{
  "book":{
    "properties":{
      "title":{"type":"string","copy_to":"meta_data"},
      "abstract":{"type":"string","copy_to":"meta_data"},
      "meta_data":{"type":"string"},
    }
}

Multiple fields are also supported:

{
  "book":{
    "properties":{
      "title":{"type":"string","copy_to":["meta_data","article_info"]},
    }
}
multi fields

Added in 1.0.0.RC1.

The fields options allows to map several core types fields into a single json source field. This can be useful if a single field need to be used in different ways. For example a single field is to be used for both free text search and sorting.

{
  "tweet":{
    "properties":{
      "name":{
        "type":"string",
        "index":"analyzed",
        "fields":{
          "raw":{"type":"string","index":"not_analyzed"}
        }
      }
    }
  }
}

In the above example the field name gets processed twice. The first time it gets processed as an analyzed string and this version is accessible under the field name name, this is the main field and is in fact just like any other field. The second time it gets processed as a not analyzed string and is accessible under the name name.raw.

include in all

The include_in_all setting is ignored on any field that is defined in the fields options. Setting the include_in_all only makes sense on the main field, since the raw field value to copied to the _all field, the tokens aren’t copied.

updating a field

In the essence a field can’t be updated. However multi fields can be added to existing fields. This allows for example to have a different index_analyzer configuration in addition to the already configured index_analyzer configuration specified in the main and other multi fields.

Also the new multi field will only be applied on document that have been added after the multi field has been added and in fact the new multi field doesn’t exist in existing documents.

Another important note is that new multi fields will be merged into the list of existing multi fields, so when adding new multi fields for a field previous added multi fields don’t need to be specified.

accessing fields

deprecated in 1.0.0.

Use copy_to instead.

The multi fields defined in the fields are prefixed with the name of the main field and can be accessed by their full path using the navigation notation: name.raw, or using the typed navigation notation tweet.name.raw. The path option allows to control how fields are accessed. If the pathoption is set to full, then the full path of the main field is prefixed, but if the path option is set to just_name the actual multi field name without any prefix is used. The default value for the pathoption is full.

The just_name setting, among other things, allows indexing content of multiple fields under the same name. In the example below the content of both fields first_name and last_name can be accessed by using any_name or tweet.any_name.

{
  "tweet":{
    "properties":{
      "first_name":{
        "type":"string",
        "index":"analyzed",
        "path":"just_name",
        "fields":{
          "any_name":{"type":"string","index":"analyzed"}
        }
      },
      "last_name":{
        "type":"string",
        "index":"analyzed",
        "path":"just_name",
        "fields":{
          "any_name":{"type":"string","index":"analyzed"}
        }
      }
    }
  }
}

2)array type

JSON documents allow to define an array (list) of fields or objects. Mapping array types could not be simpler since arrays gets automatically detected and mapping them can be done either withCore Types or Object Type mappings. For example, the following JSON defines several arrays:

{
    "tweet":{
        "message":"some arrays in this tweet...",
        "tags":["elasticsearch","wow"],
        "lists":[
            {
                "name":"prog_list",
                "description":"programming list"
            },
            {
                "name":"cool_list",
                "description":"cool stuff list"
            }
        ]
    }
}

The above JSON has the tags property defining a list of a simple string type, and the listsproperty is an object type array. Here is a sample explicit mapping:

{
    "tweet":{
        "properties":{
            "message":{"type":"string"},
            "tags":{"type":"string","index_name":"tag"},
            "lists":{
                "properties":{
                    "name":{"type":"string"},
                    "description":{"type":"string"}
                }
            }
        }
    }
}

The fact that array types are automatically supported can be shown by the fact that the following JSON document is perfectly fine:

{
    "tweet":{
        "message":"some arrays in this tweet...",
        "tags":"elasticsearch",
        "lists":{
            "name":"prog_list",
            "description":"programming list"
        }
    }
}

Note also, that thanks to the fact that we used the index_name to use the non plural form (taginstead of tags), we can actually refer to the field using the index_name as well. For example, we can execute a query using tweet.tags:wow or tweet.tag:wow. We could, of course, name the field as tag and skip the index_name all together).

3)object type

JSON documents are hierarchical in nature, allowing them to define inner "objects" within the actual JSON. Elasticsearch completely understands the nature of these inner objects and can map them easily, providing query support for their inner fields. Because each document can have objects with different fields each time, objects mapped this way are known as "dynamic". Dynamic mapping is enabled by default. Let’s take the following JSON as an example:

{
    "tweet":{
        "person":{
            "name":{
                "first_name":"Shay",
                "last_name":"Banon"
            },
            "sid":"12345"
        },
        "message":"This is a tweet!"
    }
}

The above shows an example where a tweet includes the actual person details. A person is an object, with a sid, and a name object which has first_name and last_name. It’s important to note that tweet is also an object, although it is a special root object type which allows for additional mapping definitions.

The following is an example of explicit mapping for the above JSON:

{
    "tweet":{
        "properties":{
            "person":{
                "type":"object",
                "properties":{
                    "name":{
                        "properties":{
                            "first_name":{"type":"string"},
                            "last_name":{"type":"string"}
                        }
                    },
                    "sid":{"type":"string","index":"not_analyzed"}
                }
            },
            "message":{"type":"string"}
        }
    }
}

In order to mark a mapping of type object, set the type to object. This is an optional step, since if there are properties defined for it, it will automatically be identified as an object mapping.

properties

An object mapping can optionally define one or more properties using the properties tag for a field. Each property can be either another object, or one of the core_types.

dynamic

One of the most important features of Elasticsearch is its ability to be schema-less. This means that, in our example above, the person object can be indexed later with a new property — age, for example — and it will automatically be added to the mapping definitions. Same goes for the tweetroot object.

This feature is by default turned on, and it’s the dynamic nature of each object mapped. Each object mapped is automatically dynamic, though it can be explicitly turned off:

{
    "tweet":{
        "properties":{
            "person":{
                "type":"object",
                "properties":{
                    "name":{
                        "dynamic":false,
                        "properties":{
                            "first_name":{"type":"string"},
                            "last_name":{"type":"string"}
                        }
                    },
                    "sid":{"type":"string","index":"not_analyzed"}
                }
            },
            "message":{"type":"string"}
        }
    }
}

In the above example, the name object mapped is not dynamic, meaning that if, in the future, we try to index JSON with a middle_name within the name object, it will get discarded and not added.

There is no performance overhead if an object is dynamic, the ability to turn it off is provided as a safety mechanism so "malformed" objects won’t, by mistake, index data that we do not wish to be indexed.

If a dynamic object contains yet another inner object, it will be automatically added to the index and mapped as well.

When processing dynamic new fields, their type is automatically derived. For example, if it is a number, it will automatically be treated as number core_type. Dynamic fields default to their default attributes, for example, they are not stored and they are always indexed.

Date fields are special since they are represented as a string. Date fields are detected if they can be parsed as a date when they are first introduced into the system. The set of date formats that are tested against can be configured using the dynamic_date_formats on the root object, which is explained later.

Note, once a field has been added, its type can not change. For example, if we added age and its value is a number, then it can’t be treated as a string.

The dynamic parameter can also be set to strict, meaning that not only will new fields not be introduced into the mapping, but also that parsing (indexing) docs with such new fields will fail.

enabled

The enabled flag allows to disable parsing and indexing a named object completely. This is handy when a portion of the JSON document contains arbitrary JSON which should not be indexed, nor added to the mapping. For example:

{
    "tweet":{
        "properties":{
            "person":{
                "type":"object",
                "properties":{
                    "name":{
                        "type":"object",
                        "enabled":false
                    },
                    "sid":{"type":"string","index":"not_analyzed"}
                }
            },
            "message":{"type":"string"}
        }
    }
}

In the above, name and its content will not be indexed at all.

include_in_all

include_in_all can be set on the object type level. When set, it propagates down to all the inner mappings defined within the object that do no explicitly set it.

path

deprecated in 1.0.0.

Use copy_to instead.

In the core_types section, a field can have a index_name associated with it in order to control the name of the field that will be stored within the index. When that field exists within an object(s) that are not the root object, the name of the field of the index can either include the full "path" to the field with its index_name, or just the index_name. For example (under mapping of type person, removed the tweet type for clarity):

{
    "person":{
        "properties":{
            "name1":{
                "type":"object",
                "path":"just_name",
                "properties":{
                    "first1":{"type":"string"},
                    "last1":{"type":"string","index_name":"i_last_1"}
                }
            },
            "name2":{
                "type":"object",
                "path":"full",
                "properties":{
                    "first2":{"type":"string"},
                    "last2":{"type":"string","index_name":"i_last_2"}
                }
            }
        }
    }
}

In the above example, the name1 and name2 objects within the person object have different combination of path and index_name. The document fields that will be stored in the index as a result of that are:

JSON Name Document Field Name

name1/first1

first1

name1/last1

i_last_1

name2/first2

name2.first2

name2/last2

name2.i_last_2

Note, when querying or using a field name in any of the APIs provided (search, query, selective loading, …), there is an automatic detection from logical full path and into the index_name and vice versa. For example, even though name1/last1 defines that it is stored with just_name and a different index_name, it can either be referred to using name1.last1 (logical name), or its actual indexed name of i_last_1.

More over, where applicable, for example, in queries, the full path including the type can be used such as person.name.last1, in this case, both the actual indexed name will be resolved to match against the index, and an automatic query filter will be added to only match person types.

4)root object type

The root object mapping is an object type mapping that maps the root object (the type itself). On top of all the different mappings that can be set using the object type mapping, it allows for additional, type level mapping definitions.

The root object mapping allows to index a JSON document that either starts with the actual mapping type, or only contains its fields. For example, the following tweet JSON can be indexed:

{
    "message":"This is a tweet!"
}

But, also the following JSON can be indexed:

{
    "tweet":{
        "message":"This is a tweet!"
    }
}

Out of the two, it is preferable to use the document without the type explicitly set.

index / search analyzers

The root object allows to define type mapping level analyzers for index and search that will be used with all different fields that do not explicitly set analyzers on their own. Here is an example:

{
    "tweet":{
        "index_analyzer":"standard",
        "search_analyzer":"standard"
    }
}

The above simply explicitly defines both the index_analyzer and search_analyzer that will be used. There is also an option to use the analyzer attribute to set both the search_analyzer and index_analyzer.

dynamic_date_formats

dynamic_date_formats (old setting called date_formats still works) is the ability to set one or more date formats that will be used to detect date fields. For example:

{
    "tweet":{
        "dynamic_date_formats":["yyyy-MM-dd","dd-MM-yyyy"],
        "properties":{
            "message":{"type":"string"}
        }
    }
}

In the above mapping, if a new JSON field of type string is detected, the date formats specified will be used in order to check if its a date. If it passes parsing, then the field will be declared with datetype, and will use the matching format as its format attribute. The date format itself is explainedhere.

The default formats are: dateOptionalTime (ISO) and yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z.

Note: dynamic_date_formats are used only for dynamically added date fields, not for date fields that you specify in your mapping.

date_detection

Allows to disable automatic date type detection (if a new field is introduced and matches the provided format), for example:

{
    "tweet":{
        "date_detection":false,
        "properties":{
            "message":{"type":"string"}
        }
    }
}

numeric_detection

Sometimes, even though json has support for native numeric types, numeric values are still provided as strings. In order to try and automatically detect numeric values from string, the numeric_detection can be set to true. For example:

{
    "tweet":{
        "numeric_detection":true,
        "properties":{
            "message":{"type":"string"}
        }
    }
}

dynamic_templates

Dynamic templates allow to define mapping templates that will be applied when dynamic introduction of fields / objects happens.

For example, we might want to have all fields to be stored by default, or all string fields to be stored, or have string fields to always be indexed with multi fields syntax, once analyzed and once not_analyzed. Here is a simple example:

{
    "person":{
        "dynamic_templates":[
            {
                "template_1":{
                    "match":"multi*",
                    "mapping":{
                        "type":"{dynamic_type}",
                        "index":"analyzed",
                        "fields":{
                            "org":{"type":"{dynamic_type}","index":"not_analyzed"}
                        }
                    }
                }
            },
            {
                "template_2":{
                    "match":"*",
                    "match_mapping_type":"string",
                    "mapping":{
                        "type":"string",
                        "index":"not_analyzed"
                    }
                }
            }
        ]
    }
}

The above mapping will create a field with multi fields for all field names starting with multi, and will map all string types to be not_analyzed.

Dynamic templates are named to allow for simple merge behavior. A new mapping, just with a new template can be "put" and that template will be added, or if it has the same name, the template will be replaced.

The match allow to define matching on the field name. An unmatch option is also available to exclude fields if they do match on match. The match_mapping_type controls if this template will be applied only for dynamic fields of the specified type (as guessed by the json format).

Another option is to use path_match, which allows to match the dynamic template against the "full" dot notation name of the field (for example obj1.*.value or obj1.obj2.*), with the respective path_unmatch.

The format of all the matching is simple format, allowing to use * as a matching element supporting simple patterns such as xxx*, *xxx, xxx*yyy (with arbitrary number of pattern types), as well as direct equality. The match_pattern can be set to regex to allow for regular expression based matching.

The mapping element provides the actual mapping definition. The {name} keyword can be used and will be replaced with the actual dynamic field name being introduced. The {dynamic_type}(or {dynamicType}) can be used and will be replaced with the mapping derived based on the field type (or the derived type, like date).

Complete generic settings can also be applied, for example, to have all mappings be stored, just set:

{
    "person":{
        "dynamic_templates":[
            {
                "store_generic":{
                    "match":"*",
                    "mapping":{
                        "store":true
                    }
                }
            }
        ]
    }
}

Such generic templates should be placed at the end of the dynamic_templates list because when two or more dynamic templates match a field, only the first matching one from the list is used.

5)nested type

Nested objects/documents allow to map certain sections in the document indexed as nested allowing to query them as if they are separate docs joining with the parent owning doc.

One of the problems when indexing inner objects that occur several times in a doc is that "cross object" search match will occur, for example:

{
    "obj1":[
        {
            "name":"blue",
            "count":4
        },
        {
            "name":"green",
            "count":6
        }
    ]
}

Searching for name set to blue and count higher than 5 will match the doc, because in the first element the name matches blue, and in the second element, count matches "higher than 5".

Nested mapping allows mapping certain inner objects (usually multi instance ones), for example:

{
    "type1":{
        "properties":{
            "obj1":{
                "type":"nested",
                "properties":{
                    "name":{"type":"string","index":"not_analyzed"},
                    "count":{"type":"integer"}
                }
            }
        }
    }
}

The above will cause all obj1 to be indexed as a nested doc. The mapping is similar in nature to setting type to object, except that it’s nested. Nested object fields can be defined explicitly as in the example above or added dynamically in the same way as for the root object.

Note: changing an object type to nested type requires reindexing.

The nested object fields can also be automatically added to the immediate parent by setting include_in_parent to true, and also included in the root object by setting include_in_root to true.

Nested docs will also automatically use the root doc _all field.

Searching on nested docs can be done using either the nested query or nested filter.

internal implementation

Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs.

Those internal nested documents are automatically masked away when doing operations against the index (like searching with a match_all query), and they bubble out when using the nested query.

Because nested docs are always masked to the parent doc, the nested docs can never be accessed outside the scope of the nested query. For example stored fields can be enabled on fields inside nested objects, but there is no way of retrieving them, since stored fields are fetched outside of the nested query scope.

The _source field is always associated with the parent document and because of that field values via the source can be fetched for nested object.

6)ip type

An ip mapping type allows to store ipv4 addresses in a numeric form allowing to easily sort, and range query it (using ip values).

The following table lists all the attributes that can be used with an ip type:

Attribute Description

index_name

The name of the field that will be stored in the index. Defaults to the property/field name.

store

Set to true to store actual field in the index, false to not store it. Defaults to false (note, the JSON document itself is stored, and it can be retrieved from it).

index

Set to no if the value should not be indexed. In this case, store should be set totrue, since if it’s not indexed and not stored, there is nothing to do with it.

precision_step

The precision step (number of terms generated for each number value). Defaults to 4.

boost

The boost value. Defaults to 1.0.

null_value

When there is a (JSON) null value for the field, use the null_value as the field value. Defaults to not adding the field at all.

include_in_all

Should the field be included in the _all field (if enabled). Defaults to true or to the parent object type setting.

7)geo point type

Mapper type called geo_point to support geo based points. The declaration looks as follows:

{
    "pin":{
        "properties":{
            "location":{
                "type":"geo_point"
            }
        }
    }
}

indexed fields

The geo_point mapping will index a single field with the format of lat,lon. The lat_lon option can be set to also index the .lat and .lon as numeric fields, and geohash can be set to true to also index .geohash value.

A good practice is to enable indexing lat_lon as well, since both the geo distance and bounding box filters can either be executed using in memory checks, or using the indexed lat lon values, and it really depends on the data set which one performs better. Note though, that indexed lat lon only make sense when there is a single geo point value for the field, and not multi values.

geohashes

Geohashes are a form of lat/lon encoding which divides the earth up into a grid. Each cell in this grid is represented by a geohash string. Each cell in turn can be further subdivided into smaller cells which are represented by a longer string. So the longer the geohash, the smaller (and thus more accurate) the cell is.

Because geohashes are just strings, they can be stored in an inverted index like any other string, which makes querying them very efficient.

If you enable the geohash option, a geohash “sub-field” will be indexed as, eg pin.geohash. The length of the geohash is controlled by the geohash_precision parameter, which can either be set to an absolute length (eg 12, the default) or to a distance (eg 1km).

More usefully, set the geohash_prefix option to true to not only index the geohash value, but all the enclosing cells as well. For instance, a geohash of u30 will be indexed as [u,u3,u30]. This option can be used by the Geohash Cell Filter to find geopoints within a particular cell very efficiently.

input structure

The above mapping defines a geo_point, which accepts different formats. The following formats are supported:

lat lon as properties
{
    "pin":{
        "location":{
            "lat":41.12,
            "lon":-71.34
        }
    }
}
lat lon as string

Format in lat,lon.

{
    "pin":{
        "location":"41.12,-71.34"
    }
}
geohash
{
    "pin":{
        "location":"drm3btev3e86"
    }
}
lat lon as array

Format in [lon, lat], note, the order of lon/lat here in order to conform with GeoJSON.

{
    "pin":{
        "location":[-71.34,41.12]
    }
}

mapping options

Option Description

lat_lon

Set to true to also index the .lat and .lon as fields. Defaults to false.

geohash

Set to true to also index the .geohash as a field. Defaults to false.

geohash_precision

Sets the geohash precision. It can be set to an absolute geohash length or a distance value (eg 1km, 1m, 1ml) defining the size of the smallest cell. Defaults to an absolute length of 12.

geohash_prefix

If this option is set to true, not only the geohash but also all its parent cells (true prefixes) will be indexed as well. The number of terms that will be indexed depends on the geohash_precision. Defaults to falseNote: This option implicitly enables geohash.

validate

Set to true to reject geo points with invalid latitude or longitude (default is false). Note: Validation only works when normalization has been disabled.

validate_lat

Set to true to reject geo points with an invalid latitude.

validate_lon

Set to true to reject geo points with an invalid longitude.

normalize

Set to true to normalize latitude and longitude (default is true).

normalize_lat

Set to true to normalize latitude.

normalize_lon

Set to true to normalize longitude.

precision_step

The precision step (number of terms generated for each number value) for .lat and .lon fields if lat_lon is set to true. Defaults to 4.

field data

By default, geo points use the array format which loads geo points into two parallel double arrays, making sure there is no precision loss. However, this can require a non-negligible amount of memory (16 bytes per document) which is why Elasticsearch also provides a field data implementation with lossy compression called compressed:

{
    "pin":{
        "properties":{
            "location":{
                "type":"geo_point",
                "fielddata":{
                    "format":"compressed",
                    "precision":"1cm"
                }
            }
        }
    }
}

This field data format comes with a precision option which allows to configure how much precision can be traded for memory. The default value is 1cm. The following table presents values of the memory savings given various precisions:

Precision

Bytes per point

Size reduction

1km

4

75%

3m

6

62.5%

1cm

8

50%

1mm

10

37.5%

Precision can be changed on a live index by using the update mapping API.

usage in scripts

When using doc[geo_field_name] (in the above mapping, doc['location']), the doc[...].value returns a GeoPoint, which then allows access to lat and lon (for example, doc[...].value.lat). For performance, it is better to access the lat and lon directly using doc[...].lat and doc[...].lon.

8)geo shape type

The geo_shape mapping type facilitates the indexing of and searching with arbitrary geo shapes such as rectangles and polygons. It should be used when either the data being indexed or the queries being executed contain shapes other than just points.

You can query documents using this type using geo_shape Filter or geo_shape Query.

Note, the geo_shape type uses Spatial4J and JTS, both of which are optional dependencies. Consequently you must add Spatial4J v0.3 and JTS v1.12 to your classpath in order to use this type.

mapping options

The geo_shape mapping maps geo_json geometry objects to the geo_shape type. To enable it, users must explicitly map fields to the geo_shape type.

Option Description

tree

Name of the PrefixTree implementation to be used: geohash for GeohashPrefixTree and quadtree for QuadPrefixTree. Defaults to geohash.

precision

This parameter may be used instead of tree_levels to set an appropriate value for the tree_levels parameter. The value specifies the desired precision and Elasticsearch will calculate the best tree_levels value to honor this precision. The value should be a number followed by an optional distance unit. Valid distance units include: ininchydyardmimiles,kmkilometersm,meters (default), cm,centimetersmmmillimeters.

tree_levels

Maximum number of layers to be used by the PrefixTree. This can be used to control the precision of shape representations and therefore how many terms are indexed. Defaults to the default value of the chosen PrefixTree implementation. Since this parameter requires a certain level of understanding of the underlying implementation, users may use the precision parameter instead. However, Elasticsearch only uses the tree_levels parameter internally and this is what is returned via the mapping API even if you use the precision parameter.

distance_error_pct

Used as a hint to the PrefixTree about how precise it should be. Defaults to 0.025 (2.5%) with 0.5 as the maximum supported value.

prefix trees

To efficiently represent shapes in the index, Shapes are converted into a series of hashes representing grid squares using implementations of a PrefixTree. The tree notion comes from the fact that the PrefixTree uses multiple grid layers, each with an increasing level of precision to represent the Earth.

Multiple PrefixTree implementations are provided:

  • GeohashPrefixTree - Uses geohashes for grid squares. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. So the longer the hash, the more precise it is. Each character added to the geohash represents another tree level and adds 5 bits of precision to the geohash. A geohash represents a rectangular area and has 32 sub rectangles. The maximum amount of levels in Elasticsearch is 24.
  • QuadPrefixTree - Uses a quadtree for grid squares. Similar to geohash, quad trees interleave the bits of the latitude and longitude the resulting hash is a bit set. A tree level in a quad tree represents 2 bits in this bit set, one for each coordinate. The maximum amount of levels for the quad trees in Elasticsearch is 50.
accuracy

Geo_shape does not provide 100% accuracy and depending on how it is configured it may return some false positives or false negatives for certain queries. To mitigate this, it is important to select an appropriate value for the tree_levels parameter and to adjust expectations accordingly. For example, a point may be near the border of a particular grid cell and may thus not match a query that only matches the cell right next to it — even though the shape is very close to the point.

example
{
    "properties":{
        "location":{
            "type":"geo_shape",
            "tree":"quadtree",
            "precision":"1m"
        }
    }
}

This mapping maps the location field to the geo_shape type using the quad_tree implementation and a precision of 1m. Elasticsearch translates this into a tree_levels setting of 26.

performance considerations

Elasticsearch uses the paths in the prefix tree as terms in the index and in queries. The higher the levels is (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.

The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.

input structure

The GeoJSON format is used to represent Shapes as input as follows:

{
    "location":{
        "type":"point",
        "coordinates":[45.0,-45.0]
    }
}

Note, both the type and coordinates fields are required.

The supported types are pointlinestringpolygonmultipoint and multipolygon.

Note, in geojson the correct order is longitude, latitude coordinate arrays. This differs from some APIs such as e.g. Google Maps that generally use latitude, longitude.

envelope

Elasticsearch supports an envelope type which consists of coordinates for upper left and lower right points of the shape:

{
    "location":{
        "type":"envelope",
        "coordinates":[[-45.0,45.0],[45.0,-45.0]]
    }
}
polygonedit

A polygon is defined by a list of a list of points. The first and last points in each (outer) list must be the same (the polygon must be closed).

{
    "location":{
        "type":"polygon",
        "coordinates":[
            [[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]
        ]
    }
}

The first array represents the outer boundary of the polygon, the other arrays represent the interior shapes ("holes"):

{
    "location":{
        "type":"polygon",
        "coordinates":[
            [[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]],
            [[100.2,0.2],[100.8,0.2],[100.8,0.8],[100.2,0.8],[100.2,0.2]]
        ]
    }
}
multipolygonedit

A list of geojson polygons.

{
    "location":{
        "type":"multipolygon",
        "coordinates":[
            [[[102.0,2.0],[103.0,2.0],[103.0,3.0],[102.0,3.0],[102.0,2.0]]],
            [[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]],
            [[100.2,0.2],[100.8,0.2],[100.8,0.8],[100.2,0.8],[100.2,0.2]]]
        ]
    }
}

sorting and retrieving index shapes

Due to the complex input structure and index representation of shapes, it is not currently possible to sort shapes or retrieve their fields directly. The geo_shape value is only retrievable through the _source field.

9)attachment type

The attachment type allows to index different "attachment" type field (encoded as base64), for example, Microsoft Office formats, open document formats, ePub, HTML, and so on (full list can be found here).

The attachment type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under $ES_HOME/plugins location. It will be automatically detected and the attachment type will be added.

Note, the attachment type is experimental.

Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:

{
    "person":{
        "properties":{
            "my_attachment":{"type":"attachment"}
        }
    }
}

In this case, the JSON to index can be:

{
    "my_attachment":"... base64 encoded attachment ..."
}

Or it is possible to use more elaborated JSON if content type or resource name need to be set explicitly:

{
    "my_attachment":{
        "_content_type":"application/pdf",
        "_name":"resource/name/of/my.pdf",
        "content":"... base64 encoded attachment ..."
    }
}

The attachment type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: datetitleauthor, and keywords. They can be queried using the "dot notation", for example: my_attachment.author.

Both the meta data and the actual content are simple core type mappers (string, date, …), thus, they can be controlled in the mappings. For example:

{
    "person":{
        "properties":{
            "file":{
                "type":"attachment",
                "fields":{
                    "file":{"index":"no"},
                    "date":{"store":true},
                    "author":{"analyzer":"myAnalyzer"}
                }
            }
        }
    }
}

In the above example, the actual content indexed is mapped under fields name file, and we decide not to index it, so it will only be available in the _all field. The other fields map to their respective metadata names, but there is no need to specify the type (like string or date) since it is already known.

The plugin uses Apache Tika to parse attachments, so many formats are supported, listed here.