Elasticsearch学习1-基本概念

时间:2023-02-22 15:06:45

基本概念

翻译原文请参考 Basic Concepts

(下面)有几个概念是 Elasticsearch 的核心。从一开始就理解这些概念将极大地帮助(我们)简化学习过程。

There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.

1. 近实时 (NRT)

Elasticsearch 是一个近乎实时的搜索平台。这意味着从您索引文档到它变得可搜索之间存在轻微的延迟(通常为一秒)。

Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

2. 集群

集群是一个或多个节点(服务器)的集合,它们共同保存您的整个数据并提供跨所有节点的联合索引和搜索功能。集群由唯一名称标识,默认情况下为 “elasticsearch”。此名称很重要,因为如果节点设置为按名称加入集群,则该节点只能是集群的一部分。

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.

确保不要在不同环境中重复使用相同的集群名称,否则最终可能会有节点加入错误的集群。例如,您可以将logging-devlogging-stagelogging-prod 用于开发、暂存和生产集群。

Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.

请注意,集群中只有一个节点是有效且完全没问题的。此外,您还可以拥有多个独立的集群,每个集群都有自己唯一的集群名称。

Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.

3. 节点

节点是单个服务器,它是集群的一部分,存储您的数据,并参与集群的索引和搜索功能。就像集群一样,节点由名称标识,默认情况下,该名称是在启动时分配给节点的随机通用唯一标识符 (UUID)。如果您不想要默认值,您可以定义任何您想要的节点名称。(如果)您想要识别网络中的哪些服务器对应于 Elasticsearch 集群中的哪些节点(的话),此名称对于(这种情况的)管理上很重要。

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

可以将节点配置为通过集群名称加入特定集群。默认情况下,每个节点都被设置为加入一个名为 elasticsearch 的集群,这意味着如果您在网络上启动多个节点并且——假设它们可以相互发现——它们都将自动形成并加入一个名为 elasticsearch 的集群。

A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—​assuming they can discover each other—​they will all automatically form and join a single cluster named elasticsearch.

在单个集群中,您可以拥有任意数量的节点。此外,如果您的网络上当前没有其他 Elasticsearch 节点正在运行,则默认情况下启动单个节点将形成一个名为 elasticsearch 的新单节点集群。

In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.

4. 索引

索引是具有某些相似特征的文档的集合。例如,您可以有一个用于客户数据的索引、另一个用于产品目录的索引以及另一个用于订单数据的索引。索引由名称(必须全部小写)标识; 该名称用于在对其中的文档执行索引、搜索、更新和删除操作时引用该索引。

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.

在单个集群中,您可以定义任意数量的索引。

In a single cluster, you can define as many indexes as you want.

5. 类型

Elasticsearch学习1-基本概念 类型曾经是索引的逻辑类别/分区,其允许您在同一索引中存储不同类型的文档,例如一种类型用于用户,另一种类型用于博客文章。不再可能在一个索引中创建多个类型,并且在以后的版本中将删除整个类型的概念。有关更多信息,请参阅 删除映射类型

A type used to be a logical category/partition of your index to allow you to store different types of documents in the same index, eg one type for users, another type for blog posts. It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version. See Removal of mapping types for more.

6. 文档

文档是可以被索引的基本信息单元。例如,您可以为单个客户创建一个文档,为单个产品创建另一个文档,以及为单个订单创建另一个文档。该文档以 JSON(JavaScript Object Notation)表示,这是一种普遍存在的互联网数据交换格式。

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is a ubiquitous internet data interchange format.

在索引/类型中,您可以存储任意数量的文档。请注意,尽管文档物理上位于索引中,但文档实际上必须被索引/分配给索引内的类型。

Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

7. 分片和副本

索引可能会存储大量数据,这些数据可能会超出单个节点的硬件限制。例如,占用 1TB 磁盘空间的 10 亿个文档的单个索引可能不适合单个节点的磁盘,或者可能太慢而无法单独处理来自单个节点的搜索请求。

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

为了解决这个问题,Elasticsearch 提供了将索引细分为多个称为分片的功能。创建索引时,您可以简单地定义所需的分片数量。每个分片本身就是一个功能齐全且独立的“索引”,可以托管在集群中的任何节点上。

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

分片很重要,主要有两个原因:

  • 它允许您水平拆分/缩放您的内容量
  • 它允许您跨分片(可能在多个节点上)分布和并行化操作,从而提高性能/吞吐量

Sharding is important for two primary reasons:

  • It allows you to horizontally split/scale your content volume
  • It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

分片的分布机制以及它的文档如何聚合回搜索请求完全由 Elasticsearch 管理,并且对作为用户的您是透明的。

The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.

在随时可能出现故障的网络/云环境中,如果分片/节点以某种方式脱机或因任何原因消失,则具有故障转移机制非常有用并强烈建议使用。为此,Elasticsearch 允许您将索引分片的一个或多个副本制作成所谓的副本分片,或简称为副本。

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

复制之所以重要,主要有两个原因:

  • 它在分片/节点发生故障时提供高可用性。出于这个原因,重要的是要注意副本分片永远不会与复制它的原始/主分片在同一节点上分配。
  • 它允许您扩展搜索量/吞吐量,因为搜索可以在所有副本上并行执行。

Replication is important for two primary reasons:

  • It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
  • It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

总而言之,每个索引都可以拆分为多个分片。索引也可以复制零次(意味着没有副本)或多次。复制后,每个索引将具有主分片(从中复制的原始分片)和副本分片(主分片的副本)。可以在创建索引时为每个索引定义分片和副本的数量。创建索引后,您可以随时动态更改副本数,但事后无法更改分片数。

To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.

默认情况下,Elasticsearch 中的每个索引都分配了 5 个主分片和 1 个副本,这意味着如果您的集群中至少有两个节点,则您的索引将有 5 个主分片和另外 5 个副本分片(1 个完整副本),总共每个索引 10 个分片。

By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

每个 Elasticsearch 分片都是一个 Lucene 索引。单个 Lucene 索引中可以包含的文档数量有上限。截至 LUCENE-5843,限制为 2,147,483,519(= Integer.MAX_VALUE - 128) 个文档。您可以使用 _cat/shards API 监控分片大小。

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards API.