es第十篇:Elasticsearch for Apache Hadoop

时间:2022-01-25 21:54:41

es for apache hadoop(elasticsearch-hadoop.jar)允许hadoop作业(mapreduce、hive、pig、cascading、spark)与es交互。

At the core, elasticsearch-hadoop integrates two distributed systems: Hadoop, a distributed computing platform and Elasticsearch, a real-time search and analytics engine. From a high-level view both provide a computational component: Hadoop through Map/Reduce or recent libraries like Apache Spark on one hand, and Elasticsearch through its search and aggregation on the other.elasticsearch-hadoop goal is to connect these two entities so that they can transparently benefit from each other.

Map/Reduce and Shards

A critical component for scalability is parallelism or splitting a task into multiple, smaller ones that execute at the same time, on different nodes in the cluster. The concept is present in both Hadoop through its splits (the number of parts in which a source or input can be divided) and Elasticsearch through shards (the number of parts in which a index is divided into).In short, roughly speaking more input splits means more tasks that can read at the same time, different parts of the source. More shards means more buckets from which to read an index content (at the same time).As such, elasticsearch-hadoop uses splits and shards as the main drivers behind the number of tasks executed within the Hadoop and Elasticsearch clusters as they have a direct impact on parallelism.Hadoop splits as well as Elasticsearch shards play an important role regarding a system behavior - we recommend familiarizing with the two concepts to get a better understanding of your system runtime semantics.

Apache Spark and Shards

While Apache Spark is not built on top of Map/Reduce it shares similar concepts: it features the concept of partition which is the rough equivalent of Elasticsearch shard or the Map/Reduce split. Thus, the analogy above applies here as well - more shards and/or more partitions increase the number of parallelism and thus allows both systems to scale better.Due to the similarity in concepts, through-out the docs one can think interchangebly of Hadoop InputSplit and Spark Partition.

Reading from Elasticsearch

Shards play a critical role when reading information from Elasticsearch. Since it acts as a source, elasticsearch-hadoop will create one Hadoop InputSplit per Elasticsearch shard, or in case of Apache Spark one Partition, that is given a query that works against index I. elasticsearch-hadoop will dynamically discover the number of shards backing I and then for each shard will create, in case of Hadoop an input split (which will determine the maximum number of Hadoop tasks to be executed) or in case of Spark a partition which will determine the RDD maximum parallelism.

With the default settings, Elasticsearch uses 5 primary shards per index which will result in the same number of tasks on the Hadoop side for each query.
elasticsearch-hadoop does not query the same shards - it iterates through all of them (primaries and replicas) using a round-robin approach. To avoid data duplication, only one shard is used from each shard group (primary and replicas).

A common concern (read optimization) for improving performance is to increase the number of shards and thus increase the number of tasks on the Hadoop side. Unless such gains are demonstrated through benchmarks, we recommend against such a measure since in most cases, an Elasticsearch shard can easily handle data streaming to a Hadoop or Spark task.

Writing to Elasticsearch

Writing to Elasticsearch is driven by the number of Hadoop input splits (or tasks) or Spark partitions available. elasticsearch-hadoop detects the number of (primary) shards where the write will occur and distributes the writes between these. The more splits/partitions available, the more mappers/reducers can write data in parallel to Elasticsearch.

Whenever possible, elasticsearch-hadoop shares the Elasticsearch cluster information with Hadoop and Spark to facilitate data co-location. In practice, this means whenever data is read from Elasticsearch, the source nodes' IPs are passed on to Hadoop and Spark to optimize task execution. If co-location is desired/possible, hosting the Elasticsearch and Hadoop and Spark clusters within the same rack will provide significant network savings.

常用设置:

必需项:

es.resource.read

从哪个索引读取数据,值格式是<index>/<type>,如artists/_doc。

支持多个index,如artists,bank/_doc,表示从artists和bank索引的_doc/读取数据。artists,bank/,表示从artists和bank索引中读取数据,type任意。_all/_doc表示从所有索引的_doc读取数据。

es.resource.write

写数据到哪个索引,值格式是<index>/<type>,如artists/_doc。

不支持多个索引,但是支持动态索引。索引名依据文档的某个或某些字段产生,如文档字段有id、name、password、age、created_date、updated_date,则es.resource.write的值可以是{name}/_doc,甚至还支持格式化,如{updated_date|yyyy-MM-dd}/_doc。但是这里应该是有bug。实测,要求es.index.auto.create值必须为true,否则会报错:Target index [{name}/_doc] does not exist and auto-creation is disabled [setting 'es.index.auto.create' is 'false'],即使对应的索引存在。但是实际生产中,索引不可能是自动创建的,绝对是通过人为移交脚本创建的。

es.resource

读写数据到哪个索引,值格式是<index>/<type>,如artists/_doc。当read和write的索引是同一个时,就可以用这个来简化配置。

es.nodes

es集群地址,默认是localhost。

es.port

es集群端口,默认是9200。

es.write.operation

往es插入文档时,es的操作,值可以是index、create、update、upsert,默认是index。

index:根据文档id,如果不存在则插入,如果已存在,则替换。就是es的原生index操作

create:根据文档id,如果不存在则插入,否则抛异常。

update:根据文档id,如果存在则更新,否则抛异常。

更新效果示例:

假如原来文档内容是{"id" : 1, "name" : "zhangsan", "password" : "abc123"},新文档内容是{"id" : 1, "name" : "lisi", "age" : 20},

则更新后文档内容是{"id" : 1, "name" : "lisi", "password" : "abc123", "age":20}。

upsert:根据文档id,如果不存在则插入,否则更新。更新效果同上。

es.input.jsones.output.json

es.input.json,值为true或者false。当写入es的数据为json字符串时,es是否解析该字符串到索引各字段中,还是直接当成一个普通字符串存储。 默认为false,即不解析。

示例:

es与hive整合时,hive es外部表es_test对应es test/_doc,es test索引字段有id、name、password、age、created_date、updated_date。

情况1:es_test字段有id、name、password、age、created_date、updated_date

此种情况下,在建表语句中不能设置es.input.json值为true,只能为false。否则在插入数据时,会报"org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: When using JSON input, only one field is expected"错误。

情况2:es_test仅有1个字段data

此种情况下,es.input.json值为false时,在插入json字符串数据时会报空指针异常。值为true时,会把json字符串中各字段的值都解析到索引中各对应字段中,就好像正常插入数据一样。

es.output.json,值为true或者false,默认是false。值为true时,通过elasticsearch-hadoop.jar从es读取数据会直接返回json字符串。

最佳实践:慎用es.input.json和es.output.json。hive与es一个一个字段对应,多好,省的各种B事。

es.mapping.id

写数据到es时,文档的id由数据中的哪个字段指定。如果不指定的话,则文档id会由es自动生成,这样每插入一条数据,es都会多一条文档,没法做更新了。所以生产环境下,es.mapping.id是必须配置的,值是那种值是唯一的字段名,例如主键id、资讯id、产品id、客户id、订单id等。

es.mapping.date.rich

Whether to create a rich Date like object for Date fields in Elasticsearch or returned them as primitives (String or long). By default this is true. The actual object type is based on the library used; noteable exception being Map/Reduce which provides no built-in Date object and as such LongWritable and Text are returned regardless of this setting.

es第十篇:Elasticsearch for Apache Hadoop的更多相关文章

  1. FAILED&colon; Execution Error&comma; return code 1 from org&period;apache&period;hadoop&period;hive&period;ql&period;exec&period;DDLTask&period; MetaException&lpar;message&colon;javax&period;jdo&period;JDODataStoreException&colon; An exception was thrown while adding&sol;validating class&lpar;es&rpar; &colon;

    在hive命令行创建表时报错: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ...

  2. 剖析Elasticsearch集群系列第一篇 Elasticsearch的存储模型和读写操作

    剖析Elasticsearch集群系列涵盖了当今最流行的分布式搜索引擎Elasticsearch的底层架构和原型实例. 本文是这个系列的第一篇,在本文中,我们将讨论的Elasticsearch的底层存 ...

  3. Apache Hadoop 2&period;9&period;2 的Federation架构设计

    Apache Hadoop 2.9.2 的Federation架构设计 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 能看到这篇文件,说明你对NameNode的工作原理想必已经了如 ...

  4. Elastic Stack 笔记(十)Elasticsearch5&period;6 For Hadoop

    博客地址:http://www.moonxy.com 一.前言 ES-Hadoop 是连接快速查询和大数据分析的桥梁,它能够无间隙的在 Hadoop 和 ElasticSearch 上移动数据.ES ...

  5. &lbrack;转帖&rsqb;2018年的新闻&colon; 国内首家!腾讯主导Apache Hadoop新版本发布

    国内首家!腾讯主导Apache Hadoop新版本发布   https://blog.csdn.net/weixin_34194317/article/details/88811258 腾讯也挖了很多 ...

  6. Hive创建表格报【Error&comma; return code 1 from org&period;apache&period;hadoop&period;hive&period;ql&period;exec&period;DDLTask&period; MetaException】引发的血案

    在成功启动Hive之后感慨这次终于没有出现Bug了,满怀信心地打了长长的创建表格的命令,结果现实再一次给了我一棒,报了以下的错误Error, return code 1 from org.apache ...

  7. Ubuntu14&period;04用apt在线&sol;离线安装CDH5&period;1&period;2&lbrack;Apache Hadoop 2&period;3&period;0&rsqb;

    目录 [TOC] 1.CDH介绍 1.1.什么是CDH和CM? CDH一个对Apache Hadoop的集成环境的封装,可以使用Cloudera Manager进行自动化安装. Cloudera-Ma ...

  8. hadoop错误org&period;apache&period;hadoop&period;yarn&period;exceptions&period;YarnException Unauthorized request to start container

    错误: 14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to ...

  9. hadoop错误org&period;apache&period;hadoop&period;util&period;DiskChecker&dollar;DiskErrorException Could not find any valid local directory for

    错误: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory ...

随机推荐

  1. 部署私有的Nuget服务器

    1.查看官方的部署文档:http://docs.nuget.org/Create/Hosting-Your-Own-NuGet-Feeds 2.使用开源的项目:https://github.com/h ...

  2. post 之checkbox 未被选中解决方法

    第一种方法: http://cnn237111.blog.51cto.com/2359144/1293812 第二种方法(推荐): http://blog.csdn.net/xyanghomepage ...

  3. Getty – Java NIO 框架设计与实现

    前言 Getty是我为了学习 Java NIO 所写的一个 NIO 框架,实现过程中参考了 Netty 的设计,同时使用 Groovy 来实现.虽然只是玩具,但是麻雀虽小,五脏俱全,在实现过程中,不仅 ...

  4. Visual Studio C&sol;C&plus;&plus; 编译器选项

    优化- /O1 最小化空间                          /O2 最大化速度/Ob<n> 内联扩展(默认 n=0)               /Od 禁用优化(默认) ...

  5. 算法起步之动态规划LCS

    原文:算法起步之动态规划LCS 前一篇文章我们了解了什么是动态规划问题,这里我们再来看动态规划另一个经典问题,最长公共子序列问题(LCS),什么是子序列,我们定义:一个给定序列将其中的0个或者多个元素 ...

  6. DOM事件代码小结

    以下代码出自<DOM Enlightenment>一书1.三种事件形式 <body onclick="alert('触发内联属性事件')"> <div ...

  7. CgLib动态代理学习【Spring AOP基础之一】

    如果不了解JDK中proxy动态代理机制的可以先查看上篇文章的内容:Java动态代理学习[Spring AOP基础之一] 由于Java动态代理Proxy.newProxyInstance()的时候会发 ...

  8. 监听键盘弹起View上调

    可以用三方库IQKeyboardManager 用这个第三方 http://www.jianshu.com/p/f8157895 #pragma mark - keyboard events - // ...

  9. &lbrack;学习笔记&rsqb;FWT——快速沃尔什变换

    解决涉及子集配凑的卷积问题 一.介绍 1.基本用法 FWT快速沃尔什变换学习笔记 就是解决一类问题: $f[k]=\sum_{i\oplus j=k}a[i]*b[j]$ 基本思想和FFT类似. 首先 ...

  10. mac上Python多版本共存&lpar;python2&period;7&period;10和python3&period;5&period;0&rpar;

    本文的实现目标是在mac上安装一个python3.5.0的版本,跟当前系统自带的python2.7.10共存. 查看当前版本号 python -V 2.7.10 安装配置Python版本管理器pyen ...