MongoDB Connector for Hadoop

时间:2022-08-28 12:28:05

MongoDB Connector for Hadoop

https://github.com/mongodb/mongo-hadoop

Purpose

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.

Current stable release: 1.2.0

Features

  • Can create data splits to read from standalone, replica set, or sharded configurations
  • Source data can be filtered with queries using the MongoDB query language
  • Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)
  • Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems
  • Can write data out in .bson format, which can then be imported to any MongoDB database with mongorestore
  • Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.

Download

See the release page.

Building

To build, first edit the value for hadoopRelease in ThisBuild in the build.sbt file to select the distribution of Hadoop that you want to build against. For example to build for CDH4:

hadoopRelease in ThisBuild := "cdh4"

or for Hadoop 1.0.x:

hadoopRelease in ThisBuild := "1.0"

To determine which value you need to set in this file, refer to the list of distributions below. Then run ./sbt package to build the jars, which will be generated in the core/target/ directory.

After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the following locations, depending on which Hadoop release you are using:

  • $HADOOP_HOME/lib/
  • $HADOOP_HOME/share/hadoop/mapreduce/
  • $HADOOP_HOME/share/hadoop/lib/

Supported Distributions of Hadoop

  • Apache Hadoop 1.0

    Does not support Hadoop Streaming.

    Build using "1.0" or "1.0.x"

  • Apache Hadoop 1.1

    Includes support for Hadoop Streaming.

    Build using "1.1" or "1.1.x"

  • Apache Hadoop 0.20.*

    Does not support Hadoop Streaming

    Includes Pig 0.9.2.

    Build using "0.20" or "0.20.x"

  • Apache Hadoop 0.23

    Includes Pig 0.9.2.

    Includes support for Streaming

    Build using "0.23" or "0.23.x"

  • Cloudera Distribution for Hadoop Release 4

    This is the newest release from Cloudera which is based on Apache Hadoop 2.0. The newer MR2/YARN APIs are not yet supported, but MR1 is still fully compatible.

    Includes support for Streaming and Pig 0.11.1.

    Build with "cdh4"

  • Apache Hadoop 2.2

    Includes Pig 0.9.2

    Includes support for Streaming

    Build using "2.2" or "2.2.x"

Configuration

Configuration

Streaming

Streaming

Examples

Examples

Usage with static .bson (mongo backup) files

BSON Usage

Usage with Amazon Elastic MapReduce

Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration, without needing to deal with provisioning nodes and installing software.

Using EMR with the MongoDB Connector for Hadoop allows you to run MapReduce jobs against MongoDB backup files stored in S3.

Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions lib folders.

For a full example (running the enron example on Elastic MapReduce) please see here.

Usage with Pig

Documentation on Pig with the MongoDB Connector for Hadoop.

For examples on using Pig with the MongoDB Connector for Hadoop, also refer to the examples section.

Notes for Contributors

If your code introduces new features, please add tests that cover them if possible and make sure that the existing test suite still passes. If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details and we will try to help.

Maintainers

Mike O'Brien (mikeo@10gen.com)

Contributors

Support

Issue tracking: https://jira.mongodb.org/browse/HADOOP/

Discussion: http://groups.google.com/group/mongodb-user/

MongoDB Connector for Hadoop的更多相关文章

  1. mongoDB BI 分析利器 - PostgreSQL FDW (MongoDB Connector for BI)

    背景 mongoDB是近几年迅速崛起的一种文档型数据库,广泛应用于对事务无要求,但是要求较好的开发灵活性,扩展弹性的领域,. 随着企业对数据挖掘需求的增加,用户可能会对存储在mongo中的数据有挖掘需 ...

  2. 收藏2个mongodb connector网址

    https://github.com/plaa/mongo-spark https://github.com/mongodb/mongo-hadoop http://codeforhire.com/2 ...

  3. Scala2.11.8 spark2.3.1 mongodb connector 2.3.0

    import java.sql.DriverManager import com.mongodb.spark._ import org.apache.spark.SparkConf import or ...

  4. MongoDB资料--Java驱动, Hadoop驱动, Spark使用

    MongoDB数据库备份: mongodump -h 192.168.1.160 -d MapLoc -o /usr/local/myjar/mongo/MapLoc/数据库还原:mongoresto ...

  5. 零售行业下MongoDB在产品目录系统、库存系统、个性推荐系统中的应用【转载】

    Retail Reference Architecture Part 1: Building a Flexible, Searchable, Low-Latency Product Catalog P ...

  6. Hadoop, Python, and NoSQL lead the pack for big data jobs

    Hadoop, Python, and NoSQL lead the pack for big data jobs   Rise in cloud-based analytics could incr ...

  7. MongoDB:数据库介绍与基础操作

    二.部署在本地服务器 在上次的学习过程中,我们主要进行了MongoDB运行环境的搭建和可视化工具的安装.此次我们将学习MongoDB有关的基本概念和在adminmongo上的基本操作.该文档中的数据库 ...

  8. Spark连接MongoDB之Scala

    MongoDB Connector for Spark Spark Connector Scala Guide spark-shell --jars "mongo-spark-connect ...

  9. 后Hadoop时代的大数据技术思考:数据即服务

    1. Hadoop 的神话正在破灭 IBM leads BigInsights for Hadoop out behind barn. Shots heard IBM has announced th ...

随机推荐

  1. hibernate+mysql的连接池配置

    1:连接池的必知概念    首先,我们还是老套的讲讲连接池的基本概念,概念理解清楚了,我们也知道后面是怎么回事了. 以前我们程序连接数据库的时候,每一次连接数据库都要一个连接,用完后再释放.如果频繁的 ...

  2. 【原创】.NET读写Excel工具Spire.Xls使用(4)对数据操作与控制

                  本博客所有文章分类的总目录:http://www.cnblogs.com/asxinyu/p/4288836.html .NET读写Excel工具Spire.Xls使用文章 ...

  3. HDU 3974 Assign the task(dfs建树+线段树)

    题目大意:公司里有一些员工及对应的上级,给出一些员工的关系,分配给某员工任务后,其和其所有下属都会进行这项任务.输入T表示分配新的任务, 输入C表示查询某员工的任务.本题的难度在于建树,一开始百思不得 ...

  4. image hover

    http://www.nxworld.net/tips/css-image-hover-effects.html

  5. Codeforces Round #311 (Div. 2) E. Ann and Half-Palindrome 字典树/半回文串

    E. Ann and Half-Palindrome Time Limit: 20 Sec Memory Limit: 256 MB 题目连接 http://codeforces.com/contes ...

  6. 基于visual Studio2013解决算法导论之026二叉树

     题目 二叉树实现 解决代码及点评 #include<stdio.h> #include <malloc.h> #include <stdlib.h> typ ...

  7. 常量指针(const X&ast;)和指针常量(X&ast; const)

    const X* 类型的指针(指向常量的指针),此指针的地址是一个变量,是可以修改的:但其所指向的内容是常量,是不可以修改的. 例如: 1: char name[5] = "lisi&quo ...

  8. 商城项目回顾整理(二)easyUi数据表格使用

    后台主页: 商品的数据表格展示 引入用户表数据表格展示 引入日志表数据表格展示 引入订单表数据表格展示 后台主页代码: <%@ page language="java" co ...

  9. 如何将多个C文件链接在一起----Makefile编写及make指令

    需使用GCC编译器,关于MinGW的安装指南:https://people.eng.unimelb.edu.au/ammoffat/teaching/20005/Install-MinGW.pdf 单 ...

  10. MT【285】含参数函数绝对值的最大值

    (浙江2013高考压轴题)已知$a\in R$,函数$f(x)=x^3-3x^2+3ax-3a+3$(2)当$x\in[0,2]$时,求$|f(x)|$的最大值. 分析:由题意$f^{'}(x)=3x ...