spark第四篇:Running Spark on YARN

时间:2023-03-09 19:10:59
spark第四篇:Running Spark on YARN

确保HADOOP_CONF_DIR或者YARN_CONF_DIR指向hadoop集群配置文件目录。这些配置用来写数据到hdfs以及连接yarn ResourceManager。(在$SPARK_HOME/conf/spark-env.sh中,添加export HADOOP_CONF_DIR=/home/koushengrui/app/hadoop/etc/hadoop)。The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode).

spark on yarn 有两种部署模式。In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN。cluster 模式,spark 驱动程序运行在应用主进程内。client 模式,驱动程序运行在客户端进程中,应用主进程只负责向yarn 请求资源。

不像spark standalone 和 mesos 模式,这两种模式下master 的地址由--master 参数指定,yarn 模式,ResourceManager 的地址从hadoop 配置中取。因此,--master 参数的值是yarn。

以cluster 模式启动应用:

./spark-submit --class class_name --master yarn --deploy-mode cluster [options] <app jar> [app options]

例如:

./spark-submit --class org.apache.spark.examples.SparkPi \
                        --master yarn \
                        --deploy-mode cluster \
                        --driver-memory 4g \
                        --executor-memory 2g \
                        --executor-cores 1 \
                        --queue queue_name \
                        /path/spark-example*.jar \
                        10

The above starts a YARN client program which starts the default Application Master. Then SparkPi will run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running。

以client 模式启动应用:

和上面一样,除了--deploy-mode 参数值为client

./spark-submit --class class_name --master yarn --deploy-mode client [options] <app jar> [app options]

例如可以以client 模式运行spark-shell:

./spark-shell --master yarn --deploy-mode client

添加其他jar

利用spark把hive数据导到hbase:

spark-submit                                             \
--class com.kou.spark.util.Hive2Hbase \
--master yarn \
--deploy-mode client \
--executor-memory 500m \
--driver-memory 500m \
--num-executors 2 \
--executor-cores 2 \
--queue ${spark_queuename} \
--conf spark.sql.autoBroadcastJoinThreshold=20971520 \
--conf spark.default.parallelism=40 \
--conf spark.sql.shuffle.partitions=40 \
--conf spark.speculation=false \
--conf spark.task.maxFailures=40 \
--conf spark.akka.timeout=300 \
--conf spark.network.timeout=300 \
--conf spark.yarn.max.executor.failures=40 \
--conf spark.executor.extraJavaOptions="-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+CMSClassUnloadingEnabled -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+HeapDumpOnOutOfMemoryError -verbose:gc " \
spark-hive2Hbase.jar "${appName}" "${sql}" "${outputTable}" "${phoenix_jdbc_url}"
hiveContext.sql("use sx_ela_safe")
hiveContext.sql("set mapred.job.queue.name=" + HDP_QUEUE_NAME)
hiveContext.sql("set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
hiveContext.sql("set hive.merge.mapredfiles=true")
hiveContext.sql("set hive.merge.smallfiles.avgsize=100000000")
hiveContext.sql("set mapred.combine.input.format.local.only=false")
// 创建上下文
val sparkConf = new SparkConf().setAppName(s"${args(4)}")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max", "300m")
.set("spark.sql.parquet.compression.codec", "snappy")
.set("spark.sql.parquet.mergeSchema", "true")
.set("spark.sql.parquet.binaryAsString", "true")
.set("spark.streaming.kafka.maxRatePerPartition", s"${args(2)}")
def InitEnvConfig(conf: SparkConf) = {
brokers = conf.get(KAFKA_METADATA_BROKER_LIST)
zkConnectString = conf.get(ZOOKEEPER_QUORUM)
phoenixZkUrl = conf.get(PHOENIX_JDBC_URL)
hdfsRootPath = conf.get(HDFS_ROOT_PATH)
spark_deploy_mode = conf.get(SPARK_MASTER_URL)
HDP_QUEUE_NAME = conf.get(HADOOP_QUEUE_NAME)
}