Using YARN with Cgroups testing in sparkml cluster

部署服务器：

sparkml 集群

########### sparkml ##########

sparkml-node1 # yarn resource manager
sparkml-node2 # nodemanager spark-2.0.0
sparkml-node3 # nodemanager spark-2.0.0
sparkml-node4 # nodemanager spark-2.0.0
sparkml-node5 # nodemanager spark-2.0.0

上线功能：

Cgroup 限制每个节点 yarn container 能占用的该节点 CPU 总量
每个 yarn container 能够按照被分配的 vcore 数目 share CPU

测试方法：

功能一测试：

在不限制的情况下，我们跑一条 hive SQL

test_hive_sql.sql

我们看看 container 分配情况：

Using YARN with Cgroups testing in sparkml cluster

4 个 nodemanager 节点的 CPU 使用情况：

Using YARN with Cgroups testing in sparkml cluster

都接近 100 %

我们现在尝试限制到 50%

设置 cpu.cfs_quota_us="1200000"; （计算方法：24 (逻辑CPU核心数)* 0.5(50% CPU 使用)* 100000(每个计算周期) = 1200000）

重启 cgroup ： /etc/init.d/cgconfig restart

再跑一次同样的 SQL ：

基本同样的 container 分配

Using YARN with Cgroups testing in sparkml cluster

nodemanager 服务器上的 CPU 使用：

Using YARN with Cgroups testing in sparkml cluster

全部限制在 50% 以内

功能二，测试：

hive SQL 跑出来的 container 都只占用了一个 vcore （mapred的特性？），因此我们用 spark 来进行测试：

我们跑这一段代码：

from __future__ import print_function

#

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements.  See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License.  You may obtain a copy of the License at

#

#    http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

#

import sys

from random import random

from operator import add

from pyspark import SparkContext

import time

if __name__ == "__main__":

    """

        Usage: pi [partitions]

    """

    sc = SparkContext(appName="PythonPi")

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2

    n = 100000 * partitions

    def f(_):

        for i in range(1,10000):

            x = random() * random() * random() - 1

            y = random() * random() * random() - 1

        #time.sleep(60)

        x = random() * random() * random() - 1

        y = random() * random() * random() - 1

        return 1 if x ** 2 + y ** 2 < 1 else 0

    count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

    print("Pi is roughly %f" % (4.0 * count / n))

    sc.stop()

container 分配：

Using YARN with Cgroups testing in sparkml cluster

跑了 1 个 container 4 个 vcore 的服务器上面：

Using YARN with Cgroups testing in sparkml cluster

跑测试的 hive SQL

Using YARN with Cgroups testing in sparkml cluster

在 node4 这台服务器上：

Using YARN with Cgroups testing in sparkml cluster

spark_sc 的 CPU 占用只有 100，没有其他 vcore 为 1 的来自 hdfs 的 container 多

这是因为上述 python 代码没有并发，因此只能使用一个核

Using YARN with Cgroups testing in sparkml cluster

这台服务器上有 5 个 container ：

Using YARN with Cgroups testing in sparkml cluster

只有最后一个 container 的 cpu.shares 值是 4096 ，是别的 4 倍

Using YARN with Cgroups testing in sparkml cluster

上述结果和我们观察到的 vcore 分配一致，在这里 python code 的 CPU 占用没有 hive SQL 生成的 container 多是因为 python 使用了单进程，没有多核调度

测试结果：

对于功能一：生效

对于功能二：生效，通过控制 cpu.shares 来按照 vcore 分配 CPU ，缺乏直观的测试数据

配置参数：

yarn.nodemanager.container-executor.class : org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor

yarn.nodemanager.linux-container-executor.resources-handler.class : org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler

yarn.nodemanager.linux-container-executor.cgroups.hierarchy : /hadoop-yarn （对于 /cgroup/cpu/ 目录下的 cgroup hierarchy ，手动配置到 cgconfig.conf 文件里面）

yarn.nodemanager.linux-container-executor.cgroups.mount : true

yarn.nodemanager.linux-container-executor.cgroups.mount-path : /cgroup （cgroup 文件系统根目录）

yarn.nodemanager.linux-container-executor.group : yarn

yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users : false

不生效参数：

yarn.nodemanager.resource.percentage-physical-cpu-limit : 100 （该参数控制 nodemanager 节点的总体CPU 使用，hadoop-2.5.0-cdh5.3.2 不支持，可以同在在 cgconfig.conf 中配置 cpu.cfs_quota_us）

yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage : false （CPU use hard limit）

cgroup 配置：

#

#  Copyright IBM Corporation.

#

#  Authors:    Balbir Singh <balbir@linux.vnet.ibm.com>

#  This program is free software; you can redistribute it and/or modify it

#  under the terms of version 2.1 of the GNU Lesser General Public License

#  as published by the Free Software Foundation.

#

#  This program is distributed in the hope that it would be useful, but

#  WITHOUT ANY WARRANTY; without even the implied warranty of

#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

#

# See man cgconfig.conf for further details.

#

# By default, mount all controllers to /cgroup/<controller>

mount {

    cpuset    = /cgroup/cpuset;

    cpu    = /cgroup/cpu;

    cpuacct    = /cgroup/cpuacct;

    memory    = /cgroup/memory;

    devices    = /cgroup/devices;

    freezer    = /cgroup/freezer;

    net_cls    = /cgroup/net_cls;

    blkio    = /cgroup/blkio;

}

group hadoop-yarn {

     perm {

         task {

             uid = yarn;

             gid = hadoop;

         } admin {

             uid = yarn;

             gid = hadoop;

         }

     }

    cpu {

#             cpu.shares="1024";

#             cpu.cfs_period_us="100000";

#             cpu.cfs_quota_us="1200000";

    }

}

原理简述：

cgroup 通过 cgroup hierarchy 来将 subsystem 和 task 联系起来，每次 yarn 在启动 container 的时候都会将在指定的 hadoop-yarn cgroup hierarchy 下面新建属于每个 container 的 hierarchy

Using YARN with Cgroups testing in sparkml cluster

开始跑 container 以后

Using YARN with Cgroups testing in sparkml cluster

由于总体的节点 CPU 限制在线上版本不支持（YarnConfiguration.java 里面没有读入 yarn.nodemanager.resource.percentage-physical-cpu-limit 参数，也没有在 CgroupsLCEResourcesHandler 有相关实现，具体实现参考： YARN-2440）

我们在 hadoop-yarn 里面配置设置 cpu.cfs_quota_us ，在 hadoop-yarn 下属的所有 container cgroup hierarchy 都不能超过父 hierarchy 的限制

对于功能二：

通过 YARN-600 加入到 CgroupsLCEResourcesHandler 类

if (isCpuWeightEnabled()) {

  createCgroup(CONTROLLER_CPU, containerName);

  int cpuShares = CPU_DEFAULT_WEIGHT * containerResource.getVirtualCores();

  // absolute minimum of 10 shares for zero CPU containers

 cpuShares = Math.max(cpuShares, 10);

  updateCgroup(CONTROLLER_CPU, containerName, "shares",

      String.valueOf(cpuShares));

}

cpuShares 最少值为 10 ，按照 VirtualCores 给予每个 container 相应的 cpu.shares 值

Linux cfs 调度器会根据 cpu.shares 值作用到 CPU 调度，具体参考：cpu.shares 作用原理

部署流程：

yarn-site.xml

<property>

<name>yarn.nodemanager.container-executor.class</name>

<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>

</property>

<property>

<name>yarn.nodemanager.linux-container-executor.resources-handler.class</name>

<value>org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler</value>

</property>

<property>

<name>yarn.nodemanager.linux-container-executor.cgroups.hierarchy</name>

<value>/hadoop-yarn</value>

</property>

<property>

<name>yarn.nodemanager.linux-container-executor.cgroups.mount</name>

<value>true</value>

</property>

<property>

<name>yarn.nodemanager.linux-container-executor.cgroups.mount-path</name>

<value>/cgroup</value>

</property>

<property>

<name>yarn.nodemanager.linux-container-executor.group</name>

<value>yarn</value>

</property>

<property>

<name>yarn.nodemanager.resource.percentage-physical-cpu-limit</name>

<value></value>

</property>

<property>

<name>yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage</name>

<value>false</value>

</property>

<property>

<name>yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users</name>

<value>false</value>

</property>

部署 cgroup

重新编译 container-executor ：

cd ${HADOOP_HOME}/hadoop-2.6.-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/

cmake src -DHADOOP_CONF_DIR=/etc/hadoop

make

cd targe/usr/local/bin/即可获得需要的container-executor文件

配置 container-executor.cfg

yarn.nodemanager.linux-container-executor.group=yarn

banned.users=bin

min.user.id=

allowed.system.users=hdfs,yarn

启动 cgroup

重启 yarn

参考文献：

yarn 新特性 - cgroup

Using YARN with Cgroups

Using YARN with Cgroups 参数配置 Apache 官网

cgroup 使用文档

YARN配置Kerberos认证

container executor 简介

按照 vcore 计算 container CPU 使用

后续跟进：

调查 yarn 是否支持灰度上 cgroup

我们使用在外围不停 cgclassify 来上 cgroup

#!/bin/bash 

echo ""

echo ""

containerPid=` su - yarn -c ' jps | grep -v NodeManager | grep -v -i jps ' | awk '{print $1}' `

containerList=` su - yarn -c ' jps | grep -v NodeManager | grep -v -i jps ' ` 

echo " We will begin to move ${containerList} of yarn to cgroup "

for pid in ${containerPid}

do

  cgclassify -g cpu:hadoop-yarn $pid

done 

echo " Move to cgroup per minute done "

taskID=` cat /cgroup/cpu/hadoop-yarn/tasks `

echo " Content in hadoop-yarn hierarchy is : ${taskID} "

date

echo ""

echo ""

部署 crontab job 一分钟一次，看效果

待续