前言
今天我们的yarn集群出现了一个奇怪的问题,在资源足够的情况下,提交的job一直处在ACCEPTED状态,不能运行。
我们的集群是CDH-5.13.3-1.cdh5.13.3.p0.2,提交到下的任何一个queue(和)的job都不能运行,提交到的job可以运行。但是我们不使用,这就等于yarn集群不能工作了。
定位
名为的queue有足够的资源,但是不能运行job,这就排除了queue的原因。
查看yarn日志(/var/log/hadoop-yarn路径下),发现近一天的时间里频繁出现如下内容:
2019-05-20 00:51:36,885 INFO : Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, removing app application_1556100181928_29377 from state store.
2019-05-20 00:51:36,887 INFO : Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 10000, removing app application_1556100181928_29377 from memory:
这是因为yarn里完成的job会存储在内存中,存储的数量是有限制的,当到达上限时便不能在运行新的job,并且打印上述的日志。
与该配置相关的参数如下:
Parameter | Description |
---|---|
-completed-applications | The maximum number of completed applications RM keeps. Default value: 10000 Default source: |
-completed-applications | The maximum number of completed applications RM state store keeps, less than or equals to ${-completed-applications}. By default, it equals to ${-completed-applications}. This ensures that the applications kept in the state store are consistent with the applications remembered in RM memory. Any values larger than ${-completed-applications} will be reset to ${-completed-applications}. Note that this value impacts the RM recovery , a smaller value indicates better performance on RM value: ${-completed-applications} Default source: |
The class to use as the persistent store. If is used, the store is implicitly fenced; meaning a single ResourceManager is able to use the store at any point in time. More details on this implicit fencing, along with setting up appropriate ACLs is discussed under value: Default source: | |
-path | Full path of the ZooKeeper znode where RM state will be stored. This must be supplied when using as the value for 。Default value: /rmstore Default source: |
解决方法
查看,查看使用了哪种存储方式。
> grep -B 1 -A 2 /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop-yarn/etc/hadoop/
<property>
<name></name>
<value></value>
</property>
由于集群使用了zookeeper作为存储系统,去zookeeper查看有多少个已经完成的job:
> echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/zookeeper/bin/ | grep application_ | awk -F , '{print NF}'
100040
生成删除/rmstore/ZKRMStateRoot/RMAppRoot下的节点的命令:
echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" |
/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/zookeeper/bin/ |
grep application_ |
while read item; do echo ${item#*[}; done |
while read item; do echo ${item%*]}; done |
awk -F ', ' '{ for (i=1;i<=NF;i++) printf "rmr /rmstore/ZKRMStateRoot/RMAppRoot/%s\n",$i}' >
attention:
当前running的job,也会出现在zookeeper的/rmstore/ZKRMStateRoot/RMAppRoot节点下,这里要注意别把他们给删除了,可以在生成命令的时候与***yarn application -list***配合过滤掉running状态的job。
执行删除/rmstore/ZKRMStateRoot/RMAppRoot下的节点的命令:
cat | /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/zookeeper/bin/
参考
Default YARN Parameters
FileSystem Vs ZKStateStore for RM recovery
Yarn crash [max number of completed apps kept in memory met]