clickhouse 生产集群部署之坑坑洼洼

之前看过ck中文社区，看到了很多莫名其妙的Exception 但并没有解答。自己测试集群小量数据抽取也没有遇到这些问题。

果不其然，生产集群部署都跳了一遍

新增相关配置：网上找的，与我原本的配置的一些不同点

<receive_timeout>800</receive_timeout>
<send_timeout>800</send_timeout>
<keep_alive_timeout>300</keep_alive_timeout>
<default_session_timeout>300</default_session_timeout>

我也没有添加的配置：
<merge_tree>
   <parts_to_delay_insert>300</parts_to_delay_insert>
   <parts_to_throw_insert>600</parts_to_throw_insert>
   <max_delay_to_insert>2</max_delay_to_insert>
   </merge_tree>

报错大致信息：

many parts (606). Merges are processing significantly slower than inserts

2. ERROR : Error while invoking RpcHandler#receive() for one-way message.

time out

: Lost executor 174 on hadoop1: Container marked as failed: container_xx on host: hadoop1.
Exit status: 143. Diagnostics: Container killed on request.

::Exception: Possible deadlock avoided. Client should retry. (version 19.15.3.6 (official build)) (from [::1]:38736) (in query: SELECT * FROM dwd_ms_complex_detail_di_cluster LIMIT 1), Stack trace:

问题描述：

速度跟不上插入速度，也就是insert，可能原因：数据是否可能跨多个分区，如果这样的话每次写入有多个partition， merge压力很大

2.同1一起处理，可适当减少并发，同时修改numpartition

3.超时问题，可适当增加超时时间并添加 socketTimeout=600000 具体数自己测

资源问题单个excutor 内存不足处理不过来，适当减少批次量（此处是waterdrop 设置的默认 20000 因为ck大批量小批次特性并结合yarn资源我设置了500000）

5.可能是truncate table导致数据删除不完全，解决方法：去存储位置，删数据吧

秒客网

clickhouse 生产集群部署之坑坑洼洼

相关文章