在一个核心内的Spark worker上启动多个处理器线程

时间:2022-07-31 20:51:47

Our situation is: using Spark streaming with AWS Kinesis.

我们的情况是:使用AWS Kinesis的Spark流媒体。

If specify the Spark master to be in memory as "local[32]", then Spark can consume data from Kinesis fairly quick.

如果将Spark master指定为内存为“local [32]”,那么Spark可以相当快地使用Kinesis中的数据。

But if we switch to a cluster with 1 master and 3 workers (on 4 separate machines), and set master to be "spark://[IP]:[port]", then the Spark cluster is consuming data at a very slow rate. This cluster has 3 worker machines, and each worker machine with 1 core.

但是如果我们切换到一个拥有1个主服务器和3个工作服务器的集群(在4*立的机器上),并将master设置为“spark:// [IP]:[port]”,那么Spark集群正在以非常慢的速度消耗数据率。该集群有3个工作机器,每个工作机器有1个核心。

I'm trying to speed up the consuming speed, so I add more executors on each worker machine, but it does not help much since each executor will need 1 core at least (and my worker machine has 1 core only). I also read adding more Kinesis shard number will help scale up, but I just want to maximize my read capacity.

我正在努力加快消耗速度,所以我在每台工作机器上添加了更多的执行程序,但它没有多大帮助,因为每个执行程序至少需要1个核心(而我的工作机器只有1个核心)。我还看到添加更多Kinesis分片数将有助于扩大规模,但我只想最大化我的阅读能力。

Since the "in memory" mode is possible to consume fast enough, is it possible to also start multiple "Kinesis record processor thread" on each worker machine, shown in the picture below? Or start many threads to consume from Kinesis within 1 core?

由于“内存”模式可以足够快地消耗,是否可以在每个工作机器上启动多个“Kinesis记录处理器线程”,如下图所示?或者从1核心内的Kinesis开始使用多个线程?

Thank you very much.

非常感谢你。

picture below from https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

以下图片来自https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

在一个核心内的Spark worker上启动多个处理器线程

1 个解决方案

#1


It turns out to be related to resources of the cluster.

事实证明它与集群的资源有关。

For AWS Kinesis, one Kinesis stream requires one receiver from Spark cluster, and one receiver will acquire one core from Spark workers.

对于AWS Kinesis,一个Kinesis流需要一个来自Spark集群的接收器,一个接收器将从Spark工作者获得一个核心。

I increased the core of each worker to be 4 cores, and then executors have extra cores to run jobs.

我将每个worker的核心增加到4个核心,然后执行者有额外的核心来运行作业。

#1


It turns out to be related to resources of the cluster.

事实证明它与集群的资源有关。

For AWS Kinesis, one Kinesis stream requires one receiver from Spark cluster, and one receiver will acquire one core from Spark workers.

对于AWS Kinesis,一个Kinesis流需要一个来自Spark集群的接收器,一个接收器将从Spark工作者获得一个核心。

I increased the core of each worker to be 4 cores, and then executors have extra cores to run jobs.

我将每个worker的核心增加到4个核心,然后执行者有额外的核心来运行作业。