kafka主题重新处理的spark-streaming批处理间隔

Current setting: a Spark Streaming job processes a Kafka topic of timeseries data. About every second new data comes in of different sensors. Also, the batch interval is 1 Second. By means of updateStateByKey() stateful data is computed as a new stream. As soon as this stateful data crosses a treshold, an event is generated on a Kafka topic. When the value later drops below the treshhold, again an event is fired that topic.

当前设置:Spark Streaming作业处理时间序列数据的Kafka主题。大约每秒钟都有不同传感器的新数据。此外,批处理间隔为1秒。通过updateStateByKey()将有状态数据计算为新流。只要此有状态数据超过阈值,就会在Kafka主题上生成事件。当值稍后降低到阈值以下时,再次触发该主题的事件。

So far, so good.

到现在为止还挺好。

Problem: when applying a new algorithm on the data by reconsuming the Kafka topic, I would like this to go fast. But this means that every batch contains (hundreds of) thousands messages. Moving these in 1 batch to updateStateByKey() results in 1 computed value for that key on the resulting stream.

问题:当通过重新设置Kafka主题在数据上应用新算法时,我希望这样做很快。但这意味着每批包含(数百)条消息。将这些批量移动到updateStateByKey()会导致生成的流上的该键的计算值为1。

Of course that's unacceptable as loads of data points are reduced to a single one. Alarm events that will be generated on a real-time stream will not be on the recomputed stream. So comparing algorithms this way is totally useless.

当然,由于数据点的负载减少到一个,这是不可接受的。将在重新计算的流上生成将在实时流上生成的警报事件。因此,通过这种方式比较算法是完全没用的。

Question: How can I avoid this? Preferably not switching frameworks. It seems to me I'm looking for a true streaming (1 event a a time) framework. On the other hand Spark streaming is new to me, so I'm definitely missing a lot there.

问题:我该如何避免这种情况?优选地,不切换框架。在我看来,我正在寻找一个真正的流媒体(一次一个事件)框架。另一方面,Spark流媒体对我来说是新的,所以我肯定在那里错过了很多。

1 个解决方案

#1

In spark 1.6, a new API mapWithState for interacting with state has been introduced. I believe that will solve your problem.

在spark 1.6中,引入了一个用于与state进行交互的新API mapWithState。我相信这会解决你的问题。

Have a look at it here.

看看这里。

#1