如何确保DataFlow和Cloud Pub Sub的幂等性?

时间:2022-11-30 15:34:58

I'm curious about the best way to ensure idempotence when using Cloud DataFlow and PubSub?

我对使用Cloud DataFlow和PubSub时确保幂等性的最佳方法感到好奇吗?

We currently have a system which processes and stores records in a MySQL database. I'm curious about using DataFlow for some of our reporting, but wanted to understand what I would need to do to ensure that I didn't accidentally double count (or more than double count) the same messages.

我们目前有一个处理和存储MySQL数据库中的记录的系统。我很好奇使用DataFlow进行一些报告,但是想要了解我需要做些什么才能确保我不会意外地重复计算(或多次计算)相同的消息。

My confusion comes in two parts, firstly ensuring I only send the messages once and secondly ensuring I process them only once.

我的困惑分为两部分,首先确保我只发送一次消息,然后确保我只处理一次。

My gut would be as follows:

我的直觉如下:

Whenever an event I'm interested in is recorded in our MySQL database, transform it into a PubSub message and publish it to PubSub. Assuming success, record the PubSub id that's returned alongside the MySQL record. That way, if it has a PubSub id, I know I've sent it and I don't need to send it again. If the publish to PubSub fails, then I know I need to send it again. All good.

每当我感兴趣的事件被记录在我们的MySQL数据库中时,将其转换为PubSub消息并将其发布到PubSub。假设成功,记录与MySQL记录一起返回的PubSub标识。这样,如果它有一个PubSub ID,我知道我发送了它,我不需要再发送它。如果发布到PubSub失败,那么我知道我需要再次发送它。都好。

But if the write to MySQL fails after the PubSub write succeeds, I might end up publishing the same message to pub sub again, so I need something on the DataFlow side to handle both this case and the case that PubSub sends a message twice (as per https://cloud.google.com/pubsub/subscriber#guarantees).

但是如果在PubSub写入成功后对MySQL的写入失败,我可能最终会再次向pub sub发布相同的消息,所以我需要在DataFlow端处理这种情况和PubSub发送消息两次的情况(如根据https://cloud.google.com/pubsub/subscriber#guarantees)。

What's the best way to handle this? In AppEngine or other systems I would have a check against the datastore to see if the new record I'm creating exists, but I'm not sure how you'd do that with DataFlow. Is there a way I can easily implement a filter to stop a message being processed twice? Or does DataFlow handle this already?

处理这个问题的最佳方法是什么?在AppEngine或其他系统中,我会检查数据存储区,看看我创建的新记录是否存在,但我不确定你是如何使用DataFlow的。有没有办法可以轻松实现过滤器来阻止消息被处理两次?或者DataFlow已经处理过了吗?

1 个解决方案

#1


6  

Dataflow can de-duplicate messages based on an arbitrarily message attribute (selected by idLabel) on the receiver side, as outlined in Using Record IDs. From the producer side, you'll want to make sure that you are deterministically and uniquely populating the attribute based on the MySQL record. If this is done correctly, Dataflow will process each logical record exactly once.

数据流可以根据接收方的任意消息属性(由idLabel选择)对消息进行重复数据删除,如使用记录ID中所述。从生产者方面来说,您需要确保基于MySQL记录确定性地和唯一地填充属性。如果这样做正确,Dataflow将只处理每个逻辑记录一次。

#1


6  

Dataflow can de-duplicate messages based on an arbitrarily message attribute (selected by idLabel) on the receiver side, as outlined in Using Record IDs. From the producer side, you'll want to make sure that you are deterministically and uniquely populating the attribute based on the MySQL record. If this is done correctly, Dataflow will process each logical record exactly once.

数据流可以根据接收方的任意消息属性(由idLabel选择)对消息进行重复数据删除,如使用记录ID中所述。从生产者方面来说,您需要确保基于MySQL记录确定性地和唯一地填充属性。如果这样做正确,Dataflow将只处理每个逻辑记录一次。