使用Google DataFlow直接将数据流式传输到Cloud SQL的简单查询

时间:2022-01-16 14:03:39

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.

所以我正在开发一个使用Google Dataflow和apache beam建立流媒体管道的小项目。我经历了一些教程并且能够获得管道并运行流入BigQuery,但我想要将其流式传输到完整的关系数据库(即:Cloud SQL)。我搜索了这个网站和谷歌,似乎最好的方法是使用JdbcIO。我在这里有点困惑,因为当我查找有关如何执行此操作的信息时,所有这些都是指批量写入云SQL而不是完全流式传输。

My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.

我的简单问题是我可以直接将数据流式传输到Cloud SQL中,还是必须通过批处理将其发送出去。

Cheers!

干杯!

1 个解决方案

#1


4  

You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.

你应该使用JdbcIO - 它可以做你想要的,它不会假设它的输入PCollection是有界的还是*的,所以你可以在任何管道和任何Beam运行器中使用它; Dataflow Streaming Runner也不例外。

In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.

如果通过阅读其源代码并看到“批处理”这个词来提示您的问题:它只是意味着为了提高效率,它会为每个数据库调用写入多个记录 - “批处理”一词的重载使用可能令人困惑,但在此处只是意味着它试图避免为每个记录执行昂贵的数据库调用的开销。

In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

实际上,默认情况下,每次调用写入的记录数最多为1000,但通常取决于特定运行程序在此特定时刻如何选择在此特定数据上执行此特定管道,并且可能小于该值。

#1


4  

You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.

你应该使用JdbcIO - 它可以做你想要的,它不会假设它的输入PCollection是有界的还是*的,所以你可以在任何管道和任何Beam运行器中使用它; Dataflow Streaming Runner也不例外。

In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.

如果通过阅读其源代码并看到“批处理”这个词来提示您的问题:它只是意味着为了提高效率,它会为每个数据库调用写入多个记录 - “批处理”一词的重载使用可能令人困惑,但在此处只是意味着它试图避免为每个记录执行昂贵的数据库调用的开销。

In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

实际上,默认情况下,每次调用写入的记录数最多为1000,但通常取决于特定运行程序在此特定时刻如何选择在此特定数据上执行此特定管道,并且可能小于该值。