保持本地缓存的策略在分布式系统中看到相同的“版本”数据

I'm trying to build a distributed system to run some performance intensive calculations. One calculation can be done in parallel at multiple worker nodes. The problem is, as the data source keeps changing in real-time, we want each worker node (during a single calculation) to operate on the same "version" of data, i.e. a point-in-time snapshot of the database. This is to avoid inconsistent results.

我正在尝试构建一个分布式系统来运行一些性能密集型计算。一个计算可以在多个工作节点处并行完成。问题是，随着数据源实时变化，我们希望每个工作节点（在单次计算期间）对相同的“版本”数据进行操作，即数据库的时间点快照。这是为了避免不一致的结果。

Another issue is, the entire set of input data per calculation can be very large, so currently we keep a local cache at each worker node, which refreshes the content periodically by asking the data source for "diffs" since the current local cache version and applies the diffs to the local cache.

另一个问题是，每次计算的整个输入数据集可能非常大，因此目前我们在每个工作节点保留一个本地缓存，通过询问数据源自当前本地缓存版本以来的“差异”来定期刷新内容。将差异应用于本地缓存。

What are some design strategies to achieve the requirement that each worker node sees the same "version" of data (while still have reasonably fresh data)? I have thought about a solution below but wanted to see if this is a common pattern that has been solved:

有哪些设计策略可以实现每个工作节点看到相同“数据”数据的要求（同时仍然具有相当新鲜的数据）？我已经考虑过下面的解决方案，但想知道这是否是一个已经解决的常见模式：

Build a "versioning" service that periodically queries the data source for diffs and store each diff as a data "version". The worker node's caches sync with the versioning service and also keep its cached data at multiple versions. For one calculation, we make sure that the worker nodes use input data at the same version to achieve consistency. This versioning service should also keep the latest copy of the entire data set for the worker node to load its cache initially, and to restore the local cache content if a worker node goes down and goes back up.
构建一个“版本控制”服务，定期查询数据源的差异，并将每个差异存储为数据“版本”。工作节点的缓存与版本控制服务同步，并将其缓存数据保留在多个版本中。对于一个计算，我们确保工作节点使用相同版本的输入数据来实现一致性。此版本控制服务还应保留工作节点的整个数据集的最新副本，以便最初加载其缓存，并在工作节点关闭并重新启动时恢复本地缓存内容。

Some estimated parameters of the system:

一些估计的系统参数：

Number of workers: 10

工人人数：10
Average job duration: obviously we want this to be as fast as possible, but let's say it should be less than 2 minutes

平均工作时间：显然我们希望尽可能快，但是假设它应该不到2分钟
Input data for a job (overall for all workers): ~100GB

输入作业数据（对所有工人总体而言）：~100GB
Size of the database: ~1TB

数据库大小：~1TB

2 个解决方案

#1

If you are not tied to MySQL and could to use Oracle there is a simple solution for you:

如果您不依赖MySQL并且可以使用Oracle，那么有一个简单的解决方案：

Oracle Flashback

Oracle Flashback

(I have not found MySQL flashback yet comment please if you know some motor for this.) You do not have to create a manual snapshot etc. You could use this with a single database server and all of your processes could read the data as it was represented in the requiried time. This solution is pretty clean and robust but requires licences.

（我还没有找到MySQL闪回但请注意，如果你知道一些电机。）你不必创建一个手动快照等。你可以使用一个数据库服务器，所有的进程都可以读取数据，因为它在需要的时间代表。此解决方案非常干净且功能强大但需要许可证。

If I were you I would try to take a step back and try to simplify the problem a bit more. If the different workers could run parallel the following should apply:

如果我是你，我会尝试退一步，尝试更多地简化问题。如果不同的工人可以并行运行，则应适用以下条款：

None of the workers use the output of the others
没有工人使用其他人的输出
None of them is altering the original data
他们都没有改变原始数据

If both of those requirement is valid you could use a single database to store the calculations etc. The only thing You have to care about is that the transactions should be carefully planned.

如果这两个要求都有效，您可以使用单个数据库来存储计算等。您唯一需要关心的是交易应该仔细规划。

On the other hand on a simmilar project we used a small trick to achive this (as the flashback solution): there was the insertion time stored in the database too. (And updates were actually inserts with new timestamps.) All of the calculations etc were made on accurate records by adding to the query the

另一方面，在一个类似的项目中，我们使用了一个小技巧来实现这一点（作为闪回解决方案）：插入时间也存储在数据库中。（并且更新实际上是插入了新的时间戳。）所有的计算等都是通过添加到查询中的准确记录来进行的。

give me the last version of this kind of row before x timestamp

在x时间戳之前给我这种行的最后一个版本

With this solution we avoided the licence costs and the snapshot maintanence. The only problem with this, if you do not need the whole history it will eat your database space pertty fast. To solve this we made a cron job that is clearing off the unused records based on the timestamp.

通过此解决方案，我们避免了许可证成本和快照维护。唯一的问题是，如果你不需要整个历史记录，它会快速占用你的数据库空间。为了解决这个问题，我们做了一个cron作业，它根据时间戳清除未使用的记录。

If you want to get more, there is something called shadow tables. There is a nice MySQL blog post on this topic: http://arnab.org/blog/shadow-tables-using-mysql-triggers

如果你想获得更多，有一个叫做影子表的东西。关于这个主题有一篇很好的MySQL博客文章：http：//arnab.org/blog/shadow-tables-using-mysql-triggers

#2

I think that you are overcomplicating. For your task, you need to store and to differentiate only the current and the latest version of data. So, your script should:

我认为你太复杂了。对于您的任务，您需要存储和区分当前和最新版本的数据。所以，你的脚本应该：

mark latest data as currently used dataset
将最新数据标记为当前使用的数据集
delete all older data
删除所有旧数据
tell workers to use marked dataset
告诉工人使用标记数据集
all this time, you add new data to tables (not update but add)
这一次，你向表中添加新数据（不更新但添加）
go to step 1
转到第1步

#1