MongoD JAVA插入与更新和比较更改

时间:2022-01-05 12:38:04

I have a large collection of roughly 3.2 million records, this collection data is being updated monthly but the source data is being fetched as-is, meaning I don't get just the updated records but everything. In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record? Also is there a good way to compare existing record with the one being read from the source to check if there's any change?

我有大约320万条记录的大量集合,这个集合数据每月都在更新,但源数据是按原样获取的,这意味着我不仅仅获得更新的记录,而是一切。在性能方面,最好是简单地删除集合并插入所有内容或对每条记录进行更新吗?还有一种很好的方法可以将现有记录与从源读取的记录进行比较,以检查是否有任何变化?

Thanks.

1 个解决方案

#1


0  

Also is there a good way to compare existing record with the one being read from the source to check if there's any change?

还有一种很好的方法可以将现有记录与从源读取的记录进行比较,以检查是否有任何变化?

You're searching for a Change Detection System : it's a problem commonly described for ETL system. I suggest you to read something about ETL process (Kimball's Datawarehouse ETL Toolkit is a good source). In general detecting changes is an hard problem and involves the use of snapshot in order to calculate differences. If you're sure that your collection will always remain in a mongo storage you can see if it's possible to mess around with mongo log. Furthermore consider that change detection is very coupled with the structure and the meaning of your data: e.g. if you have insertion-only collection you can get changed data with _id. The problem is too complex to give answers like "do this and that and you'll get it"; you have to analyze your data and understand what is the better method: refer to literature to find known solutions and avoid reinventing the wheel.

您正在搜索变更检测系统:这是ETL系统常见的问题。我建议你阅读一些关于ETL过程的内容(Kimball的Datawarehouse ETL Toolkit是一个很好的资源)。通常,检测更改是一个难题,并且涉及使用快照来计算差异。如果您确定您的收藏将始终保留在mongo存储中,您可以查看是否可以使用mongo日志。此外,请考虑变更检测与数据的结构和含义非常相关:例如:如果您只有插入集合,则可以使用_id获取更改的数据。问题太复杂了,无法给出答案,比如“做这个和那个,你会得到它”;你必须分析你的数据,并了解什么是更好的方法:参考文献找到已知的解决方案,避免重新发明*。

In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?

在性能方面,最好是简单地删除集合并插入所有内容或对每条记录进行更新吗?

Once again, you have to known how you data is structured. If you have a collection that has more changes than constant parts you'd better reload the entire collection and avoid tracking changes. If your collection has changeset that is considerably smaller than the whole collection updating existing document leads to better performance.

再次,您必须知道数据的结构。如果您的集合具有比常量部分更多的更改,则最好重新加载整个集合并避免跟踪更改。如果您的集合的变更集比整个集合小得多,则更新现有文档可以提高性能。

Hope this helps.

希望这可以帮助。

#1


0  

Also is there a good way to compare existing record with the one being read from the source to check if there's any change?

还有一种很好的方法可以将现有记录与从源读取的记录进行比较,以检查是否有任何变化?

You're searching for a Change Detection System : it's a problem commonly described for ETL system. I suggest you to read something about ETL process (Kimball's Datawarehouse ETL Toolkit is a good source). In general detecting changes is an hard problem and involves the use of snapshot in order to calculate differences. If you're sure that your collection will always remain in a mongo storage you can see if it's possible to mess around with mongo log. Furthermore consider that change detection is very coupled with the structure and the meaning of your data: e.g. if you have insertion-only collection you can get changed data with _id. The problem is too complex to give answers like "do this and that and you'll get it"; you have to analyze your data and understand what is the better method: refer to literature to find known solutions and avoid reinventing the wheel.

您正在搜索变更检测系统:这是ETL系统常见的问题。我建议你阅读一些关于ETL过程的内容(Kimball的Datawarehouse ETL Toolkit是一个很好的资源)。通常,检测更改是一个难题,并且涉及使用快照来计算差异。如果您确定您的收藏将始终保留在mongo存储中,您可以查看是否可以使用mongo日志。此外,请考虑变更检测与数据的结构和含义非常相关:例如:如果您只有插入集合,则可以使用_id获取更改的数据。问题太复杂了,无法给出答案,比如“做这个和那个,你会得到它”;你必须分析你的数据,并了解什么是更好的方法:参考文献找到已知的解决方案,避免重新发明*。

In terms of performance, is it better to simply remove the collection and insert everything or do an update for each record?

在性能方面,最好是简单地删除集合并插入所有内容或对每条记录进行更新吗?

Once again, you have to known how you data is structured. If you have a collection that has more changes than constant parts you'd better reload the entire collection and avoid tracking changes. If your collection has changeset that is considerably smaller than the whole collection updating existing document leads to better performance.

再次,您必须知道数据的结构。如果您的集合具有比常量部分更多的更改,则最好重新加载整个集合并避免跟踪更改。如果您的集合的变更集比整个集合小得多,则更新现有文档可以提高性能。

Hope this helps.

希望这可以帮助。