使用哪个NoSql来存储数十亿的整数对数据?

时间:2023-01-23 08:47:47

Right now I have table in Mysql with 3 columns.

现在我在Mysql中有3列的表。

DocId             Int
Match_DocId       Int
Percentage Match  Int

I am storing document id along with its near duplicate document id and percentage which indicate how closely two documents match.

我正在存储文档id,以及它的几乎重复的文档id和百分比,这表明两个文档的匹配程度有多接近。

So if one document has 100 near duplicates, we have 100 rows for that particular document.

因此,如果一个文档有100个几乎相同的副本,那么这个文档就有100行。

Right now, this table has more than 1 billion records for total of 14 millions documents. I am expecting total documents to go upto 30 millions. That means my table which stores near duplicate information will have more than 5 billions rows, may be more than that. (Near duplicate data grows exponentially compare to total document set)

目前,这个表格有超过10亿的记录,总共有1400万份文件。我希望文件总数能达到3000万份。这意味着,存储在重复信息附近的表将有超过5亿行,可能不止这些。(几乎重复的数据与整个文档集相比呈指数增长)

Here are few issues that I have:

以下是我的一些问题:

  1. Getting all there records in mysql table is taking lot of time.
  2. 在mysql表中获取所有记录需要花费大量的时间。
  3. Query takes lot of time as well.
  4. 查询也需要很多时间。

Here are few queries that I run:

下面是我运行的几个查询:

  • Check if particular document has any near duplicate. (this is relatively fast, but still slow)

    检查特定的文件是否有接近副本。(这个速度相对较快,但仍然很慢)

  • Check for given set of documents, how many near duplicates are there in each percentage range (Percentage range is 86-90, 91-95 , 96-100)?

    检查给定的一组文档,每个百分比范围中有多少接近重复的文档(百分比范围是86-90,91-95,96-100)?

    This query takes lot of time. Most of the time it fails. I am going group by on percentage column.

    这个查询需要很多时间。大多数时候它都失败了。我将按百分比列分组。

Can this be managed with any available NoSql solution?

可以使用任何可用的NoSql解决方案来管理它吗?

I am skeptical for SQL query support for NoSql solutions as I need group by support while querying data.

我对NoSql解决方案的SQL查询支持持怀疑态度,因为我在查询数据时需要分组支持。

2 个解决方案

#1


2  

MySQL

You can try sharding with your current MySql solution, i.e. splitting your large database into smaller distinctive databases. The problem with that is you should only work with one shard at a time and this would be fast. If you plan to use queries across several shards then it would be painfully slow.

您可以尝试使用当前的MySql解决方案进行分片,即将大型数据库分割为较小的、与众不同的数据库。这样做的问题是,你一次只能处理一个碎片,这样会很快。如果您计划跨多个碎片使用查询,那么这将是非常缓慢的。

NoSql

Apache Hadoop stack will be worth looking at. There are several systems that allow you to perform slightly different queries. A good point is that they all tend to interoperate well between each other.

Apache Hadoop栈值得一看。有几个系统允许您执行稍微不同的查询。一个很好的观点是,它们都倾向于彼此间良好的互操作。

Check if particular document has any near duplicate. (this is relatively fast, but still slow)

检查特定的文件是否有接近副本。(这个速度相对较快,但仍然很慢)

HBase can do this job for big table.

HBase可以在大表上完成这项工作。

Check for given set of documents, how many near duplicates are there in each percentage range ? (Percentage range is 86-90, 91-95 , 96-100)

检查给定的一组文档,每个百分比范围内有多少接近的副本?(百分比范围为86-90,91-95,96-100)

This should be a good fit for Map-Reduce

这应该很适合Map-Reduce


There are many other solutions, see this link for a list and brief description of other NoSql databases.

还有许多其他解决方案,请参见此链接以获得列表和其他NoSql数据库的简要描述。

#2


1  

We have good experiences with Redis. It's fast, can be make as reliable as you want it to. Other options could be CouchDB or Cassandra.

我们在Redis方面有很好的经验。它的速度很快,可以让它像你想的那样可靠。其他选择可能是CouchDB或Cassandra。

#1


2  

MySQL

You can try sharding with your current MySql solution, i.e. splitting your large database into smaller distinctive databases. The problem with that is you should only work with one shard at a time and this would be fast. If you plan to use queries across several shards then it would be painfully slow.

您可以尝试使用当前的MySql解决方案进行分片,即将大型数据库分割为较小的、与众不同的数据库。这样做的问题是,你一次只能处理一个碎片,这样会很快。如果您计划跨多个碎片使用查询,那么这将是非常缓慢的。

NoSql

Apache Hadoop stack will be worth looking at. There are several systems that allow you to perform slightly different queries. A good point is that they all tend to interoperate well between each other.

Apache Hadoop栈值得一看。有几个系统允许您执行稍微不同的查询。一个很好的观点是,它们都倾向于彼此间良好的互操作。

Check if particular document has any near duplicate. (this is relatively fast, but still slow)

检查特定的文件是否有接近副本。(这个速度相对较快,但仍然很慢)

HBase can do this job for big table.

HBase可以在大表上完成这项工作。

Check for given set of documents, how many near duplicates are there in each percentage range ? (Percentage range is 86-90, 91-95 , 96-100)

检查给定的一组文档,每个百分比范围内有多少接近的副本?(百分比范围为86-90,91-95,96-100)

This should be a good fit for Map-Reduce

这应该很适合Map-Reduce


There are many other solutions, see this link for a list and brief description of other NoSql databases.

还有许多其他解决方案,请参见此链接以获得列表和其他NoSql数据库的简要描述。

#2


1  

We have good experiences with Redis. It's fast, can be make as reliable as you want it to. Other options could be CouchDB or Cassandra.

我们在Redis方面有很好的经验。它的速度很快,可以让它像你想的那样可靠。其他选择可能是CouchDB或Cassandra。