如何设计分区标记系统的数据存储?

How to design data storage for huge tagging system (like digg or delicious)?

如何为大型标签系统(如digg或delicious)设计数据存储?

There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?

已经有关于它的讨论，但它是关于集中式数据库。由于数据应该是增长的，所以我们需要将数据分割成多个碎片。那么，问题就变成了:如何设计分区标签系统的数据存储?

The tagging system basically has 3 tables:

标签系统基本上有3个表:

Item (item_id, item_content)

Tag (tag_id, tag_title)

TagMapping(map_id, tag_id, item_id)

That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.

如果表存储在一个数据库实例中，那么这对于查找给定标记的所有项和查找给定项的所有标记都是有效的。如果我们需要将数据分割成多个数据库实例，那就不是那么容易了。

For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.

对于表项，我们可以使用其键item_id对其内容进行分区。对于表标记，我们可以使用其关键的tag_id对其内容进行分区。例如，我们希望将表标记划分到K个数据库中。我们可以简单地选择number (tag_id % K)数据库来存储给定的标记。

But, how to partition table TagMapping?

但是，如何划分表标记映射呢?

The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.

TagMapping表表示多对多关系。我只能对图像进行复制。也就是说，TagMappping的相同内容有两个副本。一个用tag_id分区，另一个用item_id分区。在为给定项查找标记的场景中，我们使用带有tag_id的分区。如果要查找给定标记的项，我们使用item_id分区。

As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.

因此，存在数据冗余。并且，应用程序级别应该保持所有表的一致性。它看起来很难。

Is there any better solution to solve this many-to-many partition problem?

有更好的解决方法来解决多对多分区问题吗?

3 个解决方案

#1

I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!

我怀疑是否有一种方法可以优化所有可能的使用场景。如您所言，标记映射表支持两种主要场景:为给定项查找标记，以及使用给定标记查找项。我认为您将如何为每个可能感兴趣的场景使用标记映射表有一些不同。我只能基于典型的标签应用程序做出合理的假设，所以请原谅我的错误!

Finding Tags for a Given Item

查找给定项的标记

A1. You're going to display all of the tags for a given item at once

A1。您将同时显示给定项的所有标记

A2. You're going to ensure that all of an item's tags are unique

A2。您将确保项目的所有标记都是惟一的

Finding Items for a Given Tag

为给定的标记查找项

B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)

B1。您将需要一次为给定的标签添加一些项(以填充搜索结果的页面)

B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags

B2。您可能允许用户指定多个标记，因此您需要找到一些匹配多个标记的项目

B3. You're going to sort the items for a given tag (or tags) by some measure of popularity

B3。您将根据受欢迎程度对给定标记(或标记)的项进行排序

Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).

考虑到上面的内容，我认为一种很好的方法是按条目划分标记映射。这样，给定项的所有标记都位于一个分区上。分区可以是更细粒度的，因为可能有比标签多得多的项目，而且每个项目只有少量的标签。这使得检索变得容易(A1)，并且可以在单个分区(A2)中强制惟一性。此外，该单个分区可以告诉您一个项是否匹配多个标记(B2)。

Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.

由于您只需要一次给定的标记(或标记)的某些项，您可以在某个时间内查询分区，直到您有足够多的记录来填充结果页面。您需要查询多少个分区取决于您有多少个分区、希望显示多少结果以及使用标记的频率。每个分区在tag_id上都有自己的索引来有效地回答这个查询。

The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.

选择分区的顺序非常重要，因为它将影响搜索结果的分组方式。如果排序不重要(例如，B3不重要)，那么随机选择分区，这样所有分区都不会太热。如果排序很重要，您可以构造项id，以便它编码与结果排序顺序相关的信息。然后，适当的分区方案将注意到这种编码。例如，如果结果是按流行程度排序的URL，那么您可以将顺序项id与该URL的谷歌页面排名分数(或任何类似的内容)结合起来。分区方案必须确保给定分区中的所有项都具有相同的分数。查询将根据分数选择分区，以确保首先返回更流行的项(B3)。显然，这只允许进行一种排序，而且涉及的属性应该是常量，因为它们现在是键的一部分，并确定记录的分区。不过，这并不是一个新的限制，因为用分区数据支持各种类型或对易失性属性进行排序并不容易。

#2

The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.

规则是，您要通过字段进行分区。否则，您将不得不检查所有分区。您确定只需要使用tag_id查询Tag表吗?我不相信，您还需要通过标签标题查询。这对于Item表来说并不是很明显，但是当其他用户为它分配标签时，您可能也想通过URL之类的查询来查找item_id。

But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:

但是注意，标记和项表具有不可变的标题和URL。这意味着你可以使用以下技术:

Choose partition from title (for Tag) or URL (for Item).
从标题(标记)或URL(项目)中选择分区。
Choose sequence for this partition to generate id.
选择该分区的序列以生成id。

You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.

您可以使用部分- localid对作为全局标识符，也可以使用不重叠的数字集。无论如何，现在您可以从id和title/URL字段中计算分区。不事先知道分区的数量，或者担心将来会发生变化?创建更多它们并加入到组中，以便将来可以重新分组。

Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.

当然，不能对TagMapping表执行相同的操作，所以必须进行复制。你需要通过map_id, tag_id, item_id来查询它，对吧?所以即使没有分区，你也必须通过创建3个索引来复制数据。不同之处在于，你对每个索引使用不同的分区(根据不同的字段)。我看没什么好担心的。

#3

Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.

您的查询很可能与用户或主题相关。意思是你应该把所有的信息都联系到一个地方。

You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.

你说的是DB的分布，通常这是一个同步问题。阅读，通常是90%的工作，可以在复制的数据库上完成。问题是如何更新一个DB并保持一致而不破坏性能。这取决于您的场景细节。

The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.

另一种可能是像您所问的那样，对所有数据进行分区，而不重叠。您可能会使用用户ID或主题ID进行分区，如果您使用主题ID进行分区，一个数据库可以引用所有主题，并只告诉哪个专用的DB保存数据。然后您可以查询正确的查询。由于您通过ID进行分区，因此与该主题相关的所有信息都可以位于该专用数据库上。您也可以按语言或国家划分一个国际网站。

Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.

最后但并非最不重要的是，您可能最终将两者混合:一些非重叠数据和一些重叠(复制)数据。首先找到通常的操作，然后找到如何在最少可能的查询中使用一个DB上的操作。

PS: Don't forget about caching, it'll save you more than distributed-DB.

PS:别忘了缓存，它会比分布式db节省更多。

#1