帮助设计歌词数据库的架构

时间:2022-08-01 12:54:12

I'd like to work on a project, but it's a little odd. I want to create a site that shows lyrics and their translations, but they are shown simultaneously side-by-side (so this isn't just a normal i18n of the site).

我想参与一个项目,但这有点奇怪。我想创建一个显示歌词及其翻译的网站,但它们是并排显示的(因此这不仅仅是网站的正常i18n)。

I have normalized the tables like this (formatted to show hierarchy).

我已经像这样规范化了表格(格式化为显示层次结构)。

artists
  artistNames

  albums
    albumNames

    tracks
      trackNames
      trackLyrics
      user

So questions,

First, that'll be a whopping seven joins. I must have written pretty small queries in the past because I've never come across something like this. Is joining so many tables a bad thing? I'm pretty sure I'll be using SQLite for this project, but does anyone think PostgreSQL or MySQL could perform better with a pretty big join like this?

首先,这将是一个惊人的七连接。我过去一定写过很小的问题,因为我从未遇到过这样的问题。加入这么多桌子是件坏事吗?我很确定我会在这个项目中使用SQLite,但有没有人认为PostgreSQL或MySQL可以通过这样一个非常大的连接表现更好?

Second, my current self-built framework uses a data mapper to create domain objects. This is the first time I will be working with so many one-to-many relationships, so my mapper really only takes one row as one object. For example,

其次,我目前的自建框架使用数据映射器来创建域对象。这是我第一次使用这么多的一对多关系,所以我的mapper实际上只占一行作为一个对象。例如,

id      name
------  ----------
1       *
2       Stephen Chow

So it's super easy to map objects. But with those one to many relationships...

因此,映射对象非常容易。但是那些一对多的关系......

id      language    name
------  ----------  -------
1       en          *
1       zh          *
2       en          Stephen Chow
2       zh          周星馳

...I'm not sure what to do. Is looping through the result set to create a massive array and feeding it to my domain object factory the only option when dealing with a data set like this?

......我不知道该怎么做。循环遍历结果集以创建一个海量数组并将其提供给我的域对象工厂是处理这样的数据集的唯一选择吗?

<?php
    array(
        array(
            'id' => 1,
            'names' => array(
                'en' => '*'
                'zh' => '*'
            )
        ),
        array(
            'id' => 2,
            'names' => array(
                'en' => 'Stephan Chow',
                'zh' => '周星馳'
            )
        )
    );
?>

I have an itch to just denormalize these tables so I can get my one row per object application working, but I've always read this is not the way to go.

我只是对这些表进行非规范化处理,所以我可以让每个对象应用程序运行一行,但我总是读到这不是要走的路。

Third, does this schema sound right for the job?

第三,这种架构是否适合这项工作?

5 个解决方案

#1


Twelve way joins are not unheard of in serious industrial work. You need sufficient hardware, a strong DBMS, and good database design. Seven way joins should be a breeze for any good environment.

在严肃的工业工作中,十二路加入并非闻所未闻。您需要足够的硬件,强大的DBMS和良好的数据库设计。对于任何良好的环境,七路连接应该是轻而易举的。

You separate out data, as needed, to avoid difficulties like database update anomalies. These anomalies are what you get when you don't follow the normalization rules. You join data as needed to get the data that you need in a single result.

您可以根据需要分离数据,以避免数据库更新异常等问题。当您不遵循规范化规则时,您会获得这些异常。您可以根据需要连接数据,以便在单个结果中获取所需的数据。

Sometimes it's better to ignore some of the normalization rules when you build a database. In that case, you need an alternative set of design principles in order to avoid design by trial and error. The amount of joining you are doing has little to do with the disadvantages of looping through results or unfortunate mapping between tuples and objects.

有时,在构建数据库时忽略一些规范化规则会更好。在这种情况下,您需要一套替代设计原则,以避免通过反复试验进行设计。您正在进行的连接数量与循环结果或元组和对象之间的不幸映射的缺点几乎没有关系。

Most of the mappings between tuples (table rows) and objects are done in an incorrect fashion. A tuple is an object, but it isn't an application oriented object. This can cause either performance problems or difficult programmming or both.

元组(表行)和对象之间的大多数映射都以不正确的方式完成。元组是一个对象,但它不是面向应用程序的对象。这可能导致性能问题或难以编程或两者兼而有之。

As far as you can avoid it, don't loop through results, one row at a time. Deal with results as a set of data. If you can't do that in PHP, then you need to learn how, or get a better programming environment.

只要你可以避免它,不要循环遍历结果,一次一行。将结果作为一组数据处理。如果你不能用PHP做到这一点,那么你需要学习如何,或者获得更好的编程环境。

#2


Just a note. I'm not really sure that 7 tables is that big a join. I seem to remember that Postgres has a special query optimiser (based on a genetic algorithm, no less) that only kicks in once you join 12 tables or more.

只是一张纸条。我不太确定7个表是一个很大的连接。我似乎记得Postgres有一个特殊的查询优化器(基于遗传算法,不能少)只有在你加入12个表或更多表后才会启动。

#3


General rule is to make schema as normalized as possible. Then perform stress tests with expected amount of data. If you find performance bottlenecks you should try to optimize in following order:

一般规则是使架构尽可能规范化。然后使用预期的数据量执行压力测试。如果发现性能瓶颈,应尝试按以下顺序进行优化:

  1. Profile and optimize queries
    • add indices to schema
    • 为模式添加索引

    • add hints to query optimizer (don't know if SQLite has any, but most of databases do)
    • 添加提示到查询优化器(不知道SQLite是否有任何,但大多数数据库都有)

  2. 配置文件和优化查询添加索引到架构添加提示到查询优化器(不知道SQLite是否有任何,但大多数数据库都有)

  3. If 1. does not gain any performance benefits, consider denormalizing database.
  4. 如果1.没有获得任何性能优势,请考虑对数据库进行非规范化。

Denormalizing database is usually needed only if you work with "large" amounts of data. I checked several lyrics databases on internet and the largest I found have lyrics for about 400.000 songs. Let's assume you can find 1.000.000 of lyrics performed by 500.000 artists. That is amount of data that all databases can easily handle on average modern computer.

仅当您使用“大量”数据时,通常才需要使用非规范化数据库。我在互联网上检查了几个歌词数据库,我发现最大的歌词有大约400,000首歌曲。假设您可以找到由500,000名艺术家执行的1.000.000的歌词。这是所有数据库可以在普通现代计算机上轻松处理的数据量。

#4


Doing this many joins shouldn't be an issue on any serious DB. I haven't worked with SQLite to know if it's in the "serious" category. The only way to find out would be to create your schema, load up a lot of data and start looking at query plans (visual explains are very useful here). When I am doing these kinds of tests, I usually shoot for 10x the data I expect to have in production. If things work ok with this much data, I know I should be ok with real data.

这样做很多连接不应该是任何严重数据库的问题。我没有和SQLite合作过,知道它是否属于“严重”类别。找出答案的唯一方法是创建架构,加载大量数据并开始查看查询计划(视觉解释在这里非常有用)。当我进行这些测试时,我通常会拍摄10倍于我期望的制作数据。如果这些数据的工作正常,我知道我应该对实际数据没问题。

Also, depending on how you need to retrieve the data, you may want to try subqueries instead of joins:

此外,根据您需要检索数据的方式,您可能希望尝试子查询而不是连接:

select a.*, (select r.name from artist r where r.id=a.artist a and r.locale='en') from album where a.id=1;

#5


I've helped a friend optimize a web storefront. In your case, it's a lot the same.

我帮助一位朋友优化了网店。在你的情况下,它是相同的。

First. What is your priority, webpage speed or update speed?

第一。您的优先级,网页速度或更新速度是多少?

Normal forms were designed to make data maintenance simple. If Prince changes his name again, voila, just one row is updated. But if you want your web pages to render as fast as possible, then 3rd normal isn't your best plan. Yes, every one is correct that it will do a 7 way join no problem, but that will be dozens of i/o's... index lookup on every table then table access by rowid, then again and again. If you denormalize for webpage loading speed you may do 2 or 3 i/o's. Which will also allow for greater scaling since every page hit will need fewer i/o's to complete, you'll be able to do more simultaneous hits before maxing your i/o.

正常形式旨在简化数据维护。如果Prince再次改名,瞧,只有一行更新。但是如果你想让你的网页尽可能快地渲染,那么第3次正常就不是你最好的计划了。是的,每一个都是正确的,它会做7路加入没问题,但那将是几十个i / o的...索引查找每个表然后表访问rowid,然后一次又一次。如果您对网页加载速度进行非规范化,则可以执行2或3次i / o。这也将允许更大的扩展,因为每个页面命中将需要更少的i / o来完成,你将能够在最大化你的i / o之前做更多的同时命中。

But there's no reason not to do both. you can keep the base data, the official copy in a normal form, then write a script that can generate a denormal table for web performance. If it's not that big, you can regen the whole thing in a few minute of maintenance downtime. If it is very big, you may need to be smart about the update and only change what needs to be keeping change vectors in an intermediate driving table.

但是没有理由不去做这两件事。您可以保留基本数据,正式形式的正式副本,然后编写一个脚本,可以为Web性能生成非正常表。如果它不是那么大,你可以在几分钟的维护停机时间内重新生成整个东西。如果它非常大,您可能需要对更新有所了解,并且只需要更改需要在中间驱动表中保留更改向量的内容。

But at the heart of your design I have a question.

但是你设计的核心是我有一个问题。

Artist names change over time. John Cougar became John Cougar Melonhead (or something) and then later he became John Mellancamp. Do you care which John did a song? will you stamp the entries with from and to valid dates?

艺术家名称随时间而变化。 John Cougar成为了John Cougar Melonhead(或其他什么),后来他成为John Mellancamp。你关心约翰做了哪首歌吗?你会在有效日期和有效日期之前盖章吗?

It looks like you have a 1-n relationship from artists to albums but that really should many-many.

看起来你有一个从艺术家到专辑的1-n关系,但真的应该有很多人。

Sometimes the same album is released more than once, with different included tracks and sometimes with different names for a track. Think international releases. Or bonus tracks. How will you know that's all the same album?

有时相同的专辑会不止一次发布,包含不同的曲目,有时会有不同的曲目名称。想想国际版本。或奖金曲目。你怎么会知道那张相同的专辑呢?

If you don't care about those details then why bother with normalization? If Jon and Vangelis is 1 artist, then there is simply no need to normalize. You're not interested in the answers normalization will provide.

如果您不关心这些细节,那么为什么还要理惯化呢?如果Jon和Vangelis是一位艺术家,那么就没有必要进行规范化。你对标准化提供的答案不感兴趣。

#1


Twelve way joins are not unheard of in serious industrial work. You need sufficient hardware, a strong DBMS, and good database design. Seven way joins should be a breeze for any good environment.

在严肃的工业工作中,十二路加入并非闻所未闻。您需要足够的硬件,强大的DBMS和良好的数据库设计。对于任何良好的环境,七路连接应该是轻而易举的。

You separate out data, as needed, to avoid difficulties like database update anomalies. These anomalies are what you get when you don't follow the normalization rules. You join data as needed to get the data that you need in a single result.

您可以根据需要分离数据,以避免数据库更新异常等问题。当您不遵循规范化规则时,您会获得这些异常。您可以根据需要连接数据,以便在单个结果中获取所需的数据。

Sometimes it's better to ignore some of the normalization rules when you build a database. In that case, you need an alternative set of design principles in order to avoid design by trial and error. The amount of joining you are doing has little to do with the disadvantages of looping through results or unfortunate mapping between tuples and objects.

有时,在构建数据库时忽略一些规范化规则会更好。在这种情况下,您需要一套替代设计原则,以避免通过反复试验进行设计。您正在进行的连接数量与循环结果或元组和对象之间的不幸映射的缺点几乎没有关系。

Most of the mappings between tuples (table rows) and objects are done in an incorrect fashion. A tuple is an object, but it isn't an application oriented object. This can cause either performance problems or difficult programmming or both.

元组(表行)和对象之间的大多数映射都以不正确的方式完成。元组是一个对象,但它不是面向应用程序的对象。这可能导致性能问题或难以编程或两者兼而有之。

As far as you can avoid it, don't loop through results, one row at a time. Deal with results as a set of data. If you can't do that in PHP, then you need to learn how, or get a better programming environment.

只要你可以避免它,不要循环遍历结果,一次一行。将结果作为一组数据处理。如果你不能用PHP做到这一点,那么你需要学习如何,或者获得更好的编程环境。

#2


Just a note. I'm not really sure that 7 tables is that big a join. I seem to remember that Postgres has a special query optimiser (based on a genetic algorithm, no less) that only kicks in once you join 12 tables or more.

只是一张纸条。我不太确定7个表是一个很大的连接。我似乎记得Postgres有一个特殊的查询优化器(基于遗传算法,不能少)只有在你加入12个表或更多表后才会启动。

#3


General rule is to make schema as normalized as possible. Then perform stress tests with expected amount of data. If you find performance bottlenecks you should try to optimize in following order:

一般规则是使架构尽可能规范化。然后使用预期的数据量执行压力测试。如果发现性能瓶颈,应尝试按以下顺序进行优化:

  1. Profile and optimize queries
    • add indices to schema
    • 为模式添加索引

    • add hints to query optimizer (don't know if SQLite has any, but most of databases do)
    • 添加提示到查询优化器(不知道SQLite是否有任何,但大多数数据库都有)

  2. 配置文件和优化查询添加索引到架构添加提示到查询优化器(不知道SQLite是否有任何,但大多数数据库都有)

  3. If 1. does not gain any performance benefits, consider denormalizing database.
  4. 如果1.没有获得任何性能优势,请考虑对数据库进行非规范化。

Denormalizing database is usually needed only if you work with "large" amounts of data. I checked several lyrics databases on internet and the largest I found have lyrics for about 400.000 songs. Let's assume you can find 1.000.000 of lyrics performed by 500.000 artists. That is amount of data that all databases can easily handle on average modern computer.

仅当您使用“大量”数据时,通常才需要使用非规范化数据库。我在互联网上检查了几个歌词数据库,我发现最大的歌词有大约400,000首歌曲。假设您可以找到由500,000名艺术家执行的1.000.000的歌词。这是所有数据库可以在普通现代计算机上轻松处理的数据量。

#4


Doing this many joins shouldn't be an issue on any serious DB. I haven't worked with SQLite to know if it's in the "serious" category. The only way to find out would be to create your schema, load up a lot of data and start looking at query plans (visual explains are very useful here). When I am doing these kinds of tests, I usually shoot for 10x the data I expect to have in production. If things work ok with this much data, I know I should be ok with real data.

这样做很多连接不应该是任何严重数据库的问题。我没有和SQLite合作过,知道它是否属于“严重”类别。找出答案的唯一方法是创建架构,加载大量数据并开始查看查询计划(视觉解释在这里非常有用)。当我进行这些测试时,我通常会拍摄10倍于我期望的制作数据。如果这些数据的工作正常,我知道我应该对实际数据没问题。

Also, depending on how you need to retrieve the data, you may want to try subqueries instead of joins:

此外,根据您需要检索数据的方式,您可能希望尝试子查询而不是连接:

select a.*, (select r.name from artist r where r.id=a.artist a and r.locale='en') from album where a.id=1;

#5


I've helped a friend optimize a web storefront. In your case, it's a lot the same.

我帮助一位朋友优化了网店。在你的情况下,它是相同的。

First. What is your priority, webpage speed or update speed?

第一。您的优先级,网页速度或更新速度是多少?

Normal forms were designed to make data maintenance simple. If Prince changes his name again, voila, just one row is updated. But if you want your web pages to render as fast as possible, then 3rd normal isn't your best plan. Yes, every one is correct that it will do a 7 way join no problem, but that will be dozens of i/o's... index lookup on every table then table access by rowid, then again and again. If you denormalize for webpage loading speed you may do 2 or 3 i/o's. Which will also allow for greater scaling since every page hit will need fewer i/o's to complete, you'll be able to do more simultaneous hits before maxing your i/o.

正常形式旨在简化数据维护。如果Prince再次改名,瞧,只有一行更新。但是如果你想让你的网页尽可能快地渲染,那么第3次正常就不是你最好的计划了。是的,每一个都是正确的,它会做7路加入没问题,但那将是几十个i / o的...索引查找每个表然后表访问rowid,然后一次又一次。如果您对网页加载速度进行非规范化,则可以执行2或3次i / o。这也将允许更大的扩展,因为每个页面命中将需要更少的i / o来完成,你将能够在最大化你的i / o之前做更多的同时命中。

But there's no reason not to do both. you can keep the base data, the official copy in a normal form, then write a script that can generate a denormal table for web performance. If it's not that big, you can regen the whole thing in a few minute of maintenance downtime. If it is very big, you may need to be smart about the update and only change what needs to be keeping change vectors in an intermediate driving table.

但是没有理由不去做这两件事。您可以保留基本数据,正式形式的正式副本,然后编写一个脚本,可以为Web性能生成非正常表。如果它不是那么大,你可以在几分钟的维护停机时间内重新生成整个东西。如果它非常大,您可能需要对更新有所了解,并且只需要更改需要在中间驱动表中保留更改向量的内容。

But at the heart of your design I have a question.

但是你设计的核心是我有一个问题。

Artist names change over time. John Cougar became John Cougar Melonhead (or something) and then later he became John Mellancamp. Do you care which John did a song? will you stamp the entries with from and to valid dates?

艺术家名称随时间而变化。 John Cougar成为了John Cougar Melonhead(或其他什么),后来他成为John Mellancamp。你关心约翰做了哪首歌吗?你会在有效日期和有效日期之前盖章吗?

It looks like you have a 1-n relationship from artists to albums but that really should many-many.

看起来你有一个从艺术家到专辑的1-n关系,但真的应该有很多人。

Sometimes the same album is released more than once, with different included tracks and sometimes with different names for a track. Think international releases. Or bonus tracks. How will you know that's all the same album?

有时相同的专辑会不止一次发布,包含不同的曲目,有时会有不同的曲目名称。想想国际版本。或奖金曲目。你怎么会知道那张相同的专辑呢?

If you don't care about those details then why bother with normalization? If Jon and Vangelis is 1 artist, then there is simply no need to normalize. You're not interested in the answers normalization will provide.

如果您不关心这些细节,那么为什么还要理惯化呢?如果Jon和Vangelis是一位艺术家,那么就没有必要进行规范化。你对标准化提供的答案不感兴趣。