SQL数据库设计用于多对多关系的统计分析

时间:2022-09-08 16:19:58

It's my first time working with databases so I spent a bunch of hours reading and watching videos. The data I am analyzing is a limited set of marathon data, and the goal is to produce statistics on each runner.

这是我第一次使用数据库,所以我花了很多时间阅读和观看视频。我正在分析的数据是一组有限的马拉松数据,目标是为每个跑步者提供统计数据。

I am looking for advice and suggestions on my database design as well as how I might go about producing statistics. Please see this image for my proposed design:

我正在寻找有关我的数据库设计的建议和建议,以及我如何制作统计数据。请参阅此图片以了解我提出的设计:

SQL数据库设计用于多对多关系的统计分析

Basically, I'm thinking there's a many-to-many relationship between Races and Runners: there are multiple runners in a race, and a runner can have run multiple races. Therefore, I have the bridge table called Race_Results to store the time and age for a given runner in a given race.

基本上,我认为Races和Runners之间存在多对多的关系:比赛中有多个跑步者,跑步者可以参加多场比赛。因此,我有一个名为Race_Results的桥牌表来存储给定赛跑者在给定比赛中的时间和年龄。

The Statistics table is what I'm looking to get to in the end. In the image are just some random things I may want to calculate.

统计表是我最终想要达到的目的。在图像中只是我可能想要计算的一些随机事物。

So my questions are:

所以我的问题是:

  1. Does this design make sense? What improvements might you make?

    这个设计有意义吗?你可以做些什么改进?

  2. What kinds of SQL queries would be used to calculate these statistics? Would I have to make some other tables in between - for example, to find the percentage of the time a runner finished within 10 minutes of first place, would I have to first make a table of all runner data for that race and then do some queries, or is there a better way? Any links I should check out for more on calculating these sorts of statistics?

    将使用哪种SQL查询来计算这些统计信息?我是否必须在其间制作一些其他表格 - 例如,要找到跑步者在第一名的10分钟内完成的时间百分比,我是否必须首先为该比赛制作所有跑步者数据的表格然后做一些查询,还是有更好的方法?我应该查看有关计算这些统计数据的更多链接吗?

  3. Should I possibly be using python or another language to get these statistics instead of SQL? My understanding was that SQL has the potential to cut down a few hundred lines of python code to one line, so I thought I'd try to give it a shot with SQL.

    我是否应该使用python或其他语言来获取这些统计信息而不是SQL?我的理解是SQL有可能将几百行的python代码减少到一行,所以我想我会试着用SQL来试一试。

Thanks!

谢谢!

2 个解决方案

#1


1  

I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.

我认为你的设计很好,虽然Race_Results.Age是多余的 - 注意你是否更新跑步者的DOB或比赛日期。

It should be reasonably easy to create views for each of your statistics. For example:

为每个统计信息创建视图应该相当容易。例如:

CREATE VIEW Best_Times AS
SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results
GROUP BY Race_ID;

CREATE VIEW Within_10_Minutes AS
SELECT rr.*
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);

SELECT
    rr.Runner_ID,
    COUNT(*) AS Number_of_races,
    COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID

#2


1  

1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.

1)你的3个桌子Races,Race_Results和Runners的设计非常有意义。这里没什么可改进的。统计数据有所不同。如果您设法以一种可以在视图中使用的方式编写那些可能稍微复杂的查询,那么您应该这样做并避免保存每天需要重新计算的统计信息。只要性能足够,只要需要,就可以在需要时即时计算这样的东西比保存它更好。

2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible. If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.

2)如果你将使用Oracle或MSSQL,我会说你可以使用一些聚合函数和公用表表达式。在MySQL中,您将不得不使用group by和子查询。使整个方法更复杂,但完全可行。如果您在评论中要求特定的指标,我可能会建议一些代码,尽管我的专业知识更多地在Oracle和MSSQL中。

3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.

3)如果可以,请将您的代码放入数据库中。这样,您就可以避免在编程语言和数据库之间频繁切换上下文。这种方法通常是所有数据库系统中最快的。

#1


1  

I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.

我认为你的设计很好,虽然Race_Results.Age是多余的 - 注意你是否更新跑步者的DOB或比赛日期。

It should be reasonably easy to create views for each of your statistics. For example:

为每个统计信息创建视图应该相当容易。例如:

CREATE VIEW Best_Times AS
SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results
GROUP BY Race_ID;

CREATE VIEW Within_10_Minutes AS
SELECT rr.*
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);

SELECT
    rr.Runner_ID,
    COUNT(*) AS Number_of_races,
    COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID

#2


1  

1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.

1)你的3个桌子Races,Race_Results和Runners的设计非常有意义。这里没什么可改进的。统计数据有所不同。如果您设法以一种可以在视图中使用的方式编写那些可能稍微复杂的查询,那么您应该这样做并避免保存每天需要重新计算的统计信息。只要性能足够,只要需要,就可以在需要时即时计算这样的东西比保存它更好。

2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible. If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.

2)如果你将使用Oracle或MSSQL,我会说你可以使用一些聚合函数和公用表表达式。在MySQL中,您将不得不使用group by和子查询。使整个方法更复杂,但完全可行。如果您在评论中要求特定的指标,我可能会建议一些代码,尽管我的专业知识更多地在Oracle和MSSQL中。

3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.

3)如果可以,请将您的代码放入数据库中。这样,您就可以避免在编程语言和数据库之间频繁切换上下文。这种方法通常是所有数据库系统中最快的。