
时间:2022-09-08 16:19:58

It's my first time working with databases so I spent a bunch of hours reading and watching videos. The data I am analyzing is a limited set of marathon data, and the goal is to produce statistics on each runner.


I am looking for advice and suggestions on my database design as well as how I might go about producing statistics. Please see this image for my proposed design:



Basically, I'm thinking there's a many-to-many relationship between Races and Runners: there are multiple runners in a race, and a runner can have run multiple races. Therefore, I have the bridge table called Race_Results to store the time and age for a given runner in a given race.


The Statistics table is what I'm looking to get to in the end. In the image are just some random things I may want to calculate.


So my questions are:


  1. Does this design make sense? What improvements might you make?


  2. What kinds of SQL queries would be used to calculate these statistics? Would I have to make some other tables in between - for example, to find the percentage of the time a runner finished within 10 minutes of first place, would I have to first make a table of all runner data for that race and then do some queries, or is there a better way? Any links I should check out for more on calculating these sorts of statistics?

    将使用哪种SQL查询来计算这些统计信息?我是否必须在其间制作一些其他表格 - 例如,要找到跑步者在第一名的10分钟内完成的时间百分比,我是否必须首先为该比赛制作所有跑步者数据的表格然后做一些查询,还是有更好的方法?我应该查看有关计算这些统计数据的更多链接吗?

  3. Should I possibly be using python or another language to get these statistics instead of SQL? My understanding was that SQL has the potential to cut down a few hundred lines of python code to one line, so I thought I'd try to give it a shot with SQL.




2 个解决方案



I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.

我认为你的设计很好,虽然Race_Results.Age是多余的 - 注意你是否更新跑步者的DOB或比赛日期。

It should be reasonably easy to create views for each of your statistics. For example:


SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results

CREATE VIEW Within_10_Minutes AS
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);

    COUNT(*) AS Number_of_races,
    COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID



1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.


2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible. If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.

2)如果你将使用Oracle或MSSQL,我会说你可以使用一些聚合函数和公用表表达式。在MySQL中,您将不得不使用group by和子查询。使整个方法更复杂,但完全可行。如果您在评论中要求特定的指标,我可能会建议一些代码,尽管我的专业知识更多地在Oracle和MSSQL中。

3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.




I think your design is fine, though Race_Results.Age is redundant - watch out if you update a runner's DOB or a race date.

我认为你的设计很好,虽然Race_Results.Age是多余的 - 注意你是否更新跑步者的DOB或比赛日期。

It should be reasonably easy to create views for each of your statistics. For example:


SELECT Race_ID, MIN(Time) AS Time,
FROM Race_Results

CREATE VIEW Within_10_Minutes AS
FROM Race_Results rr
JOIN Best_Times b
ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE);

    COUNT(*) AS Number_of_races,
    COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place`
FROM Race_Results rr
LEFT JOIN Within_10_Minutes w
ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID
GROUP BY rr.Runner_ID



1) The design of your 3 tables Races, Race_Results and Runners make perfectly sense. Nothing to improve here. The statistics are something different. If you manage to write those probably slightly complicated queries in a way they can be used in a view, you should do that and avoid saving statistics that need to be recalculated each day. Calculating something like this on-the-fly whenever it is needed is better than saving it, as long as the performance is sufficient.


2) If you would be using Oracle or MSSQL, I'd say you would be fine with some aggregate functions and common table expressions. In MySQL, you will have to use group by and subqueries. Makes the whole approach a bit more complicated, but totally feasible. If you ask for a specific metric in a comment, I might be able to suggest some code, though my expertise is more in Oracle and MSSQL.

2)如果你将使用Oracle或MSSQL,我会说你可以使用一些聚合函数和公用表表达式。在MySQL中,您将不得不使用group by和子查询。使整个方法更复杂,但完全可行。如果您在评论中要求特定的指标,我可能会建议一些代码,尽管我的专业知识更多地在Oracle和MSSQL中。

3) If you can, put your code in the database. In this way, you avoid frequent context switches between your programming language and the database. This approach usually is the fastest in all database systems.
