如何使用唯一的user_id从前50中随机选择8首歌曲?

时间:2021-04-20 12:54:57

I am trying to get top 50 downloads, and then shuffling (randomizing) 8 results. Plus, the 8 results have to be unique user_id's. I came up with this so far:

我正在尝试获得前50个下载,然后移动(随机)8个结果。另外,8个结果必须是唯一的user_id。我想到了这个

Song.select('DISTINCT songs.user_id, songs.*').where(:is_downloadable => true).order('songs.downloads_count DESC').limit(50).sort_by{rand}.slice(0,8)

My only gripe with this is, the last part .sort_by{rand}.slice(0,8) is being done via Ruby. Any way I can do all this via Active Record?

我唯一的抱怨是,最后一部分。sort_by{rand}.slice(0,8)是通过Ruby完成的。我可以通过活动记录来做这些吗?

2 个解决方案

#1


3  

I wonder how the column user_id ended up in the table songs? That means you have one row for every combination of song and user? In a normalized schema, that would be an n:m relationship implemented with three tables:

我想知道user_id列是如何在表歌曲中结束的?这就意味着你对歌曲和用户的每一个组合都有一行?在规范化模式中,这是一个n:m关系,由三个表实现:

song(song_id, ...)
usr(usr_id, ...)    -- "user" is a reserved word
download (song_id, user_id, ...) -- implementing the n:m relationship

The query in your question yields incorrect results. The same user_id can pop up multiple times. DISTINCT does not do what you seem to expect it to. You need DISTINCT ON or some other method like aggregate or window functions.

问题中的查询会产生不正确的结果。同一个user_id可以多次弹出。显然不是你所期望的那样。你需要不同的方法,比如聚合或窗口函数。

You also need to use subqueries or CTEs, because this cannot be done in one step. When you use DISTINCT you cannot at the same time ORDER BY random(), because the sort order cannot disagree with the order dictated by DISTINCT. This query is certainly not trivial.

您还需要使用子查询或cte,因为这不能一步完成。当您使用DISTINCT时,不能同时使用random()排序,因为排序顺序不能与DISTINCT所指定的顺序不一致。这个查询当然不是琐碎的。

Simple case, top 50 songs

If you are happy to just pick the top 50 songs (not knowing how many duplicate user_ids are among them), this "simple" case will do:

如果你高兴地选择50首歌曲(不知道其中有多少个重复的user_id),这个“简单”的例子将会:

WITH x AS (
    SELECT *
    FROM   songs
    WHERE  is_downloadable
    ORDER  BY downloads_count DESC
    LIMIT  50
    )
    , y AS (
    SELECT DISTINCT ON (user_id) *
    FROM   x
    ORDER  BY user_id, downloads_count DESC -- pick most popular song per user
--  ORDER  BY user_id, random() -- pick random song per user
    )
SELECT *
FROM   y
ORDER  BY random()
LIMIT  8;
  1. Get the 50 songs with the highest download_count. Users can show up multiple times.
  2. 获得下载次数最高的50首歌曲。用户可以多次出现。
  3. Pick 1 song per user. Randomly or the most popular one, that's not defined in your question.
  4. 每个用户选择一首歌曲。随机的或者最流行的,你们的问题中没有定义。
  5. Pick 8 songs with now distinct user_id randomly.
  6. 随机选择8首有不同user_id的歌曲。

You only need an index on songs.downloads_count for this to be fast:

你只需要一个关于歌曲的索引。downloads_count这样做的速度很快:

CREATE INDEX songs_downloads_count_idx ON songs (downloads_count DESC);

Top 50 songs with unique user_id

WITH x AS (
    SELECT DISTINCT ON (user_id) *
    FROM   songs
    WHERE  is_downloadable
    ORDER  BY user_id, downloads_count DESC
    )
    , y AS (
    SELECT *
    FROM   x
    ORDER  BY downloads_count DESC
    LIMIT  50
    )
SELECT *
FROM   y
ORDER  BY random()
LIMIT  8;
  1. Get the song with the highest download_count per user. Every user can only show up once, so it has to be the one song with the highest download_count.
  2. 以每个用户最高的download_count获得歌曲。每个用户只能出现一次,所以它必须是下载次数最多的一首歌。
  3. Pick the 50 with highest downloads_count from that.
  4. 选择下载次数最多的50个。
  5. Pick 8 songs from that randomly.
  6. 随机选择8首歌。

With a big table, performance will suck, because you have to find the best row for every user before you can proceed. A multi-column index will help, but it will still not be very fast:

对于一个大表,性能会很糟糕,因为在继续之前,您必须为每个用户找到最佳的行。多列索引将有所帮助,但仍不会很快:

CREATE INDEX songs_u_dc_idx ON songs (user_id, downloads_count DESC);

The same, faster

If duplicate user_ids among the top songs are predictably rare, you can use a trick. Pick just enough of the top downloads, so that the top 50 with unique user_id are certainly among them. After this step, proceed like above. This will be much faster with big tables, because the top n rows can be read from the top of an index quickly:

如果在*歌曲中有重复的user_id是非常罕见的,那么您可以使用一个技巧。只要选择足够的*下载,就可以在它们之间找到具有唯一user_id的前50。在此步骤之后,按照上面的步骤进行。使用大表会更快,因为前n行可以快速从索引的顶部读取:

WITH x AS (
    SELECT *
    FROM   songs
    WHERE  is_downloadable
    ORDER  BY downloads_count DESC
    LIMIT  100 -- adjust to your secure estimate
    )
    , y AS (
    SELECT DISTINCT ON (user_id) *
    FROM   x
    ORDER  BY user_id, downloads_count DESC
    )
    , z AS (
    SELECT *
    FROM   y
    ORDER  BY downloads_count DESC
    LIMIT  50
    )
SELECT *
FROM   z
ORDER  BY random()
LIMIT  8;

The index from the simple case above will suffice to make it almost as fast as the simple case.

上面简单情况中的索引将足以使它几乎与简单情况一样快。

This would fall short if less than 50 distinct users are among the top 100 "songs".

如果在100首“歌曲”中,只有不到50位不同的用户,这一数字就会不足。

All queries should work with PostgreSQL 8.4 or later.

所有查询都应该使用PostgreSQL 8.4或更高版本。


If it has to be faster, yet, create a materialized view that holds the pre-selected top 50, and rewrite that table in regular intervals or triggered by events. If you make heavy use of this and the table is big, I would go for that. Otherwise it's not worth the overhead.

如果必须要更快,那么创建一个包含预先选择的前50名的物化视图,并以有规律的间隔或事件触发的方式重写该表。如果你大量使用这个,而且桌子很大,我也会这么做。否则它就不值得我们负担。

Generalized, improved solution

I later formalized and improved this approach further to be applicable to a whole class of similar problems under this related question at dba.SE.

后来,我将此方法形式化并改进,以便在dba.SE的相关问题中进一步适用于一类类似的问题。

#2


1  

You could use PostgreSQL's RANDOM() function in the order by, making it

您可以按顺序使用PostgreSQL的RANDOM()函数

___.order('songs.downloads_count DESC, RANDOM()').limit(8)

though this doesn't work though because PostgreSQL requires the columns used in the ORDER BY be found in the SELECT. You'll get an error like

尽管这样做不起作用,因为PostgreSQL需要在SELECT中按顺序找到所使用的列。你会得到一个错误。

ActiveRecord::StatementInvalid: PG::Error: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list

The only way to do what your'e asking all in SQL (using PostgreSQL) is with a subquery, which may or may not be a better solution for you. If it is, your best bet is to write out the full query/subquery using find_by_sql.

实现您在SQL(使用PostgreSQL)中所要求的所有功能的唯一方法是使用子查询,这对您来说可能是更好的解决方案,也可能不是更好的解决方案。如果是,最好的方法是使用find_by_sql写出完整的查询/子查询。

I'm happy to help come up with the SQL, though now that you know about RANDOM(), it should be pretty trivial.

我很高兴能帮助您开发SQL,不过既然您已经了解了RANDOM(),那么它应该是非常简单的。

#1


3  

I wonder how the column user_id ended up in the table songs? That means you have one row for every combination of song and user? In a normalized schema, that would be an n:m relationship implemented with three tables:

我想知道user_id列是如何在表歌曲中结束的?这就意味着你对歌曲和用户的每一个组合都有一行?在规范化模式中,这是一个n:m关系,由三个表实现:

song(song_id, ...)
usr(usr_id, ...)    -- "user" is a reserved word
download (song_id, user_id, ...) -- implementing the n:m relationship

The query in your question yields incorrect results. The same user_id can pop up multiple times. DISTINCT does not do what you seem to expect it to. You need DISTINCT ON or some other method like aggregate or window functions.

问题中的查询会产生不正确的结果。同一个user_id可以多次弹出。显然不是你所期望的那样。你需要不同的方法,比如聚合或窗口函数。

You also need to use subqueries or CTEs, because this cannot be done in one step. When you use DISTINCT you cannot at the same time ORDER BY random(), because the sort order cannot disagree with the order dictated by DISTINCT. This query is certainly not trivial.

您还需要使用子查询或cte,因为这不能一步完成。当您使用DISTINCT时,不能同时使用random()排序,因为排序顺序不能与DISTINCT所指定的顺序不一致。这个查询当然不是琐碎的。

Simple case, top 50 songs

If you are happy to just pick the top 50 songs (not knowing how many duplicate user_ids are among them), this "simple" case will do:

如果你高兴地选择50首歌曲(不知道其中有多少个重复的user_id),这个“简单”的例子将会:

WITH x AS (
    SELECT *
    FROM   songs
    WHERE  is_downloadable
    ORDER  BY downloads_count DESC
    LIMIT  50
    )
    , y AS (
    SELECT DISTINCT ON (user_id) *
    FROM   x
    ORDER  BY user_id, downloads_count DESC -- pick most popular song per user
--  ORDER  BY user_id, random() -- pick random song per user
    )
SELECT *
FROM   y
ORDER  BY random()
LIMIT  8;
  1. Get the 50 songs with the highest download_count. Users can show up multiple times.
  2. 获得下载次数最高的50首歌曲。用户可以多次出现。
  3. Pick 1 song per user. Randomly or the most popular one, that's not defined in your question.
  4. 每个用户选择一首歌曲。随机的或者最流行的,你们的问题中没有定义。
  5. Pick 8 songs with now distinct user_id randomly.
  6. 随机选择8首有不同user_id的歌曲。

You only need an index on songs.downloads_count for this to be fast:

你只需要一个关于歌曲的索引。downloads_count这样做的速度很快:

CREATE INDEX songs_downloads_count_idx ON songs (downloads_count DESC);

Top 50 songs with unique user_id

WITH x AS (
    SELECT DISTINCT ON (user_id) *
    FROM   songs
    WHERE  is_downloadable
    ORDER  BY user_id, downloads_count DESC
    )
    , y AS (
    SELECT *
    FROM   x
    ORDER  BY downloads_count DESC
    LIMIT  50
    )
SELECT *
FROM   y
ORDER  BY random()
LIMIT  8;
  1. Get the song with the highest download_count per user. Every user can only show up once, so it has to be the one song with the highest download_count.
  2. 以每个用户最高的download_count获得歌曲。每个用户只能出现一次,所以它必须是下载次数最多的一首歌。
  3. Pick the 50 with highest downloads_count from that.
  4. 选择下载次数最多的50个。
  5. Pick 8 songs from that randomly.
  6. 随机选择8首歌。

With a big table, performance will suck, because you have to find the best row for every user before you can proceed. A multi-column index will help, but it will still not be very fast:

对于一个大表,性能会很糟糕,因为在继续之前,您必须为每个用户找到最佳的行。多列索引将有所帮助,但仍不会很快:

CREATE INDEX songs_u_dc_idx ON songs (user_id, downloads_count DESC);

The same, faster

If duplicate user_ids among the top songs are predictably rare, you can use a trick. Pick just enough of the top downloads, so that the top 50 with unique user_id are certainly among them. After this step, proceed like above. This will be much faster with big tables, because the top n rows can be read from the top of an index quickly:

如果在*歌曲中有重复的user_id是非常罕见的,那么您可以使用一个技巧。只要选择足够的*下载,就可以在它们之间找到具有唯一user_id的前50。在此步骤之后,按照上面的步骤进行。使用大表会更快,因为前n行可以快速从索引的顶部读取:

WITH x AS (
    SELECT *
    FROM   songs
    WHERE  is_downloadable
    ORDER  BY downloads_count DESC
    LIMIT  100 -- adjust to your secure estimate
    )
    , y AS (
    SELECT DISTINCT ON (user_id) *
    FROM   x
    ORDER  BY user_id, downloads_count DESC
    )
    , z AS (
    SELECT *
    FROM   y
    ORDER  BY downloads_count DESC
    LIMIT  50
    )
SELECT *
FROM   z
ORDER  BY random()
LIMIT  8;

The index from the simple case above will suffice to make it almost as fast as the simple case.

上面简单情况中的索引将足以使它几乎与简单情况一样快。

This would fall short if less than 50 distinct users are among the top 100 "songs".

如果在100首“歌曲”中,只有不到50位不同的用户,这一数字就会不足。

All queries should work with PostgreSQL 8.4 or later.

所有查询都应该使用PostgreSQL 8.4或更高版本。


If it has to be faster, yet, create a materialized view that holds the pre-selected top 50, and rewrite that table in regular intervals or triggered by events. If you make heavy use of this and the table is big, I would go for that. Otherwise it's not worth the overhead.

如果必须要更快,那么创建一个包含预先选择的前50名的物化视图,并以有规律的间隔或事件触发的方式重写该表。如果你大量使用这个,而且桌子很大,我也会这么做。否则它就不值得我们负担。

Generalized, improved solution

I later formalized and improved this approach further to be applicable to a whole class of similar problems under this related question at dba.SE.

后来,我将此方法形式化并改进,以便在dba.SE的相关问题中进一步适用于一类类似的问题。

#2


1  

You could use PostgreSQL's RANDOM() function in the order by, making it

您可以按顺序使用PostgreSQL的RANDOM()函数

___.order('songs.downloads_count DESC, RANDOM()').limit(8)

though this doesn't work though because PostgreSQL requires the columns used in the ORDER BY be found in the SELECT. You'll get an error like

尽管这样做不起作用,因为PostgreSQL需要在SELECT中按顺序找到所使用的列。你会得到一个错误。

ActiveRecord::StatementInvalid: PG::Error: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list

The only way to do what your'e asking all in SQL (using PostgreSQL) is with a subquery, which may or may not be a better solution for you. If it is, your best bet is to write out the full query/subquery using find_by_sql.

实现您在SQL(使用PostgreSQL)中所要求的所有功能的唯一方法是使用子查询,这对您来说可能是更好的解决方案,也可能不是更好的解决方案。如果是,最好的方法是使用find_by_sql写出完整的查询/子查询。

I'm happy to help come up with the SQL, though now that you know about RANDOM(), it should be pretty trivial.

我很高兴能帮助您开发SQL,不过既然您已经了解了RANDOM(),那么它应该是非常简单的。