Postgres窗口函数和异常分组

时间:2022-04-01 22:59:27

I'm trying to put together a query that will retrieve the statistics of a user (profit/loss) as a cumulative result, over a period of time.

我正在尝试将一个查询放在一起,该查询将在一段时间内检索用户的统计信息(利润/损失)作为累积结果。

Here's the query I have so far:

这是我到目前为止的查询:

SELECT p.name, e.date, 
    sum(sp.payout) OVER (ORDER BY e.date)
    - sum(s.buyin) OVER (ORDER BY e.date) AS "Profit/Loss" 
FROM result r 
    JOIN game g ON r.game_id = g.game_id 
    JOIN event e ON g.event_id = e.event_id 
    JOIN structure s ON g.structure_id = s.structure_id 
    JOIN structure_payout sp ON g.structure_id = sp.structure_id
                            AND r.position = sp.position 
    JOIN player p ON r.player_id = p.player_id 
WHERE p.player_id = 17 
GROUP BY p.name, e.date, e.event_id, sp.payout, s.buyin
ORDER BY p.name, e.date ASC

The query will run. However, the result is slightly incorrect. The reason is that an event can have multiple games (with different sp.payouts). Therefore, the above comes out with multiple rows if a user has 2 results in an event with different payouts (i.e. there are 4 games per event, and a user gets £20 from one, and £40 from another).

查询将运行。但是,结果略有不正确。原因是一个事件可以有多个游戏(具有不同的sp.payouts)。因此,如果用户在具有不同支付的事件中具有2个结果(即,每个事件有4个游戏,并且用户从一个获得20英镑,而从另一个获得40英镑),则上面出现多行。

The obvious solution would be to amend the GROUP BY to:

显而易见的解决方案是将GROUP BY修改为:

GROUP BY p.name, e.date, e.event_id

However, Postgres complains at this as it doesn't appear to be recognizing that sp.payout and s.buyin are inside an aggregate function. I get the error:

然而,Postgres在此抱怨,因为它似乎没有认识到sp.payout和s.buyin在聚合函数中。我收到错误:

column "sp.payout" must appear in the GROUP BY clause or be used in an aggregate function

列“sp.payout”必须出现在GROUP BY子句中或用于聚合函数

I'm running 9.1 on Ubuntu Linux server.
Am I missing something, or could this be a genuine defect in Postgres?

我在Ubuntu Linux服务器上运行9.1。我错过了什么,或者这可能是Postgres的真正缺陷吗?

1 个解决方案

#1


25  

You are not, in fact, using aggregate functions. You are using window functions. That's why PostgreSQL demands sp.payout and s.buyin to be included in the GROUP BY clause.

事实上,您并非使用聚合函数。您正在使用窗口功能。这就是PostgreSQL要求sp.payout和s.buyin包含在GROUP BY子句中的原因。

By appending an OVER clause, the aggregate function sum() is turned into a window function, which aggregates values per partition while keeping all rows.

通过附加OVER子句,聚合函数sum()将转换为窗口函数,该函数在保留所有行的同时聚合每个分区的值。

You can combine window functions and aggregate functions. Aggregations are applied first. I did not understand from your description how you want to handle multiple payouts / buyins per event. As a guess, I calculate a sum of them per event. Now I can remove sp.payout and s.buyin from the GROUP BY clause and get one row per player and event:

您可以组合窗口函数和聚合函数。首先应用聚合。我从你的描述中不理解你想如何处理每个事件的多个支付/购买。作为猜测,我计算每个事件的总和。现在我可以从GROUP BY子句中删除sp.payout和s.buyin,并为每个玩家和事件获取一行:

SELECT p.name
     , e.event_id
     , e.date
     , sum(sum(sp.payout)) OVER w
     - sum(sum(s.buyin  )) OVER w AS "Profit/Loss" 
FROM   player            p
JOIN   result            r ON r.player_id     = p.player_id  
JOIN   game              g ON g.game_id       = r.game_id 
JOIN   event             e ON e.event_id      = g.event_id 
JOIN   structure         s ON s.structure_id  = g.structure_id 
JOIN   structure_payout sp ON sp.structure_id = g.structure_id
                          AND sp.position     = r.position
WHERE  p.player_id = 17 
GROUP  BY e.event_id
WINDOW w AS (ORDER BY e.date, e.event_id)
ORDER  BY e.date, e.event_id;

In this expression: sum(sum(sp.payout)) OVER w, the outer sum() is a window function, the inner sum() is an aggregate function.

在这个表达式中:sum(sum(sp.payout))OVER w,外部sum()是一个窗口函数,内部sum()是一个聚合函数。

Assuming p.player_id and e.event_id are PRIMARY KEY in their respective tables.

假设p.player_id和e.event_id在各自的表中是PRIMARY KEY。

I added e.event_id to the ORDER BY of the WINDOW clause to arrive at a deterministic sort order. (There could be multiple events on the same date.) Also included event_id in the result to distinguish multiple events per day.

我将e.event_id添加到WINDOW子句的ORDER BY以获得确定的排序顺序。 (同一日期可能有多个事件。)结果中还包括event_id,以区分每天的多个事件。

While the query restricts to a single player (WHERE p.player_id = 17), we don't need to add p.name or p.player_id to GROUP BY and ORDER BY. If one of the joins would multiply rows unduly, the resulting sum would be incorrect (partly or completely multiplied). Grouping by p.name could not repair the query then.

虽然查询限制为单个播放器(WHERE p.player_id = 17),但我们不需要将p.name或p.player_id添加到GROUP BY和ORDER BY。如果其中一个连接会过度地乘以行,则得到的总和将是不正确的(部分或完全相乘)。按p.name分组无法修复查询。

I also removed e.date from the GROUP BY clause. The primary key e.event_id covers all columns of the input row since PostgreSQL 9.1.

我还从GROUP BY子句中删除了e.date。主键e.event_id涵盖自PostgreSQL 9.1以来输入行的所有列。

If you change the query to return multiple players at once, adapt:

如果您更改查询以立即返回多个玩家,请调整:

...
WHERE  p.player_id < 17  -- example - multiple players
GROUP  BY p.name, p.player_id, e.date, e.event_id  -- e.date and p.name redundant
WINDOW w AS (ORDER BY p.name, p.player_id, e.date, e.event_id)
ORDER  BY p.name, p.player_id, e.date, e.event_id;

Unless p.name is defined unique (?), group and order by player_id additionally to get correct results in a deterministic sort order.

除非p.name被定义为唯一(?),否则按player_id分组和排序以获得确定排序顺序的正确结果。

I only kept e.date and p.name in GROUP BY to have identical sort order in all clauses, hoping for a performance benefit. Else, you can remove the columns there. (Similar for just e.date in the first query.)

我只将GROUP BY中的e.date和p.name保留在所有子句中具有相同的排序顺序,希望获得性能优势。否则,您可以删除那里的列。 (类似于第一个查询中的e.date。)

#1


25  

You are not, in fact, using aggregate functions. You are using window functions. That's why PostgreSQL demands sp.payout and s.buyin to be included in the GROUP BY clause.

事实上,您并非使用聚合函数。您正在使用窗口功能。这就是PostgreSQL要求sp.payout和s.buyin包含在GROUP BY子句中的原因。

By appending an OVER clause, the aggregate function sum() is turned into a window function, which aggregates values per partition while keeping all rows.

通过附加OVER子句,聚合函数sum()将转换为窗口函数,该函数在保留所有行的同时聚合每个分区的值。

You can combine window functions and aggregate functions. Aggregations are applied first. I did not understand from your description how you want to handle multiple payouts / buyins per event. As a guess, I calculate a sum of them per event. Now I can remove sp.payout and s.buyin from the GROUP BY clause and get one row per player and event:

您可以组合窗口函数和聚合函数。首先应用聚合。我从你的描述中不理解你想如何处理每个事件的多个支付/购买。作为猜测,我计算每个事件的总和。现在我可以从GROUP BY子句中删除sp.payout和s.buyin,并为每个玩家和事件获取一行:

SELECT p.name
     , e.event_id
     , e.date
     , sum(sum(sp.payout)) OVER w
     - sum(sum(s.buyin  )) OVER w AS "Profit/Loss" 
FROM   player            p
JOIN   result            r ON r.player_id     = p.player_id  
JOIN   game              g ON g.game_id       = r.game_id 
JOIN   event             e ON e.event_id      = g.event_id 
JOIN   structure         s ON s.structure_id  = g.structure_id 
JOIN   structure_payout sp ON sp.structure_id = g.structure_id
                          AND sp.position     = r.position
WHERE  p.player_id = 17 
GROUP  BY e.event_id
WINDOW w AS (ORDER BY e.date, e.event_id)
ORDER  BY e.date, e.event_id;

In this expression: sum(sum(sp.payout)) OVER w, the outer sum() is a window function, the inner sum() is an aggregate function.

在这个表达式中:sum(sum(sp.payout))OVER w,外部sum()是一个窗口函数,内部sum()是一个聚合函数。

Assuming p.player_id and e.event_id are PRIMARY KEY in their respective tables.

假设p.player_id和e.event_id在各自的表中是PRIMARY KEY。

I added e.event_id to the ORDER BY of the WINDOW clause to arrive at a deterministic sort order. (There could be multiple events on the same date.) Also included event_id in the result to distinguish multiple events per day.

我将e.event_id添加到WINDOW子句的ORDER BY以获得确定的排序顺序。 (同一日期可能有多个事件。)结果中还包括event_id,以区分每天的多个事件。

While the query restricts to a single player (WHERE p.player_id = 17), we don't need to add p.name or p.player_id to GROUP BY and ORDER BY. If one of the joins would multiply rows unduly, the resulting sum would be incorrect (partly or completely multiplied). Grouping by p.name could not repair the query then.

虽然查询限制为单个播放器(WHERE p.player_id = 17),但我们不需要将p.name或p.player_id添加到GROUP BY和ORDER BY。如果其中一个连接会过度地乘以行,则得到的总和将是不正确的(部分或完全相乘)。按p.name分组无法修复查询。

I also removed e.date from the GROUP BY clause. The primary key e.event_id covers all columns of the input row since PostgreSQL 9.1.

我还从GROUP BY子句中删除了e.date。主键e.event_id涵盖自PostgreSQL 9.1以来输入行的所有列。

If you change the query to return multiple players at once, adapt:

如果您更改查询以立即返回多个玩家,请调整:

...
WHERE  p.player_id < 17  -- example - multiple players
GROUP  BY p.name, p.player_id, e.date, e.event_id  -- e.date and p.name redundant
WINDOW w AS (ORDER BY p.name, p.player_id, e.date, e.event_id)
ORDER  BY p.name, p.player_id, e.date, e.event_id;

Unless p.name is defined unique (?), group and order by player_id additionally to get correct results in a deterministic sort order.

除非p.name被定义为唯一(?),否则按player_id分组和排序以获得确定排序顺序的正确结果。

I only kept e.date and p.name in GROUP BY to have identical sort order in all clauses, hoping for a performance benefit. Else, you can remove the columns there. (Similar for just e.date in the first query.)

我只将GROUP BY中的e.date和p.name保留在所有子句中具有相同的排序顺序,希望获得性能优势。否则,您可以删除那里的列。 (类似于第一个查询中的e.date。)