按查询对组进行优化,以检索每个用户的最新记录

时间:2021-03-12 12:27:41

I have the following table (simplified form) in Postgres 9.2

我在Postgres 9.2中有以下表格(简化形式)

CREATE TABLE user_msg_log (
    aggr_date DATE,
    user_id INTEGER,
    running_total INTEGER
);

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. running_total is always increasing for each user.

它每个用户每天最多包含一条记录。每天约有500K记录,持续300天。running_total对于每个用户都是递增的。

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

我希望在特定日期之前有效地检索每个用户的最新记录。我查询的方法是:

SELECT user_id, max(aggr_date), max(running_total) 
FROM user_msg_log 
WHERE aggr_date <= :mydate 
GROUP BY user_id

which is extremely slow. I have also tried:

这是极其缓慢的。我也试过:

SELECT DISTINCT ON(user_id), aggr_date, running_total
FROM user_msg_log
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC;

which has the same plan and is equally slow.

它有相同的计划,而且同样缓慢。

So far I have a single index on user_msg_log(aggr_date), but doesn't help much. Is there any other index I should use to speed this up, or any other way to achieve what I want?

到目前为止,我在user_msg_log(aggr_date)上只有一个索引,但是没有多大帮助。我是否应该用其他的指数来加速这一进程,或者用其他的方法来实现我想要的?

3 个解决方案

#1


73  

For best read performance you need a multicolumn index:

为了获得最佳的阅读性能,您需要一个多色索引:

CREATE INDEX user_msg_log_combo_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST)

To make index only scans possible, add the otherwise not needed column running_total:

要使索引仅扫描成为可能,添加不需要的列running_total:

CREATE INDEX user_msg_log_combo_covering_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST, running_total)

Why DESC NULLS LAST?

为什么DESC null去年?

For few rows per user_id or small tables a simple DISTINCT ON is among the fastest and simplest solutions:

对于每个user_id或小表的几行,一个简单的区别是最快和最简单的解决方案:

For many rows per user_id a loose index scan would be (much) more efficient. That's not implemented in Postgres (at least up to Postgres 10), but there are ways to emulate it:

对于每个user_id的许多行,松散的索引扫描将(非常)高效。这在Postgres中没有实现(至少在Postgres 10中没有实现),但是有一些方法可以模仿它:

1. No separate table with unique users

The following solutions go beyond what's covered in the Postgres Wiki.
With a separate users table, solutions in 2. below are typically simpler and faster.

下面的解决方案超出了Postgres Wiki的范围。有一个单独的用户表,解决方案在2中。下面是典型的简单和快速。

1a. Recursive CTE with LATERAL join

Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.

公共表表达式需要Postgres 8.4+。侧需要Postgres 9.3 +。

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT user_id, aggr_date, running_total
   FROM   user_msg_log
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT u.user_id, u.aggr_date, u.running_total
   FROM   cte c
   ,      LATERAL (
      SELECT user_id, aggr_date, running_total
      FROM   user_msg_log
      WHERE  user_id > c.user_id   -- lateral reference
      AND    aggr_date <= :mydate  -- repeat condition
      ORDER  BY user_id, aggr_date DESC NULLS LAST
      LIMIT  1
      ) u
   )
SELECT user_id, aggr_date, running_total
FROM   cte
ORDER  BY user_id;

This is preferable in current versions of Postgres and it's simple to retrieve arbitrary columns. More explanation in chapter 2a. below.

在当前版本的Postgres中,这是更可取的,并且很容易检索任意列。更多的解释在第2a章。在下面。

1b. Recursive CTE with correlated subqueries

Convenient to retrieve either a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

方便检索单个列或整个行。该示例使用了整个表的行类型。其他变体是可能的。

WITH RECURSIVE cte AS (
   (
   SELECT u  -- whole row
   FROM   user_msg_log u
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT u1  -- again, whole row
           FROM   user_msg_log u1
           WHERE  user_id > (c.u).user_id  -- parentheses to access row type
           AND    aggr_date <= :mydate     -- repeat predicate
           ORDER  BY user_id, aggr_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.u).user_id IS NOT NULL        -- any NOT NULL column of the row
   )
SELECT (u).*                               -- finally decompose row
FROM   cte
WHERE  (u).user_id IS NOT NULL             -- any column defined NOT NULL
ORDER  BY (u).user_id;

It could be misleading to test the row value with c.u IS NOT NULL. This only returns true if every single column of the tested row is NOT NULL and would fail if a single NULL value is contained. (I had this bug in my answer for some time.) Instead, to assert a row was found in the previous iteration, test a single column of the row that is defined NOT NULL (like the primary key). More:

用c测试行值可能具有误导性。你不是零。只有当被测试行的每一列都不是NULL时才返回true,并且如果包含一个空值就会失败。(一段时间以来,我一直在回答这个问题。)相反,要断言在前一次迭代中找到了一行,请测试定义为NOT NULL的行的单个列(如主键)。更多:

More explanation for this query in chapter 2b. below.
Related answers:

关于这个查询的更多解释见第2b章。在下面。相关的答案:

2. With separate users table

Table layout hardly matters as long as we have exactly one row per relevant user_id. Example:

只要每个相关的user_id只有一行,表布局就不重要了。例子:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

Ideally, the table is physically sorted. See:

理想情况下,表是物理排序的。看到的:

Or it's small enough (low cardinality) that it hardly matters.
Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.

或者它足够小(低基数),这几乎不重要。另外,在查询中对行进行排序可以帮助进一步优化性能。看到梁帮派的加法。

2a. LATERAL join

SELECT u.user_id, l.aggr_date, l.running_total
FROM   users u
CROSS  JOIN LATERAL (
   SELECT aggr_date, running_total
   FROM   user_msg_log
   WHERE  user_id = u.user_id  -- lateral reference
   AND    aggr_date <= :mydate
   ORDER  BY aggr_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERAL allows to reference preceding FROM items on the same query level. You get one index (-only) look-up per user.

JOIN LATERAL允许从相同查询级别的项引用前面的内容。每个用户只有一个索引查询。

Consider the possible improvement by sorting the users table suggested by Gang Liang in another answer. If the physical sort order of the users table happens to match the index on user_msg_log, you don't need this.

通过对梁刚在另一个答案中建议的用户表进行排序来考虑可能的改进。如果users表的物理排序顺序恰好与user_msg_log上的索引匹配,则不需要这个。

You don't get results for users missing in the users table, even if you have entries in user_msg_log. Typically, you would have a foreign key constraint enforcing referential integrity to rule that out.

对于用户表中丢失的用户,即使在user_msg_log中有条目,也不会得到结果。通常,您会有一个外键约束强制引用完整性以排除这种情况。

You also don't get a row for any user that has no matching entry in user_msg_log. That conforms to your original question. If you need to include those rows in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

对于user_msg_log中没有匹配条目的任何用户,也不会获得行。那符合你原来的问题。如果需要在结果中包含这些行,请使用左侧连接……正确而非横向交叉连接:

This form is also best to retrieve more than one rows (but not all) per user. Just use LIMIT n instead of LIMIT 1.

这个表单也最好为每个用户检索多个行(但不是全部)。用极限n代替极限1。

Effectively, all of these would do the same:

实际上,所有这些都将起到同样的作用:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

The latter has a lower priority, though. Explicit JOIN binds before comma.

不过,后者的优先级较低。显式连接在逗号之前绑定。

2b. Correlated subquery

Good choice to retrieve a single column from a single row. Code example:

从单个行检索单个列的好选择。代码示例:

The same is possible for multiple columns, but you need more smarts:

对于多个列也是如此,但是您需要更多的智慧:

CREATE TEMP TABLE combo (aggr_date date, running_total int);

SELECT user_id, (my_combo).*  -- note the parentheses
FROM (
   SELECT u.user_id
        , (SELECT (aggr_date, running_total)::combo
           FROM   user_msg_log
           WHERE  user_id = u.user_id
           AND    aggr_date <= :mydate
           ORDER  BY aggr_date DESC NULLS LAST
           LIMIT  1) AS my_combo
   FROM   users u
   ) sub;
  • Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in user_msg_log. You get NULL for my_combo, which you can easily filter with a WHERE clause in the outer query if need be.
    Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all values returned happen to be NULL - same result. You would have to include a NOT NULL column in the subquery to be sure.

    与上面的左连接侧边一样,此变体包含所有用户,即使user_msg_log中没有条目。my_combo可以得到空值,如果需要的话,可以使用外部查询中的WHERE子句轻松地进行筛选。Nitpick:在外部查询中,您无法区分子查询是否没有找到一行,或者返回的所有值恰好为空——结果相同。您必须在子查询中包含一个非空列才能确定。

  • A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.

    关联子查询只能返回一个值。可以将多个列封装到复合类型中。但是要以后分解它,Postgres需要一个众所周知的复合类型。只有提供列定义列表才能分解匿名记录。

  • Use a registered type like the row type of an existing table, or create a type. Register a composite type explicitly (and permanently) with CREATE TYPE, or create a temporary table (dropped automatically at end of session) to provide a row type temporarily. Cast to that type: (aggr_date, running_total)::combo

    使用已注册的类型,如现有表的行类型,或创建类型。使用CREATE类型显式地(和永久地)注册一个复合类型,或者创建一个临时表(在会话结束时自动删除),以临时提供一个行类型。转换为该类型:(aggr_date, running_total)::组合。

  • Finally, we do not want to decompose combo on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (up to Postgres 9.6 - improvements are planned for Postgres 10). Instead, make it a subquery and decompose in the outer query.

    最后,我们不希望在相同的查询级别上分解组合。由于查询计划器的弱点,这将对每个列计算一次子查询(直到Postgres 9.6 -计划对Postgres 10进行改进)。相反,将它设置为子查询并在外部查询中分解。

Related:

相关:

SQL Fiddle demonstrating all four queries.
With a big test for 1k users and 100k log entries.

SQL小提琴演示了所有四个查询。对1k用户和100k日志条目进行了大型测试。

#2


4  

Perhaps a different index on the table would help. Try this one: user_msg_log(user_id, aggr_date). I am not positive that Postgres will make optimal use with distinct on.

也许桌上的一个不同的指数会有所帮助。试试这个:user_msg_log(user_id, aggr_date)。我不认为Postgres会以独特的方式发挥最佳作用。

So, I would stick with that index and try this version:

所以,我还是用那个索引试试这个版本:

select *
from user_msg_log uml
where not exists (select 1
                  from user_msg_log uml2
                  where uml2.user_id = u.user_id and
                        uml2.aggr_date <= :mydate and
                        uml2.aggr_date > uml.aggr_date
                 );

This should replace the sorting/grouping with index look ups. It might be faster.

这将用索引查找取代排序/分组。可能会更快。

#3


2  

This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users table to exploit the locality of the index on user_msg_log.

这不是一个独立的答案,而是对@Erwin的答案的评论。对于2a(横向连接示例),可以通过对users表进行排序以利用user_msg_log上索引的位置来改进查询。

SELECT u.user_id, l.aggr_date, l.running_total
  FROM (SELECT user_id FROM users ORDER BY user_id) u,
       LATERAL (SELECT aggr_date, running_total
                  FROM user_msg_log
                 WHERE user_id = u.user_id -- lateral reference
                   AND aggr_date <= :mydate
              ORDER BY aggr_date DESC NULLS LAST
                 LIMIT 1) l;

The rationale is that index lookup is expensive if user_id values are random. By sorting out user_id first, the subsequent lateral join would be like a simple scan on the index of user_msg_log. Even though both query plans look alike, the running time would differ much especially for large tables.

其基本原理是,如果user_id值是随机的,那么索引查找的开销就很大。通过首先对user_id进行排序,后续的横向连接将类似于对user_msg_log索引的简单扫描。尽管这两个查询计划看起来很相似,但是运行时间会有很大的不同,特别是对于大型表。

The cost of the sorting is minimal especially if there is an index on the user_id field.

排序的代价是最小的,特别是如果user_id字段上有一个索引。

#1


73  

For best read performance you need a multicolumn index:

为了获得最佳的阅读性能,您需要一个多色索引:

CREATE INDEX user_msg_log_combo_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST)

To make index only scans possible, add the otherwise not needed column running_total:

要使索引仅扫描成为可能,添加不需要的列running_total:

CREATE INDEX user_msg_log_combo_covering_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST, running_total)

Why DESC NULLS LAST?

为什么DESC null去年?

For few rows per user_id or small tables a simple DISTINCT ON is among the fastest and simplest solutions:

对于每个user_id或小表的几行,一个简单的区别是最快和最简单的解决方案:

For many rows per user_id a loose index scan would be (much) more efficient. That's not implemented in Postgres (at least up to Postgres 10), but there are ways to emulate it:

对于每个user_id的许多行,松散的索引扫描将(非常)高效。这在Postgres中没有实现(至少在Postgres 10中没有实现),但是有一些方法可以模仿它:

1. No separate table with unique users

The following solutions go beyond what's covered in the Postgres Wiki.
With a separate users table, solutions in 2. below are typically simpler and faster.

下面的解决方案超出了Postgres Wiki的范围。有一个单独的用户表,解决方案在2中。下面是典型的简单和快速。

1a. Recursive CTE with LATERAL join

Common Table Expressions require Postgres 8.4+.
LATERAL requires Postgres 9.3+.

公共表表达式需要Postgres 8.4+。侧需要Postgres 9.3 +。

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT user_id, aggr_date, running_total
   FROM   user_msg_log
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT u.user_id, u.aggr_date, u.running_total
   FROM   cte c
   ,      LATERAL (
      SELECT user_id, aggr_date, running_total
      FROM   user_msg_log
      WHERE  user_id > c.user_id   -- lateral reference
      AND    aggr_date <= :mydate  -- repeat condition
      ORDER  BY user_id, aggr_date DESC NULLS LAST
      LIMIT  1
      ) u
   )
SELECT user_id, aggr_date, running_total
FROM   cte
ORDER  BY user_id;

This is preferable in current versions of Postgres and it's simple to retrieve arbitrary columns. More explanation in chapter 2a. below.

在当前版本的Postgres中,这是更可取的,并且很容易检索任意列。更多的解释在第2a章。在下面。

1b. Recursive CTE with correlated subqueries

Convenient to retrieve either a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

方便检索单个列或整个行。该示例使用了整个表的行类型。其他变体是可能的。

WITH RECURSIVE cte AS (
   (
   SELECT u  -- whole row
   FROM   user_msg_log u
   WHERE  aggr_date <= :mydate
   ORDER  BY user_id, aggr_date DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT u1  -- again, whole row
           FROM   user_msg_log u1
           WHERE  user_id > (c.u).user_id  -- parentheses to access row type
           AND    aggr_date <= :mydate     -- repeat predicate
           ORDER  BY user_id, aggr_date DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (c.u).user_id IS NOT NULL        -- any NOT NULL column of the row
   )
SELECT (u).*                               -- finally decompose row
FROM   cte
WHERE  (u).user_id IS NOT NULL             -- any column defined NOT NULL
ORDER  BY (u).user_id;

It could be misleading to test the row value with c.u IS NOT NULL. This only returns true if every single column of the tested row is NOT NULL and would fail if a single NULL value is contained. (I had this bug in my answer for some time.) Instead, to assert a row was found in the previous iteration, test a single column of the row that is defined NOT NULL (like the primary key). More:

用c测试行值可能具有误导性。你不是零。只有当被测试行的每一列都不是NULL时才返回true,并且如果包含一个空值就会失败。(一段时间以来,我一直在回答这个问题。)相反,要断言在前一次迭代中找到了一行,请测试定义为NOT NULL的行的单个列(如主键)。更多:

More explanation for this query in chapter 2b. below.
Related answers:

关于这个查询的更多解释见第2b章。在下面。相关的答案:

2. With separate users table

Table layout hardly matters as long as we have exactly one row per relevant user_id. Example:

只要每个相关的user_id只有一行,表布局就不重要了。例子:

CREATE TABLE users (
   user_id  serial PRIMARY KEY
 , username text NOT NULL
);

Ideally, the table is physically sorted. See:

理想情况下,表是物理排序的。看到的:

Or it's small enough (low cardinality) that it hardly matters.
Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.

或者它足够小(低基数),这几乎不重要。另外,在查询中对行进行排序可以帮助进一步优化性能。看到梁帮派的加法。

2a. LATERAL join

SELECT u.user_id, l.aggr_date, l.running_total
FROM   users u
CROSS  JOIN LATERAL (
   SELECT aggr_date, running_total
   FROM   user_msg_log
   WHERE  user_id = u.user_id  -- lateral reference
   AND    aggr_date <= :mydate
   ORDER  BY aggr_date DESC NULLS LAST
   LIMIT  1
   ) l;

JOIN LATERAL allows to reference preceding FROM items on the same query level. You get one index (-only) look-up per user.

JOIN LATERAL允许从相同查询级别的项引用前面的内容。每个用户只有一个索引查询。

Consider the possible improvement by sorting the users table suggested by Gang Liang in another answer. If the physical sort order of the users table happens to match the index on user_msg_log, you don't need this.

通过对梁刚在另一个答案中建议的用户表进行排序来考虑可能的改进。如果users表的物理排序顺序恰好与user_msg_log上的索引匹配,则不需要这个。

You don't get results for users missing in the users table, even if you have entries in user_msg_log. Typically, you would have a foreign key constraint enforcing referential integrity to rule that out.

对于用户表中丢失的用户,即使在user_msg_log中有条目,也不会得到结果。通常,您会有一个外键约束强制引用完整性以排除这种情况。

You also don't get a row for any user that has no matching entry in user_msg_log. That conforms to your original question. If you need to include those rows in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

对于user_msg_log中没有匹配条目的任何用户,也不会获得行。那符合你原来的问题。如果需要在结果中包含这些行,请使用左侧连接……正确而非横向交叉连接:

This form is also best to retrieve more than one rows (but not all) per user. Just use LIMIT n instead of LIMIT 1.

这个表单也最好为每个用户检索多个行(但不是全部)。用极限n代替极限1。

Effectively, all of these would do the same:

实际上,所有这些都将起到同样的作用:

JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...

The latter has a lower priority, though. Explicit JOIN binds before comma.

不过,后者的优先级较低。显式连接在逗号之前绑定。

2b. Correlated subquery

Good choice to retrieve a single column from a single row. Code example:

从单个行检索单个列的好选择。代码示例:

The same is possible for multiple columns, but you need more smarts:

对于多个列也是如此,但是您需要更多的智慧:

CREATE TEMP TABLE combo (aggr_date date, running_total int);

SELECT user_id, (my_combo).*  -- note the parentheses
FROM (
   SELECT u.user_id
        , (SELECT (aggr_date, running_total)::combo
           FROM   user_msg_log
           WHERE  user_id = u.user_id
           AND    aggr_date <= :mydate
           ORDER  BY aggr_date DESC NULLS LAST
           LIMIT  1) AS my_combo
   FROM   users u
   ) sub;
  • Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in user_msg_log. You get NULL for my_combo, which you can easily filter with a WHERE clause in the outer query if need be.
    Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all values returned happen to be NULL - same result. You would have to include a NOT NULL column in the subquery to be sure.

    与上面的左连接侧边一样,此变体包含所有用户,即使user_msg_log中没有条目。my_combo可以得到空值,如果需要的话,可以使用外部查询中的WHERE子句轻松地进行筛选。Nitpick:在外部查询中,您无法区分子查询是否没有找到一行,或者返回的所有值恰好为空——结果相同。您必须在子查询中包含一个非空列才能确定。

  • A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.

    关联子查询只能返回一个值。可以将多个列封装到复合类型中。但是要以后分解它,Postgres需要一个众所周知的复合类型。只有提供列定义列表才能分解匿名记录。

  • Use a registered type like the row type of an existing table, or create a type. Register a composite type explicitly (and permanently) with CREATE TYPE, or create a temporary table (dropped automatically at end of session) to provide a row type temporarily. Cast to that type: (aggr_date, running_total)::combo

    使用已注册的类型,如现有表的行类型,或创建类型。使用CREATE类型显式地(和永久地)注册一个复合类型,或者创建一个临时表(在会话结束时自动删除),以临时提供一个行类型。转换为该类型:(aggr_date, running_total)::组合。

  • Finally, we do not want to decompose combo on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (up to Postgres 9.6 - improvements are planned for Postgres 10). Instead, make it a subquery and decompose in the outer query.

    最后,我们不希望在相同的查询级别上分解组合。由于查询计划器的弱点,这将对每个列计算一次子查询(直到Postgres 9.6 -计划对Postgres 10进行改进)。相反,将它设置为子查询并在外部查询中分解。

Related:

相关:

SQL Fiddle demonstrating all four queries.
With a big test for 1k users and 100k log entries.

SQL小提琴演示了所有四个查询。对1k用户和100k日志条目进行了大型测试。

#2


4  

Perhaps a different index on the table would help. Try this one: user_msg_log(user_id, aggr_date). I am not positive that Postgres will make optimal use with distinct on.

也许桌上的一个不同的指数会有所帮助。试试这个:user_msg_log(user_id, aggr_date)。我不认为Postgres会以独特的方式发挥最佳作用。

So, I would stick with that index and try this version:

所以,我还是用那个索引试试这个版本:

select *
from user_msg_log uml
where not exists (select 1
                  from user_msg_log uml2
                  where uml2.user_id = u.user_id and
                        uml2.aggr_date <= :mydate and
                        uml2.aggr_date > uml.aggr_date
                 );

This should replace the sorting/grouping with index look ups. It might be faster.

这将用索引查找取代排序/分组。可能会更快。

#3


2  

This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users table to exploit the locality of the index on user_msg_log.

这不是一个独立的答案,而是对@Erwin的答案的评论。对于2a(横向连接示例),可以通过对users表进行排序以利用user_msg_log上索引的位置来改进查询。

SELECT u.user_id, l.aggr_date, l.running_total
  FROM (SELECT user_id FROM users ORDER BY user_id) u,
       LATERAL (SELECT aggr_date, running_total
                  FROM user_msg_log
                 WHERE user_id = u.user_id -- lateral reference
                   AND aggr_date <= :mydate
              ORDER BY aggr_date DESC NULLS LAST
                 LIMIT 1) l;

The rationale is that index lookup is expensive if user_id values are random. By sorting out user_id first, the subsequent lateral join would be like a simple scan on the index of user_msg_log. Even though both query plans look alike, the running time would differ much especially for large tables.

其基本原理是,如果user_id值是随机的,那么索引查找的开销就很大。通过首先对user_id进行排序,后续的横向连接将类似于对user_msg_log索引的简单扫描。尽管这两个查询计划看起来很相似,但是运行时间会有很大的不同,特别是对于大型表。

The cost of the sorting is minimal especially if there is an index on the user_id field.

排序的代价是最小的,特别是如果user_id字段上有一个索引。