为Redshift查询优化大IN条件

I have a ~2TB fully vacuumed Redshift table with a distkey phash (high cardinality, hundreds of millions of values) and compound sortkeys (phash, last_seen).

我有一个~2TB完全真空的Redshift表，带有distkey phash（高基数，数亿个值）和复合sortkeys（phash，last_seen）。

When I do a query like:

当我做一个查询，如：

SELECT
    DISTINCT ret_field
FROM
    table
WHERE
    phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae'
    )
AND
    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

It returns very quickly. However when I increase the number of hashes beyond 10, Redshift converts the IN condition from a bunch of ORs to an array, per http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists

它很快就会返回。但是，当我将哈希数增加到10以上时，Redshift会将IN条件从一堆OR转换为数组，符合http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-优化换大的，名单

The problem is when I have a couple dozen phash values, the "optimized" query goes from less than a second response time to over half an hour. In other words it stops using the sortkey and does a full table scan.

问题是当我有几十个phash值时，“优化”查询从不到一秒的响应时间变为超过半小时。换句话说，它停止使用sortkey并进行全表扫描。

Any idea how I can prevent this behavior and retain the use of sortkeys to keep the query quick?

知道如何防止这种行为并保留使用sortkeys来保持查询的快速性吗？

Here is the EXPLAIN difference between <10 hashes and >10 hashes:

这是<10个哈希和> 10个哈希之间的EXPLAIN区别：

Less than 10 (0.4 seconds):

少于10（0.4秒）：

XN Unique  (cost=0.00..157253450.20 rows=43 width=27)
    ->  XN Seq Scan on table  (cost=0.00..157253393.92 rows=22510 width=27)
                Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))

More than 10 (45-60 minutes):

超过10（45-60分钟）：

XN Unique  (cost=0.00..181985241.25 rows=1717530 width=27)
    ->  XN Seq Scan on table  (cost=0.00..179718164.48 rows=906830708 width=27)
                Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))

5 个解决方案

#1

It's worth a try to set sortkeys (last_seen, phash), putting last_seen first.

值得尝试设置sortkeys（last_seen，phash），将last_seen放在第一位。

The reason of slowness might be because the leading column for the sort key is phash which looks like a random character. As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.

缓慢的原因可能是因为排序键的前导列是phash，看起来像一个随机字符。正如AWS redshift开发文档所述，如果将条件列用于where条件，则timestamp列应作为排序键的前导列。

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift

如果最常查询最新数据，请将timestamp列指定为排序键的前导列。 - 选择最佳排序键 - 亚马逊Redshift

With this order of the sort key, all columns will be sorted by last_seen, then phash. (What does it mean to have multiple sortkey columns?)

使用此排序键的顺序，所有列将按last_seen排序，然后是phash。（有多个sortkey列是什么意思？）

One note is that you have to recreate your table to change the sort key. This will help you to do that.

需要注意的是，您必须重新创建表以更改排序键。这将帮助您做到这一点。

#2

You can try to create temporary table/subquery:

您可以尝试创建临时表/子查询：

SELECT DISTINCT t.ret_field
FROM table t
JOIN (
   SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
   UNION ALL 
   SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
   UNION ALL
   SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
   -- UNION ALL
) AS sub
   ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

Alternatively do searching in chunks (if query optimizer merge it to one, use auxiliary table to store intermediate results):

或者，在块中搜索（如果查询优化器将其合并为一个，则使用辅助表来存储中间结果）：

SELECT ret_field
FROM table
WHERE phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae')
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION 
-- ...

If query optimizer merge it to one you can try to use temp table for intermediate results

如果查询优化器将其合并为一个，您可以尝试将临时表用于中间结果

EDIT:

编辑：

SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
      FROM ...
) AS sub
   ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

#3

Do you really need DISTINCT ? This operator could be expensive.

你真的需要DISTINCT吗？这个运营商可能很贵。

I'd try to use LATERAL JOIN. In the query below the table Hashes has a column phash - this is your big batch of hashes. It could be a temp table, a (sub)query, anything.

我试着用LATERAL JOIN。在下面的查询中，表Hashes有一个列phash - 这是你的大批哈希。它可以是临时表，（子）查询，任何东西。

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN LATERAL
    (
        SELECT table.ret_field
        FROM table
        WHERE
            table.phash = Hashes.phash
            AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
    ) AS T ON true

It is quite likely that optimizer implements LATERAL JOIN as a nested loop. It would loop through all rows in Hashes and for each row run the SELECT FROM table. The inner SELECT should use index that you have on (phash, last_seen). To play it safe include ret_field into the index as well to make it a covering index: (phash, last_seen, ret_field).

优化器很可能将LATERAL JOIN实现为嵌套循环。它将循环遍历Hashes中的所有行，并为每一行运行SELECT FROM表。内部SELECT应该使用你所拥有的索引（phash，last_seen）。为了安全起见，将ret_field包括在索引中以使其成为覆盖索引：（phash，last_seen，ret_field）。

There is a very valid point in the answer by @Diego: instead of putting constant phash values into the query, put them in a temporary or permanent table.

@Diego在答案中有一个非常有效的观点：不是将常量的phash值放入查询中，而是将它们放在临时表或永久表中。

I'd like to extend the answer by @Diego and add that it is important that this table with hashes has index, unique index.

我想扩展@Diego的答案，并补充说这个带哈希的表具有索引，唯一索引是很重要的。

So, create a table Hashes with one column phash that has exactly the same type as in your main table.phash. It is important that types match. Make that column a primary key with unique clustered index. Dump your dozens of phash values into the Hashes table.

因此，创建一个具有一个列phash的表Hashes，其类型与主table.phash中的类型完全相同。类型匹配很重要。使该列成为具有唯一聚簇索引的主键。将几十个phash值转储到Hashes表中。

Then the query becomes a simple INNER JOIN, not lateral:

然后查询变为简单的INNER JOIN，而不是横向：

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN table ON table.phash = Hashes.phash
WHERE
    table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

It is still important that table has index on (phash, last_seen, ret_field).

表具有索引（phash，last_seen，ret_field）仍然很重要。

Optimizer should be able to take advantage of the fact that both joined tables are sorted by phash column and that it is unique in the Hashes table.

优化器应该能够利用两个连接表按phash列排序并且在Hashes表中是唯一的这一事实。

#4

you can get rid of the "ORs" by inserting the data you want into a temp table and joining it with your actual table.

您可以通过将所需数据插入临时表并将其与实际表连接来删除“OR”。

Here's an example (I'm using a CTE because with the tool Im using is hard to capture the plan when you have more than one SQL statement - but go with a temp table if you can)

这是一个例子（我正在使用CTE，因为当你有多个SQL语句时，我使用的工具难以捕获计划 - 但如果可以，请使用临时表）

select * 
from <my_table>
where checksum in 
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)

VERSUS

与

with foo as
(
    select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
    select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
    select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
    select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select  * 
from <my_table> r 
     join foo f on r.checksum = F.my_key

and here's the plan, as you can see it looks more complex but that's because of the CTE, it wouldn't look that ways on a temp table:

这是计划，因为你可以看到它看起来更复杂，但这是因为CTE，它在临时表上看起来不会那样：

#5

Did you try using union for all phash values?

您是否尝试使用union来获取所有phash值？

Just like that:

就像那样：

SELECT ret_field 
FROM   table 
WHERE  phash = '5c8615fa967576019f846b55f11b6e41' -- 1st phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = '8719c8caa9740bec10f914fc2434ccfd' -- 2nd phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = '9b657c9f6bf7c5bbd04b5baf94e61dae' -- 3rd phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

-- and so on...

UNION 

SELECT ret_field 
FROM   table 
WHERE  phash = 'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn' -- Nth phash value
and    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

#1