基于VARCHAR大小的Amazon Redshift查询性能不佳

时间:2021-09-12 01:15:18

I'm building an Amazon Redshift data warehouse, and experiencing unexpected performance impacts based on the defined size of the VARCHAR column. Details are as follows. Three of my columns are shown from pg_table_def:

我正在构建Amazon Redshift数据仓库,并根据VARCHAR列的已定义大小遇到意外的性能影响。细节如下。我的三个列显示在pg_table_def中:

 schemaname | tablename |     column      |            type             | encoding  | distkey | sortkey | notnull 
------------+-----------+-----------------+-----------------------------+-----------+---------+---------+---------
 public     | logs      | log_timestamp   | timestamp without time zone | delta32k  | f       |       1 | t
 public     | logs      | event           | character varying(256)      | lzo       | f       |       0 | f
 public     | logs      | message         | character varying(65535)    | lzo       | f       |       0 | f

I've recently run Vacuum and Analyze, I have about 100 million rows in the database, and I'm seeing very different performance depending on which columns I include.

我最近运行了Vacuum and Analyze,我在数据库中有大约1亿行,而且根据我包含的列,我看到了非常不同的性能。

Query 1: For instance, the following query takes about 3 seconds:

查询1:例如,以下查询大约需要3秒钟:

select log_timestamp from logs order by log_timestamp desc limit 5;

Query 2: A similar query asking for more data runs in 8 seconds:

查询2:类似的查询要求在8秒内运行更多数据:

select log_timestamp, event from logs order by log_timestamp desc limit 5;

Query 3: However, this query, very similar to the previous, takes 8 minutes to run!

查询3:但是,此查询与之前的查询非常相似,需要8分钟才能运行!

select log_timestamp, message from logs order by log_timestamp desc limit 5;

Query 4: Finally, this query, identical to the slow one but with explicit range limits, is very fast (~3s):

查询4:最后,此查询与慢速查询相同但具有明确的范围限制,非常快(~3s):

select log_timestamp, message from logs where log_timestamp > '2014-06-18' order by log_timestamp desc limit 5;

The message column is defined to be able to hold larger messages, but in practice it doesn't hold much data: the average length of the message field is 16 charachters (std_dev 10). The average length of the event field is 5 charachters (std_dev 2). The only distinction I can really see is the max length of the VARCHAR field, but I wouldn't think that should have an order of magnitude affect on the time a simple query takes to return!

消息列被定义为能够容纳更大的消息,但实际上它不包含太多数据:消息字段的平均长度是16个字符(std_dev 10)。事件字段的平均长度是5个字符(std_dev 2)。我能真正看到的唯一区别是VARCHAR字段的最大长度,但我认为这不应该对简单查询返回的时间产生一个数量级的影响!

Any insight would be appreciated. While this isn't the typical use case for this tool (we'll be aggregating far more than we'll be inspecting individual logs), I'd like to understand any subtle or not-so-subtle affects of my table design.

任何见解将不胜感激。虽然这不是这个工具的典型用例(我们将聚合远远超过我们将检查单个日志),但我想了解我的表设计的任何微妙或不那么微妙的影响。

Thanks!

Dave

2 个解决方案

#1


10  

Redshift is a "true columnar" database and only reads columns that are specified in your query. So, when you specify 2 small columns, only those 2 columns have to be read at all. However when you add in the 3rd large column then the work that Redshift has to do dramatically increases.

Redshift是一个“真正的柱状”数据库,只读取查询中指定的列。因此,当您指定2个小列时,只需要读取这两列。但是当你添加第3个大列时,Redshift必须做的工作会大大增加。

This is very different from a "row store" database (SQL Server, MySQL, Postgres, etc.) where the entire row is stored together. In a row store adding/removing query columns does not make much difference in response time because the database has to read the whole row anyway.

这与整行存储在一起的“行存储”数据库(SQL Server,MySQL,Postgres等)非常不同。在行存储中添加/删除查询列在响应时间上没有太大区别,因为数据库无论如何都必须读取整行。

Finally the reason your last query is very fast is because you've told Redshift that it can skip a large portion of the data. Redshift stores your each column in "blocks" and these blocks are sorted according the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned.

最后你的上次查询速度非常快的原因是因为你告诉Redshift它可以跳过很大一部分数据。 Redshift将您的每一列存储在“块”中,并根据您指定的排序键对这些块进行排序。 Redshift记录每个块的最小值/最大值,并且可以跳过任何不能包含要返回的数据的块。

The limit clause doesn't reduce the work that has to be done because you've told Redshift that it must first order all by log_timestamp descending. The problem is your ORDER BY … DESC has to be executed over the entire potential result set before any data can be returned or discarded. When the columns are small that's fast, when they're big it's slow.

limit子句不会减少必须完成的工作,因为你告诉Redshift它必须首先按log_timestamp降序排序。问题是你的ORDER BY ...在返回或丢弃任何数据之前,必须在整个潜在结果集上执行DESC。当列很小而且很快时,当它们很大时它很慢。

#2


1  

Out of curiosity, how long does this take?

出于好奇,这需要多长时间?

select log_timestamp, message
from logs l join
     (select min(log_timestamp) as log_timestamp
      from (select log_timestamp
            from logs
            order by log_timestamp desc
            limit 5
           ) lt
     ) lt
     on l.log_timestamp >= lt.log_timestamp;

#1


10  

Redshift is a "true columnar" database and only reads columns that are specified in your query. So, when you specify 2 small columns, only those 2 columns have to be read at all. However when you add in the 3rd large column then the work that Redshift has to do dramatically increases.

Redshift是一个“真正的柱状”数据库,只读取查询中指定的列。因此,当您指定2个小列时,只需要读取这两列。但是当你添加第3个大列时,Redshift必须做的工作会大大增加。

This is very different from a "row store" database (SQL Server, MySQL, Postgres, etc.) where the entire row is stored together. In a row store adding/removing query columns does not make much difference in response time because the database has to read the whole row anyway.

这与整行存储在一起的“行存储”数据库(SQL Server,MySQL,Postgres等)非常不同。在行存储中添加/删除查询列在响应时间上没有太大区别,因为数据库无论如何都必须读取整行。

Finally the reason your last query is very fast is because you've told Redshift that it can skip a large portion of the data. Redshift stores your each column in "blocks" and these blocks are sorted according the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned.

最后你的上次查询速度非常快的原因是因为你告诉Redshift它可以跳过很大一部分数据。 Redshift将您的每一列存储在“块”中,并根据您指定的排序键对这些块进行排序。 Redshift记录每个块的最小值/最大值,并且可以跳过任何不能包含要返回的数据的块。

The limit clause doesn't reduce the work that has to be done because you've told Redshift that it must first order all by log_timestamp descending. The problem is your ORDER BY … DESC has to be executed over the entire potential result set before any data can be returned or discarded. When the columns are small that's fast, when they're big it's slow.

limit子句不会减少必须完成的工作,因为你告诉Redshift它必须首先按log_timestamp降序排序。问题是你的ORDER BY ...在返回或丢弃任何数据之前,必须在整个潜在结果集上执行DESC。当列很小而且很快时,当它们很大时它很慢。

#2


1  

Out of curiosity, how long does this take?

出于好奇,这需要多长时间?

select log_timestamp, message
from logs l join
     (select min(log_timestamp) as log_timestamp
      from (select log_timestamp
            from logs
            order by log_timestamp desc
            limit 5
           ) lt
     ) lt
     on l.log_timestamp >= lt.log_timestamp;