有什么方法可以提高PostgreSQL 8中正则表达式查询的性能?

时间:2020-12-09 03:58:21

I'm performing a regular expression match on a column of type character varying(256) in PostgreSQL 8.3.3. The column currently has no indices. I'd like to improve the performance of this query if I can.

我正在PostgreSQL 8.3.3中对字符变化(256)的列执行正则表达式匹配。该列目前没有索引。如果可以的话,我想提高这个查询的性能。

Will adding an index help? Are there other things I can try to help improve performance?

添加索引会有帮助吗?还有其他我可以尝试帮助提高性能的方法吗?

6 个解决方案

#1


5  

You cannot create an index that will speed up any generic regular expression; however, if you have one or a limited number of regular expressions that you are matching against, you have a few options.

您无法创建将加速任何通用正则表达式的索引;但是,如果您要匹配一个或有限数量的正则表达式,则有几个选项。

As Paul Tomblin mentions, you can use an extra column or columns to indicate whether or not a given row matches that regex or regexes. That column can be indexed, and queried efficiently.

正如Paul Tomblin所提到的,您可以使用额外的一列或多列来指示给定行是否与正则表达式或正则表达式匹配。该列可以编制索引,并有效查询。

If you want to go further than that, this paper discusses an interesting sounding technique for indexing against regular expressions, which involves looking for long substrings in the regex and indexing based on whether those are present in the text to generate candidate matches. That filters down the number of rows that you actually need to check the regex against. You could probably implement this using GiST indexes, though that would be a non-trivial amount of work.

如果你想要更进一步,本文讨论了一种有趣的声音技术,用于对正则表达式进行索引,包括在正则表达式中查找长子串并根据文本中是否存在这些子索引来生成候选匹配。这会过滤掉您实际需要检查正则表达式的行数。您可以使用GiST索引来实现这一点,尽管这将是一项非常重要的工作。

#2


4  

An index can't do anything with a regular expression. You're going to have to do a full table scan.

索引无法对正则表达式执行任何操作。你将不得不进行全表扫描。

If at all possible, like if you're querying for the same regex all the time, you could add a column that specifies whether this row matches that regex and maintain that on inserts and updates.

如果可能的话,就像你一直在查询相同的正则表达式一样,你可以添加一个列来指定该行是否与该正则表达式匹配并在插入和更新时保持该行。

#3


0  

Regex matches do not perform well on fairly big text columns. Try to accomplish this without the regex, or do the match in code if the dataset is not large.

正则表达式匹配在相当大的文本列上表现不佳。尝试在没有正则表达式的情况下完成此操作,或者如果数据集不大则在代码中进行匹配。

#4


0  

This may be one of those times when you don't want to use RegEx. What does your reg-ex code look like? Maybe that's a way to speed it up.

当您不想使用RegEx时,这可能是其中之一。您的注册代码是什么样的?也许这是一种加快速度的方法。

#5


0  

If you have a limited set of regexes to match against you could create a table with the primary key of your table and a field indicating if it matches that regex, which you would update on a trigger and then index your tables key in that table. This trades a small decrease in update and insert speed for a probably large speed increase in select.

如果您有一组有限的正则表达式匹配,您可以创建一个包含表的主键的表和一个表示它是否匹配该正则表达式的字段,您将在触发器上更新该表,然后在该表中索引表键。这会使更新和插入速度略有下降,以便在select中大幅提高速度。

Alternatively, you could write a function which compares your field to that regex (or even pass the regex along with the field you are matching to the function), then create a functional index on your table against that function. This also assumes a fixed set of regexes (but you can add new regex matches more easily this way).

或者,您可以编写一个函数,将您的字段与该正则表达式进行比较(或者甚至将正则表达式与您与该函数匹配的字段一起传递),然后在您的表上针对该函数创建一个函数索引。这也假设一组固定的正则表达式(但你可以这样更容易地添加新的正则表达式匹配)。

If the regex is dynamically created from user input you might have to live with the table scan or change the user app to produce a more simple search like field like 'value%', which would use an index on field ('%value%' wouldn't).

如果从用户输入动态创建正则表达式,您可能必须使用表扫描或更改用户应用程序以生成更简单的搜索,如“value%”这样的字段,这将使用字段上的索引('%value%'不会)。

#6


0  

If you do manage to reduce your needs to a simple LIKE query, look up indexes with text_pattern_ops to speed those up.

如果您确实设法减少了对简单LIKE查询的需求,请使用text_pattern_ops查找索引以加快这些速度。

#1


5  

You cannot create an index that will speed up any generic regular expression; however, if you have one or a limited number of regular expressions that you are matching against, you have a few options.

您无法创建将加速任何通用正则表达式的索引;但是,如果您要匹配一个或有限数量的正则表达式,则有几个选项。

As Paul Tomblin mentions, you can use an extra column or columns to indicate whether or not a given row matches that regex or regexes. That column can be indexed, and queried efficiently.

正如Paul Tomblin所提到的,您可以使用额外的一列或多列来指示给定行是否与正则表达式或正则表达式匹配。该列可以编制索引,并有效查询。

If you want to go further than that, this paper discusses an interesting sounding technique for indexing against regular expressions, which involves looking for long substrings in the regex and indexing based on whether those are present in the text to generate candidate matches. That filters down the number of rows that you actually need to check the regex against. You could probably implement this using GiST indexes, though that would be a non-trivial amount of work.

如果你想要更进一步,本文讨论了一种有趣的声音技术,用于对正则表达式进行索引,包括在正则表达式中查找长子串并根据文本中是否存在这些子索引来生成候选匹配。这会过滤掉您实际需要检查正则表达式的行数。您可以使用GiST索引来实现这一点,尽管这将是一项非常重要的工作。

#2


4  

An index can't do anything with a regular expression. You're going to have to do a full table scan.

索引无法对正则表达式执行任何操作。你将不得不进行全表扫描。

If at all possible, like if you're querying for the same regex all the time, you could add a column that specifies whether this row matches that regex and maintain that on inserts and updates.

如果可能的话,就像你一直在查询相同的正则表达式一样,你可以添加一个列来指定该行是否与该正则表达式匹配并在插入和更新时保持该行。

#3


0  

Regex matches do not perform well on fairly big text columns. Try to accomplish this without the regex, or do the match in code if the dataset is not large.

正则表达式匹配在相当大的文本列上表现不佳。尝试在没有正则表达式的情况下完成此操作,或者如果数据集不大则在代码中进行匹配。

#4


0  

This may be one of those times when you don't want to use RegEx. What does your reg-ex code look like? Maybe that's a way to speed it up.

当您不想使用RegEx时,这可能是其中之一。您的注册代码是什么样的?也许这是一种加快速度的方法。

#5


0  

If you have a limited set of regexes to match against you could create a table with the primary key of your table and a field indicating if it matches that regex, which you would update on a trigger and then index your tables key in that table. This trades a small decrease in update and insert speed for a probably large speed increase in select.

如果您有一组有限的正则表达式匹配,您可以创建一个包含表的主键的表和一个表示它是否匹配该正则表达式的字段,您将在触发器上更新该表,然后在该表中索引表键。这会使更新和插入速度略有下降,以便在select中大幅提高速度。

Alternatively, you could write a function which compares your field to that regex (or even pass the regex along with the field you are matching to the function), then create a functional index on your table against that function. This also assumes a fixed set of regexes (but you can add new regex matches more easily this way).

或者,您可以编写一个函数,将您的字段与该正则表达式进行比较(或者甚至将正则表达式与您与该函数匹配的字段一起传递),然后在您的表上针对该函数创建一个函数索引。这也假设一组固定的正则表达式(但你可以这样更容易地添加新的正则表达式匹配)。

If the regex is dynamically created from user input you might have to live with the table scan or change the user app to produce a more simple search like field like 'value%', which would use an index on field ('%value%' wouldn't).

如果从用户输入动态创建正则表达式,您可能必须使用表扫描或更改用户应用程序以生成更简单的搜索,如“value%”这样的字段,这将使用字段上的索引('%value%'不会)。

#6


0  

If you do manage to reduce your needs to a simple LIKE query, look up indexes with text_pattern_ops to speed those up.

如果您确实设法减少了对简单LIKE查询的需求,请使用text_pattern_ops查找索引以加快这些速度。