Concat有两个postgresql tsvector字段,它们在单独的表中创建单个postgresql视图,以启用加入的全文搜索

时间:2020-11-25 22:48:21

I have a postgresql view that is comprised as a combination of 3 tables:

我有一个postgresql视图,它包含3个表的组合:

create view search_view as 
select u.first_name, u.last_name, a.notes, a.summary, a.search_index 
from user as u, assessor as a, connector as c 
where a.connector_id = c.id and c.user_id = u.id;

However, I need to concat tsvector fields from 2 of the 3 table into a single tsvector field in the view which provides full text search across 4 fields: 2 from one table, and 2 from another.

但是,我需要将3个表中的2个的tsvector字段连接到视图中的单个tsvector字段,该字段提供跨4个字段的全文搜索:2个来自一个表,2个来自另一个表。

I've read the documentation stating that I can use the concat operator to combine two tsvector fields, but I'm not certain what this looks like syntactically, and also whether there are potential gotchas with this implementation.

我已经阅读了文档,说明我可以使用concat运算符来组合两个tsvector字段,但我不确定这在语法上是什么样的,以及是否存在与此实现有关的潜在问题。

I'm looking for example code that concats two tsvector fields from separate tables into a view, and also commentary on whether this is a good or bad practice in postgresql land.

我正在寻找将两个tsvector字段从单独的表连接到视图中的示例代码,还要评论这是否是postgresql中的好或坏做法。

2 个解决方案

#1


1  

I was wondering the same thing. I don't think we are supposed to be combining tsvectors from multiple tables like this. Best solution is to:

我想知道同样的事情。我认为我们不应该像这样组合来自多个表的tsvector。最佳解决方案是:

  1. create a new tsv column in each of your tables (user, assessor, connector)
  2. 在每个表中创建一个新的tsv列(用户,评估者,连接器)

  3. update the new tsv column in each table with all of the text you want to search. for example in the user table you would update the tsv column of all records concatenating first_name and last_name columns.
  4. 使用您要搜索的所有文本更新每个表中的新tsv列。例如,在用户表中,您将更新连接first_name和last_name列的所有记录的tsv列。

  5. create an index on the new tsv column, this will be faster than indexing on the individual columns
  6. 在新的tsv列上创建一个索引,这比在各个列上建立索引要快

  7. Run your queries as usual, and let Postgres do the "thinking" about which indexes to use. It may or may not use all indexes in queries involving more than one table.
  8. 像往常一样运行查询,让Postgres“思考”使用哪些索引。它可能会也可能不会在涉及多个表的查询中使用所有索引。

  9. use the ANALYZE and EXPLAIN commands to look at how Postgres is utilizing your new indexes for particular queries, and this will give you insight into speeding things up further.
  10. 使用ANALYZE和EXPLAIN命令来查看Postgres如何利用新索引进行特定查询,这将使您深入了解加速事情。

This will be my approach at least. I to have been doing lots of reading and have found that people aren't combining data from multiple tables into tsvectors. In fact I don't think this is possible, it may only be possible to use the columns of the current table when creating a tsvector.

这至少是我的方法。我一直在做很多阅读,并发现人们没有将多个表中的数据组合到tsvector中。实际上我认为这不可行,在创建tsvector时可能只能使用当前表的列。

#2


1  

Concatenating tsv vectors works but as per comments, index is probably not used this way (not an expert, can't say if it does or does not).

连接tsv向量是有效的,但根据评论,索引可能不会以这种方式使用(不是专家,不能说它是否有)。

SELECT * FROM newsletters LEFT JOIN campaigns ON newsletters.campaign_id=campaigns.id WHERE newsletters.tsv || campaigns.tsv @@ to_tsquery(unaccent(?))

The reason why you'd want this is to search for an AND string like txt1 & txt2 & txt 3 which is very common usage scenario. If you simpy split the search by an OR WHERE campaigns.tsv @@ to_tsquery(unaccent(?) this won't work because it will try to match all 3 tokens in both tsv column but the tokens could be in either column.

你想要这个的原因是搜索一个AND字符串,如txt1&txt2&txt 3,这是非常常见的使用场景。如果您通过OR WHERE campaigns.tsv @@ to_tsquery(unaccent(?)来简化搜索,那么这将无效,因为它将尝试匹配tsv列中的所有3个令牌,但令牌可能位于任一列中。

One solution which I found is to use triggers to insert and update the tsv column in table1 whenever the table2 changes, see: https://dba.stackexchange.com/questions/154011/postgresql-full-text-search-tsv-column-trigger-with-many-to-many but this is not a definitive answer and using that many triggers is error prone and hacky.

我找到的一个解决方案是每当table2更改时使用触发器插入和更新table1中的tsv列,请参阅:https://dba.stackexchange.com/questions/154011/postgresql-full-text-search-tsv-column -trigger-with-many-to-many但这不是一个明确的答案,使用那么多触发器容易出错和hacky。

Official documentation and some tutorials also show concatenating all the wanted colums into a ts vector on the fly without using a tsv column. But it is unclear how much slower is the on-the-fly versus tsv column approach, I can't find a single benchmark or explanation about this. The documenntation simply states:

官方文档和一些教程还显示在不使用tsv列的情况下将所有想要的colums连接成ts矢量。但目前还不清楚实时与tsv列方法相比有多慢,我找不到单一的基准或解释。该文件简单地指出:

Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches. (This is more important when using a GiST index than a GIN index; see Section 12.9.) The expression-index approach is simpler to set up, however, and it requires less disk space since the tsvector representation is not stored explicitly.

另一个优点是搜索速度更快,因为没有必要重做to_tsvector调用来验证索引匹配。 (当使用GiST索引而不是GIN索引时,这一点更为重要;请参阅第12.9节。)然而,表达式索引方法设置起来更简单,并且由于未明确存储tsvector表示,因此需要更少的磁盘空间。

All I can tell from this is that tsv columns are probably waste of resources and just complicate things but it'd be nice to see some hard numbers. But if you can concat tsv columns like this, then I guess it's no different than doing it in a WHERE clause.

我可以从中看出,tsv列可能是浪费资源而且只是让事情复杂化,但看到一些难以理解的数字会很高兴。但是如果你可以像这样连接tsv列,那么我想这与在WHERE子句中执行它没有什么不同。

#1


1  

I was wondering the same thing. I don't think we are supposed to be combining tsvectors from multiple tables like this. Best solution is to:

我想知道同样的事情。我认为我们不应该像这样组合来自多个表的tsvector。最佳解决方案是:

  1. create a new tsv column in each of your tables (user, assessor, connector)
  2. 在每个表中创建一个新的tsv列(用户,评估者,连接器)

  3. update the new tsv column in each table with all of the text you want to search. for example in the user table you would update the tsv column of all records concatenating first_name and last_name columns.
  4. 使用您要搜索的所有文本更新每个表中的新tsv列。例如,在用户表中,您将更新连接first_name和last_name列的所有记录的tsv列。

  5. create an index on the new tsv column, this will be faster than indexing on the individual columns
  6. 在新的tsv列上创建一个索引,这比在各个列上建立索引要快

  7. Run your queries as usual, and let Postgres do the "thinking" about which indexes to use. It may or may not use all indexes in queries involving more than one table.
  8. 像往常一样运行查询,让Postgres“思考”使用哪些索引。它可能会也可能不会在涉及多个表的查询中使用所有索引。

  9. use the ANALYZE and EXPLAIN commands to look at how Postgres is utilizing your new indexes for particular queries, and this will give you insight into speeding things up further.
  10. 使用ANALYZE和EXPLAIN命令来查看Postgres如何利用新索引进行特定查询,这将使您深入了解加速事情。

This will be my approach at least. I to have been doing lots of reading and have found that people aren't combining data from multiple tables into tsvectors. In fact I don't think this is possible, it may only be possible to use the columns of the current table when creating a tsvector.

这至少是我的方法。我一直在做很多阅读,并发现人们没有将多个表中的数据组合到tsvector中。实际上我认为这不可行,在创建tsvector时可能只能使用当前表的列。

#2


1  

Concatenating tsv vectors works but as per comments, index is probably not used this way (not an expert, can't say if it does or does not).

连接tsv向量是有效的,但根据评论,索引可能不会以这种方式使用(不是专家,不能说它是否有)。

SELECT * FROM newsletters LEFT JOIN campaigns ON newsletters.campaign_id=campaigns.id WHERE newsletters.tsv || campaigns.tsv @@ to_tsquery(unaccent(?))

The reason why you'd want this is to search for an AND string like txt1 & txt2 & txt 3 which is very common usage scenario. If you simpy split the search by an OR WHERE campaigns.tsv @@ to_tsquery(unaccent(?) this won't work because it will try to match all 3 tokens in both tsv column but the tokens could be in either column.

你想要这个的原因是搜索一个AND字符串,如txt1&txt2&txt 3,这是非常常见的使用场景。如果您通过OR WHERE campaigns.tsv @@ to_tsquery(unaccent(?)来简化搜索,那么这将无效,因为它将尝试匹配tsv列中的所有3个令牌,但令牌可能位于任一列中。

One solution which I found is to use triggers to insert and update the tsv column in table1 whenever the table2 changes, see: https://dba.stackexchange.com/questions/154011/postgresql-full-text-search-tsv-column-trigger-with-many-to-many but this is not a definitive answer and using that many triggers is error prone and hacky.

我找到的一个解决方案是每当table2更改时使用触发器插入和更新table1中的tsv列,请参阅:https://dba.stackexchange.com/questions/154011/postgresql-full-text-search-tsv-column -trigger-with-many-to-many但这不是一个明确的答案,使用那么多触发器容易出错和hacky。

Official documentation and some tutorials also show concatenating all the wanted colums into a ts vector on the fly without using a tsv column. But it is unclear how much slower is the on-the-fly versus tsv column approach, I can't find a single benchmark or explanation about this. The documenntation simply states:

官方文档和一些教程还显示在不使用tsv列的情况下将所有想要的colums连接成ts矢量。但目前还不清楚实时与tsv列方法相比有多慢,我找不到单一的基准或解释。该文件简单地指出:

Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches. (This is more important when using a GiST index than a GIN index; see Section 12.9.) The expression-index approach is simpler to set up, however, and it requires less disk space since the tsvector representation is not stored explicitly.

另一个优点是搜索速度更快,因为没有必要重做to_tsvector调用来验证索引匹配。 (当使用GiST索引而不是GIN索引时,这一点更为重要;请参阅第12.9节。)然而,表达式索引方法设置起来更简单,并且由于未明确存储tsvector表示,因此需要更少的磁盘空间。

All I can tell from this is that tsv columns are probably waste of resources and just complicate things but it'd be nice to see some hard numbers. But if you can concat tsv columns like this, then I guess it's no different than doing it in a WHERE clause.

我可以从中看出,tsv列可能是浪费资源而且只是让事情复杂化,但看到一些难以理解的数字会很高兴。但是如果你可以像这样连接tsv列,那么我想这与在WHERE子句中执行它没有什么不同。