我的Lucene文档应该包含哪些内容?

时间:2023-02-06 03:04:50

I use Lucene.net to index content and documents etc.. on our CMS. This has worked well so far, but now I've got to take account of the following additions to web pages:

我使用Lucene.net在我们的CMS上索引内容和文档等。到目前为止,这一点运作良好,但现在我必须考虑以下网页添加内容:

  1. Publish date
  2. Expiry date
  3. Page 'is active'
  4. 页面'活跃'

  5. User authorisation

So the search results should only show pages that are within the Publish / Expiry window, are 'active' and that the current user is authorised to view.

因此,搜索结果应仅显示“发布/过期”窗口内的页面,“活动”并且当前用户有权查看。

Should I include the above information in the Lucene index? It will make the queries a little more complicated, but the hits collection will only return 'valid' documents which will make paging the results a lot easier.

我应该在Lucene索引中包含上述信息吗?它会使查询变得更复杂,但命中集合只返回“有效”文档,这将使分页结果更容易。

On the other hand, I'll be repeating information that is already in the CMS database so I'll be risking the integrity of my data, and I'll have update the index whenever anything in the above list is changed as well as the actual content itself.

另一方面,我将重复已经存在于CMS数据库中的信息,因此我将面临数据完整性的风险,并且只要上述列表中的任何内容发生更改,我就会更新索引。实际内容本身。

Anyone else had this problem? How did you solve it? Thanks.

还有其他人有这个问题吗?你是怎么解决的?谢谢。

Edit: I may need to use a 'FieldCache' (mentioned here) to pass the 'valid' doc ids into the lucene search?

编辑:我可能需要使用'FieldCache'(这里提到)将'有效'文档ID传递给lucene搜索?

2 个解决方案

#1


Query the CMS database first, and build a BitSet with all the matching documents (you'll need a FieldCache to translate between your app's doc ID's with Lucene's internal doc ID's). Then you can run your Lucene query on your index using a Filter (wrapping the BitSet).

首先查询CMS数据库,然后使用所有匹配的文档构建一个BitSet(您需要一个FieldCache来在您的应用程序的文档ID与Lucene的内部文档ID之间进行转换)。然后,您可以使用Filter(包装BitSet)在索引上运行Lucene查询。

You keep all mutable data in your database (where it belongs), and you don't have to worry about updating or rebuilding your index. This will run very fast as well.

您将所有可变数据保存在数据库中(它所属的位置),并且您不必担心更新或重建索引。这也将非常快。

P.S. I've only used the Java version of Lucene, but this should work fine in Lucene.NET

附:我只使用了Lucene的Java版本,但这在Lucene.NET中应该可以正常工作

#2


..so the search results should only show pages that are within the Publish / Expiry window, are 'active' and that the current user is authorised to view.

..所以搜索结果应该只显示发布/到期窗口内的页面,“活动”并且当前用户有权查看。

There are a few ways to handle the authorization issue. You could maintain multiple indexes (one per permission level), filter the results with the query (by storing permission required) or filter the results before you display them. If there are only a few levels, I think that I would maintain separate indexes - it seems safest.

有几种方法可以处理授权问题。您可以维护多个索引(每个权限级别一个),使用查询过滤结果(通过存储所需权限)或在显示结果之前过滤结果。如果只有几个级别,我认为我会维护单独的索引 - 这似乎是最安全的。

As for 'is active' - can you just rebuild your index with that in mind? Just rebuild your index in the background every so often and only add active content. You may have too much info to make that feasible - but Lucene is VERY fast.

至于“是活跃的” - 你能想到重建你的指数吗?只需在后台重建索引,并且只添加活动内容。你可能有太多信息可以让它变得可行 - 但是Lucene非常快。

#1


Query the CMS database first, and build a BitSet with all the matching documents (you'll need a FieldCache to translate between your app's doc ID's with Lucene's internal doc ID's). Then you can run your Lucene query on your index using a Filter (wrapping the BitSet).

首先查询CMS数据库,然后使用所有匹配的文档构建一个BitSet(您需要一个FieldCache来在您的应用程序的文档ID与Lucene的内部文档ID之间进行转换)。然后,您可以使用Filter(包装BitSet)在索引上运行Lucene查询。

You keep all mutable data in your database (where it belongs), and you don't have to worry about updating or rebuilding your index. This will run very fast as well.

您将所有可变数据保存在数据库中(它所属的位置),并且您不必担心更新或重建索引。这也将非常快。

P.S. I've only used the Java version of Lucene, but this should work fine in Lucene.NET

附:我只使用了Lucene的Java版本,但这在Lucene.NET中应该可以正常工作

#2


..so the search results should only show pages that are within the Publish / Expiry window, are 'active' and that the current user is authorised to view.

..所以搜索结果应该只显示发布/到期窗口内的页面,“活动”并且当前用户有权查看。

There are a few ways to handle the authorization issue. You could maintain multiple indexes (one per permission level), filter the results with the query (by storing permission required) or filter the results before you display them. If there are only a few levels, I think that I would maintain separate indexes - it seems safest.

有几种方法可以处理授权问题。您可以维护多个索引(每个权限级别一个),使用查询过滤结果(通过存储所需权限)或在显示结果之前过滤结果。如果只有几个级别,我认为我会维护单独的索引 - 这似乎是最安全的。

As for 'is active' - can you just rebuild your index with that in mind? Just rebuild your index in the background every so often and only add active content. You may have too much info to make that feasible - but Lucene is VERY fast.

至于“是活跃的” - 你能想到重建你的指数吗?只需在后台重建索引,并且只添加活动内容。你可能有太多信息可以让它变得可行 - 但是Lucene非常快。