如何索引和搜索.doc文件

时间:2023-01-15 00:21:41

I have an application that needs to have .doc files uploaded to it. These documents should then be index and the whole collection of documents should be searchable. This will run on a Windows Server, without Word installed, using IIS and SqlServer, but I'd rather not be tied to SqlServer's full text indexing.

我有一个需要将.doc文件上传到它的应用程序。然后,这些文档应该是索引,并且应该可以搜索整个文档集合。这将使用IIS和SqlServer在没有安装Word的Windows Server上运行,但我宁愿不依赖于SqlServer的全文索引。

I was thinking of using Lucene.Net for the indexing part and was wondering what the best way to get the text out of the .doc files would be. I could probably extract the text by reading in the whole stream and then using a regEx to pull out any regular characters, but that seems hefty and prone to error.

我正在考虑使用Lucene.Net作为索引部分,并想知道从.doc文件中获取文本的最佳方法是什么。我可以通过读取整个流然后使用regEx来提取任何常规字符来提取文本,但这看起来很大并且容易出错。

I saw an article on using iFilters that sounds promising, but I thought I'd put this out there since it's not something I'm familiar with.

我看到一篇关于使用iFilters的文章听起来很有希望,但我想我会把它放在那里,因为它不是我熟悉的东西。

P.S. If it matters, these .doc files will have mail-merge fields in them and there's no other current alternative for the .doc format.

附:如果重要的话,这些.doc文件中将包含邮件合并字段,而且.doc格式没有其他当前替代品。

3 个解决方案

#1


As far as a solution that didn't require an external program, it looks like the iFilter solution is the way to go (even though you might count that as an external program).

对于不需要外部程序的解决方案,看起来iFilter解决方案是可行的方法(即使您可能将其视为外部程序)。

Here's a simple CodePlex article and code on how it can be done: http://www.codeproject.com/KB/cs/IFilter.aspx

这是一篇简单的CodePlex文章以及如何完成的代码:http://www.codeproject.com/KB/cs/IFilter.aspx

#2


In our PHP based applications we always used external programs similar to this one: doc2txt. Then we took the text and saved it into the database. If you search on Google for "doc2txt" you will find many different programs doing exactly the same thing. Just take the one that suits you best.

在我们基于PHP的应用程序中,我们总是使用与此类似的外部程序:doc2txt。然后我们将文本保存到数据库中。如果您在Google上搜索“doc2txt”,您会发现许多不同的程序完全相同。只需选择最适合您的产品。

#3


Maybe you'd like to checkout Solr.

也许你想结账Solr。

#1


As far as a solution that didn't require an external program, it looks like the iFilter solution is the way to go (even though you might count that as an external program).

对于不需要外部程序的解决方案,看起来iFilter解决方案是可行的方法(即使您可能将其视为外部程序)。

Here's a simple CodePlex article and code on how it can be done: http://www.codeproject.com/KB/cs/IFilter.aspx

这是一篇简单的CodePlex文章以及如何完成的代码:http://www.codeproject.com/KB/cs/IFilter.aspx

#2


In our PHP based applications we always used external programs similar to this one: doc2txt. Then we took the text and saved it into the database. If you search on Google for "doc2txt" you will find many different programs doing exactly the same thing. Just take the one that suits you best.

在我们基于PHP的应用程序中,我们总是使用与此类似的外部程序:doc2txt。然后我们将文本保存到数据库中。如果您在Google上搜索“doc2txt”,您会发现许多不同的程序完全相同。只需选择最适合您的产品。

#3


Maybe you'd like to checkout Solr.

也许你想结账Solr。