在SQL Server 2005中使用PDF文件进行全文搜索

时间:2022-09-01 22:12:56

I've got a strange problem with indexing PDF files in SQL Server 2005, and hope someone can help. My database has a table called MediaFile with the following fields - MediaFileId int identity pk, FileContent image, and FileExtension varchar(5). I've got my web application storing file contents in this table with no problems, and am able to use full-text searching on doc, xls, etc with no problems - the only file extension not working is PDF. When performing full-text searches on this table for words which I know exist inside of PDF files saved in the table, these files are not returned in the search results.

我在SQL Server 2005中索引PDF文件时遇到了一个奇怪的问题,希望有人可以提供帮助。我的数据库有一个名为MediaFile的表,其中包含以下字段:MediaFileId int identity pk,FileContent image和FileExtension varchar(5)。我的web应用程序在此表中存储文件内容没有任何问题,并且能够在doc,xls等上使用全文搜索而没有任何问题 - 唯一无法正常工作的文件扩展名是PDF。在此表上执行全文搜索时,我知道在表中保存的PDF文件中存在的单词,搜索结果中不会返回这些文件。

The OS is Windows Server 2003 SP2, and I've installed Adobe iFilter 6.0. Following the instructions on this blog entry, I executed the following commands:

操作系统是Windows Server 2003 SP2,我安装了Adobe iFilter 6.0。按照此博客条目上的说明,我执行了以下命令:

exec sp_fulltext_service 'load_os_resources', 1;
exec sp_fulltext_service 'verify_signature', 0;

After this, I restarted the SQL Server, and verified that the iFilter for the PDF extensions is installed correctly by executing the following command:

在此之后,我重新启动了SQL Server,并通过执行以下命令验证是否正确安装了PDF扩展的iFilter:

select document_type, path from sys.fulltext_document_types where document_type = '.pdf' 

This returns the following information, which looks correct:

这将返回以下信息,看起来正确:

document_type: .pdf
path: C:\Program Files\Adobe\PDF IFilter 6.0\PDFFILT.dll

document_type:.pdf路径:C:\ Program Files \ Adob​​e \ PDF IFilter 6.0 \ PDFFILT.dll

Then I (re)created the index on the MediaFile table, selecting FileContent as the column to index and the FileExtension as its type. The wizard creates the index and completes successfully. To test, I'm performing a search like this:

然后我(重新)在MediaFile表上创建了索引,选择FileContent作为索引列,将FileExtension作为其类型。该向导将创建索引并成功完成。为了测试,我正在执行这样的搜索:

SELECT MediaFileId, FileExtension FROM MediaFile WHERE CONTAINS(*, '"house"');

This returns DOC files which contain this term, but not any PDF files, although I know that there are definitely PDF files in the table which contain the word house.

这将返回包含此术语但不包含任何PDF文件的DOC文件,但我知道表中肯定有PDF文件包含单词house。

Incidentally, I got this working once for a few minutes, where the search above returned the correct PDF files, but then it just stopped working again for no apparent reason.

顺便说一句,我让这个工作了几分钟,上面的搜索返回了正确的PDF文件,但后来它没有明显的原因再次停止工作。

Any ideas as to what could be stopping SQL Server 2005 from indexing PDF's, even though Adobe iFilter is installed and appears to be loaded?

关于什么可以阻止SQL Server 2005索引PDF的任何想法,即使安装了Adobe iFilter并且似乎已加载?

2 个解决方案

#1


7  

Thanks Ivan. Managed to eventually get this working by starting everything from scratch. It seems like the order in which things are done makes a big difference, and the advice given on the linked blog to to turn off the 'load_os_resources' setting after loading the iFilter probably isn't the best option, as this will cause the iFilter to not be loaded when the SQL Server is restarted.

谢谢伊万。管理最终通过从头开始一切来实现这一点。似乎完成任务的顺序会产生很大的不同,并且在加载iFilter后关联博客上关闭'load_os_resources'设置的建议可能不是最好的选择,因为这会导致iFilter重新启动SQL Server时不加载。

If I recall correctly, the sequence of steps that eventually worked for me was as follows:

如果我没记错的话,最终对我有用的步骤顺序如下:

  1. Ensure that the table does not have an index already (and if so, delete it)
  2. 确保表中没有索引(如果是,则删除它)

  3. Install Adobe iFilter
  4. 安装Adobe iFilter

  5. Execute the command exec sp_fulltext_service 'load_os_resources', 1;
  6. 执行命令exec sp_fulltext_service'load_os_resources',1;

  7. Execute the command exec sp_fulltext_service 'verify_signature', 0;
  8. 执行命令exec sp_fulltext_service'verify_signature',0;

  9. Restart SQL Server
  10. 重新启动SQL Server

  11. Verify PDF iFilter is installed
  12. 验证PDF iFilter已安装

  13. Create full-text index on table
  14. 在表上创建全文索引

  15. Do full re-index
  16. 做完全重新索引

Although this did the trick, I'm quite sure I performed these steps a few times before it eventually started working properly.

尽管这样做了,但我确信在最终开始正常工作之前,我已经执行了几次这些步骤。

#2


0  

I've just struggled with it for an hour, but finally got it working. I did everything you did, so just try to simplify the query (I replaced * with field name and removed double quotes on term):

我刚刚与它斗争了一个小时,但终于搞定了。我做了你所做的一切,所以只是尝试简化查询(我用字段名替换*并在术语上删除双引号):

SELECT MediaFileId, FileExtension FROM MediaFile WHERE CONTAINS(FileContent, 'house')

Also when you create full text index make sure you specify the language. And the last thing is maybe you can try to change the field type from Image to varbinary(MAX).

此外,当您创建全文索引时,请确保指定语言。最后一件事可能是你可以尝试将字段类型从Image更改为varbinary(MAX)。

#1


7  

Thanks Ivan. Managed to eventually get this working by starting everything from scratch. It seems like the order in which things are done makes a big difference, and the advice given on the linked blog to to turn off the 'load_os_resources' setting after loading the iFilter probably isn't the best option, as this will cause the iFilter to not be loaded when the SQL Server is restarted.

谢谢伊万。管理最终通过从头开始一切来实现这一点。似乎完成任务的顺序会产生很大的不同,并且在加载iFilter后关联博客上关闭'load_os_resources'设置的建议可能不是最好的选择,因为这会导致iFilter重新启动SQL Server时不加载。

If I recall correctly, the sequence of steps that eventually worked for me was as follows:

如果我没记错的话,最终对我有用的步骤顺序如下:

  1. Ensure that the table does not have an index already (and if so, delete it)
  2. 确保表中没有索引(如果是,则删除它)

  3. Install Adobe iFilter
  4. 安装Adobe iFilter

  5. Execute the command exec sp_fulltext_service 'load_os_resources', 1;
  6. 执行命令exec sp_fulltext_service'load_os_resources',1;

  7. Execute the command exec sp_fulltext_service 'verify_signature', 0;
  8. 执行命令exec sp_fulltext_service'verify_signature',0;

  9. Restart SQL Server
  10. 重新启动SQL Server

  11. Verify PDF iFilter is installed
  12. 验证PDF iFilter已安装

  13. Create full-text index on table
  14. 在表上创建全文索引

  15. Do full re-index
  16. 做完全重新索引

Although this did the trick, I'm quite sure I performed these steps a few times before it eventually started working properly.

尽管这样做了,但我确信在最终开始正常工作之前,我已经执行了几次这些步骤。

#2


0  

I've just struggled with it for an hour, but finally got it working. I did everything you did, so just try to simplify the query (I replaced * with field name and removed double quotes on term):

我刚刚与它斗争了一个小时,但终于搞定了。我做了你所做的一切,所以只是尝试简化查询(我用字段名替换*并在术语上删除双引号):

SELECT MediaFileId, FileExtension FROM MediaFile WHERE CONTAINS(FileContent, 'house')

Also when you create full text index make sure you specify the language. And the last thing is maybe you can try to change the field type from Image to varbinary(MAX).

此外,当您创建全文索引时,请确保指定语言。最后一件事可能是你可以尝试将字段类型从Image更改为varbinary(MAX)。