如何从Perl中对PDF文件进行全文搜索?

时间:2022-09-19 17:58:17

I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string. To date I have been using this:

我有一堆PDF文件,我的Perl程序需要对它们进行全文搜索,以返回哪些包含特定的字符串。到目前为止,我一直在使用这个:

my @search_results = `grep -i -l \"$string\" *.pdf`;

where $string is the text to look for. However this fails for most pdf's because the file format is obviously not ASCII.

其中$ string是要查找的文本。然而,对于大多数pdf而言,这都失败了,因为文件格式显然不是ASCII。

What can I do that's easiest?

我能做的最简单的事情是什么?

Clarification: There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.

澄清:大约有300个pdf的名字我事先不知道。 PDF :: Core可能有点矫枉过正。我试图让pdftotext和grep相互玩得很好,因为我不知道pdf的名字,我找不到合适的语法。

Final solution using Adam Bellaire's suggestion below:

使用Adam Bellaire建议的最终解决方案如下:

@search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;

6 个解决方案

#1


9  

The PerlMonks thread here talks about this problem.

这里的PerlMonks线程讨论了这个问题。

It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:

看来,对于你的情况,获取pdftotext(命令行工具)可能是最简单的,然后你可以这样做:

my @search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;

#2


2  

I second Adam Bellaire solution. I used pdftotext utility to create full-text index of my ebook library. It's somewhat slow but does its job. As for full-text, try PLucene or KinoSearch to store full-text index.

我是第二个Adam Bellaire解决方案。我使用pdftotext实用程序来创建我的电子书库的全文索引。它有点慢但完成它的工作。至于全文,请尝试PLucene或KinoSearch来存储全文索引。

#3


2  

My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:

我的库CAM :: PDF支持提取文本,但鉴于PDF语法的图形方向,这是一个固有的难题。所以,输出有时是胡言乱语。 CAM :: PDF捆绑了一个getpdftext.pl程序,或者您可以调用这样的功能:

my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
   my $text = $doc->getPageText($pagenum);
   print $text;
}

#4


2  

You may want to look at PDF::Core.

您可能想看看PDF :: Core。

#5


1  

The easiest fulltext index/seach I've used is mysql. You just insert into the table with the appropriate index on it. You need to spend some time working out the relative weightings for fields (a match in the title might score higher than a match in the body), but this is all possible, albeit with some hairy sql.

我使用的最简单的全文索引/搜索是mysql。您只需在表中插入适当的索引即可。你需要花一些时间计算字段的相对权重(标题中的匹配可能得分高于正文中的匹配),但这都是可能的,尽管有一些毛茸茸的sql。

Plucene is deprecated (there hasn't been any active work on it in the last two years afaik) in favour of KinoSearch. KinoSearch grew, in part, out of understanding the architectural limitations of Plucene.

Plucene已被弃用(在过去两年中没有任何积极的工作)支持KinoSearch。 KinoSearch的成长部分源于对Plucene架构限制的理解。

If you have ~300 pdfs, then once you've extracted the text from the PDF (assuming the PDF has text and not just images of text ;) and depending on your query volumes you may find grep is sufficient.

如果您有~300 pdf,那么一旦您从PDF中提取文本(假设PDF有文本而不仅仅是文本图像;)并且根据您的查询量,您可能会发现grep就足够了。

However, I'd strongly suggest the mysql/kinosearch route as they have covered a lot of ground (stemming, stopwords, term weighting, token parsing) that you don't benefit from getting bogged down with.

但是,我强烈建议使用mysql / kinosearch路径,因为它们已经涵盖了许多基础(词干,停用词,术语加权,令牌解析),而这些路线并没有因为陷入困境而受益。

KinoSearch is probably faster than the mysql route, but the mysql route gives you more widely used standard software/tools/developer-experience. And you get the ability to use the power of sql to augement your freetext search queries.

KinoSearch可能比mysql路由更快,但mysql路由为您提供了更广泛使用的标准软件/工具/开发人员体验。并且您可以使用sql的强大功能来处理您的*文本搜索查询。

So unless you're talking HUGE data-sets and insane query volumes, my money would be on mysql.

因此,除非您正在谈论巨大的数据集和疯狂的查询量,否则我的资金将用于mysql。

#6


0  

You could try Lucene (the Perl port is called Plucene). The searches are incredibly fast and I know that PDFBox already knows how to index PDF files with Lucene. PDFBox is Java, but chances are there is something very similar somewhere in CPAN. Even if you can't find something that already adds PDF files to a Lucene index it shouldn't be more than a few lines of code to do it yourself. Lucene will give you quite a few more searching options than simply looking for a string in a file.

您可以尝试Lucene(Perl端口称为Plucene)。搜索速度非常快,我知道PDFBox已经知道如何使用Lucene索引PDF文件。 PDFBox是Java,但在CPAN的某处可能会有类似的东西。即使你找不到已经将PDF文件添加到Lucene索引中的东西,也不应该自己做几行代码。 Lucene将为您提供更多搜索选项,而不仅仅是在文件中查找字符串。

There's also a very quick and dirty way. Text in a PDF file is actually stored as plain text. If you open a PDF in a text editor or use 'strings' you can see the text in there. The binary junk is usually embedded fonts, images, etc.

还有一种非常快速和肮脏的方式。 PDF文件中的文本实际上存储为纯文本。如果您在文本编辑器中打开PDF或使用“字符串”,则可以在其中查看文本。二进制垃圾通常是嵌入的字体,图像等。

#1


9  

The PerlMonks thread here talks about this problem.

这里的PerlMonks线程讨论了这个问题。

It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:

看来,对于你的情况,获取pdftotext(命令行工具)可能是最简单的,然后你可以这样做:

my @search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;

#2


2  

I second Adam Bellaire solution. I used pdftotext utility to create full-text index of my ebook library. It's somewhat slow but does its job. As for full-text, try PLucene or KinoSearch to store full-text index.

我是第二个Adam Bellaire解决方案。我使用pdftotext实用程序来创建我的电子书库的全文索引。它有点慢但完成它的工作。至于全文,请尝试PLucene或KinoSearch来存储全文索引。

#3


2  

My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:

我的库CAM :: PDF支持提取文本,但鉴于PDF语法的图形方向,这是一个固有的难题。所以,输出有时是胡言乱语。 CAM :: PDF捆绑了一个getpdftext.pl程序,或者您可以调用这样的功能:

my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
   my $text = $doc->getPageText($pagenum);
   print $text;
}

#4


2  

You may want to look at PDF::Core.

您可能想看看PDF :: Core。

#5


1  

The easiest fulltext index/seach I've used is mysql. You just insert into the table with the appropriate index on it. You need to spend some time working out the relative weightings for fields (a match in the title might score higher than a match in the body), but this is all possible, albeit with some hairy sql.

我使用的最简单的全文索引/搜索是mysql。您只需在表中插入适当的索引即可。你需要花一些时间计算字段的相对权重(标题中的匹配可能得分高于正文中的匹配),但这都是可能的,尽管有一些毛茸茸的sql。

Plucene is deprecated (there hasn't been any active work on it in the last two years afaik) in favour of KinoSearch. KinoSearch grew, in part, out of understanding the architectural limitations of Plucene.

Plucene已被弃用(在过去两年中没有任何积极的工作)支持KinoSearch。 KinoSearch的成长部分源于对Plucene架构限制的理解。

If you have ~300 pdfs, then once you've extracted the text from the PDF (assuming the PDF has text and not just images of text ;) and depending on your query volumes you may find grep is sufficient.

如果您有~300 pdf,那么一旦您从PDF中提取文本(假设PDF有文本而不仅仅是文本图像;)并且根据您的查询量,您可能会发现grep就足够了。

However, I'd strongly suggest the mysql/kinosearch route as they have covered a lot of ground (stemming, stopwords, term weighting, token parsing) that you don't benefit from getting bogged down with.

但是,我强烈建议使用mysql / kinosearch路径,因为它们已经涵盖了许多基础(词干,停用词,术语加权,令牌解析),而这些路线并没有因为陷入困境而受益。

KinoSearch is probably faster than the mysql route, but the mysql route gives you more widely used standard software/tools/developer-experience. And you get the ability to use the power of sql to augement your freetext search queries.

KinoSearch可能比mysql路由更快,但mysql路由为您提供了更广泛使用的标准软件/工具/开发人员体验。并且您可以使用sql的强大功能来处理您的*文本搜索查询。

So unless you're talking HUGE data-sets and insane query volumes, my money would be on mysql.

因此,除非您正在谈论巨大的数据集和疯狂的查询量,否则我的资金将用于mysql。

#6


0  

You could try Lucene (the Perl port is called Plucene). The searches are incredibly fast and I know that PDFBox already knows how to index PDF files with Lucene. PDFBox is Java, but chances are there is something very similar somewhere in CPAN. Even if you can't find something that already adds PDF files to a Lucene index it shouldn't be more than a few lines of code to do it yourself. Lucene will give you quite a few more searching options than simply looking for a string in a file.

您可以尝试Lucene(Perl端口称为Plucene)。搜索速度非常快,我知道PDFBox已经知道如何使用Lucene索引PDF文件。 PDFBox是Java,但在CPAN的某处可能会有类似的东西。即使你找不到已经将PDF文件添加到Lucene索引中的东西,也不应该自己做几行代码。 Lucene将为您提供更多搜索选项,而不仅仅是在文件中查找字符串。

There's also a very quick and dirty way. Text in a PDF file is actually stored as plain text. If you open a PDF in a text editor or use 'strings' you can see the text in there. The binary junk is usually embedded fonts, images, etc.

还有一种非常快速和肮脏的方式。 PDF文件中的文本实际上存储为纯文本。如果您在文本编辑器中打开PDF或使用“字符串”,则可以在其中查看文本。二进制垃圾通常是嵌入的字体,图像等。