如何将PDF转换为文本以便我可以使用PHP解析该文本?

时间:2022-10-30 12:10:41

I have PDFs that are mostly simply formatted text. I would like to parse the text with PHP. I realize that the PDF is binary so I need a utility or library to convert it to text.

我有PDF,大多是简单的格式化文本。我想用PHP解析文本。我意识到PDF是二进制的,所以我需要一个实用程序或库来将其转换为文本。

Any recommendations?

3 个解决方案

#1


4  

Third party software can dump the text contents of a PDF file, for example:

第三方软件可以转储PDF文件的文本内容,例如:

  • xdoc2txt (Windows-only, used in WinMerge plugins)
  • xdoc2txt(仅限Windows,用于WinMerge插件)

  • pdftotext, part of Xpdf
  • pdftotext,Xpdf的一部分

#2


4  

I ended up using XPDF ( which includes pdftotext ). This works great and I use it in production to extract text from millions of PDFs being uploaded to our servers.

我最终使用XPDF(包括pdftotext)。这非常有效,我在生产中使用它来从上传到我们服务器的数百万个PDF中提取文本。

Below is the install process for Linux CentOS:

以下是Linux CentOS的安装过程:

  1. download version 3.03 from here: http://foolabs.com/xpdf/download.html
  2. 从这里下载3.03版本:http://foolabs.com/xpdf/download.html

  3. tar -zxvf xpdfbin-linux-3.03.tar.gz ( extract tar.gz )
  4. tar -zxvf xpdfbin-linux-3.03.tar.gz(extract tar.gz)

  5. create required directories for install ( some or all of these might exist already )
    • sudo mkdir /usr/local/man/
    • sudo mkdir / usr / local / man /

    • sudo mkdir /usr/local/man/man1/
    • sudo mkdir / usr / local / man / man1 /

    • sudo mkdir /usr/local/man/man5/
    • sudo mkdir / usr / local / man / man5 /

    • sudo mkdir /usr/local/etc/xpdfrc/
    • sudo mkdir / usr / local / etc / xpdfrc /

  6. 创建所需的安装目录(部分或全部可能已存在)sudo mkdir / usr / local / man / sudo mkdir / usr / local / man / man1 / sudo mkdir / usr / local / man / man5 / sudo mkdir / usr /本地/ etc / xpdfrc /

  7. move files from extracted folders ( cd into the folder where xpdf was just unzipped )
    • move all the executables from the bin64 directory (xpdf, pdftotext ... all the files ) to /usr/local/bin/
    • 将所有可执行文件从bin64目录(xpdf,pdftotext ...所有文件)移动到/ usr / local / bin /

    • move the sample-xpdfrc file to /usr/local/etc/xpdfrc ( this can be used as is )
    • 将sample-xpdfrc文件移动到/ usr / local / etc / xpdfrc(这可以按原样使用)

    • move the manual pages from the doc directory ( *.1 to /usr/local/man/man1/ & *.5 to /usr/local/man/man5/ )
    • 将手册页从doc目录(* .1移动到/ usr / local / man / man1 /&* .5到/ usr / local / man / man5 /)

  8. 从提取的文件夹中移动文件(cd到刚刚解压缩xpdf的文件夹)将所有可执行文件从bin64目录(xpdf,pdftotext ...所有文件)移动到/ usr / local / bin /将sample-xpdfrc文件移动到/ usr / local / etc / xpdfrc(可以按原样使用)将手册页从doc目录(* .1移到/ usr / local / man / man1 /&* .5到/ usr / local / man /) man5 /)

  9. xpdf should be installed and ready to use
  10. 应安装xpdf并准备使用

  11. you can delete the downloaded tar.gz file and the folder where it was unzipped
  12. 您可以删除下载的tar.gz文件及其解压缩文件夹

#3


1  

You can't do that with file_get_contents() because PDF files contain only binary data (no plain text). To read / modify a pdf file you can use some third-party libraries. Take a look at:

您无法使用file_get_contents()执行此操作,因为PDF文件仅包含二进制数据(无纯文本)。要阅读/修改pdf文件,您可以使用某些第三方库。看一眼:

And don't forget

别忘了

#1


4  

Third party software can dump the text contents of a PDF file, for example:

第三方软件可以转储PDF文件的文本内容,例如:

  • xdoc2txt (Windows-only, used in WinMerge plugins)
  • xdoc2txt(仅限Windows,用于WinMerge插件)

  • pdftotext, part of Xpdf
  • pdftotext,Xpdf的一部分

#2


4  

I ended up using XPDF ( which includes pdftotext ). This works great and I use it in production to extract text from millions of PDFs being uploaded to our servers.

我最终使用XPDF(包括pdftotext)。这非常有效,我在生产中使用它来从上传到我们服务器的数百万个PDF中提取文本。

Below is the install process for Linux CentOS:

以下是Linux CentOS的安装过程:

  1. download version 3.03 from here: http://foolabs.com/xpdf/download.html
  2. 从这里下载3.03版本:http://foolabs.com/xpdf/download.html

  3. tar -zxvf xpdfbin-linux-3.03.tar.gz ( extract tar.gz )
  4. tar -zxvf xpdfbin-linux-3.03.tar.gz(extract tar.gz)

  5. create required directories for install ( some or all of these might exist already )
    • sudo mkdir /usr/local/man/
    • sudo mkdir / usr / local / man /

    • sudo mkdir /usr/local/man/man1/
    • sudo mkdir / usr / local / man / man1 /

    • sudo mkdir /usr/local/man/man5/
    • sudo mkdir / usr / local / man / man5 /

    • sudo mkdir /usr/local/etc/xpdfrc/
    • sudo mkdir / usr / local / etc / xpdfrc /

  6. 创建所需的安装目录(部分或全部可能已存在)sudo mkdir / usr / local / man / sudo mkdir / usr / local / man / man1 / sudo mkdir / usr / local / man / man5 / sudo mkdir / usr /本地/ etc / xpdfrc /

  7. move files from extracted folders ( cd into the folder where xpdf was just unzipped )
    • move all the executables from the bin64 directory (xpdf, pdftotext ... all the files ) to /usr/local/bin/
    • 将所有可执行文件从bin64目录(xpdf,pdftotext ...所有文件)移动到/ usr / local / bin /

    • move the sample-xpdfrc file to /usr/local/etc/xpdfrc ( this can be used as is )
    • 将sample-xpdfrc文件移动到/ usr / local / etc / xpdfrc(这可以按原样使用)

    • move the manual pages from the doc directory ( *.1 to /usr/local/man/man1/ & *.5 to /usr/local/man/man5/ )
    • 将手册页从doc目录(* .1移动到/ usr / local / man / man1 /&* .5到/ usr / local / man / man5 /)

  8. 从提取的文件夹中移动文件(cd到刚刚解压缩xpdf的文件夹)将所有可执行文件从bin64目录(xpdf,pdftotext ...所有文件)移动到/ usr / local / bin /将sample-xpdfrc文件移动到/ usr / local / etc / xpdfrc(可以按原样使用)将手册页从doc目录(* .1移到/ usr / local / man / man1 /&* .5到/ usr / local / man /) man5 /)

  9. xpdf should be installed and ready to use
  10. 应安装xpdf并准备使用

  11. you can delete the downloaded tar.gz file and the folder where it was unzipped
  12. 您可以删除下载的tar.gz文件及其解压缩文件夹

#3


1  

You can't do that with file_get_contents() because PDF files contain only binary data (no plain text). To read / modify a pdf file you can use some third-party libraries. Take a look at:

您无法使用file_get_contents()执行此操作,因为PDF文件仅包含二进制数据(无纯文本)。要阅读/修改pdf文件,您可以使用某些第三方库。看一眼:

And don't forget

别忘了