使用iText从pdf文件中提取文本列

时间:2023-02-10 22:18:40

I need to extract text from pdf files using iText.

我需要使用iText从pdf文件中提取文本。

The problem is: some pdf files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from both columns in the same line)

问题是:一些pdf文件包含2列,当我提取文本时,我得到一个文本文件,其中列被合并为结果(即同一行中两列的文本)

this is the code:

这是代码:

public class pdf
{
    private static String INPUTFILE = "http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf" ;
    private static String OUTPUTFILE = "c:/new3.pdf";

    public static void main(String[] args) throws DocumentException, IOException {
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(OUTPUTFILE));
        document.open();

        PdfReader reader = new PdfReader(INPUTFILE);
        int n = reader.getNumberOfPages();

        PdfImportedPage page;

        // Go through all pages
        for (int i = 1; i <= n; i++) {
            page = writer.getImportedPage(reader, i);
            Image instance = Image.getInstance(page);
            document.add(instance);
        }

        document.close();

        PdfReader readerN = new PdfReader(OUTPUTFILE);
        for (int i = 1; i <= n; i++) {
            String myLine = PdfTextExtractor.getTextFromPage(readerN,i);
            System.out.println(myLine);

            try {             
                FileWriter fw = new FileWriter("c:/yo.txt",true);
                fw.write(myLine);
                fw.close();
            }catch (IOException ioe) {ioe.printStackTrace(); }
    }
}

Could you please help me with the task?

你能帮我完成这个任务吗?

6 个解决方案

#1


24  

I am the author of the iText text extraction sub-system. What you need to do is develop your own text extraction strategy (if you look at how PdfTextExtractor.getTextFromPage is implemented, you will see that you can provide a pluggable strategy).

我是iText文本提取子系统的作者。您需要做的是开发自己的文本提取策略(如果您查看如何实现PdfTextExtractor.getTextFromPage,您将看到您可以提供可插入的策略)。

How you are going to determine where columns start and stop is entirely up to you - this is a difficult problem - PDF doesn't have any concept of columns (heck, it doesn't even have a concept of words - just putting together the text extraction that the default strategy provides is quite tricky). If you know in advanced where the columns are, then you can use a region filter on the text render listener callback (there is code in the iText library for doing this, and the latest version of the iText In Action book gives a detailed example).

如何确定列的开始和停止位置完全取决于您 - 这是一个难题 - PDF没有任何列的概念(哎呀,它甚至没有单词的概念 - 只是把它放在一起默认策略提供的文本提取非常棘手)。如果您知道列的高级位置,那么您可以在文本渲染侦听器回调中使用区域过滤器(iText库中有代码用于执行此操作,最新版本的iText In Action书籍提供了详细示例) 。

If you need to obtain columns from arbitrary data, you've got some algorithm work ahead of you (if you get something working, I'd love to take a look). Some ideas on how to approach this:

如果你需要从任意数据中获取列,你就会有一些算法工作(如果你得到了一些工作,我很乐意看看)。关于如何处理这个问题的一些想法:

  1. Use an algorithm similar to that used in the default text extraction strategy (LocationAware...) to obtain a list of words and X/Y locations (be sure to account for rotation angle as well)
  2. 使用类似于默认文本提取策略(LocationAware ...)中使用的算法来获取单词列表和X / Y位置(确保也考虑旋转角度)
  3. For each word, draw an imaginary line running the full height of the page. Scan for all other words that start at the same X position.
  4. 对于每个单词,绘制一条运行页面整个高度的虚线。扫描从相同X位置开始的所有其他单词。
  5. While scanning, also look for words that intersect the X position (but do not start on the X position). This will give you potential location for column start/stop Y positions on the page.
  6. 扫描时,还要查找与X位置相交的单词(但不要在X位置开始)。这将为您提供页面上列开始/停止Y位置的潜在位置。
  7. Once you have column X and Y, you can resort to a region filtered approach
  8. 一旦有了X和Y列,就可以使用区域过滤方法

Another approach that may be equally feasible would be to analyze draw operations and look for long horizontal and vertical lines (assuming the columns are demarcated in a table-like format). Right now, the iText content parser doesn't have callbacks for these operations, but it would be possible to add them without major difficulty.

另一种可能同样可行的方法是分析绘制操作并寻找长的水平和垂直线(假设列以类似于表的格式划分)。目前,iText内容解析器没有针对这些操作的回调,但是可以毫无困难地添加它们。

#2


1  

You could also try PdfBox, but it all goes back to lack of structure in the PDF - its primarily an end file output format for display.

您也可以尝试使用PdfBox,但这一切都可以追溯到PDF中缺少结构 - 它主要是用于显示的结束文件输出格式。

#3


1  

I know my answer is a bit late. But I'm using the following code to read certain pages from PDF files. I didn't have any problem reading columns, no merged text, each column is being printed aside from the other.

我知道我的答案有点晚了。但我正在使用以下代码从PDF文件中读取某些页面。我没有任何问题,阅读列,没有合并的文本,每列都被打印出来。

    /**
 * Get plain text from a specific page in a pdf file.
 * @param pdfPath
 * @return
 * @throws IOException
 */
public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{
    PdfReader reader = new PdfReader(pdfPath); 

    StringWriter output = new StringWriter();  

        try {
            output.append(PdfTextExtractor.getTextFromPage(reader, pageNumber, new SimpleTextExtractionStrategy()));

        } catch (OutOfMemoryError e) {

            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    return output.toString();
}

If you are looking into extracting part of a page, let's say 1 column only, then you need to get the dimensions of the column. It's still a bit tricky but you might be able to figure this out if you already knew the begining text of the column (in a way to estimate the width and height). This can be done by using a rectangular area. See code below, and sorry if I got the point measurement wrong. In the code below I try to get the whole page dimension.

如果您正在研究提取页面的一部分,我们只说1列,那么您需要获取列的维度。它仍然有点棘手,但如果您已经知道列的开头文本(以估计宽度和高度的方式),您可能能够解决这个问题。这可以通过使用矩形区域来完成。请参阅下面的代码,如果我的点测量错误,请对不起。在下面的代码中,我尝试获取整个页面维度。

public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{

    PDDocument pdDoc = PDDocument.load(pdfPath);
    PDPage specPage = (PDPage)pdDoc.getDocumentCatalog().getAllPages().get( 0 );

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
float width = (specPage.getMediaBox().getHeight())*25.4f;
float height = (specPage.getMediaBox().getWidth())*25.4f;
Rectangle rect = new Rectangle( 0, 0, Math.round(width), Math.round(height));
stripper.addRegion( "class1", rect );
List allPages = pdDoc.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( pageNumber-1 );
stripper.extractRegions( firstPage );

return stripper.getTextForRegion( "class1" );

}

}

#4


1  

PDFTextStream is the one! At least I am able to identify the column values. Earlier, I was using iText and got stuck in defining strategy. Its hard.

PDFTextStream就是其中之一!至少我能够识别列值。早些时候,我使用iText并陷入定义策略的困境。这个很难(硬。

This api separates column cells by putting more spaces. Its fixed. you can put logic. (this was missing in iText).

这个api通过放置更多空格来分隔列单元格。它是固定的。你可以把逻辑。 (这在iText中丢失了)。

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class PDFText {
    public static void main(String[] args) throws java.io.IOException {
        String pdfFilePath = "xyz.pdf";

        Document pdf = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdf.pipe(new OutputTarget(text));
        pdf.close();
        System.out.println(text);
   }
}

Question has been asked related to this on *!

有关*的问题已被问及相关问题!

#5


0  

The file you are extracting from is pretty complex for data extraction purposes. There are tables, images, multiple, columns. You will need special algorithms to determine the reading order and also process the table data.

您要从中提取的文件非常复杂,可用于数据提取。有表,图像,多个,列。您将需要特殊算法来确定读取顺序并处理表格数据。

What are you trying to achieve here ? You could use a commercial OCR engine instead and let it do all the hard work and then process the data from there.

你想在这里实现什么目标?您可以使用商业OCR引擎,让它完成所有艰苦的工作,然后从那里处理数据。

#6


0  

Tables do not exist as structures in PDF unless the file uses Structured content. Do you understand what a PDF file is? I wrote a blog article explaining the issues of text extraction at http://www.jpedal.org/PDFblog/?p=228

除非文件使用结构化内容,否则表不作为PDF中的结构存在。你了解PDF文件是什么吗?我在http://www.jpedal.org/PDFblog/?p=228上写了一篇博客文章,解释了文本提取的问题

#1


24  

I am the author of the iText text extraction sub-system. What you need to do is develop your own text extraction strategy (if you look at how PdfTextExtractor.getTextFromPage is implemented, you will see that you can provide a pluggable strategy).

我是iText文本提取子系统的作者。您需要做的是开发自己的文本提取策略(如果您查看如何实现PdfTextExtractor.getTextFromPage,您将看到您可以提供可插入的策略)。

How you are going to determine where columns start and stop is entirely up to you - this is a difficult problem - PDF doesn't have any concept of columns (heck, it doesn't even have a concept of words - just putting together the text extraction that the default strategy provides is quite tricky). If you know in advanced where the columns are, then you can use a region filter on the text render listener callback (there is code in the iText library for doing this, and the latest version of the iText In Action book gives a detailed example).

如何确定列的开始和停止位置完全取决于您 - 这是一个难题 - PDF没有任何列的概念(哎呀,它甚至没有单词的概念 - 只是把它放在一起默认策略提供的文本提取非常棘手)。如果您知道列的高级位置,那么您可以在文本渲染侦听器回调中使用区域过滤器(iText库中有代码用于执行此操作,最新版本的iText In Action书籍提供了详细示例) 。

If you need to obtain columns from arbitrary data, you've got some algorithm work ahead of you (if you get something working, I'd love to take a look). Some ideas on how to approach this:

如果你需要从任意数据中获取列,你就会有一些算法工作(如果你得到了一些工作,我很乐意看看)。关于如何处理这个问题的一些想法:

  1. Use an algorithm similar to that used in the default text extraction strategy (LocationAware...) to obtain a list of words and X/Y locations (be sure to account for rotation angle as well)
  2. 使用类似于默认文本提取策略(LocationAware ...)中使用的算法来获取单词列表和X / Y位置(确保也考虑旋转角度)
  3. For each word, draw an imaginary line running the full height of the page. Scan for all other words that start at the same X position.
  4. 对于每个单词,绘制一条运行页面整个高度的虚线。扫描从相同X位置开始的所有其他单词。
  5. While scanning, also look for words that intersect the X position (but do not start on the X position). This will give you potential location for column start/stop Y positions on the page.
  6. 扫描时,还要查找与X位置相交的单词(但不要在X位置开始)。这将为您提供页面上列开始/停止Y位置的潜在位置。
  7. Once you have column X and Y, you can resort to a region filtered approach
  8. 一旦有了X和Y列,就可以使用区域过滤方法

Another approach that may be equally feasible would be to analyze draw operations and look for long horizontal and vertical lines (assuming the columns are demarcated in a table-like format). Right now, the iText content parser doesn't have callbacks for these operations, but it would be possible to add them without major difficulty.

另一种可能同样可行的方法是分析绘制操作并寻找长的水平和垂直线(假设列以类似于表的格式划分)。目前,iText内容解析器没有针对这些操作的回调,但是可以毫无困难地添加它们。

#2


1  

You could also try PdfBox, but it all goes back to lack of structure in the PDF - its primarily an end file output format for display.

您也可以尝试使用PdfBox,但这一切都可以追溯到PDF中缺少结构 - 它主要是用于显示的结束文件输出格式。

#3


1  

I know my answer is a bit late. But I'm using the following code to read certain pages from PDF files. I didn't have any problem reading columns, no merged text, each column is being printed aside from the other.

我知道我的答案有点晚了。但我正在使用以下代码从PDF文件中读取某些页面。我没有任何问题,阅读列,没有合并的文本,每列都被打印出来。

    /**
 * Get plain text from a specific page in a pdf file.
 * @param pdfPath
 * @return
 * @throws IOException
 */
public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{
    PdfReader reader = new PdfReader(pdfPath); 

    StringWriter output = new StringWriter();  

        try {
            output.append(PdfTextExtractor.getTextFromPage(reader, pageNumber, new SimpleTextExtractionStrategy()));

        } catch (OutOfMemoryError e) {

            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    return output.toString();
}

If you are looking into extracting part of a page, let's say 1 column only, then you need to get the dimensions of the column. It's still a bit tricky but you might be able to figure this out if you already knew the begining text of the column (in a way to estimate the width and height). This can be done by using a rectangular area. See code below, and sorry if I got the point measurement wrong. In the code below I try to get the whole page dimension.

如果您正在研究提取页面的一部分,我们只说1列,那么您需要获取列的维度。它仍然有点棘手,但如果您已经知道列的开头文本(以估计宽度和高度的方式),您可能能够解决这个问题。这可以通过使用矩形区域来完成。请参阅下面的代码,如果我的点测量错误,请对不起。在下面的代码中,我尝试获取整个页面维度。

public static String getPageContent(String pdfPath, int pageNumber) throws IOException
{

    PDDocument pdDoc = PDDocument.load(pdfPath);
    PDPage specPage = (PDPage)pdDoc.getDocumentCatalog().getAllPages().get( 0 );

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
float width = (specPage.getMediaBox().getHeight())*25.4f;
float height = (specPage.getMediaBox().getWidth())*25.4f;
Rectangle rect = new Rectangle( 0, 0, Math.round(width), Math.round(height));
stripper.addRegion( "class1", rect );
List allPages = pdDoc.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( pageNumber-1 );
stripper.extractRegions( firstPage );

return stripper.getTextForRegion( "class1" );

}

}

#4


1  

PDFTextStream is the one! At least I am able to identify the column values. Earlier, I was using iText and got stuck in defining strategy. Its hard.

PDFTextStream就是其中之一!至少我能够识别列值。早些时候,我使用iText并陷入定义策略的困境。这个很难(硬。

This api separates column cells by putting more spaces. Its fixed. you can put logic. (this was missing in iText).

这个api通过放置更多空格来分隔列单元格。它是固定的。你可以把逻辑。 (这在iText中丢失了)。

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class PDFText {
    public static void main(String[] args) throws java.io.IOException {
        String pdfFilePath = "xyz.pdf";

        Document pdf = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdf.pipe(new OutputTarget(text));
        pdf.close();
        System.out.println(text);
   }
}

Question has been asked related to this on *!

有关*的问题已被问及相关问题!

#5


0  

The file you are extracting from is pretty complex for data extraction purposes. There are tables, images, multiple, columns. You will need special algorithms to determine the reading order and also process the table data.

您要从中提取的文件非常复杂,可用于数据提取。有表,图像,多个,列。您将需要特殊算法来确定读取顺序并处理表格数据。

What are you trying to achieve here ? You could use a commercial OCR engine instead and let it do all the hard work and then process the data from there.

你想在这里实现什么目标?您可以使用商业OCR引擎,让它完成所有艰苦的工作,然后从那里处理数据。

#6


0  

Tables do not exist as structures in PDF unless the file uses Structured content. Do you understand what a PDF file is? I wrote a blog article explaining the issues of text extraction at http://www.jpedal.org/PDFblog/?p=228

除非文件使用结构化内容,否则表不作为PDF中的结构存在。你了解PDF文件是什么吗?我在http://www.jpedal.org/PDFblog/?p=228上写了一篇博客文章,解释了文本提取的问题