Java – PDFBox – Line/rectangle extraction

PDFBox – Line/rectangle extraction… here is a solution to the problem.

PDFBox – Line/rectangle extraction

I’m trying to extract text coordinates and line (or rectangle) coordinates from a PDF.

The TextPosition

class has getXDirAdj() and getYDirAdj() methods, which convert coordinates to their respective TextPosition object representations based on the orientation of the text fragment (corrected by comments in @mkl)
The final output is consistent regardless of page rotation.

The coordinates required for output are X0,Y0 (upper-left corner of the page).

This is a slight modification of @Tilman Hausherr solution. Invert the y-coordinate (height – y) to align it with the coordinates in the text extraction process and write the output to csv.

    public class LineCatcher extends PDFGraphicsStreamEngine
{
    private static final GeneralPath linePath = new GeneralPath();
    private static ArrayList<Rectangle2D> rectList= new ArrayList<Rectangle2D>();
    private int clipWindingRule = -1;
    private static String headerRecord = "Text| Page|x|y|width|height|space|font";

public LineCatcher(PDPage page)
    {
        super(page);
    }

public static void main(String[] args) throws IOException
    {
        if( args.length != 4 )
        {
            usage();
        }
        else
        {
            PDDocument document = null;
            FileOutputStream fop = null;
            File file;
            Writer osw = null;
            int numPages;
            double page_height;
            try
            {
                document = PDDocument.load( new File(args[0], args[1]) );
                numPages = document.getNumberOfPages();
                file = new File(args[2], args[3]);
                fop = new FileOutputStream(file);

 if file doesnt exists, then create it
                if (!file.exists()) {
                    file.createNewFile();
                }

osw = new OutputStreamWriter(fop, "UTF8");
                osw.write(headerRecord + System.lineSeparator());
                System.out.println("Line Processing numPages:" + numPages);
                for (int n = 0; n < numPages; n++) {
                    System.out.println("Line Processing page:" + n);
                    rectList = new ArrayList<Rectangle2D>();
                    PDPage page = document.getPage(n);
                    page_height = page.getCropBox().getUpperRightY();
                    LineCatcher lineCatcher = new LineCatcher(page);
                    lineCatcher.processPage(page);

try{
                        for(Rectangle2D rect:rectList) {

String pageNum = Integer.toString(n + 1);
                            String x = Double.toString(rect.getX());
                            String y = Double.toString(page_height - rect.getY()) ;
                            String w = Double.toString(rect.getWidth());
                            String h = Double.toString(rect.getHeight());
                            writeToFile(pageNum, x, y, w, h, osw);

}
                        rectList = null;
                        page = null;
                        lineCatcher = null;
                    }
                    catch(IOException io){
                        throw new IOException("Failed to Parse document for line processing. Incorrect document format. Page:" + n);
                    }
                };

}
            catch(IOException io){
                throw new IOException("Failed to Parse document for line processing. Incorrect document format.");
            }
            finally
            {
                if ( osw != null ){
                    osw.close();
                }
                if( document != null )
                {
                    document.close();
                }
            }
        }
    }

private static void writeToFile(String pageNum, String x, String y, String w, String h, Writer osw) throws IOException {
        String c = "^" + "|" +
                pageNum + "|" +
                x + "|" +
                y + "|" +
                w + "|" +
                h + "|" +
                "999" + "|" +
                "marker-only";
        osw.write(c + System.lineSeparator());
    }

@Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
         to ensure that the path is created in the right direction, we have to create
         it by combining single lines instead of creating a simple rectangle
        linePath.moveTo((float) p0.getX(), (float) p0.getY());
        linePath.lineTo((float) p1.getX(), (float) p1.getY());
        linePath.lineTo((float) p2.getX(), (float) p2.getY());
        linePath.lineTo((float) p3.getX(), (float) p3.getY());

 close the subpath instead of adding the last line so that a possible set line
         cap style isn't taken into account at the "beginning" of the rectangle
        linePath.closePath();
    }

@Override
    public void drawImage(PDImage pdi) throws IOException
    {
    }

@Override
    public void clip(int windingRule) throws IOException
    {
         the clipping path will not be updated until the succeeding painting operator is called
        clipWindingRule = windingRule;

}

@Override
    public void moveTo(float x, float y) throws IOException
    {
        linePath.moveTo(x, y);
    }

@Override
    public void lineTo(float x, float y) throws IOException
    {
        linePath.lineTo(x, y);
    }

@Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        linePath.curveTo(x1, y1, x2, y2, x3, y3);
    }

@Override
    public Point2D getCurrentPoint() throws IOException
    {
        return linePath.getCurrentPoint();
    }

@Override
    public void closePath() throws IOException
    {
        linePath.closePath();
    }

@Override
    public void endPath() throws IOException
    {
        if (clipWindingRule != -1)
        {
            linePath.setWindingRule(clipWindingRule);
            getGraphicsState().intersectClippingPath(linePath);
            clipWindingRule = -1;
        }
        linePath.reset();

}

@Override
    public void strokePath() throws IOException
    {
        rectList.add(linePath.getBounds2D());
        linePath.reset();
    }

@Override
    public void fillPath(int windingRule) throws IOException
    {
        linePath.reset();
    }

@Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        linePath.reset();
    }

@Override
    public void shadingFill(COSName cosn) throws IOException
    {
    }

/**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println( "Usage: java " + LineCatcher.class.getName() + " <input-pdf>"  + " <output-file>");
    }
}

The PDFGraphicsStreamEngine class is being used to extract line and rectangle coordinates. The coordinates of lines and rectangles do not align with text coordinates

PLOT of the text and line extract

Green: Text
Red: Line coordinates obtained as-is
Black: The expected coordinates (obtained after applying a transformation to the output).

Try using the setRotation() method to correct the rotation before running row extraction. However, the results were not consistent.

What are the possible options for using PDFBox to get a rotation and get consistent line/rectangle coordinate output?

Solution

As far as I understand the requirements here, OP works in a coordinate system, with the origin in the upper-left corner of the visible page (taking into account page rotation), the x-coordinate increments to the right, and the y-coordinate increases downwards in PDF default user-space units (usually 1/72 inch).

In this coordinate system, he needs to extract lines in (horizontal or vertical) form

  • Coordinates and sum of left/top endpoints
  • Width/height.

Convert LineCatcher results

On the other hand, the helper class LineCatcher he got from Tilman didn’t take page rotation into account. Also, it returns the bottom endpoint of the vertical line, not the top endpoint. Therefore, coordinate transformations must be applied to the LineCatcher result.

To do this, simply replace

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x = Double.toString(rect.getX());
    String y = Double.toString(page_height - rect.getY()) ;
    String w = Double.toString(rect.getWidth());
    String h = Double.toString(rect.getHeight());
    writeToFile(pageNum, x, y, w, h, osw);
}

Pass

int pageRotation = page.getRotation();
PDRectangle pageCropBox = page.getCropBox();

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x, y, w, h;
    switch(pageRotation) {
    case 0:
        x = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        y = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 90:
        x = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        y = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    case 180:
        x = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        y = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 270:
        x = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        y = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    default:
        throw new IOException(String.format("Unsupported page rotation %d on page %d.", pageRotation, page));
    }
    writeToFile(pageNum, x, y, w, h, osw);
}

( ExtractLinesWithDir Test testExtractLineRotationTestWithDir).

TextPosition.get? The relationship of DirAdj() coordinates

OP describes coordinates by referencing the TextPosition class methods getXDirAdj() and getYDirAdj(). In effect, these methods return coordinates in a coordinate system, with the origin at the upper-left corner of the page and the y-coordinate increasing down after rotating the page, causing the text to draw vertically.

In the case of the sample document, all text is drawn to stand upright after applying page rotation. My understanding of the requirements written above is derived from this.

However, using TextPosition.get? The problem with the DirAdj() value as a global coordinate is that in a document containing pages with text drawn in different directions, the collected text coordinates suddenly differ from the coordinate system. Therefore, a universal solution should not be to collect coordinates like crazy. Instead, it should first determine the page orientation (such as the direction given by page rotation or the direction most of the text is shared) and use the coordinates in a fixed coordinate system given by that direction plus an indication of the direction in which the text is written in question.

Related Problems and Solutions