PyMuPDF is a Python binding for MuPDF, a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit. It offers many features for manipulating PDF and other documents, such as extracting text and images, creating and modifying pages, adding annotations and form fields, encrypting and decrypting files, and more.
One of the advanced features of PyMuPDF is the ability to detect multi-column pages in supported document types. This can be useful for processing documents that have complex layouts, such as reports, newspapers, magazines, or academic papers. By identifying the text belonging to different columns on the page, you can extract it more accurately and preserve its logical structure.
In this blog post, we will show you how to use a PyMuPDF utility for detecting multiple columns in pages and extracting text along these columns. The utility supports a variable number of columns on the page. Text written on top of images can optionally be excluded, as well as footer lines by using an appropriate bottom margin.
The utility is a Python script named multi_column.py, which can be used as a command-line tool or imported as a module. The script contains a function named column_boxes
, which takes a PyMuPDF page object as an input and returns a list of text boundary boxes that correspond to the columns on the page.
The function uses MuPDF’s text block detection capability to identify text blocks and uses their bounding boxes as the primary structuring principle. It also supports ignoring footers via a footer margin parameter and optionally ignoring text written above images.
The function has the following signature:
The parameters are:
page
: a PyMuPDF page object.
footer_margin
: an integer that specifies the height of the bottom stripe to ignore on each page. Default is 50.
no_image_text
: a boolean that indicates whether to ignore text written above images. Default is True.
The return value is a list of pymupdf.IRect
objects that represent the column boundary boxes. The list is sorted ascending by their top-left coordinates.
There are two ways to use the utility: as a command-line tool or as a module. In any case, PyMuPDF must be installed. There are no other dependencies.
To use the utility as a command-line tool, run the following command:
python multi_column.py input.pdf footer_margin
Where input.pdf
is the name of the PDF file you want to process and footer_margin
is the height of the footer margin you want to ignore. The code is currently intended for demonstration purposes, in that on every page of “input.pdf” the identified column boundary boxes are given a red border. Inside these rectangles, near their top-left corner, the sequence number of the rectangle is written such that the sequence of text extraction can be easily followed.
Modify this code as needed, for instance extract the text of each rectangle and write it to some file.
To use the utility as a module, you need to import it in your Python script and call the column_boxes
function with a PyMuPDF page object as an argument.
For example, if you want to extract text from each column on each page of a PDF file named sample.pdf, you can write something like this:
This will print the text from each column on each page separated by dashes.
Here are some examples of successful column detection:
Here are some examples of problem cases:
In this blog post, we learned how to use a PyMuPDF utility for detecting multi-column pages in supported documents. The utility can separate text with different background colors, ignore footers and text written upon images, and supports a variable number of columns on the page.
The PyMuPDF library offers many other features to work with PDF documents, such as extracting images, annotations, and much more. Be sure to explore the official PyMuPDF documentation to discover more of its capabilities.
Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.
If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.