Mastering PDF Text with PyMuPDF’s ‘insert_htmlbox’: What You Need to Know

Harald Lieder·February 11, 2024

PyMuPDFStoryPDF Manipulation

Mastering PDF Text with PyMuPDF’s ‘insert_htmlbox’: What You Need to Know

In this article

PyMuPDF Overview
Solution Approach
Limitations of the Old Methods
Method insert_htmlbox
The Call Pattern
A Practical Example
Methods Comparison
What is Text Shaping?
Conclusion

Programmatically writing text to pages is among the most desired features when dealing with PDF documents.

Typical use cases include:

Fill forms and populate table cells with generating automatic line breaks and observing available space.
Add headers, footers or watermarks to existing pages.
Comment or annotate images, graphics and other objects on a page.
In language translation of manuals or articles, write translated text side-by-side to the original for easy content matching.

PyMuPDF Overview

PyMuPDF is a Python binding for the MuPDF library, which is a lightweight PDF, XPS and e-book viewer. The PyMuPDF library not only supports reading and rendering PDF (and other) documents but also provides powerful utilities for creating and manipulating PDFs.

It can help you with all of the above tasks in an easy, intuitive way.

Solution Approach

To accomplish the mentioned tasks, we require a way to write text within a specific area of the page. This area is defined by its coordinates and can be visualized as a rectangle.

Additionally, we need the ability to determine how many words can fit within a given line width. When necessary, we should start a new line to avoid overflowing the available space.

Lastly, it’s crucial to keep track of the space already utilized within the rectangle. We must ensure that our text remains within the designated area and does not extend beyond its boundaries.

For a long time, PyMuPDF has offered two ways of achieving this:

Method insert_textbox
Method fill_textbox

We do not want to delve into the subtle differences between the methods at this point. What’s important here is that both can fill a predefined rectangle with text, as described earlier.

We will soon demonstrate that both methods do have certain limitations that make their use challenging or even impossible — especially when dealing with languages featuring complex writing systems like Devanagari.

Therefore, we have developed a new method, insert_htmlbox that does not exhibit these limitations and, furthermore, provides significantly greater flexibility and convenience.

Limitations of the Old Methods

Here is a list of things that you cannot do with either insert_textbox or fill_textbox.

All characters in the text are written with the same font, font size and color.
Changes between bold, italic and regular text are not possible.
Mixtures of different script systems are not supported. For instance, the text cannot contain Latin and Hindi words.
Optional hyphenation at line ends is not supported.
Text is written character by character. There is no support for complex script systems like Devanagari. This makes both methods effectively unusable for Hindi, Bengali, Tamil and more than 120 other languages.
No or weak support for right-to-left languages.
Weak support in case the content won’t fit in the rectangle.

Method insert_htmlbox

Method insert_htmlbox was introduced in PyMuPDF version 1.23.8. It accepts a rectangle and writes text into it — like the other two methods do. It however addresses all of the above shortcomings by internally using a Story object to layout the content.

Note

If you want to learn more about the “Story” feature, please read our blog post on “Advanced PDF Layouts”.

The major difference is that the text may be enriched with HTML tags and styling instructions. This was our motivation for choosing the name.

Here is a feature overview:

The text may contain HTML tags with styling instructions. Supported syntax roughly comprises HTML version 4 and CSS version 2.
When necessary, line breaks are generated at word boundaries. The HTML “soft-hyphen” character () can be used for additional, hyphenated line breaks.
Forced line breaks are available via the HTML tag <br> (“\n” is ignored and treated as space).
Multiple spaces are replaced by a single space. Use the HTML non-breaking space character   when required.
Any mixture of styling effects like bold, italic, text color, text alignment, font size or font switching is possible.
The text may include arbitrary, including right-to-left languages.
The underlying Story object will determine whether some character is not contained by any of the specified fonts and will automatically include a suitable one.
Scripts like Devanagari and others, with a highly complex system of ligatures are supported via HarfBuzz (see section “What is Text Shaping?” below).
Images can be included via HTML tag <img> and will be appropriately laid out. This is an alternative option for inserting images, compared to Page method insert_image.
HTML tables (tag <table>) may be present in the text.
HTML hyperlinks (tag <a>) are supported.
Super- and subscripts are supported (formula support, tags <sub>, <sup>).
If content is too large for the rectangle, the developer has two choices:
- Either be just informed (and accept a no-op),
- Or let the content be automatically scaled down to fit.
The output may be given an overall opacity — which can be very handy for inserting non-transparent images with transparency.
The output can be made conditionally visible via attaching the cross reference number of an Optional Content Group
Put content above (foreground) or below (background) all other page content.
Content can be rotated by multiples of 90 degrees.

The Call Pattern

This is how the method can be invoked in a Python program:

rc = page.insert_htmlbox(  # page is a PDF Page object
     rect,         # rectangle inside the page
     text,         # text string or a Story object
     css=None,     # an optional string with CSS-compatible styling
     scale_low=0,  # limit scaling down when fitting content
     archive=None, # points to locations of fonts and images
     rotate=0,     # clockwise rotate content by this angle
     oc=0,         # assign xref of an OCG (conditional visibility)
     opacity=1,    # make content transparent (default: 1 = no)
     overlay=True, # put in foreground (default) or background
     )

The parameters rect and text are required and positional, all other parameters must be specified as keywords.

If the text contains no HTML and no extra styling instructions are specified (css=None), the following defaults apply:

The font is Charis Serif Regular and covers the basic Latin, Greek and Cyrillic letters. For Latin characters beyond this range, font Notos Serif Regular is used.
Letters outside the Latin range will cause inclusion of appropriate fonts from Google Notos fonts. This literally covers the languages world-wide.
Additional font files will also be included if text switches between bold, italic and regular (even without explicitly specifying any font).
Use styling instructions @font-face and font-family to define your own fonts.
Styling instruction “body {margin: 1px;}” is used to define a default margin for filling the rectangle. Override this as needed.
Text is aligned left by default. Use styling instructions to override, like <p style=”text-align: center”;> centers text inside a paragraph. There is no method parameter to achieve this.

By default, the content will always fit in the rectangle. If necessary, an iteration will be used to find an optimal down-scaling factor. The return code informs about the iteration result: rc=(spare_height, scale). The values have the following meaning.

spare_height — the height of the “stripe” inside the rectangle that remained unused. For instance, this area is located at the rectangle’s bottom if rotation is zero. If down-scaling has occurred, then scale is less than 1 and spare_height will be zero. If the content did not fit (because scaling was opted out), it is -1.
scale — the computed down-scaling factor. We always have 0 < scale ≤ 1.

You can prevent or limit scaling by setting scale_low to a positive value. The maximum value is 1 which prohibits scaling altogether. For example scale_low=0.2 means that content will be scaled down by at most 80%.

Method insert_htmlbox can easily lead to the inclusion of multiple font files in the PDF. To control the file size, we strongly recommend building subset fonts by executing the Document method subset_fonts. This can easily reduce the file size by one or even two orders of magnitude.

A Practical Example

Here is an example that prints the inevitable “Hello, World!” greetings in a dozen different languages. We do not specify any font and thus fully leave their selection to the Story.

import pymupdf


greetings = (
    "Hello, World!",  # english
    "Hallo, Welt!",  # german
    "سلام دنیا!",  # persian
    "வணக்கம், உலகம்!",  # tamil
    "สวัสดีชาวโลก!",  # thai
    "Привіт Світ!",  # ukrainian
    "שלום עולם!",  # hebrew
    "ওহে বিশ্ব!",  # bengali
    "你好世界！",  # chinese
    "こんにちは世界！",  # japanese
    "안녕하세요, 월드!",  # korean
    "नमस्कार, विश्व !",  # sanskrit
    "हैलो वर्ल्ड!",  # hindi
)
doc = pymupdf.open()  # make a new empty PDF
page = doc.new_page()  # give it an empty page
rect = (50, 50, 200, 500)  # define a small rectangle


# concatenate the greetings into one string.
text = " ... ".join([t for t in greetings])


page.insert_htmlbox(rect, text)  # place into the rectangle


# make subset fonts
doc.subset_fonts()
doc.save("out.pdf", garbage=3, deflate=True)

This will be generated:

To properly produce the above, the Story has identified and included eight different fonts. Because we are creating font subsets, the resulting PDF has a size of 97 KB – otherwise it would have been 2 megabytes, a size reduction factor of 20!

Methods Comparison

Let's look at differences and similarities of PyMuPDF’s ways to fill content into a rectangle.

Given the comfort of insert_htmlbox, an obvious question is:

Why would you still want to use one of the “old” methods for writing into text boxes?

Here are a number of situations when insert_textbox and fill_textbox remain viable choices:

Performance: The old methods can be much faster — here is why. A lot of things are happening behind the curtain to make insert_htmlbox work:
- The text is processed by a HTML parser.
- Every character is confirmed to be contained in its designated font. In case of failing this, a suitable font must be searched and loaded.
- Once the text string is finalized, the text shaper (HarfBuzz) is called to produce the sequence of resulting glyphs.
- The result is laid out in the given rectangle. If exceeding available space, perform an iterative repetition of the layout with varying scaling factors. Measurements show that the number of these iterations may reach 20 or more.
Special effects:
- Support of parameter morph. This can lead to arbitrary changes of the rectangle’s final appearance: any rotation angle is possible, as well as up-down and left-right flips along some line. Other possible uses include the simulation of italic fonts and independent shrinking / stretching of x- and y-values.
- Support of parameter render_mode: Writing invisible / hidden text (as used by OCR engines) and controlling thickness and color of the border of letters. Allows simulation of bold fonts and other text effects.

What is Text Shaping?

The Story class supports “text shaping”. Here is a brief explanation of the term and why we need it.

Not everyone realizes that outputting text on a document page can be much more complex than writing character by character.

But already outputting a mixture of Arabic and English is not trivial: unlike Western writing systems, Arabic is written from right to left. Therefore, in a compound text of Arabic and English you will have multiple changes of writing directions. The same is true for the right-to-left languages Hebrew, Persian etc.

On top of this, there are languages where letters must be joined with each other in a way that is situation-dependent: it depends on the sequence in which the letters happen to occur in a word.

E.g. for writing the text “another text” in Persian, it would be wrong to simply output the single letters like this:

The correct result instead looks like this:

In fact, this is just scratching the surface:

Many languages in South East Asia (Hindi, Sanskrit, Bengali, Tamil, Nepali, Thai etc.) use scripting systems that have dozens of so-called “ligatures”. These are graphical symbols (“glyphs”) that represent multiple characters. The same letters, in a different situation or in a different sequence may be represented by different glyphs. It would yield illegible output if we would ignore this and simply write the text character by character.

Note

Alphabets that are based on ASCII or Latin characters, only have few ligatures. In many or most cases one can guess the constituting letters. An example for a ligature in ASCII text is “&”. This symbol has evolved from the letters “e” and “t” of the word “et” (Latin for “and”). German and Scandinavian letters like “ä”, “æ” and “ß” are other examples.

For a more detailed impression of the challenges involved, we recommend visiting the Wikipedia website for Devanagari (देवनागरी). Devanagari is the fourth most-often used script in the world. More than 120 languages are written with it.

And Devanagari is just one example out of many others — each coming with its own set of glyphs, glyph-building rules and exceptions.

The term “text shaping” refers to a software capability that knows how to deal with all this — and will output the correct result when being given an arbitrary text string.

One of the most popular such software packages is called HarfBuzz. Its support for languages, scripting systems and fonts worldwide is deemed to be (among) the most complete. It is used by numerous software applications like browsers (Chromium, Firefox), office applications (LibreOffice) but also Adobe InDesign and Photoshop.

MuPDF uses HarfBuzz inside its story feature, as does PyMuPDF in its corresponding Story class.

Conclusion

Method insert_htmlbox is a powerful, yet elegant way to write content to PDF pages that combines the advanced features of the Story class with the expressiveness of HTML syntax in an easy-to-use, intuitive way.

Have a look at many more interesting articles on our blog. Other resources are our excellent documentation, the #pymupdf channel on Discord and the interactive, installation-free playground pymupdf.io.