Programmatically writing text to pages is among the most desired features when dealing with PDF documents.
Typical use cases include:
PyMuPDF is a Python binding for the MuPDF library, which is a lightweight PDF, XPS and e-book viewer. The PyMuPDF library not only supports reading and rendering PDF (and other) documents but also provides powerful utilities for creating and manipulating PDFs.
It can help you with all of the above tasks in an easy, intuitive way.
To accomplish the mentioned tasks, we require a way to write text within a specific area of the page. This area is defined by its coordinates and can be visualized as a rectangle.
Additionally, we need the ability to determine how many words can fit within a given line width. When necessary, we should start a new line to avoid overflowing the available space.
Lastly, it’s crucial to keep track of the space already utilized within the rectangle. We must ensure that our text remains within the designated area and does not extend beyond its boundaries.
For a long time, PyMuPDF has offered two ways of achieving this:
insert_textbox
fill_textbox
We do not want to delve into the subtle differences between the methods at this point. What’s important here is that both can fill a predefined rectangle with text, as described earlier.
We will soon demonstrate that both methods do have certain limitations that make their use challenging or even impossible — especially when dealing with languages featuring complex writing systems like Devanagari.
Therefore, we have developed a new method, insert_htmlbox
that does not exhibit these limitations and, furthermore, provides significantly greater flexibility and convenience.
Here is a list of things that you cannot do with either insert_textbox
or fill_textbox
.
Method insert_htmlbox
was introduced in PyMuPDF version 1.23.8. It accepts a rectangle and writes text into it — like the other two methods do. It however addresses all of the above shortcomings by internally using a Story object to layout the content.
The major difference is that the text may be enriched with HTML tags and styling instructions. This was our motivation for choosing the name.
Here is a feature overview:
­
) can be used for additional, hyphenated line breaks.<br>
(“\n
” is ignored and treated as space).
when required.text
color, text alignment, font size or font switching is possible.<img>
and will be appropriately laid out. This is an alternative option for inserting images, compared to Page
method insert_image
.<table>
) may be present in the text.<a>
) are supported.<sub>
, <sup>
).This is how the method can be invoked in a Python program:
The parameters rect
and text
are required and positional, all other parameters must be specified as keywords.
If the text contains no HTML and no extra styling instructions are specified (css=None
), the following defaults apply:
@font-face
and font-family
to define your own fonts.body {margin: 1px;}
” is used to define a default margin for filling the rectangle. Override this as needed.<p style=”text-align: center”;>
centers text inside a paragraph. There is no method parameter to achieve this.By default, the content will always fit in the rectangle. If necessary, an iteration will be used to find an optimal down-scaling factor. The return code informs about the iteration result: rc=(spare_height, scale)
. The values have the following meaning.
spare_height
— the height of the “stripe” inside the rectangle that remained unused. For instance, this area is located at the rectangle’s bottom if rotation is zero. If down-scaling has occurred, then scale
is less than 1 and spare_height
will be zero. If the content did not fit (because scaling was opted out), it is -1.scale
— the computed down-scaling factor. We always have 0 < scale ≤ 1
.You can prevent or limit scaling by setting scale_low
to a positive value. The maximum value is 1 which prohibits scaling altogether. For example scale_low=0.2
means that content will be scaled down by at most 80%.
Method insert_htmlbox
can easily lead to the inclusion of multiple font files in the PDF. To control the file size, we strongly recommend building subset fonts by executing the Document
method subset_fonts
. This can easily reduce the file size by one or even two orders of magnitude.
Here is an example that prints the inevitable “Hello, World!” greetings in a dozen different languages. We do not specify any font and thus fully leave their selection to the Story.
This will be generated:
To properly produce the above, the Story has identified and included eight different fonts. Because we are creating font subsets, the resulting PDF has a size of 97 KB – otherwise it would have been 2 megabytes, a size reduction factor of 20!
Let's look at differences and similarities of PyMuPDF’s ways to fill content into a rectangle.
Criteria | insert_htmlbox | insert_textbox | fill_textbox |
Auto line breaks | Words and soft hyphens | Words | Words |
Space control method | Scale down | Detect overflow | Detect overflow |
Text alignment | Full (HTML styling) | Full (“align” ) | Full (“align” ) |
Supported content | Text, tables, images, links | Text | Text |
Font support | Multiple (user fonts and auto detection) | One font only | One font, plus one fallback font |
Language support | All | No right-to-left, no text shaping | No text shaping |
Text orientation | Arbitrary | Arbitrary | Arbitrary |
Text styling | Full (all HTML features) | No | No |
Text shaping | Full support (HarfBuzz) | No | No |
Transparency | Yes | No | No |
Optional Content support | Yes | Yes | Yes |
Fore- / background | Yes | Yes | Yes |
Given the comfort of insert_htmlbox
, an obvious question is:
Why would you still want to use one of the “old” methods for writing into text boxes?
Here are a number of situations when insert_textbox
and fill_textbox
remain viable choices:
insert_htmlbox
work:morph
. This can lead to arbitrary changes of the rectangle’s final appearance: any rotation angle is possible, as well as up-down and left-right flips along some line. Other possible uses include the simulation of italic fonts and independent shrinking / stretching of x- and y-values.render_mode
: Writing invisible / hidden text (as used by OCR engines) and controlling thickness and color of the border of letters. Allows simulation of bold fonts and other text effects.The Story class supports “text shaping”. Here is a brief explanation of the term and why we need it.
Not everyone realizes that outputting text on a document page can be much more complex than writing character by character.
But already outputting a mixture of Arabic and English is not trivial: unlike Western writing systems, Arabic is written from right to left. Therefore, in a compound text of Arabic and English you will have multiple changes of writing directions. The same is true for the right-to-left languages Hebrew, Persian etc.
On top of this, there are languages where letters must be joined with each other in a way that is situation-dependent: it depends on the sequence in which the letters happen to occur in a word.
E.g. for writing the text “another text” in Persian, it would be wrong to simply output the single letters like this:
The correct result instead looks like this:
In fact, this is just scratching the surface:
Many languages in South East Asia (Hindi, Sanskrit, Bengali, Tamil, Nepali, Thai etc.) use scripting systems that have dozens of so-called “ligatures”. These are graphical symbols (“glyphs”) that represent multiple characters. The same letters, in a different situation or in a different sequence may be represented by different glyphs. It would yield illegible output if we would ignore this and simply write the text character by character.
For a more detailed impression of the challenges involved, we recommend visiting the Wikipedia website for Devanagari (देवनागरी). Devanagari is the fourth most-often used script in the world. More than 120 languages are written with it.
And Devanagari is just one example out of many others — each coming with its own set of glyphs, glyph-building rules and exceptions.
The term “text shaping” refers to a software capability that knows how to deal with all this — and will output the correct result when being given an arbitrary text string.
One of the most popular such software packages is called HarfBuzz. Its support for languages, scripting systems and fonts worldwide is deemed to be (among) the most complete. It is used by numerous software applications like browsers (Chromium, Firefox), office applications (LibreOffice) but also Adobe InDesign and Photoshop.
MuPDF uses HarfBuzz inside its story feature, as does PyMuPDF in its corresponding Story class.
Method insert_htmlbox
is a powerful, yet elegant way to write content to PDF pages that combines the advanced features of the Story class with the expressiveness of HTML syntax in an easy-to-use, intuitive way.
Have a look at many more interesting articles on our blog. Other resources are our excellent documentation, the #pymupdf channel on Discord and the interactive, installation-free playground pymupdf.io.
If you want to learn more about the “Story” feature, please read our blog post on “Advanced PDF Layouts”.