Blog

PyMuPDF’s New ‘Story’ Feature Provides Advanced PDF Layout Styling - Part 1

By Harald Lieder - Wednesday, November 09, 2022

How to Layout Articles Using PyMuPDF

In earlier blogs Text Extraction and Advanced Text Manipulation, I wrote about PyMuPDF’s text manipulation capabilities – among them reading, searching, highlighting, and updating or deleting text.

In these blogs, I did not mention the many other text features, which allow inserting text at certain positions or within rectangles, with the desired properties (font, font size, color, writing direction, text rotation, etc.).

Instead of elaborating on this, I want to focus today on PyMuPDF’s new ‘Story’ feature in version 1.21.0, released in early November 2022.

The ‘Story’ Concept

In version 1.21.0 PyMuPDF implements a concept called ‘Story’. The origins of this concept are in desktop publishing and article layout in newspapers and print magazines.

The programmer uses a Story as intermediate storage for their text with styling information (like font, bold, italic, color, etc.) and potentially interspersed images.

When finished finalizing the story, they determine a sequence of rectangles into which the story should be laid out and starts the process of actually writing the text. At this point, no more changes to the story are possible.

Which parts of its content will be placed in which rectangle is fully under the control of the Story’s output method – it cannot be influenced nor can it be predicted. The method however reports back how much of each rectangle has actually been used and also where (page number, position) portions of the content have been placed.

When inspecting the output method’s feedback and/or the produced result, the programmer may decide to restart the output process with a different layout.

This approach keeps the content (the story) separate from the areas into which it should be written.

How Stories are Implemented

The content of a Story is specified using HTML and, optionally, additional styling information via CSS (Cascading Style Sheets).

HTML has the benefit of being feature-rich, well-known, well-understood, well-documented, and well-provided with a range of tutorials. The underlying C-library MuPDF basically supports (a subset of) HTML4 and CSS2.

To define/fill the Story, the programmer has the option to either provide HTML and/or CSS strings (coming from any source, e.g. files) or build the Story’s content completely programmatically from the ground up.

It is also possible to combine both ways and modify some externally provided HTML within the script pretty arbitrarily.

A particularly interesting feature of (Py-) MuPDF’s Story implementation is the support of HTML templates: This is a standard HTML source containing named variables.

This allows making multiple copies of the HTML’s template portion and replacing the variables of each copy with information pulled out of a database. Think of examples like generating financial reports.

Solution Demos

PyMuPDF’s documentation contains a number of demo scripts that provide a gentle introduction to the most important concepts of Stories. Here is an overview of the topics covered:

  • A simple “Hello World” example.
  • Using file-based versus program-generated HTML source.
  • Make a PDF with two-columned pages.
  • Determine the free areas on each page of a PDF and use a Story to fill those rectangles with text.
  • Identify the headers in a story and automatically generate a Table of Contents from them.
  • Use an HTML template to report the content of an SQL database – explained in the next section.

Details of the SQL Database Example

Suppose you want to create a report of the films being shown at a film festival, where the films and their actors are stored in an SQL database.

Our example database (SQLITE) contains two tables:

  1. Table “films” with the fields “title” (film title, string), “director” (string), and “year” (year of release, integer).
  2. Table “actors” with the fields “name” (actor name, string) and “title” (the film title where the actor had been cast, string).

We want to report each film with its data, plus enumerate all actors that were part of the cast. The overall layout is defined as HTML within the script like this:


festival_template = (
    '<h1 style="text-align:center">Hook Norton Film Festival</h1>'
    "<ol>"
    # the film template starts here:
    '<li id="filmtemplate">'
    # receives film title:
    '<b id="filmtitle"></b>'
    "<dl>"
    # receives director:
    '<dt>Director</dt><dd id="director">'
    # receives year:
    '</dd><dt>Release Year</dt><dd id="filmyear">'
    # receives list of actors:
    '</dd><dt>Cast</dt><dd id="cast">'
    "</dd></dl>"
    "</li>"
    "</ol>"
    ""
)

By modifying this HTML, the final look can easily be influenced, by choosing a font, text color, page breaks between two films, and so on.

The script logic works like this:

  1. Locate the template part within the HTML.
  2. Read a row from the “films” table. SQL statement: "SELECT title, director, year FROM films ORDER BY title".
  3. Make a copy (“clone”) of the template and replace the three variable names of the template copy with the contents of the database row’s fields.
  4. Read those rows of the “actors” table where “title” equals the film title. SQL statement: 'SELECT name FROM actors WHERE film = "%s" ORDER BY name' - the Python placeholder "%s" will receive the respective film title.
  5. Concatenate the identified actor names with linebreaks and replace the “cast” template variable with it.
  6. Append the resulting film content to the body of the Story.
  7. Repeat from point 2 above.

This is the relevant code part:


story = fitz.Story(festival_template) # define story
body = story.body  # access the HTML body detail
template = body.find(None, "id", "filmtemplate")  # point 1.

cursor_films.execute(select_films)  # point 2
films = cursor_films.fetchall()  # still point 2

for title, director, year in films:  # iterate through films
    film = template.clone()  # point 3
    # replace the three variables with table data
    film.find(None, "id", "filmtitle").add_text(title)
    film.find(None, "id", "director").add_text(director)
    film.find(None, "id", "filmyear").add_text(str(year))

    # locate the actors of this film
    cursor_casts.execute(select_casts % title)  # point 4
    casts = cursor_casts.fetchall()  # still point 4
    # each actor appears in its own tuple, so we do this:
    actors = "\n".join([c[0] for c in casts])
    film.find(None, "id", "cast").add_text(actors)
    body.append_child(film) # add this film to our Story

Conclusion

This post provides you with an introduction to the new ‘Stories’ feature in PyMuPDF and provides a demo script for building a simple Story layout. Stories are an easy yet flexible way to format and style content in your PDF documents. Story layouts are a great option for styled text, programmatic content, navigation trees, and complex search functionality.

With PyMuPDF, integrating advanced PDF functionality into your Python application is fast and easy.

PyMuPDF is a large, full-featured document-handling Python package. Apart from its superior performance and top rendering quality, it is also known for its excellent documentation: the PDF version today has over 420 pages in Letter format — more than 70 of which are devoted to recipes in How-To format — certainly a worthwhile read.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.

Related PyMuPDF Articles

Advanced Text Manipulation Using PyMuPDF

Text Extraction with PyMuPDF