Resilience of PyMuPDF in Handling Interruptions

Harald Lieder·May 23, 2023

PyMuPDFJournaling

Resilience of PyMuPDF in Handling Interruptions

In this article

Why Journal?
A Basic PyMuPDF Journaling Walkthrough
- Create a PDF and Enable Journaling
- Investigating and Navigating the Journal
Wrapping up

Why Journal?

Long-running applications, updating multiple logically interconnected databases, cannot afford to be restarted from the beginning after crashes happen, like power or hardware failures. This would not only entail repeating previous work, but require prior restoration of all involved databases — often enough long-running jobs by themselves.

A properly designed application restarted in emergency restart mode will read the logfile, look up the last point at which all involved databases had been in a consistent state (a so-called checkpoint), roll back any logged updates thereafter and resume its work from that point on.

Modern database management systems (DBMS) include journaling facilities. This feature allows logging changes to databases in special files, so-called journals or logfiles.

Note

DBMS journaling provides several benefits, including data integrity, fault tolerance, and the ability to recover from system failures. It ensures that database modifications are recorded and can be redone or undone as necessary, maintaining the consistency and durability of the data.

In short PyMuPDF Journaling provides fail safe features for your application.

PyMuPDF & Journaling

PyMuPDF supports a journaling feature which makes the above DBMS concepts available for updating PDFs. This means that applications which require database updating may now integrate PDF changes in their “checkpoint-restart logic”.

But independent from any databases, it is also possible to put incremental updates of a PDF under the control of PDF Journaling. Together with the journal file, it is possible to reinstate earlier versions of the document.

A note on terminology:

Updates to (SQL) databases are grouped in “Logical Units of Work” (LUWs). Consistent database configurations exist at the start and end of a LUW. Writing a checkpoint therefore always happens after the end of an LUW — never in between.
In PyMuPDF the term “operation” plays the role of an LUW.

A Basic PyMuPDF Journaling Walkthrough

This blog post is the first of a series of two posts dealing with the topic “PDF Journals”. Its goal is to acquaint the reader with the concept of “Journaling”, the terminology and the basic workings. The second blog post in this series will describe concrete use cases, including dealing with journal files.

Note

PDF Journaling is a unique, unequaled feature of PyMuPDF. It allows keeping advanced control over updates to PDF documents, granular undoing of changes and an independent way to detect unauthorized changes.

The basic concepts of PDF Journaling using PyMuPDF include:

How to activate journaling and define operations.
How to make changes and how to navigate between them.

Create a PDF and Enable Journaling

We create a new empty PDF and enable journaling for it as follows:

doc = pymupdf.open()  # work with an empty PDF
doc.journal_enable()  # enable journaling

After journaling is enabled for a PDF, all updates are being logged, and therefore must be executed within the scope of some operation (called LUW in DBMS, Logical Unit of Work).

We first try an update without having an active operation — just to see how the raised exception looks like:

try:
    page = doc.new_page()
except Exception as e:
    print(e)

No journaling operation started.

Operations are started and stopped via methods journal_start_op() and journal_stop_op(). Between these two statements, any number of updates to the PDF may happen — this is entirely your decision. Undoing an operation will revert all updates within it, so you should plan the design of your operations with due diligence.

In the following, we add a new page to the document and wrap it in an operation.

doc.journal_start_op("add new page")  # define start of an operation
page = doc.new_page(width=200, height=150)
doc.journal_stop_op()  # define stop of operation

This time it worked: we have a page. Now insert some text lines — each within its own operation, so they can be individually undone.

The name of an operation is entirely documentary and is left to your discretion.

for i in range(5):
    doc.journal_start_op(f"add line-{i}")
    # insert next line 20 points below previous one
    page.insert_text((50, 40 + 20*i), f"This is line {i}.")
    doc.journal_stop_op()

Investigating and Navigating the Journal

PyMuPDF will append all update activity to an internal log, called “journal” (or sometimes also “history”).

There are methods that let us investigate the journal content in various ways. It is possible to navigate up and down inside the journal.

What is our current operation number and what is the total number of operations?

This is likely to be the first piece of information that we are interested in.

pos, count = doc.journal_position()  # returns: (current position, operations count)
print(f"Log position {pos}, operation count {count}.")

Log position 6, operation count 6.

Okay, so this tells us that the current operation was the sixth out of a total number of six. We are positioned at the end of the journal.

How about a list of all the operations?

for i in range(count):  # loop over total number of operations
    print(f"Operation {i}: {doc.journal_op_name(i)}")  # show the operation name

Operation 0: add new page
Operation 1: add line-0
Operation 2: add line-1
Operation 3: add line-2
Operation 4: add line-3
Operation 5: add line-4

Undoing & Redoing

From our current position let’s see what we can do next.

The are always two options:

“undo” — this causes the updates of the current operation to be reverted (undone) and moves us one position backwards in the journal. Use method journal_undo().
“redo” — this will re-apply the updates of the operation that follows the current one and makes it the current operation. Use method journal_redo().

As we are currently positioned at the very last operation, there is no “next” operation. Therefore, we get the following result:

doc.journal_can_do()  # what can we do?

{‘undo’: True, ‘redo’: False}

Let us nevertheless try to redo something to see the exception:

Already at end of history

Only undo activities are possible, currently. Let us try one, but first we look at what is on the page:

show_image(page, "Page Content")

Undo the current operation and display the journaling status as above.

doc.journal_undo()  # undo an operation
doc.journal_position()  # where are we in the journal?

(5, 6)

doc.journal_can_do()  # what can we do?

{‘undo’: True, ‘redo’: True}

So, again as expected:

Our current position inside the journal is reduced by one — but there are still all six entries in the journal.
We can do both: undo and redo.

If everything worked as expected, we should only see 4 lines of text.

show_image(page, "Page Content after Undo")

Good! The last text insertion was reverted.

Let us change our mind again and redo (re-apply) the undone operation. This will re-execute everything contained in the operation following the current one. In other words, our fifth text line should reappear.

doc.journal_redo()  # redo reverted operation

True

And let’s confirm that we have five text lines again:

show_image(page, "Page Content after Redo")

We have learned so far, that undoing and redoing uses the journal like a playbook to change the PDF. These actions do not change the journal itself.

Adding a New Operation to a Journal

However, if, after some undos, a new operation is executed, all journal entries will be deleted that follow the current position, and the new operation will become the last / current one. Look at the following example.

We undo the last three operations (i.e. delete the last three lines) and then draw a rectangle as new operation.

for i in range(3):  # revert last three operations
    doc.journal_undo()
pos, count = doc.journal_position()
print(f"journal position: {pos}")
print(f"operations count: {count} (unchanged)")

journal position: 3 operations count: 6 (unchanged)

# draw rectangle
doc.journal_start_op("draw rectangle")
page.draw_rect((50, 80, 120, 130), fill=pymupdf.pdfcolor["blue"])
doc.journal_stop_op()

show_image(page, "Page Image after Drawing an Object")

Also look what happened to the journal:

pos, count = doc.journal_position()
print(f"journal position: {pos}")
print(f"operations count: {count}")
print("\nList of operations:")
for i in range(count):
    print(f"Operation {i}: {doc.journal_op_name(i)}")

journal position: 4
operations count: 4

List of operations:
Operation 0: add new page
Operation 1: add line-0
Operation 2: add line-1
Operation 3: draw rectangle

Wrapping up

Journaling is a useful method to control your document fidelity and history.

Some key points:

The journal is kept inside a memory area maintained by PyMuPDF.
The journal remains available until the PDF is closed.
Once enabled, journaling of a PDF cannot be disabled again.
Journal data can be exported to a file (or file-like object).
“Undo” reverts the operation at the current position. If the current position is the top of the journal, an exception is raised.
“Redo” re-executes the operation following the current position. If the current position is at the end of the journal, an exception is raised.
Executing operations after an undo removes all journal entries following the current position.

In this blog post we have learned the basics about PDF Journaling using PyMuPDF:

How to enable journaling for a PDF.
How to extract journal information.
How to navigate in a journal and undo or redo operations.
How adding new operations affects a journal.

Continue on to Part 2: How to Journal With PyMuPDF to learn how to log changes, restart sessions, and detect unauthorized edits.

The PyMuPDF library offers a plethora of other features to work with PDF documents, such as extracting text, images, annotations, and much more. Make sure to explore the official PyMuPDF documentation to discover more about its capabilities.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.