How to Journal With PyMuPDF

Harald Lieder·May 30, 2023

PyMuPDFJournaling

In this article

Demonstrating PDF Journaling
How to Save Journal and PDF for Later Restart
How to Open a PDF Snapshot and the Associated Journal
How to Use Journaling for Detecting Unauthorized Updates
Conclusion

Demonstrating PDF Journaling

For an introduction to PDF journaling, please see Part 1: Resilience of PyMuPDF in Handling Interruptions.

In this blog post we will demonstrate how to save a journaled PDF together with the journal such that both can be opened again later — for instance, to continue journaled updating, or undoing/redoing operations. You will learn how to:

Log changes to an existing file and save the current state.
Restart or continue a previously saved journaling session.
Use the journaling feature to detect unauthorized changes.

How to Save Journal and PDF for Later Restart

We will open an existing PDF and add a new page with some text lines on it. This is much like what we have learned in the previous blog.

import pymupdf
from pprint import pprint

if tuple(map(int, pymupdf.VersionBind.split("."))) < (1, 19, 0):
    raise ValueError("Need PyMuPDF v1.19.0 or higher")

doc = pymupdf.open("1page.pdf")  # work with an existing PDF
doc.journal_enable()  # enable journaling for it
doc.journal_start_op("new page")
page = doc.new_page()
doc.journal_stop_op()
# insert 5 text lines, each within its own operation:
for i in range(5):
    doc.journal_start_op("insert-%i" % i)
    page.insert_text((100, 100 + 20*i), "This is line %i." % i)
    doc.journal_stop_op()

We now take a snapshot of the current PDF and its journal. You may want to do this so you can submit the document to someone for review, before any redactions are applied, or similar purposes.

snapname = doc.name.replace(".pdf", "-snap.pdf")
logname = doc.name.replace(".pdf", "-snap.log")

doc.save_snapshot(snapname)
doc.journal_save(logname)
doc.close()

How to Open a PDF Snapshot and the Associated Journal

The resulting file, “1page-snap.pdf”, is a valid PDF in every aspect: it can be displayed or printed, text can be extracted, etc.

When opening the snapshot PDF and loading the associated journal, any changes applied during journaling can be undone, or more changes can be applied. When finished, take another snapshot and save the journal file again, and so forth.

doc = pymupdf.open(snapname)  # open last update state of the PDF
doc.journal_load(logname)  # load the - matching! - journal file

When the journal file is loaded as above, the following actions take place:

Read the content of the journal and confirm that the document matches it.
If successful, journaling is automatically enabled, and current journal position is established.

If the journal does not match the PDF, an exception is raised. This can be used to detect changes to a PDF — see below.

We now make a few checks to see what we have got:

print(f"Snapshot PDF '{snapname}' has the following update status:")
print()
pos, count = doc.journal_position()
print(f"Journal position {pos}, operations count {count}.")
for i in range(count):
    print("Operation %i: '%s'" % (i, doc.journal_op_name(i)))

actions = doc.journal_can_do()
print()
print("Possible actions:")
print("    undo: '%s'" % actions["undo"])
print("    redo: '%s'" % actions["redo"])

Snapshot PDF ‘1page-snap.pdf’ has the following update status:

Journal position 6, operations count 6.
Operation 0: ‘new page’
Operation 1: ‘insert-0’
Operation 2: ‘insert-1’
Operation 3: ‘insert-2’
Operation 4: ‘insert-3’
Operation 5: ‘insert-4’

Possible actions:
undo: ‘True’
redo: ‘False’

How to Use Journaling for Detecting Unauthorized Updates

As a side benefit, the journaling feature can be used to confirm that a PDF still is the expected version, or to detect unauthorized changes.

Follow this approach to store the current PDF state:

doc = pymupdf.open("input.pdf")
doc.journal_enable()
doc.journal_save(doc.name + "-status.log")  # choose a suitable journal filename
doc.close()

Even if no updates are happening, the journal as a minimum will contain a so-called Fingerprint (hash value) that can be used to confirm the PDF’s identity. This is independent from any password protection, the file name, or PDF-internal /ID field values. In our case, the journal will look like this:

%!MuPDF-Journal-100

journal << /NumSections 0 /FileSize 210721 /Fingerprint <57c84501e4baddef56fd26959a808cfc> /HistoryPos 0 >> endjournal

When processing the PDF in some downstream application, perform the following check:

doc = pymupdf.open("input.pdf")
try:
    doc.journal_load(doc.name + "-status.log")  # load previously saved journal
    print(f"Confirming: file '{doc.name}' is in expected state.")
    doc.close()  # to switch off journaling
    doc = pymupdf.open(doc.name)
except Exception as e:
    print (f"Unauthorized changes to file '{doc.name}' detected.")
    raise
print(f"Journaling enabled: {doc.journal_is_enabled()}.")  # confirming: journaling disabled

Confirming: file ‘input.pdf’ is in expected state. Journaling enabled: False.

Conclusion

In this blog post we have learned:

How to log updates to an existing PDF and save the current state.
How to resume a previous journaling session and continue updating.
How to confirm the expected state of a PDF in downstream applications.

The PyMuPDF library offers a plethora of other features to work with PDF documents, such as extracting text, images, annotations, and much more. Make sure to explore the official PyMuPDF documentation to discover more about its capabilities.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.