How to Save a PDF Document with PyMuPDF: Encryption and Much More!

Harald Lieder·April 25, 2023

PyMuPDFSecurity

How to Save a PDF Document with PyMuPDF: Encryption and Much More!

In this article

1. Installing PyMuPDF
2. Opening a PDF Document
3. Saving a PDF Document
4. Closing the Document
Conclusion

Handling PDF files is a common task in the world of software development, whether it’s reading, writing, or editing PDF documents. Python, a versatile and powerful programming language, has numerous libraries to aid in this process. One of these libraries is PyMuPDF, a wrapper around the popular MuPDF library. PyMuPDF provides a simple and efficient interface to manipulate PDF files with ease.

In this blog post, we’ll walk through the process of saving a PDF document using Python and PyMuPDF. We will cover the following topics:

Installing PyMuPDF
Opening a PDF document
Saving a PDF document
Closing a PDF document

1. Installing PyMuPDF

Like with every Python package, before we can use PyMuPDF we must first install it. You can install PyMuPDF using pip, the standard Python package installer. Open a terminal or command prompt and enter the following command:

pip install pymupdf

2. Opening a PDF Document

To open a PDF document, we'll use the pymupdf.Document() function from the PyMuPDF library. First, we need to import the library:

import pymupdf

Now, we can open a PDF document by specifying its file path:

input_pdf_path = "example.pdf"
doc = pymupdf.open(input_pdf_path)

3. Saving a PDF Document

If you updated your PDF document, you would certainly want to save these changes. The Document method save() will do this job for you.

But even if you did not change the document at all, you may nevertheless need to create a new version. For example, you may want to encrypt the document’s content, or you want to store it on a web site and you therefore need a file format that supports fast viewing in a browser.

Yet another situation exists if you only made minor changes and your document file is large. In that case, you will be interested in a solution that does not write the full file new version but somehow lets you “append” your updates to the file.

The parameters of method Document.save() let you adapt to a wide range of requirements.

3.1 Encrypting and Decrypting a PDF

This allows you to protect the content of your document against unauthorized access by password-protecting it. The PDF concept allows you to define two different passwords for a PDF document:

Owner password: provides full access to the document, including decryption and defining new password versions.
User password: defines a subset of permissions when dealing with the document. For example, all access to the document may be password protected. Or viewing only is allowed but not any changes, or only changes of a certain category.

Here’s how to create a protected document version with “full-access” as the owner password and “restricted-access” as the user password. We’ll also define a collection of permissions for those using the user password. Keep in mind, both passwords must be UTF-8 strings of length forty or less.

output_pdf_path = "encrypted.pdf"
owner_pw = ”full-access”
user_pw = ”resticted-access”
encr_method = pymupdf.PDF_ENCRYPT_AES_256  # strongest encryption

# define a bit field of user permissions:
perm = (pymupdf.PDF_PERM_ACCESSIBILITY  # always use this
    | pymupdf.PDF_PERM_PRINT  # permit printing
 | pymupdf.PDF_PERM_COPY  # permit copying
 | pymupdf.PDF_PERM_ANNOTATE  # permit adding annotations
)
doc.save(output_pdf_path,
 owner_pw=owner_pw,  # set the owner password
 user_pw=user_pw,  # set the user password
 encryption=encr_method,  # set the encryption method
 permissions=perm,  # set the user permissions
)

Users attempting to open “encrypted.pdf” will be met with a prompt to enter a password. Permissions will be granted depending on whether the owner or user password is provided. This protection works the same way for apps trying to access the document.

Now, let’s see how PyMuPDF can open a password-protected PDF document and create a decrypted version:

doc = pymupdf.open(“encrypted.pdf”)

# check if password protection exists and let the user enter a password
if doc.needs_pass:  # we need the owner password!
    owner_pw = input(“enter owner password:”)

rc = doc.authenticate(owner_pw)  # authenticate with the password string
if rc not in (1, 4, 6):  # authorization levels including ownership
    sys.exit(“Bad owner password.”)

# you now have full (owner) document access
# save a decrypted version of the document
encr_method = pymupdf.PDF_ENCRYPT_NONE  # option for removing encryption
doc.save(“decrypted.pdf”,
    encryption=encr_method,
)

3.2 Creating a PDF for Fast Web Access

PDF documents are often large. If stored online, having to wait for a multi-megabyte download to see the first page is less than ideal. That’s why the PDF specification supports a “fast web access” or “linearized” file format that lets you see the first pages while the document is still downloading.

Here’s how to save a linearized version of your PDF:

input_file = “input.pdf”
output_file = “linearized.pdf”
doc = pymupdf.open(input_file)

# we can also check if the document already is linear:
If doc.is_fast_webaccess:
    sys.exit(“Document already linear!”)
doc.save(output_file, linear=True)

# upload the file “linearized.pdf” to some internet site as you normally would

3.3 Saving Incrementally

Appending changes to a PDF document without creating a new file is known as “incremental” save. Choose this option if you’ve made minor changes to a large document — if the file size is large, maybe dozens of megabytes, just appending a few hundred bytes will be multiple order magnitude faster than rewriting the whole file — or if your PDF is digitally signed and you can’t or don’t want to break this state. An advanced PDF viewer will still report the document as signed and note any subsequent changes.

Because incremental updates modify the PDF in place, the output path name must equal the input. Using the Document object’s property name, do the following:

doc.save(doc.name, incremental=True, encryption=pymupdf.PDF_ENCRYPT_KEEP)

Or use the shorter alias: doc.saveIncr().

The Document property version_count will increase by one after this.

Incremental saves aren’t always possible. Check Document.can_save_incrementally() to prevent exceptions in cases like damaged documents, newly created PDFs, or structural changes like encryption or compression.

3.4 Controlling File Size: Compression and Garbage Collection

The PDF file format is based on ASCII text — formatted in a unique way such that its objects can be located rapidly. To save space, some objects may be stored in compressed format or, conversely, decompressed to help with debugging or manual tinkering.

Earlier document updates may have generated outdated versions of objects which are still physically present in the file, occupying space, but are no longer accessible — like ghosts. The save method offers multiple ways to bust these ghosts and compress objects within the PDF. Here’s a quick overview of the most important parameters:

garbage=n: With an integer n, taking values from 0 to 4. Zero means no garbage collection. With increasing thoroughness, unused objects can be removed, and duplicate objects or even binary content can be joined. A typical, recommended value is garbage=3.
expand=True: Decompresses binary content. Mainly for debugging purposes, but some apps may be incapable of dealing with compressed object types.
deflate=True: Compresses binary content (where possible and beneficial). For finer control, try deflate_images=True and deflate_fonts=True.

If you want to go beyond these possibilities offered by the save method, consider the following two Document methods:

Document.subset_fonts(): This can have a large file size effect if you inserted text using your own font files. It shrinks font file sizes by keeping only the characters actually used in the document. Fonts supporting Asian characters benefit a lot here.
Document.scrub(): With over a dozen options, this method removes information and objects. Primarily for data protection but removing thumbnail images can also slim down your PDF.

3.5 How to Handle Documents in Memory

PyMuPDF can read documents from a memory area instead of a local disk. Downloads from popular cloud services (Google, AWS, Microsoft Azure) usually deliver Python bytes objects representing PDF documents.

Instead of storing them on your computer, you can directly open these objects via doc = pymupdf.open(stream=bytes_object). Nothing else in your app needs to be changed to work with a document created like this. This has obvious advantages.

Of special interest in the context of this article is the method Document.tobytes().

This is the version of Document.save() that saves to memory — not to disk. It supports exactly the same parameters as save — so everything you’ve learned so far still applies. You can use the produced bytes object to upload it to your cloud storage, for example.

Here’s a code snippet that downloads a PDF from AWS, converts it for fast web access, and uploads it again:

import pymupdf
import boto3


s3 = boto3.client("s3")


# fill in your credentials to access the cloud
response = s3.get_object(
    Bucket="string",  # choose your appropriate value
    Key="string"  # choose your appropriate value
)


body = response["Body"]


# define Document with these data
doc = pymupdf.open(stream=body.read())


# make a linear version
linear_version = doc.tobytes(linear=True)


# upload to cloud service again
rc = s3.write_get_object_response(
    Body=linear_version,
    RequestRoute=request_route,  # choose your appropriate value
    RequestToken=request_token,  # choose your appropriate value
)
doc.close()

Note

Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python.

4. Closing the Document

When you’re done working with the PDF document, it’s good practice to properly close it with doc.close(). This releases associated resources and relinquishes control of the file to the operating system, thus making it available to other processes before your app ends.

Python experts will appreciate that a Document in PyMuPDF is a Python context manager, enabling this construct:

with pymupdf.open(input_file) as doc:
    # do some work with the document, then
    doc.subset_fonts()  # optionally compress fonts
    doc.save(…)  # save when finished

# the document will automatically have been closed at this point

Conclusion

In this blog post, we’ve learned how to save a PDF document using Python and the PyMuPDF library. By following these simple steps, you can easily create, modify, and save PDF files in your Python projects.

Remember, PyMuPDF offers a cornucopia of other features to work with PDF documents, like extracting text, images, annotations, and more. Don’t forget to explore the official PyMuPDF documentation to learn even more about its capabilities: https://pymupdf.readthedocs.io/en/latest/.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.