PyMuPDF Explored: Low-Level Access to PDF Objects

Harald Lieder·December 5, 2022

PyMuPDFPDF Manipulation

PyMuPDF Explored: Low-Level Access to PDF Objects

In this article

PDF Objects
Inspecting Object Definitions
Accessing PDF Trailer and Catalog
- Accessing the Trailer
- Accessing the Catalog
Interpreting Single Object Definitions
- Example: Returning the PDF Page Layout
Updating PDF Objects
- Example: Setting PDF Page Layout
Conclusion
Related PyMuPDF Articles

This article is part of a series on the functionality of PyMuPDF.

PyMuPDF is a Python binding for MuPDF – a lightweight PDF, XPS, eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

(Py-) MuPDF can access files in PDF, XPS, OpenXPS, CBZ, MOBI, EPUB, and FB2 (eBooks) formats, and it is known for its top performance and high rendering quality.

PyMuPDF’s homepage is on Github. It can be installed from PyPI via "pip install pymupdf".

PDF Objects

The internal structure of a PDF consists of objects carrying a cross reference number (called xref). The xref serves as a vehicle to find each object in the file. At the file’s end (“trailer”) normally a table of the xref numbers can be found, which contains each object’s starting position. This allows rapid object access.

PDF objects are defined using ASCII text strings. The syntax of these definitions is defined via the ISO standard 32000-1 and can be consulted in manuals like this one.

For the most frequent uses, like text extraction/output or rendering of page images, you need not be concerned with this technical detail.

For special requirements, PyMuPDF does however provide access to internal PDF structures. Such requirements may include

General interest.
Special debugging purposes: verify correct functioning of applications, tracing the cause of issues, etc.
Creating Python apps that complement PyMuPDF’s high-level features. An example could be reading and setting the viewing options in the PDF catalog (PageMode, PageLayout, and similar).
Locating and correcting PDF structure damages.

Inspecting Object Definitions

Use this snippet to print all object definitions of a PDF by walking through the XREF table. An item’s position in this table equals its xref number. Please note that item 0 is reserved for technical purposes and must not be touched:

import pymupdf
doc = pymupdf.open("file.pdf")  # open the file
xreflen = doc.xref_length()  # the number of entries in the XREF table
for xref in range(1, xreflen):  # skip item 0!
	print("")
	print(f"Object {xref}, stream: {doc.xref_is_stream(xref)}")
	print(doc.xref_object(xref, compressed=False))

The output looks like this:

object 1, stream: False
<<
	/ModDate (D:20170314122233-04'00')
	/PXCViewerInfo (PDF-XChange Viewer;2.5.312.1;Feb  9 2015;12:00:06;D:20170314122233-04'00')
>>

object 2, stream: False
<<
	/Type /Catalog
	/Pages 3 0 R
>>
...
object 4, stream: False
<<
	/Type /Page
	/Annots [ 6 0 R ]
	/Parent 3 0 R
	/Contents 7 0 R
	/MediaBox [ 0 0 595 842 ]
	/Resources 8 0 R
>>
...
object 7, stream: True
<<
	/Length 494
	/Filter /FlateDecode
>>
...

Accessing PDF Trailer and Catalog

The trailer and the catalog are special objects, which contain global parameters (number of pages, encryption information) and pointers to other structures (the page tree, table of contents, metadata, embedded files, etc.). Both objects must be present in every PDF.

Accessing the Trailer

The trailer object is the only one without an xref number, so we use -1 instead.

import pymupdf
doc=pymupdf.open("PyMuPDF.pdf")
print(doc.xref_object(-1))  # or: print(doc.pdf_trailer())

The output looks like this:

<<
/Type /XRef
/Index [ 0 8263 ]
/Size 8263
/W [ 1 3 1 ]
/Root 8260 0 R					% points to the catalog, required
/Info 8261 0 R					% points to the metadata, optional
/Length 19883
/Filter /FlateDecode
>>

The trailer’s /Root value is the xref of the catalog (8260). The /Info key points to the metadata (xref 8261).

Accessing the Catalog

import pymupdf
doc=pymupdf.open("PyMuPDF.pdf")
xref = doc.pdf_catalog()  # get xref (= 8260 above) of the catalog
print(doc.xref_object(xref))  # print object definition

The output looks like this:

<<
	/Type/Catalog				 % object type
	/Pages 3593 0 R			   % points to page tree
	/OpenAction 225 0 R		   % action to perform on open
	/Names 3832 0 R			   % points to global names tree
	/PageMode /UseOutlines		% initially show the TOC
	/PageLabels<>2<>8<>]>> % page label definitions
	/Outlines 3835 0 R			% points to table of contents
>>

Interpreting Single Object Definitions

PDF specification defines the following object types: boolean (true, false), integer, real (i.e. float), string (comparable to Python strings, but must always be enclosed with brackets “()” or “<>”), name (similar to Python identifiers / variable names, always starts with a slash “/”), array (similar to Python lists, always enclosed with “[]”), dictionary (similar to Python dictionaries, always enclosed with “<<>>”), stream (roughly comparable to Python bytes, always enclosed with “stream” / “endstream” text strings), null (like Python None), indirect object (represents an xref number, format “nnn 0 R”, where nnn is a positive number).

Probably the most important object type is the dictionary, which is a list of key-value pairs. Dictionary keys are always name objects, values can be any of the object types above, including yet other dictionaries.

To tell apart dictionary types, a dictionary in most cases has a /Type or a /Subtype key. For example, pages, images, and fonts are special dictionary types.

There are dozens of dictionary types, please consult specification manuals for details.

In PyMuPDF, many objects have a property xref, which is a positive integer for PDF documents (and 0 otherwise) and can then be used to access the respective definition. To display a PDF page definition, we can therefore do this:

import pymupdf
doc = pymupdf.open("pymupdf.pdf")
page = doc[0]  # load the first page
print(doc.xref_object(page.xref))  # show its object definition

The output looks like this:

<<
	/Type /Page				# this is a page dictionary
	/Contents 1297 0 R		 # rendering commands in xref 1297
	/Resources 1296 0 R		# fonts, images etc. in xref 1296
	/MediaBox [ 0 0 612 792 ]  # the page rectangle
	/Parent 1301 0 R		   # points to the page tree structure
>>

PyMuPDF helps interpret PDF dictionaries:

xref_get_key(xref, key) returns the type and the value of a dictionary key. The return is always a tuple of strings (type, value), independent from the type. For example, for an array, the return would look like (“array”, “[0 0 612 792]”). Therefore, the type string must be used to interpret the value by appropriate string inspection. Note in this example that array items are separated by spaces and not by commas like in Python.
xref_get_keys(xref) returns a tuple of dictionary keys, similar to Python’s keys() method for dictionaries.

So for our page example, we can do this:

print(doc.xref_get_keys(page.xref))
('Type', 'Contents', 'Resources', 'MediaBox', 'Parent')

To compute the page’s MediaBox rectangle we can do this:

objtype, val = doc.xref_get_key(page.xref, "MediaBox")
# strip off [] brackets and convert items to a float tuple
mediabox = tuple(map(float, val[1:-1].split()))
# this is equal to the corresponding page property:
mediabox == tuple(page.mediabox)
True

Like in Python, PDF dictionaries can be nested: the value of a PDF dictionary key can again be a dictionary. This is regularly the case for the /Resources dictionary of a page, where you will usually find the /Font sub-dictionary, which yet contains a sub-dictionary for every font used on the page.

Such dictionary hierarchies can be expressed by using a path-like notation as the key: doc.xref_get_key(page.xref, "Resources/Font") for example, will directly deliver the /Font sub-dictionary. There is no nesting level limit.

Example: Returning the PDF Page Layout

For demonstration purposes, we define our own Python function that returns a PDF’s default page layout.

def page_layout(doc):
	"""Return the PDF standard page layout."""
	xref = doc.pdf_catalog()  # xref of the catalog
	_, val = doc.xref_get_key(xref, "PageLayout")
	if val == "null":  # not defined
		return "undefined"
	else:
		return val

Updating PDF Objects

It is possible to create new objects in a PDF by generating a new xref number.

Existing objects may be updated via their xref number using method doc.update_object(xref, source). The string "source" must contain the new object definition.

The format of the source string determines the type of the object as defined above: a string bracketed by “[]” will result in an array, “<<>>” will make a dictionary, etc.

So obviously, this method should be used only if one is sure about what will happen. Errors are likely to render (parts of) the PDF unusable.

For updating dictionary objects, method doc.xref_set_key(xref, key, value) offers an elegant and much safer method.

Example: Setting PDF Page Layout

In continuation of the above example, we extend PyMuPDF with another homegrown function that sets the page layout by modifying the catalog:

def set_page_layout(doc, layout):
	"""Set the PDF standard page layout."""
	if not layout.startswith("/"):
		layout = "/" + layout
	if not layout in (
		"/SinglePage", "/OneColumn", "/TwoColumnLeft",
		"/TwoColumnRight", "/TwoPageLeft", "/TwoPageRight",
		):
		raise ValueError("bad page layout value")
	doc.xref_set_key(doc.pdf_catalog(), "PageLayout", layout)
	return

Please note the slash prefixing “layout”: the value of dictionary key /PageLayout must be a PDF name and thus start with “/”.

Conclusion

This post provides you with an introduction to handling low-level functions with PyMuPDF and provides sample code for inspecting object definitions, accessing PDF trailer and catalog, and setting PDF page layouts. Accessing low-level PDF functions can enable numerous customizations within PDF documents.

With PyMuPDF, integrating advanced PDF functionality into your Python application is fast and easy.

PyMuPDF is a large, full-featured document-handling Python package. Apart from its superior performance and top rendering quality, it is also known for its excellent documentation: the PDF version today has over 420 pages in Letter format — more than 70 of which are devoted to recipes in How-To format — certainly a worthwhile read.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.

PyMuPDF Explored: Low-Level Access to PDF Objects

PDF Objects

Inspecting Object Definitions

Accessing PDF Trailer and Catalog

Accessing the Trailer

Accessing the Catalog

Interpreting Single Object Definitions

Example: Returning the PDF Page Layout

Updating PDF Objects

Example: Setting PDF Page Layout

Conclusion

Related PyMuPDF Articles