Optional Content: Discovering the PDF Layers

Harald Lieder·May 9, 2023

PyMuPDFPDF Optional Content
Optional Content: Discovering the PDF Layers

Handling PDF files is a common task in the world of software development, whether it is reading, writing, or editing PDF documents. Python, a versatile and powerful programming language, has numerous libraries to aid in this process. One of these libraries is PyMuPDF, a wrapper around the popular MuPDF library.

PyMuPDF offers an intuitive and efficient interface to easily work with PDF files. Both its broad range of functions and its top performance make it a leader among Python PDF libraries.

In this blog post, we will look at a PDF feature hardly any other package seems to be bothering about: PDF Optional Content. We will be covering the following topics:

  1. Overview: What Are PDF Optional Content Layers?
  2. How to Create a PDF With Layers
  3. How to Access the Content of Layers

Overview: What Are PDF Optional Content Layers?

Optional Content is a PDF feature to show or hide objects. The Optional Content capability is useful in complex PDF documents containing items such as CAD drawings, technical construction plans, layered artwork, maps, or multi-language documents.

An object’s visibility can be made dependent on the boolean value of a special other object type, a so-called Optional Content Group (OCG).

OCGs can be true or false, which translates to “ON” or “OFF”. A PDF object (like text, image, or vector graphics) will be hidden or shown depending on the attached OCG’s status. Each object can only have one OCG attached to it.

A PDF using the optional content feature will usually contain multiple OCGs. Different configurations of individual OCG states can be stored in separate layers. You can activate these layers temporarily or permanently, changing the visibility of many PDF objects at once.

A PDF will always have a standard OC layer (or none if without Optional Content support). The standard layer is the one that is activated when the PDF is being opened.

More complexity comes into play with OCMDs (Optional Content Membership Dictionaries): Those are logical expressions of the state of one or more OCGs within the currently active given layer.

As mentioned, the visibility of any PDF object can only depend on one item. But that item may also be an OCMD, the state of which can depend on multiple OCGs.

In this way, a logical condition like “show this text if OCGx is not ON” can be put in an OCMD and attached to the text.

How to Create a PDF With Layers

In this section we will create a bilingual PDF page from scratch, which shows English and alternatively German text in the same rectangle which we will give a gray background and a blue border.

The standard page appearance will show the English text and the rectangle. The German text will be shown automatically whenever English is set to OFF.

Note

We assume you already know how to install PyMuPDF and open or close documents using this package. If you’re unsure, check the documentation.

The code below is designed to run as a Jupyter Notebook. You can find and run the full version here.

This first code block is just for setting up the Jupyter environment:

# making sure we have required packages installed
!python -m pip install pymupdf
!python -m pip install ipyplot


import fitz  # import PyMuPDF


def show_image(pix):
    """Display a pixmap.

    Just to display images - ignore the man behind the curtain.
    """
    import ipyplot, numpy as np
    img = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples)
    _ = ipyplot.plot_images([img], labels=[""],custom_texts=[""],img_width=500)

Here is the actual application code. We will define a new empty PDF document with one page. We also define the rectangle which will contain both alternatives of the text we want to show (English and German), and the text versions themselves.

Note

The English text is quoted from this Wikipedia page. The German text is quoted from here.

# This is the rectangle we want to draw as text background
rect = fitz.Rect(50, 50, 400, 300)


doc = fitz.open()  # make a new PDF
# also give it a page - a bit larger than the rectangle
page = doc.new_page(height=400, width=500)


# This is the English, respectively German text we want to show:
text_en = (
    "The false killer whale (Pseudorca crassidens) is a species of oceanic"
    " dolphin that is the only extant representative of the genus Pseudorca."
    " It is found in oceans worldwide but mainly in tropical regions."
    " It was first described in 1846 as a species of porpoise based on a skull,"
    " which was revised when the first carcasses were observed in 1861."
    " The name 'false killer whale' comes from having a skull similar to the"
    " orca (Orcinus orca), or killer whale."
    "\nThe false killer whale reaches a maximum length of 6 m (20 ft), though"
    " size can vary around the world. It is highly sociable, known to form"
    " pods of up to 50 members, and can also form pods with other dolphin"
    " species, such as the common bottlenose dolphin (Tursiops truncatus)."
)
text_de = (
    "Der Kleine Schwertwal (Pseudorca crassidens), auch bekannt als Unechter"
    " oder Schwarzer Schwertwal, ist eine Art der Delfine (Delphinidae) und"
    " der einzige rezente Vertreter der Gattung Pseudorca. Er ähnelt dem"
    " Orca in Form und Proportionen, ist aber einfarbig schwarz und mit"
    " einer Maximallänge von etwa sechs Metern deutlich kleiner. Kleine"
    " Schwertwale bilden Schulen von durchschnittlich zehn bis fünfzig"
    " Tieren, wobei sie sich auch mit anderen Delfinen vergesellschaften und"
    " sich meistens abseits der Küsten aufhalten. Sie sind in allen Ozeanen"
    " gemäßigter, subtropischer und tropischer Breiten beheimatet, sind"
    " jedoch vor allem in wärmeren Jahreszeiten auch bis in die gemäßigte"
    " bis subpolare Zone südlich der Südspitze Südamerikas, vor Nordeuropa"
    " und bis vor Kanada anzutreffen."

The following three code lines define the necessary Optional Content specifications:

  • oc_gr and oc_en are OCGs which automatically will be set to ON. Only the name is mandatory.
  • oc_de is an OCMD. We make it dependent on the status of oc_en by the policy parameter: its status will be ON whenever all the OCGs in the list ocgs are OFF. In our case this means that its visibility status always is the opposite of oc_en.
oc_gr = doc.add_ocg("graphics")
oc_en = doc.add_ocg("english")
oc_de = doc.set_ocmd(ocgs=[oc_en], policy="AllOff")

The remaining code …

  • Draws the gray rectangle giving it the “graphics” OCG.
  • Inserts the English text in the rectangle, giving it the “english” OCG.
  • Inserts the German text in the same rectangle, giving it the OCMD.
# Draw rectangle as background for the text
page.draw_rect(
    rect + (-5, -5, 5, 5),  # enlarge by 5 pt for nicer text appearance
    fill=fitz.pdfcolor["gray80"],  # fill color some gray
    color=fitz.pdfcolor["blue"],  # border color blue
    oc=oc_gr,  # give it OCG "graphics"
    dashes="[3 1]",  # dashed border: 3 points followed by 1 point gap
)


# Write the text into the rectangle.
page.insert_textbox(
    rect,  # the "box"
    text_en,  # English text
    fontsize=12,  # font size
    oc=oc_en,  # give it the "english" OCG
)


page.insert_textbox(
    rect,  # the "box"
    text_de,  # German text
    fontsize=12,  # font size
    oc=oc_de,  # give it the non-"english" OCMD
)


# ------------------------------------------------------------------------------------
# Just some technical stuff - required by our notebook environment only:
# Recycle the PDF reopening it in memory, then show what we have done so far.
# ------------------------------------------------------------------------------------
pdfbytes = doc.tobytes(clean=True)
doc.close()
doc=fitz.open(stream=pdfbytes)
page=doc[0]
show_image(page.get_pixmap())
As expected, the gray rectangle and the English text are both visible.

As expected, the gray rectangle and the English text are both visible.

Once defined, an OCG or OCMD may be attached to as many objects as required within the document.

In the next section we will show how to programmatically explore the Optional Content situation and how to show or hide objects.

How to Access the Content of Layers

In this section we will show you how to navigate through a PDF’s Optional Content information. You will learn how to detect that there in fact exists content not reachable in normal view and then switch visibility states to make that content available.

The main two methods are:

  1. doc.layer_ui_configs(): This is a tuple of dictionaries. If the empty tuple () is returned, the PDF has no Optional Content. Each dictionary describes an OCG with the following important keys:
    - ‘locked’: (bool) whether the state may be changed at all.
    - ‘number’: (int) identifying number.
    - ‘on’: (bool) the current visibility state.
    - ‘text’: (str) the name given to the OCG.
  2. doc.set_layer_ui_config(number, action): Use this to change the state of the OCG with this sequence number. This method does the same thing as offered by supporting PDF viewers. The action is an integer with the following meanings: 0 = set to ON (default), 1 = toggle ON/OFF, 2 = set to OFF.
# display the modifiable OCGs and their state
for item in doc.layer_ui_configs():
    print(item)

{'number': 0, 'text': 'graphics', 'depth': 0, 'type': 'checkbox', 'on': True, 'locked': False}
{'number': 1, 'text': 'english', 'depth': 0, 'type': 'checkbox', 'on': True, 'locked': False}

First, we will switch “english” (with number = 1 above) to OFF and observe the appearance change of the page: the English text has been replaced by the German version.

If you would extract the page’s text at this point, you would also get the German text!

doc.set_layer_ui_config(1, action=2)  # switch OFF "english"
show_image(page.get_pixmap())  # show page again
German Text

Now also switch off the background and look again:

doc.set_layer_ui_config(0, action=2)  # switch OFF "graphics"
show_image(page.get_pixmap())  # show page again

The gray rectangle has disappeared.

Like using a PDF viewer to switch OCGs on or off, the method above also will not permanently modify the PDF.

Here is how the above situation would be shown by Adobe Acrobat:

Both OCGs are OFF

To permanently set a different behavior, like only showing German text with no background, we can use the following method and then save the document:

doc.set_layer(off=[oc_en, oc_gr], on=[])

Conclusion

In this blog post, we’ve learned how to:

  1. Create an English-German bilingual PDF document,
  2. Detect hidden Optional Content and access it, and
  3. Permanently set visibility defaults

using PyMuPDF. You can now create your own documents, uncover hidden content, and change default behavior of Optional Content.

Remember, the PyMuPDF library offers a plethora of other features to work with PDF documents, such as extracting text, images, annotations, and much more. Make sure to explore the official PyMuPDF documentation to discover more about its capabilities.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.