Unmasking Fake Redactions With PyMuPDF

Harald Lieder·June 22, 2023

PyMuPDFRedactions

Learn how to detect fake redactions in PDF documents with the Python library PyMuPDF.

In this article

Detection Strategy
Wrapping Up

PyMuPDF, a Python binding for the PDF processing library MuPDF, is an exceptional tool for working with PDF and other document formats. Today, we’re going to delve into a fascinating use-case: Detecting fake redactions.

You may have encountered blacked-out sections in some documents. These are called redactions and they are meant to conceal sensitive information. But sometimes, what appears as a redaction might not be an actual deletion. The information is simply covered by a black rectangle, misleading the reader into thinking that the text is gone. In reality, the text is still there and can be extracted, potentially leaking sensitive information. This is called a “fake redaction”, and it is of particular concern in areas where it is crucial to protect personal information — for example, in a legal context.

Tolerating fake redactions poses significant legal and financial risks. Over the years there have been a number of high profile documents that have fallen victim to improperly redacted data. In 2021, the European Commission published an improperly redacted PDF version of its contract with multinational pharmaceutical company AstraZeneca, unintentionally revealing significant portions of the redacted text. In another instance, lawyers for Paul Manafort failed to redact sensitive content, allowing a reporter to reveal the full contents of the document in the case against the former campaign chairman. See this Artifex video for a detailed coverage of high-security PDF redactions.

Fortunately, PyMuPDF has ways to combat the problem of improper redactions.

Note

Do have a look at the “x-ray” project (part of the Free Law Project), which makes active use of PyMuPDF features to help detect fake redactions.

Let’s explore how PyMuPDF can help detect situations like these. The following picture shows the location of three black rectangles, which we will prove to be fake redactions.

An example document taken from the Free Law Project containing improperly redacted text.

This document has been taken from the examples of the above-mentioned Free Law Project.

Detection Strategy

Our detection strategy involves some understanding of how a page’s appearance is created in a PDF document: A page is rendered by using a set of commands that every PDF viewer executes one by one to build the visual appearance. Therefore, if some text is simply covered by drawing a black rectangle over it, then that draw command must occur after the writing command for the text.

Our detection approach therefore involves two key steps:

Find rectangular black, non-transparent vector graphics.
Find text characters positioned in one of these rectangles, that are written before the rectangle is drawn.

The following code snippet executes Step 1: find “suspicious” black rectangles:

# find black, non-transparent rectangles
vector_rects = {}  # black rectangles and their sequence number
for p in page.get_drawings():  # walk through page vector graphics
    if p["type"] == "s":  # ignore if "stroke" (no fill color)
        continue
    if p["fill_opacity"] != 1: # ignore if transparent
        continue
    if p["fill"] != black: # ignore if not black fill color
        continue
    for item in p["items"]:  # ignore if not a rectangle
        if not item[0] == "re":
            continue
        rect = item[1]  # the rectangle coordinates
        if rect.width <= 3 or rect.height <= 3:  # ignore if small
         continue
        vector_rects[rect] = p["seqno"]  # this is a true candidate

print(f"There are {len(vector_rects)} suspicious rectangles.")

There are 3 suspicious rectangles.

The above code will indeed discover the three black rectangles on the page — including the sequence number when they are drawn.

The piece of code for Step 2 identifies every single character on the page, together with its location (boundary box — “bbox”) and of course also the sequence number when it is being written. This looks a little more complex:

# make a list of all characters with bbox and writing sequence number
chars = []  # characters go in this list
for span in page.get_texttrace():  # this gives us all we need
    seqno = span["seqno"]  # writing sequence number of text piece
    for c, _, o, b in span["chars"]:  # iterate over character items
        ch = chr(c)  # convert character code to string
        if not ch.isalnum():  # only need alphanumeric text
         continue  # no spaces, punctuation etc.
        bbox = pymupdf.Rect(b)  # convert tuple to a Rect object
     chars.append((ch, bbox, seqno))  # store char info

Finally, we bring together the above two pieces of information. For each black rectangle, we check if it covers any text that is written before:

for i, r in enumerate(vector_rects.keys()):  # iterate over rectangles
 # list of characters (partly) covered by r and written earlier
 rseqno = vector_rects[r]  # sequence number of r
 contains = [  # list of characters intersecting r
     c for c, bbox, seqno in chars if bbox.intersects(r) and seqno < rseqno
 ]
 if contains != []:  # if this rectangle covers any text, show it
     text = "".join(contains)
     print(f"Rectangle {i} hides '{text}'.")

The previous code will print the following text lines:

Rectangle 0 hides ‘estate’.

Rectangle 1 hides ‘Estate’.

Rectangle 2 hides ‘Estate’.

If you want to try out the above code yourself, please download the respective Jupyter notebook here.

Wrapping Up

By running the above algorithm, we can accurately uncover fake redactions and expose the text that was not correctly erased. Using this approach, we have a powerful tool to detect insufficiently protected sensitive information, helping to maintain confidentiality in documents.

PyMuPDF, with its rich set of features, thus becomes an invaluable resource not just for PDF manipulation, but also for document security and privacy concerns. Remember, the true power of a programming language or a library is determined by how creatively you use it to solve problems.

So, next time you come across a redacted PDF, you know there’s a way to verify its integrity!

The PyMuPDF library offers many other features to work with PDF documents, such as extracting images, annotations, and much more. Be sure to explore the official PyMuPDF documentation to discover more of its capabilities.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.