Unmasking Fake Redactions With PyMuPDF
Harald Lieder·June 22, 2023

PyMuPDF, a Python binding for the PDF processing library MuPDF, is an exceptional tool for working with PDF and other document formats. Today, we’re going to delve into a fascinating use-case: Detecting fake redactions.
You may have encountered blacked-out sections in some documents. These are called redactions and they are meant to conceal sensitive information. But sometimes, what appears as a redaction might not be an actual deletion. The information is simply covered by a black rectangle, misleading the reader into thinking that the text is gone. In reality, the text is still there and can be extracted, potentially leaking sensitive information. This is called a “fake redaction”, and it is of particular concern in areas where it is crucial to protect personal information — for example, in a legal context.
Tolerating fake redactions poses significant legal and financial risks. Over the years there have been a number of high profile documents that have fallen victim to improperly redacted data. In 2021, the European Commission published an improperly redacted PDF version of its contract with multinational pharmaceutical company AstraZeneca, unintentionally revealing significant portions of the redacted text. In another instance, lawyers for Paul Manafort failed to redact sensitive content, allowing a reporter to reveal the full contents of the document in the case against the former campaign chairman. See this Artifex video for a detailed coverage of high-security PDF redactions.
Fortunately, PyMuPDF has ways to combat the problem of improper redactions.
Note
Do have a look at the “x-ray” project (part of the Free Law Project), which makes active use of PyMuPDF features to help detect fake redactions.
Let’s explore how PyMuPDF can help detect situations like these. The following picture shows the location of three black rectangles, which we will prove to be fake redactions.

This document has been taken from the examples of the above-mentioned Free Law Project.
Detection Strategy
Our detection strategy involves some understanding of how a page’s appearance is created in a PDF document: A page is rendered by using a set of commands that every PDF viewer executes one by one to build the visual appearance. Therefore, if some text is simply covered by drawing a black rectangle over it, then that draw command must occur after the writing command for the text.
Our detection approach therefore involves two key steps:
- Find rectangular black, non-transparent vector graphics.
- Find text characters positioned in one of these rectangles, that are written before the rectangle is drawn.
The following code snippet executes Step 1: find “suspicious” black rectangles:
# find black, non-transparent rectangles
vector_rects = {} # black rectangles and their sequence number
for p in page.get_drawings(): # walk through page vector graphics
if p["type"] == "s": # ignore if "stroke" (no fill color)
continue
if p["fill_opacity"] != 1: # ignore if transparent
continue
if p["fill"] != black: # ignore if not black fill color
continue
for item in p["items"]: # ignore if not a rectangle
if not item[0] == "re":
continue
rect = item[1] # the rectangle coordinates
if rect.width <= 3 or rect.height <= 3: # ignore if small
continue
vector_rects[rect] = p["seqno"] # this is a true candidate
print(f"There are {len(vector_rects)} suspicious rectangles.")
There are 3 suspicious rectangles.
The above code will indeed discover the three black rectangles on the page — including the sequence number when they are drawn.
The piece of code for Step 2 identifies every single character on the page, together with its location (boundary box — “bbox”) and of course also the sequence number when it is being written. This looks a little more complex:
# make a list of all characters with bbox and writing sequence number
chars = [] # characters go in this list
for span in page.get_texttrace(): # this gives us all we need
seqno = span["seqno"] # writing sequence number of text piece
for c, _, o, b in span["chars"]: # iterate over character items
ch = chr(c) # convert character code to string
if not ch.isalnum(): # only need alphanumeric text
continue # no spaces, punctuation etc.
bbox = pymupdf.Rect(b) # convert tuple to a Rect object
chars.append((ch, bbox, seqno)) # store char info
Finally, we bring together the above two pieces of information. For each black rectangle, we check if it covers any text that is written before:
for i, r in enumerate(vector_rects.keys()): # iterate over rectangles
# list of characters (partly) covered by r and written earlier
rseqno = vector_rects[r] # sequence number of r
contains = [ # list of characters intersecting r
c for c, bbox, seqno in chars if bbox.intersects(r) and seqno < rseqno
]
if contains != []: # if this rectangle covers any text, show it
text = "".join(contains)
print(f"Rectangle {i} hides '{text}'.")
The previous code will print the following text lines:
Rectangle 0 hides ‘estate’.
Rectangle 1 hides ‘Estate’.
Rectangle 2 hides ‘Estate’.
If you want to try out the above code yourself, please download the respective Jupyter notebook here.
Wrapping Up
By running the above algorithm, we can accurately uncover fake redactions and expose the text that was not correctly erased. Using this approach, we have a powerful tool to detect insufficiently protected sensitive information, helping to maintain confidentiality in documents.
PyMuPDF, with its rich set of features, thus becomes an invaluable resource not just for PDF manipulation, but also for document security and privacy concerns. Remember, the true power of a programming language or a library is determined by how creatively you use it to solve problems.
So, next time you come across a redacted PDF, you know there’s a way to verify its integrity!
The PyMuPDF library offers many other features to work with PDF documents, such as extracting images, annotations, and much more. Be sure to explore the official PyMuPDF documentation to discover more of its capabilities.
Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.
If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.