Blog

OCR and PDF Redactions – A Tale of Two Technologies

By Lisa Fenn / Robin Watts / Michael Vrhel - Tuesday, September 29, 2020

A Quick Overview of OCR

Optical Character Recognition (OCR) is the process of converting printed paper documents, scanned raster images, digital camera images, etc. into a searchable text format.

The use cases are countless and apply to any industry where digitizing documents or records is required. From airports for passport recognition to assistive technology for visually impaired users to digitizing files, records, and forms for creating searchable PDF libraries, the reasons for implementing an OCR solution are endless.

This past year, the engineers at Artifex Software uncovered a unique use case that solves a serious problem, closing known and unknown leaks in redacted PDF documents. This technique marries an OCR component with our redaction tool, that we’ve dubbed High-Security Redactions.

PDF Redaction and their Common Fails

PDF redaction software has become a standard business tool over the years, and there are many to choose from. A typical tool allows the user to choose areas to redact and then remove the information. A sanitize step removes metadata from the document. But how do you know the redaction software truly works as intended? Do you know what the code is doing under the hood?

As we all know, PDF redaction fails are all too common. Over the past ten years, there have been a number of high profile documents that have fallen victim to improperly redacted data. As recently as 2019 lawyers for Paul Manafort submitted improperly redacted PDF documents, releasing unintended content.

There are a number of obvious reasons a redaction can go wrong, human error, bugs in software, using improper methods like changing the font color, or blacking out with a comment tool. But despite this, there are less obvious reasons as PDF is a complex format and sensitive data can easily hide within the file structures.

Through our testing, we’ve identified a number of places redacted information can hide and methods for recovering this data. We tested a number of popular solutions and found they all had areas of data leakage.

Check out the talk!

Artifex Principal Engineer Dr. Michael Vrhel discusses our High-Security Redaction technique and closing known routes for leakage in redacted documents.

https://youtu.be/1VfUyvmbCFQ