Optimizing PDF File Size with PyMuPDF: Three Essential Techniques
Harald Lieder·June 18, 2025

In today’s fast-paced workflows, bulky PDFs can become a bottleneck — slowing down email attachments, consuming valuable storage, and frustrating mobile viewers. When high-resolution images, embedded fonts, and hidden metadata bloat your files, a targeted optimization strategy is essential. This article walks you through three core techniques — stripping metadata and other dead weight, image compression, and font subsetting — and shows how PyMuPDF’s straightforward API makes it easy to turn oversized documents into leaner and faster PDFs.
1. Dead-Weight Removal
Why it’s popular
PDFs tend to accumulate “dead weight” in the form of hidden metadata (author names, timestamps, revision histories), page thumbnails, embedded files, long annotation chains — and even stale form-field values. All that extra baggage not only inflates file size but can leak sensitive information.
Typical use cases
- Publishing whitepapers or specifications publicly
- Embedding PDFs in websites or apps
- Stripping private data before distribution
How PyMuPDF helps
With one call to Document.scrub()
, you can clean out everything you don’t need:
import pymupdf
doc = pymupdf.open("input.pdf")
doc.scrub(
metadata=True, # Clears basic metadata
xml_metadata=True, # Removes XML metadata
attached_files=True, # Deletes file attachments
embedded_files=True, # Deletes embedded files
thumbnails=True, # Strips page thumbnails
reset_fields=True, # Reverts form fields to their defaults
reset_responses=True, # Removes annotation replies
)
doc.ez_save("lean.pdf")
Here, scrub()
wipes out unwanted objects, and ez_save()
(pronounced “easy save”) guarantees that logically deleted content is physically purged from the output. The result is a smaller, privacy-safe PDF.
For optimum results, execute this method once only, immediately before saving the file.
For details on file saving, see section “4. Advanced Save Options”.
2. Font Subsetting
Why it’s popular
Embedding full font files — often tens or hundreds of kilobytes each — turns a simple PDF into a heavy download, especially when the document only uses a handful of characters.
Typical use cases
- Creating multilingual manuals with large character sets
- Creating rich-text annotations
- Creating or Updating rich-text widgets (form fields)
How PyMuPDF helps
PyMuPDF can automatically subset embedded fonts, keeping only the glyphs actually used for each font:
doc.subset_fonts()
doc.ez_save("output.pdf")
This process slashes font-related overhead without sacrificing visual fidelity.
Important
Execute this method once only, immediately before saving the file.
3. Advanced Image Compression
Why it’s popular
High-resolution images are often the single biggest contributor to PDF bloat. A few 300 DPI photos can add tens of megabytes — killing upload speeds, clogging inboxes, and frustrating mobile users.
Typical use cases
- Emailing slide decks, product catalogs, or brochures
- Publishing lightweight PDFs for mobile apps
- Archiving scanned documents on space-restricted drives
How PyMuPDF helps
PyMuPDF’s Document.rewrite_images()
gives you pixel-level control —
downsampling, recompressing or converting to grayscale:
import pymupdf
doc = pymupdf.open("input.pdf")
doc.rewrite_images(
dpi_threshold=100, # only process images above 100 DPI
dpi_target=72, # downsample to 72 DPI
quality=60, # JPEG quality level
lossy=True, # include / exclude lossy images
lossless=True, # include / exclude lossless images
bitonal=True, # include / exclude monochrome images
color=True, # include / exclude colored images
gray=True, # include / exclude gray-scale images
set_to_gray=True, # convert to gray-scale before conversion
)
doc.ez_save("compressed_images.pdf")
In this example, every image with more than 100 DPI becomes a 72 DPI gray-scale JPEG at 60% quality — often cutting image size by 70–90%.
Absolute Minimum Size
If you truly don’t need images, you can remove them entirely via redaction annotations:
for page in doc:
page.add_redact_annot(page.rect)
page.apply_redactions(
images=pymupdf.PDF_REDACT_IMAGE_REMOVE, # remove images
graphics=pymupdf.pymupdf.PDF_REDACT_LINE_ART_NONE, # don't touch graphics
text=pymupdf.PDF_REDACT_TEXT_NONE, # don't touch text
)
doc.ez_save("images_stripped.pdf")
Here, redaction annotations purge all page images, leaving only text and vector graphics behind.
4. Advanced Save Options
All the “scrubbing”, image-downsampling, and font-subsetting you did so far, only happened in memory. Without physically purging unreferenced objects (“ghosts”) and compressing the PDF’s internal streams, your file would remain just as bulky.
PyMuPDF’s Document.save()
parameters can trigger garbage collection and compression at write-time:
garbage=3
De-duplicates and removes all objects no longer referenced.deflate=True
Applies zlib compression to any uncompressed streams (images, fonts, etc.).use_objstms=True
converts text-based PDF object definitions into streams that can be compressed, often additionally cutting 25%+ off size.
doc.save(
"output.pdf",
garbage=3, # de-duplicate and drop unreferenced objects
deflate=True, # zlib-compress any loose streams
use_objstms=True # convert text objects into compressible streams
)
# Or simply:
doc.ez_save("output.pdf")
Method ez_save()
applies those options under the hood, ensuring your on-disk PDF truly reflects the optimizations you’ve made.
Conclusion
Combining dead-weight removal, image optimization, and font subsetting turns oversized PDFs into sleek, lean documents — ideal for email, mobile apps, and web publishing. PyMuPDF’s straightforward API puts these powerful techniques at your fingertips, so you can focus on your content rather than delivery constraints. Ready to supercharge your PDF workflow? Dive into the PyMuPDF documentation and start trimming!
Learn more
Share your projects and connect with others on the PyMuPDF Forum.