How to Search and Replace Text in PDFs Using PyMuPDF
Jamie Lemon·July 29, 2025

- What is PyMuPDF?
- Installation
- Basic Text Search and Replace
- Advanced Search and Replace with Better Formatting
- Handling Multiple Replacements
- Case-Insensitive Search
- Regular Expression Support
- Error Handling and Best Practices
- But Wait? What About the Replaced Text?
- Redacting before Replacing
- Limitations and Considerations
- Conclusion
PDF manipulation has always been a challenging task for developers, but PyMuPDF makes it surprisingly straightforward. Whether you need to update company names, fix typos, or replace outdated information across multiple documents, PyMuPDF provides powerful tools for searching and replacing text in PDF files.
What is PyMuPDF?
PyMuPDF is a Python binding for MuPDF, a lightweight PDF toolkit. It's fast, memory-efficient, and offers comprehensive PDF manipulation capabilities including text extraction, rendering, and modification. Unlike some PDF libraries that create new documents, PyMuPDF can modify existing PDFs while preserving their structure and formatting.
Installation
First, install PyMuPDF using pip:
pip install PyMuPDF
Basic Text Search and Replace
Here's a simple example that demonstrates the core functionality:
import pymupdf
def search_and_replace_text(pdf_path, search_text, replace_text, output_path):
# Open the PDF document
doc = pymupdf.open(pdf_path)
# Iterate through each page
for page_num in range(len(doc)):
page = doc[page_num]
# Search for the text
text_instances = page.search_for(search_text)
# Replace each instance
for inst in text_instances:
# Get the rectangle containing the text
rect = inst
# Add a white rectangle to cover the old text
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1))
# Insert the new text
page.insert_text(rect.tl, replace_text, fontsize=12, color=(0, 0, 0))
# Save the modified document
doc.save(output_path)
doc.close()
# Usage example
search_and_replace_text(
"input.pdf",
"Hello World",
"Goodbye!",
"output.pdf"
)
Advanced Search and Replace with Better Formatting
The basic approach above works but doesn't preserve the original font formatting. Here's an improved version that attempts to match the original text properties:
import pymupdf
def advanced_search_replace(pdf_path, search_text, replace_text, output_path):
doc = pymupdf.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
# Get text blocks with formatting information
blocks = page.get_text("dict")
for block in blocks["blocks"]:
if "lines" in block:
for line in block["lines"]:
for span in line["spans"]:
if search_text.lower() in span["text"].lower():
# Extract font information
font = span["font"]
size = span["size"]
flags = span["flags"]
color = span["color"]
# Get the bounding box
bbox = span["bbox"]
rect = pymupdf.Rect(bbox)
# Replace the text
updated_text = span["text"].replace(search_text, replace_text)
# Cover old text with white rectangle
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1))
# Insert new text with original formatting
page.insert_text(
rect.tl,
updated_text,
fontsize=size,
color=color
)
doc.save(output_path)
doc.close()
# Usage example
advanced_search_replace(
"input.pdf",
"Hello World",
"Goodbye!",
"output.pdf"
)
Note
Font name has not been attempted here as this is a little more involved and requires matching the extracted font name against it reference ID. See the details in the documentation for fontname.
Handling Multiple Replacements
For bulk replacements, you can create a more flexible function that accepts a dictionary of search-replace pairs:
import pymupdf
def bulk_search_replace(pdf_path, replacements, output_path):
"""
Replace multiple text strings in a PDF.
Args:
pdf_path: Path to input PDF
replacements: Dictionary with search terms as keys and replacements as values
output_path: Path for output PDF
"""
doc = pymupdf.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
for search_text, replace_text in replacements.items():
text_instances = page.search_for(search_text)
for inst in text_instances:
rect = inst
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1))
page.insert_text(rect.tl, replace_text, fontsize=12)
doc.save(output_path)
doc.close()
# Usage example
replacements = {
"Acme Corp": "Super Corp",
"2023": "2024",
"john@acme.com": "john@supercorp.com"
}
bulk_search_replace("input.pdf", replacements, "output.pdf")
Case-Insensitive Search
To perform case-insensitive searches, you'll need to handle the matching manually:
import pymupdf
def case_insensitive_replace(pdf_path, search_text, replace_text, output_path):
doc = pymupdf.open(pdf_path)
for page_num in range(len(doc)):
page = doc[page_num]
# Get all text on the page
text_dict = page.get_text("dict")
for block in text_dict["blocks"]:
if "lines" in block:
for line in block["lines"]:
for span in line["spans"]:
original_text = span["text"]
# Case-insensitive search
if search_text.lower() in original_text.lower():
# Find all occurrences (case-insensitive)
import re
pattern = re.compile(re.escape(search_text), re.IGNORECASE)
new_text = pattern.sub(replace_text, original_text)
if new_text != original_text:
bbox = span["bbox"]
rect = pymupdf.Rect(bbox)
# Replace text
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1))
page.insert_text(
rect.tl,
new_text,
fontsize=span["size"],
color=span["color"]
)
doc.save(output_path)
doc.close()
# Usage example
case_insensitive_replace(
"input.pdf",
"HeLlo WoRlD",
"Goodbye!",
"output.pdf"
)
Regular Expression Support
For more complex pattern matching, you can use regular expressions:
import pymupdf
import re
def regex_replace(pdf_path, pattern, replacement, output_path):
"""
Replace text using regular expressions.
Args:
pattern: Regular expression pattern to search for
replacement: Replacement string (can include group references like \1, \2)
"""
doc = pymupdf.open(pdf_path)
compiled_pattern = re.compile(pattern)
for page_num in range(len(doc)):
page = doc[page_num]
text_dict = page.get_text("dict")
for block in text_dict["blocks"]:
if "lines" in block:
for line in block["lines"]:
for span in line["spans"]:
original_text = span["text"]
new_text = compiled_pattern.sub(replacement, original_text)
if new_text != original_text:
bbox = span["bbox"]
rect = pymupdf.Rect(bbox)
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1))
page.insert_text(
rect.tl,
new_text,
fontsize=span["size"]
)
doc.save(output_path)
doc.close()
# Example: Replace all email addresses
regex_replace(
"input.pdf",
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
"email@hidden.com",
"output.pdf"
)
Note
The above example will not replace emails which span over multiple lines - so double check your output for these edge cases!
Error Handling and Best Practices
Always include proper error handling in production code. The code below checks to see if the PDF requires a password and will throw an exception on errors.
import pymupdf
def safe_search_replace(pdf_path, search_text, replace_text, output_path):
try:
doc = pymupdf.open(pdf_path)
if doc.is_encrypted:
print("PDF is password protected")
return False
changes_made = False
for page_num in range(len(doc)):
page = doc[page_num]
text_instances = page.search_for(search_text)
if text_instances:
changes_made = True
for inst in text_instances:
rect = inst
page.draw_rect(rect, color=(1, 1, 1), fill=(1, 1, 1))
page.insert_text(rect.tl, replace_text, fontsize=12)
print("Text replacement made")
if changes_made:
doc.save(output_path)
print(f"Successfully saved modified PDF to {output_path}")
else:
print(f"No instances of '{search_text}' found")
doc.close()
return True
except Exception as e:
print(f"Error processing PDF: {str(e)}")
return False
# Usage example
safe_search_replace(
"input.pdf",
"Hello World",
"Goodbye!",
"output.pdf"
)
But Wait? What About the Replaced Text?
Thus far these examples have shown how to detect the rectangle where the found text is and then cover that area with a graphical rectangle (in white as we assume the PDF background is white!). However, we are just visually obscuring the text here! If extraction is performed on the PDF then the original text will be available to be read. Perhaps we want to completely remove this text before we replace it? If so then redactions should be used!
Redacting before Replacing
The following example removes the existing text and replaces it with our new data:
import pymupdf
def search_redact_and_replace_text(pdf_path, search_text, replace_text, output_path, fill_color=(1, 1, 1), text_color=(0, 0, 0), fontname="tiro", fontsize=14):
# Open the PDF document
doc = pymupdf.open(pdf_path)
# Iterate through each page
for page_num in range(len(doc)):
page = doc[page_num]
# Search for text instances
text_instances = page.search_for(search_text)
# Replace each instance
for rect in text_instances:
# Create redaction annotation
redact_area = page.add_redact_annot(rect, text=replace_text,
fill=fill_color, text_color=text_color, fontname=fontname, fontsize=fontsize)
# Set additional properties
redact_area.set_info(content=f"Redacted sensitive information")
redact_area.update()
page.apply_redactions()
# Save the modified document
doc.save(output_path)
doc.close()
# Usage example
search_redact_and_replace_text(
"input.pdf",
"Hello World",
"Goodbye!",
"output.pdf"
)
This utilises the add redaction method in PyMuPDF and sets some defaults for the text options - however, it is up to you to figure out the font properties and background color to best suit your PDF look and feel! Perhaps some of the earlier examples above can hint at ways to do that.
Using redactions is more secure and probably the method you will want to employ for your search and replace.
Note
Once the document is saved any original text marked for redaction is completely removed so ensure to make a copy of the original file first if required.
Limitations and Considerations
While PyMuPDF is powerful, there are some important limitations to keep in mind:
Font Matching: The library may not always perfectly match the original font, especially with embedded or custom fonts. Test your results carefully.
Layout Preservation: Complex layouts with overlapping elements or precise positioning might be affected by text replacement. If your replaced text is longer than the original it can easily overlap the next word in a sentence. PDFs should not be considered to be like Word documents - the text layout will not adjust as you insert or remove characters, new lines and new pages won't be automatically created if you insert huge blocks of text. Remember PDFs are not like Word documents!
Text Recognition: PyMuPDF works with the actual text content in PDFs. It cannot replace text that's embedded as images or in scanned documents.
Performance: For large PDFs or batch processing, consider processing pages in chunks or using multiprocessing for better performance.
Conclusion
PyMuPDF provides a robust solution for text search and replacement in PDF documents. While the basic functionality is straightforward to implement, achieving perfect formatting preservation requires more careful handling of font properties and text positioning. The examples provided here should give you a solid foundation for building PDF text manipulation tools tailored to your specific needs.
Remember, when using the replace by redaction method, to always test your replacements on sample documents first, and consider creating backups of important PDFs before performing bulk modifications.