Text Extraction Strategies with PyMuPDF
Jamie Lemon·May 30, 2025

Have you ever wondered why your extracted data feels incomplete or suspected that not all of your document was being considered during a data extraction process? Or has your pipeline been needlessly held up by unwieldy and long document processing times? In this article we discuss the two main approaches to text extraction: Native & OCR, and look into smart strategies for choosing how and when to use them.
Understanding Native Text Extraction
As you can imagine this technique utilizes the core PyMuPDF functionality to simply get the text from a document. We use the Page.get_text()
method to extract any actual content which is identified as “text” within the PDF.
- What it is: Extracting text that's already digitally embedded in the PDF
- How it works: Direct access to text objects in the PDF structure
- Advantages:
- Lightning-fast processing
- Perfect accuracy (when text exists)
- Preserves original formatting and fonts
- Low computational requirements
- Limitations:
- Only works with digitally-created PDFs
- Fails completely with scanned documents
- Can struggle with complex layouts
Understanding OCR (Optical Character Recognition)
This method involves utilizing open source 3rd party technology (Tesseract) to scan the page for images and to convert that imagery into text. Imagine PDFs which contain screenshots of information, these will just be identified as “image” within the PDF, but somehow we want machine-readable text. This method uses PyMuPDF’s Page.get_textpage_ocr()
function to take on the heavy lifting.
- What it is: Converting images of text into machine-readable text
- How it works: Image processing and pattern recognition
- Advantages:
- Works with any PDF (scanned, photographed, or image-based)
- Can handle handwritten text (with advanced models)
- Processes visual elements that native extraction misses
- Limitations:
- Slower processing times
- Accuracy depends on image quality
- Higher computational and memory requirements
- May introduce errors in recognition
When to Use Native Text Extraction
The main reason to use native extraction is for speed and for high volumes of documents. Additionally if you know your PDFs have no images then OCR will be of no benefit! However, in the real world many documents are scanned representations of older documents or even PDFs which have been deliberately flattened or “baked” to present an image format of the page.
- Typical scenarios:
- Digitally-created documents (Word exports, generated reports)
- High-volume processing where speed matters
- When perfect accuracy is required
- Clean, well-structured business documents
- Red flags that suggest native won't work:
- Scanned documents
- PDFs created from photos
- Documents with embedded images containing text
Code sample
import pymupdf
doc = pymupdf.open("a.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
out.write(text) # write text of page
out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()
When to Use OCR
If you know you are dealing with scanned documents then this is a no-brainer, you simply have to rely on OCR to extract the text content!
- Typical scenarios:
- Scanned documents or faxes
- PDFs created from photos
- Historical documents
- Mixed content (native text + images with text)
- When you need to extract text from images within PDFs
- Quality considerations:
- Resolution requirements (minimum 300 DPI)
- Clean vs. degraded source material
- Language and font considerations
Code sample
import pymupdf
doc = pymupdf.open("a.pdf") # open a document
for page in doc: # iterate the document pages
textPage = page.get_textpage_ocr()
# analyse the text page as required!
out.close()
Hybrid Approaches: Getting the Best of Both Worlds
The following is a guide line to getting the most out of your PDF data extraction.
The smart strategy
To make your data extraction more robust it is recommended to try native text extraction first and then try with OCR. For example if you are looking for a specific field of information in a document and don’t find it via native text extraction then pass the document back to PyMuPDF for an OCR pass.
Detection methods
If you are dealing with large volumes of documents try filtering the documents by type and processing them on different pipelines. For example if you get the number of images from a PDF and discover that it is very image heavy then perhaps consider sending it straight to your OCR pipeline. Other detection methods might be document size ( small sized PDFs will likely have little or no imagery)
OCR pipeline red-flags:
- page is completely covered by an image
- no text exists on the page
- thousands of small vector graphics (indicating simulated text)
Implementation workflow
- Step 1: Filter document types(set 1: small sized PDFs with little imagery, set 2: larger PDFs with many images or full scanned “baked” PDFs, i.e. files which have been previously “detected” to benefit from OCR)
- Step 2: Attempt native extraction on set 1
- Step 3: Evaluate results (empty, garbled, or incomplete text, add these to the “set 2” documents)
- Step 4: Selectively OCR the set 2 documents
PyMuPDF’s API will support this hybrid approach, ideally you should analyse the full document pages and figure out which pages require OCR and which pages do not - this is to reduce computation time. For example you may have detected a 100 page document as requiring OCR, but on further analysis of the document’s pages you can understand that only 30 of the pages will benefit from OCR. In this case you want to selectively determine which pages would benefit from OCR. The example code below uses PyMuPDF to analyse a PDF and report back with a summary of the pages with large images detected.
Code sample
import pymupdf # PyMuPDF
import os
from typing import List, Dict, Tuple
def analyze_images_in_pdf(pdf_path: str, size_threshold_mb: float = 1.0,
dimension_threshold: Tuple[int, int] = (800, 600)) -> Dict:
"""
Analyze a PDF document for large images on each page.
Args:
pdf_path (str): Path to the PDF file
size_threshold_mb (float): Minimum file size in MB to consider an image "large"
dimension_threshold (tuple): Minimum (width, height) to consider an image "large"
Returns:
dict: Analysis results containing image information for each page
"""
try:
doc = pymupdf.open(pdf_path)
total_pages = len(doc)
print(f"Analyzing {total_pages} pages in: {os.path.basename(pdf_path)}")
print(f"Size threshold: {size_threshold_mb} MB")
print(f"Dimension threshold: {dimension_threshold[0]}x{dimension_threshold[1]} pixels")
print("-" * 60)
results = {
'pdf_path': pdf_path,
'total_pages': total_pages,
'size_threshold_mb': size_threshold_mb,
'dimension_threshold': dimension_threshold,
'pages_with_large_images': [],
'summary': {
'images': 0,
'total_large_images': 0,
'pages_with_large_images': 0,
'total_image_size_mb': 0,
'largest_image': None
}
}
largest_image_size = 0
# Analyze each page (limit to 100 pages as requested)
pages_to_analyze = min(total_pages, 100)
for page_num in range(pages_to_analyze):
page = doc[page_num]
page_info = {
'page_number': page_num + 1,
'images': [],
'large_images': [],
'total_images_on_page': 0,
'large_images_count': 0
}
# Get all images on the page
image_list = page.get_images()
page_info['total_images_on_page'] = len(image_list)
for img_index, img in enumerate(image_list):
try:
# Extract image information
xref = img[0] # xref number
pix = pymupdf.Pixmap(doc, xref)
# Skip if image has alpha channel and convert if needed
if pix.alpha:
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
# Get image properties
width = pix.width
height = pix.height
image_size_bytes = len(pix.tobytes())
image_size_mb = image_size_bytes / (1024 * 1024)
print(f"Found image with size:{image_size_bytes} bytes")
# Check if image meets "large" criteria
is_large_by_size = image_size_mb >= size_threshold_mb
is_large_by_dimensions = (width >= dimension_threshold[0] and
height >= dimension_threshold[1])
if is_large_by_size or is_large_by_dimensions:
image_info = {
'image_index': img_index + 1,
'xref': xref,
'width': width,
'height': height,
'size_mb': round(image_size_mb, 2),
'size_bytes': image_size_bytes,
'colorspace': pix.colorspace.name if pix.colorspace else 'Unknown',
'reason_large': []
}
if is_large_by_size:
image_info['reason_large'].append('Size')
if is_large_by_dimensions:
image_info['reason_large'].append('Dimensions')
page_info['large_images'].append(image_info)
page_info['large_images_count'] += 1
results['summary']['total_large_images'] += 1
results['summary']['total_image_size_mb'] += image_size_mb
# Track largest image
if image_size_mb > largest_image_size:
largest_image_size = image_size_mb
results['summary']['largest_image'] = {
'page': page_num + 1,
'size_mb': round(image_size_mb, 2),
'dimensions': f"{width}x{height}",
'xref': xref
}
results['summary']['images'] += 1
pix = None # Clean up
except Exception as e:
print(f"Error processing image {img_index + 1} on page {page_num + 1}: {e}")
continue
# Only add pages that have large images
if page_info['large_images_count'] > 0:
results['pages_with_large_images'].append(page_info)
results['summary']['pages_with_large_images'] += 1
# Progress indicator
if (page_num + 1) % 10 == 0:
print(f"Processed {page_num + 1} pages...")
doc.close()
results['summary']['total_image_size_mb'] = round(results['summary']['total_image_size_mb'], 2)
return results
except Exception as e:
print(f"Error analyzing PDF: {e}")
return None
def print_analysis_results(results: Dict):
"""Print formatted analysis results."""
if not results:
print("No results to display.")
return
print("\n" + "="*60)
print("PDF IMAGE ANALYSIS RESULTS")
print("="*60)
# Summary
summary = results['summary']
print(f"Total pages analyzed: {results['total_pages']}")
print(f"Total images: {summary['images']}")
print(f"Pages with large images: {summary['pages_with_large_images']}")
print(f"Total large images found: {summary['total_large_images']}")
print(f"Total size of large images: {summary['total_image_size_mb']} MB")
if summary['largest_image']:
largest = summary['largest_image']
print(f"Largest image: {largest['size_mb']} MB ({largest['dimensions']}) on page {largest['page']}")
print("\n" + "-"*60)
print("DETAILED RESULTS BY PAGE")
print("-"*60)
# Detailed results
for page_info in results['pages_with_large_images']:
print(f"\nPage {page_info['page_number']}:")
print(f" Total images on page: {page_info['total_images_on_page']}")
print(f" Large images: {page_info['large_images_count']}")
for img in page_info['large_images']:
reasons = ", ".join(img['reason_large'])
print(f" Image {img['image_index']}: {img['width']}x{img['height']} pixels, "
f"{img['size_mb']} MB ({reasons})")
def save_analysis_to_file(results: Dict, output_file: str):
"""Save analysis results to a text file."""
if not results:
print("No results to save.")
return
with open(output_file, 'w') as f:
f.write("PDF IMAGE ANALYSIS RESULTS\n")
f.write("="*60 + "\n")
f.write(f"PDF File: {results['pdf_path']}\n")
f.write(f"Analysis Date: {pymupdf.Document().metadata.get('creationDate', 'Unknown')}\n")
f.write(f"Size Threshold: {results['size_threshold_mb']} MB\n")
f.write(f"Dimension Threshold: {results['dimension_threshold'][0]}x{results['dimension_threshold'][1]}\n\n")
# Summary
summary = results['summary']
f.write("SUMMARY\n")
f.write("-"*20 + "\n")
f.write(f"Total pages analyzed: {results['total_pages']}\n")
f.write(f"Pages with large images: {summary['pages_with_large_images']}\n")
f.write(f"Total large images: {summary['total_large_images']}\n")
f.write(f"Total size of large images: {summary['total_image_size_mb']} MB\n")
if summary['largest_image']:
largest = summary['largest_image']
f.write(f"Largest image: {largest['size_mb']} MB ({largest['dimensions']}) on page {largest['page']}\n")
# Detailed results
f.write("\nDETAILED RESULTS\n")
f.write("-"*20 + "\n")
for page_info in results['pages_with_large_images']:
f.write(f"\nPage {page_info['page_number']}:\n")
f.write(f" Total images: {page_info['total_images_on_page']}\n")
f.write(f" Large images: {page_info['large_images_count']}\n")
for img in page_info['large_images']:
reasons = ", ".join(img['reason_large'])
f.write(f" Image {img['image_index']}: {img['width']}x{img['height']} px, "
f"{img['size_mb']} MB, {img['colorspace']} ({reasons})\n")
print(f"Analysis results saved to: {output_file}")
# Example usage
if __name__ == "__main__":
# Replace with your PDF file path
pdf_file = "test.pdf"
# Customize thresholds as needed
size_threshold = 1.0 # MB
dimension_threshold = (800, 600) # width x height pixels
# Run analysis
results = analyze_images_in_pdf(
pdf_path=pdf_file,
size_threshold_mb=size_threshold,
dimension_threshold=dimension_threshold
)
if results:
# Print results to console
print_analysis_results(results)
# Optionally save to file
output_file = f"image_analysis_{os.path.splitext(os.path.basename(pdf_file))[0]}.txt"
save_analysis_to_file(results, output_file)
else:
print("Analysis failed. Please check your PDF file path and try again.")
Real-World Scenarios
The benefits of document filtering for text extraction strategies can save time & processing when dealing with documents used by financial services with mixed document types or legal firms with typically digitized historical case files. Furthermore many academic papers may contain graphical visualizations with key data. In all these cases using a smart strategy to perform OCR on pages as required makes sense.
Documentation & Videos
YouTube: Advanced PyMuPDF Text Extraction Techniques | Full Tutorial