Text Extraction Strategies with PyMuPDF

Jamie Lemon·May 30, 2025

PyMuPDFText ExtractionOCRExtraction
PyMuPDF Text Extraction Strategies

Have you ever wondered why your extracted data feels incomplete or suspected that not all of your document was being considered during a data extraction process? Or has your pipeline been needlessly held up by unwieldy and long document processing times? In this article we discuss the two main approaches to text extraction: Native & OCR, and look into smart strategies for choosing how and when to use them.

Understanding Native Text Extraction

As you can imagine this technique utilizes the core PyMuPDF functionality to simply get the text from a document. We use the Page.get_text() method to extract any actual content which is identified as “text” within the PDF.

  • What it is: Extracting text that's already digitally embedded in the PDF
  • How it works: Direct access to text objects in the PDF structure
  • Advantages:
    • Lightning-fast processing
    • Perfect accuracy (when text exists)
    • Preserves original formatting and fonts
    • Low computational requirements
  • Limitations:
    • Only works with digitally-created PDFs
    • Fails completely with scanned documents
    • Can struggle with complex layouts

Understanding OCR (Optical Character Recognition)

This method involves utilizing open source 3rd party technology (Tesseract) to scan the page for images and to convert that imagery into text. Imagine PDFs which contain screenshots of information, these will just be identified as “image” within the PDF, but somehow we want machine-readable text. This method uses PyMuPDF’s Page.get_textpage_ocr() function to take on the heavy lifting.

  • What it is: Converting images of text into machine-readable text
  • How it works: Image processing and pattern recognition
  • Advantages:
    • Works with any PDF (scanned, photographed, or image-based)
    • Can handle handwritten text (with advanced models)
    • Processes visual elements that native extraction misses
  • Limitations:
    • Slower processing times
    • Accuracy depends on image quality
    • Higher computational and memory requirements
    • May introduce errors in recognition

When to Use Native Text Extraction

The main reason to use native extraction is for speed and for high volumes of documents. Additionally if you know your PDFs have no images then OCR will be of no benefit! However, in the real world many documents are scanned representations of older documents or even PDFs which have been deliberately flattened or “baked” to present an image format of the page.

  • Typical scenarios:
    • Digitally-created documents (Word exports, generated reports)
    • High-volume processing where speed matters
    • When perfect accuracy is required
    • Clean, well-structured business documents
  • Red flags that suggest native won't work:
    • Scanned documents
    • PDFs created from photos
    • Documents with embedded images containing text

Code sample

import pymupdf

doc = pymupdf.open("a.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

When to Use OCR

If you know you are dealing with scanned documents then this is a no-brainer, you simply have to rely on OCR to extract the text content!

  • Typical scenarios:
    • Scanned documents or faxes
    • PDFs created from photos
    • Historical documents
    • Mixed content (native text + images with text)
    • When you need to extract text from images within PDFs
  • Quality considerations:
    • Resolution requirements (minimum 300 DPI)
    • Clean vs. degraded source material
    • Language and font considerations

Code sample

import pymupdf

doc = pymupdf.open("a.pdf") # open a document

for page in doc: # iterate the document pages
    textPage = page.get_textpage_ocr()
    # analyse the text page as required!

out.close()

Hybrid Approaches: Getting the Best of Both Worlds

The following is a guide line to getting the most out of your PDF data extraction.

The smart strategy

To make your data extraction more robust it is recommended to try native text extraction first and then try with OCR. For example if you are looking for a specific field of information in a document and don’t find it via native text extraction then pass the document back to PyMuPDF for an OCR pass.

Detection methods

If you are dealing with large volumes of documents try filtering the documents by type and processing them on different pipelines. For example if you get the number of images from a PDF and discover that it is very image heavy then perhaps consider sending it straight to your OCR pipeline. Other detection methods might be document size ( small sized PDFs will likely have little or no imagery)

OCR pipeline red-flags:

  • page is completely covered by an image
  • no text exists on the page
  • thousands of small vector graphics (indicating simulated text)

Implementation workflow

  • Step 1: Filter document types(set 1: small sized PDFs with little imagery, set 2: larger PDFs with many images or full scanned “baked” PDFs, i.e. files which have been previously “detected” to benefit from OCR)
  • Step 2: Attempt native extraction on set 1
  • Step 3: Evaluate results (empty, garbled, or incomplete text, add these to the “set 2” documents)
  • Step 4: Selectively OCR the set 2 documents

PyMuPDF’s API will support this hybrid approach, ideally you should analyse the full document pages and figure out which pages require OCR and which pages do not - this is to reduce computation time. For example you may have detected a 100 page document as requiring OCR, but on further analysis of the document’s pages you can understand that only 30 of the pages will benefit from OCR. In this case you want to selectively determine which pages would benefit from OCR. The example code below uses PyMuPDF to analyse a PDF and report back with a summary of the pages with large images detected.

Code sample

import pymupdf  # PyMuPDF
import os
from typing import List, Dict, Tuple

def analyze_images_in_pdf(pdf_path: str, size_threshold_mb: float = 1.0, 
                         dimension_threshold: Tuple[int, int] = (800, 600)) -> Dict:
    """
    Analyze a PDF document for large images on each page.
    
    Args:
        pdf_path (str): Path to the PDF file
        size_threshold_mb (float): Minimum file size in MB to consider an image "large"
        dimension_threshold (tuple): Minimum (width, height) to consider an image "large"
    
    Returns:
        dict: Analysis results containing image information for each page
    """
    
    try:
        doc = pymupdf.open(pdf_path)
        total_pages = len(doc)
        
        print(f"Analyzing {total_pages} pages in: {os.path.basename(pdf_path)}")
        print(f"Size threshold: {size_threshold_mb} MB")
        print(f"Dimension threshold: {dimension_threshold[0]}x{dimension_threshold[1]} pixels")
        print("-" * 60)
        
        results = {
            'pdf_path': pdf_path,
            'total_pages': total_pages,
            'size_threshold_mb': size_threshold_mb,
            'dimension_threshold': dimension_threshold,
            'pages_with_large_images': [],
            'summary': {
                'images': 0,
                'total_large_images': 0,
                'pages_with_large_images': 0,
                'total_image_size_mb': 0,
                'largest_image': None
            }
        }
        
        largest_image_size = 0
        
        # Analyze each page (limit to 100 pages as requested)
        pages_to_analyze = min(total_pages, 100)
        
        for page_num in range(pages_to_analyze):
            page = doc[page_num]
            page_info = {
                'page_number': page_num + 1,
                'images': [],
                'large_images': [],
                'total_images_on_page': 0,
                'large_images_count': 0
            }
            
            # Get all images on the page
            image_list = page.get_images()
            page_info['total_images_on_page'] = len(image_list)
            
            for img_index, img in enumerate(image_list):
                try:
                    # Extract image information
                    xref = img[0]  # xref number
                    pix = pymupdf.Pixmap(doc, xref)
                    
                    # Skip if image has alpha channel and convert if needed
                    if pix.alpha:
                        pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
                    
                    # Get image properties
                    width = pix.width
                    height = pix.height
                    image_size_bytes = len(pix.tobytes())
                    image_size_mb = image_size_bytes / (1024 * 1024)

                    print(f"Found image with size:{image_size_bytes} bytes")
                    
                    # Check if image meets "large" criteria
                    is_large_by_size = image_size_mb >= size_threshold_mb
                    is_large_by_dimensions = (width >= dimension_threshold[0] and 
                                            height >= dimension_threshold[1])
                    
                    if is_large_by_size or is_large_by_dimensions:
                        image_info = {
                            'image_index': img_index + 1,
                            'xref': xref,
                            'width': width,
                            'height': height,
                            'size_mb': round(image_size_mb, 2),
                            'size_bytes': image_size_bytes,
                            'colorspace': pix.colorspace.name if pix.colorspace else 'Unknown',
                            'reason_large': []
                        }
                        
                        if is_large_by_size:
                            image_info['reason_large'].append('Size')
                        if is_large_by_dimensions:
                            image_info['reason_large'].append('Dimensions')
                        
                        page_info['large_images'].append(image_info)
                        page_info['large_images_count'] += 1
                        results['summary']['total_large_images'] += 1
                        results['summary']['total_image_size_mb'] += image_size_mb
                        
                        # Track largest image
                        if image_size_mb > largest_image_size:
                            largest_image_size = image_size_mb
                            results['summary']['largest_image'] = {
                                'page': page_num + 1,
                                'size_mb': round(image_size_mb, 2),
                                'dimensions': f"{width}x{height}",
                                'xref': xref
                            }
                    
                    results['summary']['images'] += 1
                    pix = None  # Clean up
                    
                except Exception as e:
                    print(f"Error processing image {img_index + 1} on page {page_num + 1}: {e}")
                    continue
            
            # Only add pages that have large images
            if page_info['large_images_count'] > 0:
                results['pages_with_large_images'].append(page_info)
                results['summary']['pages_with_large_images'] += 1
            
            # Progress indicator
            if (page_num + 1) % 10 == 0:
                print(f"Processed {page_num + 1} pages...")
        
        doc.close()
        results['summary']['total_image_size_mb'] = round(results['summary']['total_image_size_mb'], 2)
        
        return results
        
    except Exception as e:
        print(f"Error analyzing PDF: {e}")
        return None

def print_analysis_results(results: Dict):
    """Print formatted analysis results."""
    
    if not results:
        print("No results to display.")
        return
    
    print("\n" + "="*60)
    print("PDF IMAGE ANALYSIS RESULTS")
    print("="*60)
    
    # Summary
    summary = results['summary']
    print(f"Total pages analyzed: {results['total_pages']}")
    print(f"Total images: {summary['images']}")
    print(f"Pages with large images: {summary['pages_with_large_images']}")
    print(f"Total large images found: {summary['total_large_images']}")
    print(f"Total size of large images: {summary['total_image_size_mb']} MB")
    
    if summary['largest_image']:
        largest = summary['largest_image']
        print(f"Largest image: {largest['size_mb']} MB ({largest['dimensions']}) on page {largest['page']}")
    
    print("\n" + "-"*60)
    print("DETAILED RESULTS BY PAGE")
    print("-"*60)
    
    # Detailed results
    for page_info in results['pages_with_large_images']:
        print(f"\nPage {page_info['page_number']}:")
        print(f"  Total images on page: {page_info['total_images_on_page']}")
        print(f"  Large images: {page_info['large_images_count']}")
        
        for img in page_info['large_images']:
            reasons = ", ".join(img['reason_large'])
            print(f"    Image {img['image_index']}: {img['width']}x{img['height']} pixels, "
                  f"{img['size_mb']} MB ({reasons})")

def save_analysis_to_file(results: Dict, output_file: str):
    """Save analysis results to a text file."""
    
    if not results:
        print("No results to save.")
        return
    
    with open(output_file, 'w') as f:
        f.write("PDF IMAGE ANALYSIS RESULTS\n")
        f.write("="*60 + "\n")
        f.write(f"PDF File: {results['pdf_path']}\n")
        f.write(f"Analysis Date: {pymupdf.Document().metadata.get('creationDate', 'Unknown')}\n")
        f.write(f"Size Threshold: {results['size_threshold_mb']} MB\n")
        f.write(f"Dimension Threshold: {results['dimension_threshold'][0]}x{results['dimension_threshold'][1]}\n\n")
        
        # Summary
        summary = results['summary']
        f.write("SUMMARY\n")
        f.write("-"*20 + "\n")
        f.write(f"Total pages analyzed: {results['total_pages']}\n")
        f.write(f"Pages with large images: {summary['pages_with_large_images']}\n")
        f.write(f"Total large images: {summary['total_large_images']}\n")
        f.write(f"Total size of large images: {summary['total_image_size_mb']} MB\n")
        
        if summary['largest_image']:
            largest = summary['largest_image']
            f.write(f"Largest image: {largest['size_mb']} MB ({largest['dimensions']}) on page {largest['page']}\n")
        
        # Detailed results
        f.write("\nDETAILED RESULTS\n")
        f.write("-"*20 + "\n")
        
        for page_info in results['pages_with_large_images']:
            f.write(f"\nPage {page_info['page_number']}:\n")
            f.write(f"  Total images: {page_info['total_images_on_page']}\n")
            f.write(f"  Large images: {page_info['large_images_count']}\n")
            
            for img in page_info['large_images']:
                reasons = ", ".join(img['reason_large'])
                f.write(f"    Image {img['image_index']}: {img['width']}x{img['height']} px, "
                        f"{img['size_mb']} MB, {img['colorspace']} ({reasons})\n")
    
    print(f"Analysis results saved to: {output_file}")

# Example usage
if __name__ == "__main__":
    # Replace with your PDF file path
    pdf_file = "test.pdf"
    
    # Customize thresholds as needed
    size_threshold = 1.0  # MB
    dimension_threshold = (800, 600)  # width x height pixels
    
    # Run analysis
    results = analyze_images_in_pdf(
        pdf_path=pdf_file,
        size_threshold_mb=size_threshold,
        dimension_threshold=dimension_threshold
    )

    if results:
        # Print results to console
        print_analysis_results(results)

        # Optionally save to file
        output_file = f"image_analysis_{os.path.splitext(os.path.basename(pdf_file))[0]}.txt"
        save_analysis_to_file(results, output_file)
    else:
        print("Analysis failed. Please check your PDF file path and try again.")

Real-World Scenarios

The benefits of document filtering for text extraction strategies can save time & processing when dealing with documents used by financial services with mixed document types or legal firms with typically digitized historical case files. Furthermore many academic papers may contain graphical visualizations with key data. In all these cases using a smart strategy to perform OCR on pages as required makes sense.

Documentation & Videos

YouTube: Advanced PyMuPDF Text Extraction Techniques | Full Tutorial