Extracting Tables from PDFs with PyMuPDF

Harald Lieder·June 6, 2025

Table ExtractionPyMuPDF

Extracting Tables from PDFs with PyMuPDF

In this article

Why Table Extraction is Important
Typical Use Cases
How PyMuPDF Enhances Table Extraction
Sample Code
Conclusion

Today, we explore the process of extracting tables from PDFs using PyMuPDF, with a focus on its practical applications in various professional settings. Many PDFs, especially those originating from spreadsheets or data exports, contain structured tables that need to be converted into a usable format. This article outlines the importance of table extraction, its common use cases, and how PyMuPDF’s enhanced capabilities — such as Markdown conversion and direct export to pandas DataFrames facilitate this process.

Why Table Extraction is Important

Many documents encountered in finance, academia, and business are created from data exports or spreadsheets, resulting in PDFs that contain well-organized tables. However, because PDFs are designed mainly for fixed-layout document rendering, they do not inherently maintain a table structure. Extracting these tables into a format that preserves the data’s organization is critical for tasks such as automated processing and detailed analysis.

Typical Use Cases

The extraction of tables from PDFs is particularly useful in scenarios such as:

Invoice Processing: Automate the extraction of itemized details from invoices and receipts to facilitate financial record-keeping.
Research Data Extraction: Retrieve tables embedded in academic papers, reducing the need for time-consuming manual entry.
Compliance Audits: Quickly gather structured data from reports to verify adherence to regulatory requirements.

In each of these applications, converting a PDF’s static content into dynamic, structured data significantly improves operational efficiency.

How PyMuPDF Enhances Table Extraction

PyMuPDF includes a powerful feature, the find_tables method on a Page object, which simplifies the process of identifying and extracting tables from a PDF. This improvement addresses the inherent challenges of PDF document structures through three main capabilities:

Markdown Conversion: The table finder can convert detected tables into Markdown text. This feature is particularly useful for integrating extracted data with Large Language Models (LLMs) for further automated processing.
Export to DataFrames: For users who prefer handling data in pandas, the find_tables method allows for direct export of tables into pandas DataFrames. This facility streamlines further processing, like data refinements using pandas’ powerful capabilities or seamless conversion to over 20 formats, among them Excel, JSON, or CSV for downstream processing.
Export to Python list objects: You can also retrieve native Python list of lists data structures and match each table cell’s text with its exact position (boundary box) on the page.

PyMuPDF’s find_tables method also includes an advanced feature that automatically detects column headers within tables. This capability distinguishes header rows from data rows during the extraction process. The result is a structured table object that not only provides rows and columns but also clearly identifies the header cells. This information is then readily accessible for conversion into Markdown or for exporting directly to pandas DataFrames.

Sample Code

The following sample code demonstrates how to use PyMuPDF’s find_tables method to extract tables from a PDF and convert them into both Markdown and pandas DataFrame formats:

import pymupdf

# Open the PDF document
doc = pymupdf.open("example.pdf")
page = doc[0]  # Process the first page

# Detect tables on the page using table finder
tables = page.find_tables()

if not tables.tables:
    print("No tables found on this page.")
else:
    for index, table in enumerate(tables):
        print(f"\nTable {index+1} found:")

        # Convert the table to Markdown text
        md_table = table.to_markdown()
        print("\nMarkdown representation:")
        print(md_table)

        # Convert the table to a pandas DataFrame
        df_table = table.to_pandas()
        print("\nPandas DataFrame:")
        print(df_table)

        # Optional: Export to CSV, JSON, or Excel
        df_table.to_csv(f"table_{index+1}.csv", index=False)
        # df_table.to_excel(f"table_{index+1}.xlsx", index=False)
        # df_table.to_json(f"table_{index+1}.json", orient="records")

print("\nTable extraction complete!")

In this example, the script opens a PDF file, uses find_tables to identify tables on the first page, and then converts each detected table into a Markdown representation and a pandas DataFrame. This dual approach enables both quick data preview and further data manipulation.

Conclusion

While extracting tables from PDFs can be challenging due to the format’s inherent limitations, PyMuPDF’s advanced find_tables method offers an efficient and effective solution. By converting table data into Markdown for language model integration or exporting to pandas DataFrames for robust data manipulation, this method significantly simplifies the task of transforming static PDF content into structured, analyzable data.

We hope this guide serves as a useful introduction to the enhanced capabilities of PyMuPDF. For further details, please consult the articles Table Recognition and Extraction With PyMuPDF, Solving Common Issues With Table Detection and Extraction, the official documentation and participate in community discussions.