Extracting Tables from PDFs with PyMuPDF
Harald Lieder·June 6, 2025

Today, we explore the process of extracting tables from PDFs using PyMuPDF, with a focus on its practical applications in various professional settings. Many PDFs, especially those originating from spreadsheets or data exports, contain structured tables that need to be converted into a usable format. This article outlines the importance of table extraction, its common use cases, and how PyMuPDF’s enhanced capabilities — such as Markdown conversion and direct export to pandas DataFrames facilitate this process.
Why Table Extraction is Important
Many documents encountered in finance, academia, and business are created from data exports or spreadsheets, resulting in PDFs that contain well-organized tables. However, because PDFs are designed mainly for fixed-layout document rendering, they do not inherently maintain a table structure. Extracting these tables into a format that preserves the data’s organization is critical for tasks such as automated processing and detailed analysis.
Typical Use Cases
The extraction of tables from PDFs is particularly useful in scenarios such as:
- Invoice Processing: Automate the extraction of itemized details from invoices and receipts to facilitate financial record-keeping.
- Research Data Extraction: Retrieve tables embedded in academic papers, reducing the need for time-consuming manual entry.
- Compliance Audits: Quickly gather structured data from reports to verify adherence to regulatory requirements.
In each of these applications, converting a PDF’s static content into dynamic, structured data significantly improves operational efficiency.
How PyMuPDF Enhances Table Extraction
PyMuPDF includes a powerful feature, the find_tables
method on a Page object, which simplifies the process of identifying and extracting tables from a PDF. This improvement addresses the inherent challenges of PDF document structures through three main capabilities:
- Markdown Conversion: The table finder can convert detected tables into Markdown text. This feature is particularly useful for integrating extracted data with Large Language Models (LLMs) for further automated processing.
- Export to DataFrames: For users who prefer handling data in pandas, the
find_tables
method allows for direct export of tables into pandas DataFrames. This facility streamlines further processing, like data refinements using pandas’ powerful capabilities or seamless conversion to over 20 formats, among them Excel, JSON, or CSV for downstream processing. - Export to Python list objects: You can also retrieve native Python list of lists data structures and match each table cell’s text with its exact position (boundary box) on the page.
PyMuPDF’s find_tables
method also includes an advanced feature that automatically detects column headers within tables. This capability distinguishes header rows from data rows during the extraction process. The result is a structured table object that not only provides rows and columns but also clearly identifies the header cells. This information is then readily accessible for conversion into Markdown or for exporting directly to pandas DataFrames.
Sample Code
The following sample code demonstrates how to use PyMuPDF’s find_tables
method to extract tables from a PDF and convert them into both Markdown and pandas DataFrame formats:
import pymupdf
# Open the PDF document
doc = pymupdf.open("example.pdf")
page = doc[0] # Process the first page
# Detect tables on the page using table finder
tables = page.find_tables()
if not tables.tables:
print("No tables found on this page.")
else:
for index, table in enumerate(tables):
print(f"\nTable {index+1} found:")
# Convert the table to Markdown text
md_table = table.to_markdown()
print("\nMarkdown representation:")
print(md_table)
# Convert the table to a pandas DataFrame
df_table = table.to_pandas()
print("\nPandas DataFrame:")
print(df_table)
# Optional: Export to CSV, JSON, or Excel
df_table.to_csv(f"table_{index+1}.csv", index=False)
# df_table.to_excel(f"table_{index+1}.xlsx", index=False)
# df_table.to_json(f"table_{index+1}.json", orient="records")
print("\nTable extraction complete!")
In this example, the script opens a PDF file, uses find_tables
to identify tables on the first page, and then converts each detected table into a Markdown representation and a pandas DataFrame. This dual approach enables both quick data preview and further data manipulation.
Conclusion
While extracting tables from PDFs can be challenging due to the format’s inherent limitations, PyMuPDF’s advanced find_tables
method offers an efficient and effective solution. By converting table data into Markdown for language model integration or exporting to pandas DataFrames for robust data manipulation, this method significantly simplifies the task of transforming static PDF content into structured, analyzable data.
We hope this guide serves as a useful introduction to the enhanced capabilities of PyMuPDF. For further details, please consult the articles Table Recognition and Extraction With PyMuPDF, Solving Common Issues With Table Detection and Extraction, the official documentation and participate in community discussions.