News / Press

Introducing PyMuPDF4LLM: A Breakthrough in PDF to Markdown Conversion for Python Developers

By Kayla Klein - Friday, May 03, 2024

We are excited to announce the release of PyMuPDF4LLM. Building on the foundation of PyMuPDF, recognized as the fastest PDF extraction tool in the Python ecosystem, PyMuPDF4LLM extends its capabilities specifically for developers working with large language models and related technologies.

Key Features of PyMuPDF4LLM

PyMuPDF4LLM introduces powerful features designed to streamline the process of converting PDF pages into Markdown format:

  • Markdown Conversion: Converts entire PDF documents into clean, GitHub-compatible Markdown text. This includes standard text and tables, ensuring they are in the correct reading sequence for seamless integration and further processing.
  • Advanced Text Formatting: The tool intelligently detects and formats header lines based on font size, applying appropriate Markdown heading tags. It also supports bold, italic, and monospaced text, as well as code blocks, which are crucial for technical documentation.
  • List Detection: Both ordered and unordered lists within the PDF are detected and correctly formatted, preserving the document's original structure and intent in the Markdown output.
  • Flexible Page Selection: Users can opt to convert the entire document or specify a subset of pages by providing a list of 0-based page numbers, offering flexibility for targeted document processing.

Ideal for Developers and Technologists

PyMuPDF4LLM is designed to cater to the needs of developers, especially those working with retrieval-augmented generation (RAG) and large language models (LLMs). The ability to swiftly turn complex PDF documents into Markdown format greatly enhances productivity and accuracy in developing applications and systems that rely on structured textual data.

Getting Started

To retrieve your document content in Markdown simply install the package and then use a couple of lines of Python code to get results.

Install the package via pip with:


pip install pymupdf4llm

Then in your Python script do:


import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

If you want to store your Markdown file, e.g. store as a UTF8-encoded file, then do:


import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

We invite you to further explore the documentation and integrate PyMuPDF4LLM  into your projects. Transform your PDF files into Markdown with ease and precision, and take your productivity to new heights.

We are committed to continually enhancing PyMuPDF4LLM to meet the evolving needs of our developer community. Your feedback is invaluable to us as we strive to make this tool not just useful but indispensable for all your document processing needs.