Introducing PyMuPDF4LLM

Jamie Lemon·May 28, 2024

PyMuPDF4LLMReleaseRAGLLMPyMuPDF
Introducing PyMuPDF4LLM

Recently we decided to enhance our RAG/LLM solutions for PyMuPDF with a new convenience library to quickly enable typical operations for RAG. 

We wanted a library to make it trivial to extract information from PDF ( and other ) files in Python. We also wanted a library which:

  • is easy to install
  • is as simple to use as possible
  • gives the AI community exactly the API they need for RAG & LLM

Installation

Install via pip with:

pip install pymupdf4llm

PyMuPDF4LLM features

PyMuPDF4LLM is based on top of the tried and tested PyMuPDF and utilizes the library behind the scenes to achieve the following:

  • Support for multi-column pages
  • Support for image and vector graphics extraction (and inclusion of references in the MD text)
  • Support for page chunking output
  • Direct support for output as LlamaIndex Documents

Multi-column pages

The text extraction can handle document layouts with multiple columns and meaning that “newspaper” type layouts are supported. The associated Markdown output will faithfully represent the intended reading order.

Multi-column pages

Multi-column layout and reading order

Image Support

PyMuPDF4LLM will also output image files alongside the Markdown if we request write_images:

import pymupdf4llm
output = pymupdf4llm.to_markdown("input.pdf", write_images=True)

The resulting output will create a markdown text output with references to any images that may have been found in the document. The images will be saved to the location from where you have run the Python script and the markdown will have logically referenced them with the correct markdown syntax for images.

Page Chunking

We can obtain output with enriched semantic information if we request page_chunks:

import pymupdf4llm
output = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)

This delivers a list of dictionary objects for each page of the document with the following schema:

  • metadata - dictionary consisting of the document’s metadata.
  • Toc_items - list of Table of Contents items pointing to the page.
  • tables - list of tables on this page.
  • images - list of images on the page.
  • graphics - list of vector graphics rectangles on the page.
  • text - page content as Markdown text.

In this way page chunking allows for more structured results for your LLM input.

LlamaIndex Documents Output

If you are using LlamaIndex for your LLM application then you are in luck! PyMuPDF4LLM has a seamless integration as follows:

import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")

With these simple 3 lines of code you will receive LLamaIndex document objects from the PDF file input for use with your LLM application!


Conclusion

We hope that you find the new library convenient and easy to use, please contact us on Discord @ #pymupdf with any questions or feedback. 

We also welcome new ideas and any issues, please contact us on our Github PyMuPDF RAG Issue board to explain and discuss.

Finally, for full documentation, including the API reference, see the: PyMuPDF4LLM documentation.

We hope you enjoy the new library!