AI Solutions for Modern Document Processing

We empower your AI solutions with fast, efficient & precise document processing and data extraction.

Reduce Hallucination
High-fidelity data extraction
Perfect for RAG & LLM environments

Empowering Innovators with
AI-Driven Solutions

PyMuPDF4LLM

The SDK Developers Need for AI Pipelines

Our PyMuPDF4LLM SDK integrates seamlessly with Hugging Face, LangChain, and LlamaIndex, simplifying document processing. With powerful tools, focus on building AI apps without the hassle of data extraction.

See License

Our Extraction Features

Support for multi-column pages & tables
Support for image and vector graphics extraction (and inclusion of references in the MD text)
Support for page chunking output
Direct support for output as LlamaIndex Documents

Fast, Chunked, and Reliable Data Extraction

Our data extraction is designed for speed and efficiency, delivering results in chunks with dependable per-page markdown. Get structured, accurate data without delays.

1from langchain_community.document_loaders import PyMuPDFLoader
2
3# Load the PDF file
4pdf_path = "example.pdf"  # Replace with your actual PDF file
5loader = PyMuPDFLoader(pdf_path)
6
7# Load and extract document data
8documents = loader.load()
9
10# Print extracted text from each page
11for i, doc in enumerate(documents):
12    print(f"Page {i+1}:\n")
13    print(doc.page_content[:1000])  # Print first 1000 characters

See Full Docs

Learn How Our AI Solutions will Help You

Building a Multimodal LLM Application with PyMuPDF4LLM

Extracting text from PDFs is a crucial and often challenging step in many AI, and LLM (Large Language Model) applications. High-quality text extraction plays a key role in improving downstream processes, such as tokenization, embedding creation, or indexing in a vector database, enhancing the overall performance of the application. PyMuPDF is a popular library for this task due to its simplicity, high speed, and reliable text extraction quality.

RAG/LLM and PDF: Conversion to Markdown Text with PyMuPDF

In the context of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) environments, data feeding in markdown text format holds significant importance. Here are some detailed considerations.

Building a RAG Chatbot GUI with the ChatGPT API and PyMuPDF

In this tutorial we will walk you through how to start creating your own chatbot for a web-browser. We are going to use a variety of Python libraries, including PyMuPDF, along with your ChatGPT API key, to create a graphical user interface (GUI) which will be able to answer a user’s inputted questions against an uploaded PDF document. We will demonstrate how to combine backend and frontend technology to deliver an effective solution for the web.

Extracting Text from Multi-Column Pages: A Practical PyMuPDF Guide

This tutorial will teach you ways to extract text from multi-column pages using PyMuPDF. Pages where text appears in multiple columns are frequently encountered in newspapers or scientific articles.

Unleash More With a License

Enjoy the freedom to customize, distribute, and scale without limits. Upgrade to a commercial license and make our product truly yours.

Unlimited distribution without requirements

No need to disclose your code

Technical support available

Learn More

Contact Sales

Not Just SDKs, Artifex Also Provides AI-Powered Solutions for Everyone

AI-Powered Invoice Parsing for Effortless PDF Automation

Our AI Invoice Parser API streamlines document processing for non-developers. Easily connect with Make, Zapier, and 7,000+ other tools. No coding required.