How to Manipulate Images in a PDF Using PyMuPDF

Harald Lieder·December 19, 2022

PyMuPDFPDF ManipulationImages

How to Manipulate Images in a PDF Using PyMuPDF

In this article

Abstract
Extracting Images
Inserting Images
Replacing or Deleting Images
- Finding the Image XREF
Reposition Images
Conclusion
Related PyMuPDF Articles

This article is part of a series on the functionality of PyMuPDF.

PyMuPDF is a Python binding for MuPDF – a lightweight PDF, XPS, eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

(Py-) MuPDF can access files in PDF, XPS, OpenXPS, CBZ, MOBI, EPUB, and FB2 (eBooks) formats, and it is known for its top performance and high rendering quality.

PyMuPDF’s homepage is on Github. It can be installed from PyPI via "pip install pymupdf".

Abstract

Apart from PDF text extraction and insertion, handling images in much the same way is often desired, too.

Image extraction: You may want to extract images, all or a selected few, that are embedded in a document and store them as conventional image files, like PNG or JPEG.
Image insertion: Or you are creating a PDF and want to insert images at certain positions alongside your text.
Image replacement: Yet another requirement we have seen a lot: a PDF is too large because of its embedded images; many are colorful or stored with a too high resolution – where grayscale versions and moderate resolutions would have done the same job.
Image deletion: Maybe an image should not or need not be displayed altogether.
Image repositioning: Even trickier: an image is not shown in the right position, within a too small box size or with incorrect rotation.

In all these situations PyMuPDF is there to help.

Extracting Images

There are at least two ways to do this:

Method 1 is available for all document types – not just PDF. Images are delivered as part of some page text extraction variants mentioned in the article Text Extraction with PyMuPDF.

A general coding pattern could be this:

doc = pymupdf.open("some.file")  # open some supported document

# iterate over the pages
for page in doc:
	img_number = 0  # for enumerating images per page
	# iterate over the image blocks
	for block in page.get_text("dict")["blocks"]:
		# skip if no image block
		if block["type"] != 1:
			continue
		# build filename, like 'img17-3.jpg'
		name = f"img{page.number}-{img_number}.{block['ext']}"
		out = open(name, "wb")
		out.write(block["image"])  # write the binary content
		out.close()
		img_number += 1  # increase image counter

A lot of metadata is available in each image block, which can help you to select relevant images, avoid storing potential duplicates and more.

Method 2 is available for PDF documents only. Extracting text or even accessing single pages is not required, because we can use PDF-specific information:

We iterate through the PDF’s object definitions and only select image objects. By avoiding access to pages, we may successfully extract images even when internal structures of the PDF are incorrect – PDF damages unfortunately are not rare and mostly happen due to incomplete downloads via the internet.

doc = pymupdf.open("some.pdf")  # open the PDF

xreflen = doc.get_xreflength()  # count of all objects in file
# we will iterate through all objects in the PDF and select images
for i in range(1, xreflen):  # do not access item 0 of the table
	if doc.xref_get_key(i, "Subtype")[1] != "/Image":  # check if image
		continue  # not an image, skip
	# this is an image!
	img = doc.extract_image(i)  # extract it and store its content
	# build filename, like 'img-4711.png'
	name = f"img-{i}.{img['ext']}"
	out = open(name, "wb")
	out.write(img["image"])  # write the binary content
	out.close()

For PDF documents other variations of this task are also available. We have created scripts you can choose from to achieve the best results:

extract.py is a standalone script following the above strategy, additionally selecting images that are large enough, not unicolor and other criteria.
extract-from-pages.py extracts images by page, applying similar selection criteria as the previous script.

Inserting Images

You want to improve a PDF page with showing an image? Or put a company’s logo in the upper left corner of every page? Or add a watermark?

All this can be done with just one method of PyMuPDF’s Page class: insert_image().

The method supports input from three different sources: image files, images in memory and MuPDF’s own image format Pixmap.

An image can be inserted into a given rectangle on the page. That rectangle can be any size, and its width-height ratio can be different from that of the image.

The image will be scaled and placed such that its center and the rectangle center coincide.

An optional image rotation by 90, 180 or 270 degrees can also be chosen.

A lot of care is being taken to achieve best possible performance of the insertion process:

The method automatically keeps track of images that it has already inserted elsewhere.
In addition, the programmer may also actively identify any image in the method parameters.

In both cases, just a reference to the existing image is placed into the page’s object definition.

So even the insertion of a logo on one hundred thousand pages will happen at blinding speed and minimal file size impact!

This is the important part of the method’s call pattern:

page.insert_image(
	rect,  # the desired rectangle
	filename=filename,  # image file
	stream=buffer,  # image in memory
	pixmap=pixmap,  # image from pixmap
	rotate=angle,  # 0, 90, 180, 270
	xref=0,  # >0 refers to an existing image
	# more parameters
)

Replacing or Deleting Images

There are several use cases, including:

You have PDFs showing an outdated company logo image on each page, and you want to replace it without recreating the files.
You want to replace an image: maybe because it was in PNG format and you would rather the JPEG version or a version with transparency, or a different colorspace (e.g., grayscale instead of colored).
You want to remove an image.

You can use the following recipe to replace or remove an image embedded in a PDF.

The new image will be shown wherever the old image had been shown, i.e., on all pages using the old image. It will cover the same rectangle areas everywhere.

The rendering instructions of each page will not realize any changes and thus assume the same aspect ratio as before. If there are deviations, then the new image will be displayed distorted in some way.

You need the following items to perform this task:

Import method img_replace from file replacer.py.
You need to know the xref of the old image.
You need one page of the PDF. This will usually be a page showing the old image, but this is not required.
You need the new image. This may be a file name, a memory area containing the image or a Pixmap.

Then all you need to do is this:

import pymupdf
from replacer import img_replace

doc = pymupdf.open("your.pdf")
page = doc[nnn]  # any page of doc
# ----------------------------------------
# find out the xref of the old image
# see some hints further down
# ----------------------------------------
filename = "image.jpg"  # some image
# or equivalently a bytes object containing it
# or equivalently a pymupdf.Pixmap
img_replace(page, xref,
	filename=filename,  # or one of the following:
	stream=None,  # alternatively
	pixmap=None,  # alternatively
)
doc.save("your-new.pdf",garbage=4)  # save changed PDF

To remove the image at xref, make the following changes in the snippet above:

pix = pymupdf.Pixmap(pymupdf.csGRAY, (0, 0, 1, 1), 1)  # small pixmap
pix.clear_with()  # empty its content
img_replace(page,xref,
	pixmap=pix,
)

Here, we have constructed a small (2 x 2) transparent pixmap and used it to replace the original image. This pixmap is not visible and thus has the same effect as if no image is being shown. It will overall need less than 200 bytes in the file.

Finding the Image XREF

So how can you find the cross reference number (xref) of the image?

Create a list of images shown on the page like this:

In [1]: import pymupdf
In [2]: doc = pymupdf.open("original.pdf")
In [3]: page = doc[0]
In [4]: page.get_images()
Out[4]: [(46, 0, 439, 501, 8, 'DeviceRGB', '', 'fzImg0', 'FlateDecode')]

Things are easy here: there is only one image on that page, and its xref is 46. If there are multiple images, display the display location (boundary box - bbox) like this:

In [5]: for item in page.get_images():
   ...:	 xref = item[0]
   ...:	 print(xref, page.get_image_rects(xref))
   ...:
46 [Rect(200.00001525878906, 240.63067626953125,
		 497.6600341796875, 580.3292236328125)]

You probably know where on the page the image is shown and can thus find the right xref.

Reposition Images

When inserting images, it sometimes is difficult to foresee just how it will look on the page:

Is positioned right? Is the rectangle large enough? Are there overlaps with other content?

We have developed a GUI script that may help you here.

To use it, you must install the latest version of wxPython (python -m pip install wxpython). Then simply start it.

It will let you select a PDF and you are ready to go! Flip through pages, select any existing images, or insert new ones choosing from image files.

You can move images around on the page, change the size of the display rectangle, or delete an image entirely.

Here is a visual impression of the interface.

Conclusion

This post provides you with an overview of PyMuPDF’s image handling capabilities.

Images can be extracted from all document types that are supported by MuPDF. This includes PDF, XPS, EPUB, MOBI, internet formats like HTML and XML, and several others.

Image insertion is possible for PDF files.

Tools exist that will help replace, remove, or reposition images in existing PDF documents.

PyMuPDF is a large, full-featured document-handling Python package. Apart from its superior performance and top rendering quality, it is also known for its excellent documentation: the PDF version today has over 420 pages in Letter format — more than 70 of which are devoted to recipes in How-To format — certainly a worthwhile read.

Another knowledge source is the utilities repository. Whatever you plan to do when dealing with PDFs: you will probably find some example script there that gives you a start.

If you have questions about PyMuPDF, you can reach the devs on the #pymupdf Discord channel.