MuPDF API Overview

API Overview

Basic MuPDF Usage Example
Common Function Arguments
Error Handling
Multithreading
Cloning the Context
Coding Style
Progressive Loading

Basic MuPDF Usage Example

For an example of how to use MuPDF in the most basic way, see docs/examples/example.c. To limit the complexity and give an easier introduction this code has no error handling at all, but any serious piece of code using MuPDF should use the error handling strategies described below.

 
 

Common Function Arguments

Most functions in MuPDF’s interface take a context argument.

A context contains global state used by MuPDF inside functions when parsing or rendering pages of the document. It contains for example:

  • an exception stack (see error handling below),
  • a memory allocator (allowing for custom allocators)
  • a resource store (for caching of images, fonts, etc.)
  • a set of locks and (un-)locking functions (for multi-threading)

Other functions in MuPDF’s interface take arguments such as document, stream, and device which contain state for each type of object. Those arguments each have a reference to a context and therefore act as proxies for a context.

Without the set of locks and accompanying functions, the context and its proxies may only be used in a single-threaded application.

 
 

Error Handling

MuPDF uses a set of exception handling macros to simplify error return and clean up. Conceptually, they work a lot like C++’s try/catch system but do not require any special compiler support.

The basic formulation is as follows:

The fz_always block is optional, and can safely be omitted.

The macro-based nature of this system has 3 main limitations:

    1. Never return from within try (or ‘goto’ or longjmp out of it). This upsets the internal housekeeping of the macros and will cause problems later on. The code will detect such things happening, but by then it is too late to give a helpful error report as to where the original infraction occurred.
    2. The fz_try(=The fz_try(=ctx) { … } fz_always(ctx) { … } fz_catch(ctx) { … } is not one atomic C statement. That is to say if you do:

      • then you will not get what you want. Use the following instead:

  1. The macros are implemented using setjmp and longjmp, and so the standard C restrictions on the use of those functions apply to fz_try/fz_catch too. In particular, any “truly local” variable that is set between the start of fz_try and something in fz_try throwing an exception may become undefined as part of the process of throwing that exception.
    As a way of mitigating this problem, we provide a fz_var() macro that tells the compiler to ensure that that variable is not unset by the act of throwing the exception.

A model piece of code using these macros then might be:

Things to note about this:

  1. If make_tiles throws an exception, this will immediately be handled by some higher level exception handler. If it succeeds, t will be set before fz_try starts, so there is no need to fz_var(t);
  2. We try first off to make some bricks as our building material. If this fails, we fall back to straw. If this fails, we’ll end up in the fz_catch, and the process will fail neatly.
  3. We assume in this code that combine takes a new reference to both the walls and the roof it uses, and therefore that w and r need to be cleaned up in all cases.
  4. We assume the standard C convention that it is safe to destroy NULL things.


 
 

Multithreading

First off, study the basic usage example in docs/examples/example.c and make sure you understand how it works as the data structures manipulated there will be referred to in this section too.

MuPDF can usefully be built into a multithreaded application without the library needing to know anything about threading at all. If the library opens a document in one thread and then sits there as a ‘server’ requesting pages and rendering them for other threads that need them, then the library is only ever being called from this one thread.

Other threads can still be used to handle UI requests etc, but as far as MuPDF is concerned it is only being used in a single threaded way. In this instance, there are no threading issues with MuPDF at all, and it can safely be used without any locking, as described in the previous sections.

This section will attempt to explain how to use MuPDF in the more complex case; where we genuinely want to call the MuPDF library concurrently from multiple threads within a single application.

MuPDF can be invoked with a user-supplied set of locking functions. It uses these to take mutexes around operations that would conflict if performed concurrently in multiple threads. By leaving the exact implementation of locks to the caller MuPDF remains threading library agnostic.

The following simple rules should be followed to ensure that multi-threaded operations run smoothly:

  1. “No simultaneous calls to MuPDF in different threads are allowed to use the same context.”
    Most of the time it is simplest to just use a different context for every thread; just create a new context at the same time as you create the thread. For more details see “Cloning the context” below.
  2. “No simultaneous calls to MuPDF in different threads are allowed to use the same document.”
    Only one thread can be accessing a document at a time, but once display lists are created from that document, multiple threads at a time can operate on them.
    The document can be used from several different threads as long as there are safeguards in place to prevent the usages being simultaneous.
  3. “No simultaneous calls to MuPDF in different threads are allowed to use the same device.”
    Calling a device simultaneously from different threads will cause it to get confused and may crash. Calling a device from several different threads is perfectly acceptable as long as there are safeguards in place to prevent the calls being simultaneous.

So, how does a multi-threaded example differ from a non-multithreaded one?

Firstly, when we create the first context, we call fz_new_context as before, but the second argument should be a pointer to a set of locking functions.

The calling code should provide FZ_LOCK_MAX mutexes, which will be locked/unlocked by MuPDF calling the lock/unlock function pointers in the supplied structure with the user pointer from the structure and the lock number, i (0 <= i < FZ_LOCK_MAX). These mutexes can safely be recursive or non-recursive as MuPDF only calls in a non-recursive style.

To make subsequent contexts, the user should NOT call fz_new_context again (as this will fail to share important resources such as the store and glyph cache), but should rather call fz_clone_context. Each of these cloned contexts can be freed by fz_free_context as usual. They will share the important data structures (like store, glyph cache etc) with the original context, but will have their own exception stacks.

To open a document, call fz_open_document as usual, passing a context and a filename. It is important to realize that only one thread at a time can be accessing the document itself.

This means that only one thread at a time can perform operations such as fetching a page or rendering that page to a display list. Once a display list has been obtained, however, it can be rendered from any other thread (or even from several threads simultaneously, giving banded rendering).

This means that an implementer has 2 basic choices when constructing an application to use MuPDF in multi-threaded mode. Either he can construct it so that a single nominated thread opens the document and then acts as a ‘server’ creating display lists for other threads to render, or he can add his own mutex around calls to MuPDF that use the document. The former is likely to be far more efficient in the long run.

For an example of how to do multithreading see docs/examples/multi-threaded.c which has a main thread and one rendering thread per page.

 
 

Cloning the Context

As described above, every context contains an exception stack which is manipulated during the course of nested fz_try/fz_catches. For obvious reasons, the same exception stack cannot be used from more than one thread at a time.

If, however, we simply created a new context (using fz_new_context) for every thread, we would end up with separate stores/glyph caches etc, which is not (generally) what is desired. MuPDF, therefore, provides a mechanism for “cloning” a context. This creates a new context that shares everything with the given context, except for the exception stack.

A commonly used general scheme is, therefore, to create a ‘base’ context at program start up, and to clone this repeatedly to get new contexts that can be used on new threads.

 
 

Coding Style

Names

Functions should be named according to one of the following schemes:

  • verb_noun
  • verb_noun_with_noun
  • noun_attribute
  • set_noun_attribute
  • noun_from_noun – convert from one type to another (avoid noun_to_noun)

Prefixes are mandatory for exported functions, macros, enums, globals, and types.

  • fz for common code
  • pdf, xps, etc., for interpreter specific code

Prefixes are optional (but encouraged) for private functions and types.

Avoid using ‘get’ as this is a meaningless and redundant filler word.

These words are reserved for reference counting schemes:

  • new, find, load, open, keep – return objects that you are responsible for freeing.
  • drop – relinquish ownership of the object passed in.

When searching for an object or value, the name used depends on whether returning the value is passing ownership:

  • lookup – return a value or borrowed pointer
  • find – return an object that the caller is responsible for freeing

Types

Various different integer types are used throughout MuPDF.

In general:

  • int is assumed to be 32bit at least.
  • short is assumed to be exactly 16 bits.
  • char is assumed to be exactly 8 bits.
  • array sizes, string lengths, and allocations are measured using size_t. size_t is 32bit in 32bit builds, and 64bit on all 64bit builds.
  • buffers of data use unsigned chars (or uint8_t).
  • Offsets within files/streams are represented using fz_off_t. fz_off_t is 64bits in 64bit builds, or in 32bit builds with FZ_LARGEFILE defined. Otherwise, it is a native int (so 32bit in 32bit builds).

In addition, we use floats (and avoid doubles when possible), assumed to be IEEE compliant.

Reference counting

Reference counting uses special words in functions to make it easy to remember and follow the rules.

Words that take ownership: new, find, load, open, keep.

Words that release ownership: drop.

If an object is returned by a function with one of the special words that take ownership, you are responsible for freeing it by calling “drop” or “free”, or “close” before you return. You may pass ownership of an owned object by return it only if you name the function using one of the special words.

Any objects returned by functions that do not have any of these special words, are borrowed and have a limited lifetime. Do not hold on to them past the duration of the current function, or stow them away inside structs. If you need to keep the object for longer than that, you have to either “keep” it or make your own copy.

 
 

Progressive Loading

What is progressive loading?

The idea of progressive loading is that as you download a PDF file into a browser, you can display the pages as they appear.

MuPDF can make use of 2 different mechanisms to achieve this. The first relies on the file being “linearized”, the second relies on the caller of MuPDF having fine control over the http fetch and on the server supporting byte-range fetches.

For optimum performance, a file should be both linearized and be available over a byte-range supporting link, but benefits can still be had with either one of these alone.

Progressive download using “linearized” files

Adobe defines “linearized” PDFs as being ones that have both a specific layout of objects and a small amount of extra information to help avoid seeking within a file. The stated aim is to deliver the first page of a document in advance of the whole document downloading, whereupon subsequent pages will become available. Adobe also refers to these as “Optimized for fast web view” or “Web Optimized”.

In fact, the standard outlines (poorly) a mechanism by which ‘hints’ can be included that enable the subsequent pages to be found within the file too. Unfortunately, this is very poorly supported with many tools, and so the hints have to be treated with suspicion.

MuPDF will attempt to use hints if they are available, but will also use a linear search of the file to discover pages if not. This means that the first page will be displayed quickly, and then subsequent ones will appear with ‘incomplete’ renderings that improve over time as more and more resources are gradually delivered.

Essentially the file starts with a slightly modified header, and the first object in the file is a special one (the linearization object) that a) indicates that the file is linearized, and b) gives some useful information (like the number of pages in the file etc).

This object is then followed by all the objects required for the first page, then the “hint stream”, then sets of the object for each subsequent page, in turn, then shared objects required for those pages, then various other random things.

[Yes, really. While page 1 is sent with all the objects that it uses, shared or otherwise, subsequent pages do not get shared resources until after all the unshared page objects have been sent.]

The Hint Stream

Adobe intended Hint Stream to be useful to facilitate the display of subsequent pages, but it has never used it. Consequently, you can’t trust people to write it properly – indeed Adobe outputs something that doesn’t quite conform to the spec.

Consequently, very few people actually use it. MuPDF will use it after sanity checking the values and should cope with illegal/ incorrect values.

So how does MuPDF handle progressive loading?

MuPDF has made various extensions to its mechanisms for handling progressive loading.

Progressive streams

At its lowest level, MuPDF reads file data from a fz_stream, using the fz_open_document_with_stream call. (fz_open_document is implemented by calling this). We have extended the fz_stream slightly, giving the system a way to ask for meta information (or perform meta operations) on a stream.

Using this mechanism MuPDF can query:

  • whether a stream is progressive or not (i.e. whether the entire stream is accessible immediately)
  • what the length of a stream should ultimately be (which an http fetcher should know from the Content-Length header),

With this information, MuPDF can decide whether to use its normal object reading code, or whether to make use of a linearized object. Knowing the length enables us to check the length value given in the linearized object – if these differ, the assumption is that an incremental save has taken place, thus the file is no longer linearized.

When data is pulled from a progressive stream, if we attempt to read data that is not currently available, the stream should throw a FZ_ERROR_TRYLATER error. This particular error code will be interpreted by the caller as an indication that it should retry the parsing of the current objects at a later time.]

When a MuPDF call is made on a progressive stream, such as fz_open_document_with_stream, or fz_load_page, the caller should be prepared to handle a FZ_ERROR_TRYLATER error as meaning that more data is required before it can continue. No indication is directly given as to exactly how much more data is required, but as the caller will be implementing the progressive fz_stream that it has passed into MuPDF to start with, it can reasonably be expected to figure out an estimate for itself.

Cookie

Once a page has been loaded, if its contents are to be ‘run’ as normal (using e.g. fz_run_page) any error (such as failing to read a font, or an image, or even a content stream belonging to the page) will result in a rendering that aborts with an FZ_ERROR_TRYLATER error. The caller can catch this and display a placeholder instead.

If each page’s data was entirely self-contained and sent in sequence this would perhaps be acceptable, with each page appearing one after the other. Unfortunately, the linearization procedure as laid down by Adobe does NOT do this: objects shared between multiple pages (other than the first) are not sent with the pages themselves, but rather AFTER all the pages have been sent.

This means that a document that has a title page, then contents that share a font used on pages 2 onwards, will not be able to correctly display page 2 until after the font has arrived in the file, which will not be until all the page data has been sent.

To mitigate against this, MuPDF provides a way whereby callers can indicate that they are prepared to accept an ‘incomplete’ rendering of the file (perhaps with missing images, or with substitute fonts).

Callers prepared to tolerate such renderings should set the ‘incomplete_ok’ flag in the cookie, then call fz_run_page etc as normal. If a FZ_ERROR_TRYLATER error is thrown at any point during the page rendering, the error will be swallowed, the ‘incomplete’ field in the cookie will become non-zero and rendering will continue. When control returns to the caller the caller can check the value of the ‘incomplete’ field and know that the rendering it received is not authoritative.

Progressive loading using byte range requests:

If the caller has control over the http fetch, then it is possible to use byte range requests to fetch the document ‘out of order’. This enables non-linearized files to be progressively displayed as they download, and fetches complete renderings of pages earlier than would otherwise be the case. This process requires no changes within MuPDF itself, but rather in the way the progressive stream learns from the attempts MuPDF makes to fetch data.

Consider, for example, an attempt to fetch a hypothetical file from a server.

  • The initial http request for the document is sent with a “Range:” header to pull down the first (say) 4k of the file.
  • As soon as we get the header in from this initial request, we can respond to meta stream operations to give the length, and whether byte requests are accepted.
    • If the header indicates that byte ranges are acceptable the stream proceeds to go into a loop fetching chunks of the file at a time (not necessarily in order). Otherwise, the server will ignore the Range: header, and just serve the whole file.
    • If the header indicates a content-length, the stream returns that.
  • MuPDF can then decide how to proceed based upon these flags and whether the file is linearized or not. (If the file contains a linearized object and the content length matches, then the file is considered to be linear, otherwise, it is not).

If the file is linear:

  • We proceed to read objects out of the file as it downloads. This will provide us the first page and all its resources. It will also enable us to read the hint streams (if present).
  • Once we have read the hint streams, we unpack (and sanity check) them to give us a map of where in the file each object is predicted to live, and which objects are required for each page. If any of these values are out of range, we treat the file as if there were no hint streams.
  • If we have hints, any attempt to load a subsequent page will cause MuPDF to attempt to read exactly the objects required. This will cause a sequence of seeks in the fz_stream followed by reads. If the stream does not have the data to satisfy that request yet, the stream code should remember the location that was fetched (and fetch that block in the background so that future retries will succeed) and should raise an FZ_ERROR_TRYLATER error.
  • [Typically therefore when we jump to a page in a linear file on a byte request capable link, we will quickly see a rough rendering, which will improve fairly fast as images and fonts arrive.]
  • Regardless of whether we have hints or byte requests, on every fz_load_page call, MuPDF will attempt to process more of the stream (that is assumed to be being downloaded in the background). As linearized files are guaranteed to have pages in order, pages will gradually become available. In the absence of byte requests and hints, however, we have no way of getting resources early, so the renderings for these pages will remain incomplete until much more of the file has arrived.
  • <[Typically therefore when we jump to a page in a linear file on a non byte request capable link, we will see a rough rendering of that page as soon as data arrives for it (which will typically take much longer than would be the case with byte range capable downloads), and that will improve much more slowly as images and fonts may not appear until almost the whole file has arrived.]
  • When the whole file has arrived, then we will attempt to read the outlines for the file.

For a nonlinearized PDF on a byte request capable stream:

  • MuPDF will immediately seek to the end of the file to attempt to read the trailer. This will fail with a FZ_ERROR_TRYLATER due to the data not being here yet, but the stream code should remember that this data is required and it should be prioritized in the background fetch process.
  • Repeated attempts to open the stream should eventually succeed, therefore. As MuPDF jumps through the file trying to read first the xrefs, then the page tree objects, then the page contents themselves etc, the background fetching process will be driven by the attempts to read the file in the foreground.
  • [Typically, therefore, the opening of a nonlinearized file will be slower than a linearized one, as the xrefs/page trees for a nonlinear file can be 20%+ of the file data. Once past this initial point, however, pages and data can be pulled from the file almost as fast as with a linearized file.]

For a nonlinearized PDF on a non-byte request capable stream:

  • MuPDF will immediately seek to the end of the file to attempt to read the trailer. This will fail with a FZ_ERROR_TRYLATER due to the data not being here yet. Subsequent retries will continue to fail until the whole file has arrived, whereupon the whole file will be instantly available.
  • [This is the worst case situation – nothing at all can be displayed until the entire file has downloaded.]

A typical structure for a fetcher process (see curl-stream.c in mupdf-curl as an example) might, therefore, look like this:

  • We consider the file as an (initially empty) buffer which we are filling by making requests. In order to ensure that we make maximum use of our download link, we ensure that whenever one request finishes, we immediately launch another. Further, to avoid the overheads for the request/response headers being too large, we may want to divide the file into ‘chunks’, perhaps 4 or 32k in size.
  • We can then have a receiver process that sits there in a loop requesting chunks to fill this buffer. In the absence of any other impetus, the receiver should request the next ‘chunk’ of data from the file that it does not yet have, following the last fill point. Initially, we start the fill point at the beginning of the file, but this will move around based on the requests made of the progressive stream.
  • Whenever MuPDF attempts to read from the stream, we check to see if we have data for this area of the file already. If we do, we can return it. If not, we remember this as the next “fill point” for our receiver process and throw a FZ_ERROR_TRYLATER error.