An Improved MuPDF API Using C++

Julian Smtih·November 30, 2022

MuPDFC

In this article

Overview
No
MuPDF
Automatic reference counting
- Class wrappers
- Class-aware wrapper functions
Putting it all together

Overview

MuPDF probably goes as far as it is possible to go in providing a consistent and thread-safe C API that will work consistently across multiple operating systems.

But C++ offers ways to create abstractions beyond what is possible in C. So we've recently added an auto-generated MuPDF C++ API which abstracts away some of the details of the MuPDF C API:

No fz_context *ctx function arguments.
MuPDF fz_try/fz_catch exceptions are converted into C++ exceptions.
Automatic reference counting.

This C++ API takes advantage of modern C++'s standardized support for threads, as well as its well-established support for exceptions and classes. Taken together, these features allow MuPDF's auto-generated C++ API to provide a very convenient way of using the MuPDF library.

Let's have a look at how these features work.

No `fz_context *ctx` function arguments

A fundamental part of the MuPDF C API is the fz_context structure. As explained in https://www.mupdf.com/docs/mupdf_explored.pdf#chapter.5, this is used to store global state (for example default levels of anti-aliasing) and per-thread state (for example exception stacks).

Typical usage involves a master fz_context being created on startup in the main thread, and then each new thread has its own fz_context created by calling fz_clone_context() on the master fz_context.

Modern C++ has convenient native support for thread-local storage, which one can use to automatically provide per-thread fz_context's.

We start with a global object that contains the master fz_context:

struct internal_state
{
    /* Constructor. */
    internal_state();

    fz_context* m_ctx;

    /* State used when setting `m_ctx`. */
    std::mutex          m_mutexes[FZ_LOCK_MAX];
    fz_locks_context    m_locks;
};
internal_state  s_state;

The constructor uses fz_new_context() to set m_ctx:

internal_state::internal_state()
{
    m_locks.user = this;
    m_locks.lock = lock;
    m_locks.unlock = unlock;
    m_ctx = fz_new_context(nullptr /*alloc*/, &m_locks, FZ_STORE_DEFAULT);
}

So on the main thread, one could just use s_state.m_ctx as the fz_context* that is passed to most MuPDF C functions.

But what about non-main threads? We need a way of calling fz_clone_context() to create per-thread fz_context's. This can be done by making a new object that is defined as thread_local:

struct internal_thread_state
{
    /* Return per-thread context. */
    fz_context* get_context();

    fz_context* m_ctx = nullptr;
};
thread_local internal_thread_state  s_thread_state;

The use of thread_local means that a separate instance of s_thread_state exists in each thread. So internal_thread_state::get_context() looks like this:

fz_context* internal_thread_state::get_context()
{
    if (!m_ctx)
    {
        /* This is the first time we have been called in this thread. So
        clone the master context. */
        m_ctx = fz_clone_context(s_state.m_ctx);
    }
    return m_ctx;
}

The generated C++ code simply calls s_thread_state.get_context() to get a fz_context* suitable for use with any MuPDF C function, regardless of what thread it is running in.

[We could instead set m_ctx in an internal_thread_state constructor, but doing it lazily in get_context() avoids this overhead in threads that never call MuPDF functions.]

MuPDF `fz_try/fz_catch` exceptions are converted into C++ exceptions

The MuPDF C API uses setjmp()/longjmp()-based exceptions for error handling; for details, see: https://www.mupdf.com/docs/mupdf_explored.pdf#chapter.6

Converting MuPDF exceptions into native C++ exceptions is straightforward. Doing so in hand-written wrapper functions would be pretty tedious of course, but the C++ bindings are auto-generated (by a Python program) so it's easy enough to generate C++ wrappers for all functions.

Here's what a typical low-level C++ wrapper function looks like:

fz_device* ll_fz_begin_page(fz_document_writer* wri, fz_rect mediabox)
{
    fz_context* auto_ctx = internal_context_get();
    fz_device*  ret;
    fz_try(auto_ctx)
    {
        ret = ::fz_begin_page(auto_ctx, wri, mediabox);
    }
    fz_catch(auto_ctx)
    {
        internal_throw_exception(auto_ctx);
    }
    return ret;
}

Notice how ll_fz_begin_page() has the same prototype as fz_begin_page() except for missing the initial fz_context* ctx argument. internal_context_get() is a function that essentially calls the s_thread_state.get_context() method described earlier.

internal_throw_exception() uses information about the current MuPDF exception to execute a C++ throw statement:

void internal_throw_exception(fz_context* ctx)
{
    int code = fz_caught(ctx);
    const char* text = fz_caught_message(ctx);
    if (code == FZ_ERROR_NONE)     throw FzErrorNone    (text);
    if (code == FZ_ERROR_MEMORY)   throw FzErrorMemory  (text);
    if (code == FZ_ERROR_GENERIC)  throw FzErrorGeneric (text);
    if (code == FZ_ERROR_SYNTAX)   throw FzErrorSyntax  (text);
    if (code == FZ_ERROR_MINOR)    throw FzErrorMinor   (text);
    if (code == FZ_ERROR_TRYLATER) throw FzErrorTrylater(text);
    if (code == FZ_ERROR_ABORT)    throw FzErrorAbort   (text);
    if (code == FZ_ERROR_REPAIRED) throw FzErrorRepaired(text);
    if (code == FZ_ERROR_COUNT)    throw FzErrorCount   (text);
    throw FzErrorBase(code, text);
}

The exception classes used above are auto-generated from the FZ_ERROR_* enum values, and look like this:

/** Base class for exceptions. */
struct FzErrorBase : std::exception
{
    int         m_code;
    std::string m_text;
    const char* what() const throw();
    FzErrorBase(int code, const char* text);
};
struct FzErrorNone : FzErrorBase
{
    FzErrorNone(const char* message);
};
struct FzErrorMemory : FzErrorBase
{
    FzErrorMemory(const char* message);
};
struct FzErrorGeneric : FzErrorBase
{
    FzErrorGeneric(const char* message);
};
struct FzErrorSyntax : FzErrorBase
{
    FzErrorSyntax(const char* message);
};
struct FzErrorMinor : FzErrorBase
{
    FzErrorMinor(const char* message);
};
struct FzErrorTrylater : FzErrorBase
{
    FzErrorTrylater(const char* message);
};
struct FzErrorAbort : FzErrorBase
{
    FzErrorAbort(const char* message);
};
struct FzErrorRepaired : FzErrorBase
{
    FzErrorRepaired(const char* message);
};
struct FzErrorCount : FzErrorBase
{
    FzErrorCount(const char* message);
};

Automatic reference counting

The MuPDF C API has a simple and reliable system of reference counting for many fz_* and pdf_* structs, provided by various fz_keep_*(), fz_drop_*(), pdf_keep_*() and pdf_drop_*() functions. For example, see https://www.mupdf.com/docs/mupdf_explored.pdf#chapter.8

However calling these functions correctly can be tricky, especially when cleaning up after errors with fz_try(), fz_catch() and fz_finally() blocks. An incorrect extra call of a *_keep_*() function or a missing call of a *_drop_*() function will result in a resource leak, while a missing call of a *_keep_*() function or an incorrect extra call of a *_drop_*() function will often result in illegal use of freed memory typically resulting in a crash.

Class wrappers

The MuPDF C++ API provides a C++ wrapper class for each MuPDF fz_* and pdf_* struct. For MuPDF structs that use reference counting, these wrapper classes contain a pointer to an instance of the MuPDF struct, and define copy constructors, assignment operators and destructors that automatically call the appropriate *_keep_*() and *_drop_*() functions.

For example the C++ wrapper class for the MuPDF C struct fz_document looks like this:

/** Wrapper class for struct `fz_document`. */
struct FzDocument
{
    /** Copy constructor using `fz_keep_document()`. */
    FZ_FUNCTION FzDocument(const FzDocument& rhs);

    /** operator= using `fz_keep_document()` and `fz_drop_document()`. */
    FZ_FUNCTION FzDocument& operator=(const FzDocument& rhs);

    /** Constructor using raw copy of pre-existing `fz_document`. */
    FZ_FUNCTION FzDocument(fz_document* internal=NULL);

    /** Destructor using fz_drop_document(). */
    FZ_FUNCTION ~FzDocument();

    /** Pointer to wrapped data. */
    fz_document* m_internal;
};

And the implementations of these methods looks like this:

/** Copy constructor using `fz_keep_document()`. */
FZ_FUNCTION FzDocument::FzDocument(const FzDocument& rhs)
: m_internal(ll_fz_keep_document(rhs.m_internal))
{
}

/* operator= using `fz_keep_document()` and `fz_drop_document()`. */
FZ_FUNCTION FzDocument& FzDocument::operator=(const FzDocument& rhs)
{
    ll_fz_drop_document(this->m_internal);
    ll_fz_keep_document(rhs.m_internal);
    this->m_internal = rhs.m_internal;
    return *this;
}

/** Constructor using raw copy of pre-existing `::fz_document`. */
FZ_FUNCTION FzDocument::FzDocument(::fz_document* internal)
: m_internal(internal)
{
}

/** Destructor using `fz_drop_document()`. */
FZ_FUNCTION FzDocument::~FzDocument()
{
    ll_fz_drop_document(m_internal);
}

Thus wrapper class instances can be freely copied, assigned, passed around by value etc, safe in the knowledge that the refcounts will be updated, and the underlying MuPDF structs' lifetimes will be exactly as required.

Incidently, you may have noticed that the constructor from a raw fz_document* does not call fz_keep_document(); the MuPDF C++ API has a convention that raw pointers passed to a wrapper class constructor must already be owned, and this ownership is transferred to the newly-created wrapper class. Most of the time this convention simplifies things, though in a small number of places a raw pointer can be a borrowed reference, so the generated code inserts an explicit call to a *_keep_*() function before creating a wrapper class instance.

Class-aware wrapper functions

The MuPDF C++ API defines "class-aware" wrappers for most MuPDF C functions, which take references to C++ wrapper classes instead of pointers to MuPDF structs. If a MuPDF C function returns a pointer to a new MuPDF struct, the corresponding class-aware wrapper will return a wrapper class instance by value.

For example the MuPDF C function fz_new_buffer_from_page() looks like this:

fz_buffer *fz_new_buffer_from_page(fz_context *ctx, fz_page *page, const fz_stext_options *options);

And the corresponding class-aware wrapper is:FzBuffer fz_new_buffer_from_page(const FzPage& page, FzStextOptions& options);

The implementation of the class-aware wrapper is straightforward:

/* Class-aware wrapper for `::fz_new_buffer_from_page()`.  */
FZ_FUNCTION FzBuffer fz_new_buffer_from_page(const FzPage& page, FzStextOptions& options)
{
    ::fz_buffer* temp = mupdf::ll_fz_new_buffer_from_page(page.m_internal,  options.internal());
    auto ret = FzBuffer(temp);
    return ret;
}

Putting it all together

The class-aware wrapper functions, along with the wrapper classes themselves, are the core of the MuPDF C++ API, offering the three abstractions we have talked about - no fz_context args, native C++ exceptions, and automatic reference counting.

To see how the C++ simplifies usage of MuPDF, consider this C code derived from a function in PyMuPDF (a widely-used Python library built on top of the MuPDF C API):

PyObject *_newPage(pdf_document *pdf, int pno, float width, float height)
{
    fz_rect mediabox = fz_unit_rect;
    mediabox.x1 = width;
    mediabox.y1 = height;
    pdf_obj *resources = NULL, *page_obj = NULL;
    fz_buffer *contents = NULL;
    fz_var(contents);
    fz_var(page_obj);
    fz_var(resources);
    fz_try(gctx)
    {
        if (pno < -1)
        {
            RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError);
        }
        // create /Resources and /Contents objects
        resources = pdf_add_new_dict(gctx, pdf, 1);
        page_obj = pdf_add_page(gctx, pdf, mediabox, 0, resources, contents);
        pdf_insert_page(gctx, pdf, pno, page_obj);
    }
    fz_always(gctx)
    {
        fz_drop_buffer(gctx, contents);
        pdf_drop_obj(gctx, page_obj);
        pdf_drop_obj(gctx, resources);
    }
    fz_catch(gctx)
    {
        return NULL;
    }

    Py_RETURN_NONE;
}

This can be rewritten to use the MuPDF C++ API (which is in C++ namespace mupdf):

PyObject* _newPage(mupdf::PdfDocument& pdf, int pno, float width, float height)
{
    mupdf::FzRect mediabox( 0, 0, width, height);
    if (pno < -1)
    {
        RAISEPY(gctx, MSG_BAD_PAGENO, PyExc_ValueError);
    }
    // create /Resources and /Contents objects
    mupdf::PdfObj resources = mupdf::pdf_add_new_dict(pdf, 1);
    mupdf::FzBuffer contents;
    mupdf::PdfObj page_obj = mupdf::pdf_add_page(pdf, mediabox, 0, resources, contents);
    mupdf::pdf_insert_page(pdf, pno, page_obj);
    return Py_RETURN_NONE;
}

The C++ version is clearly much simpler. This is mostly because automatic reference counting means that there is no need for explicit cleanup code – when the mupdf::PdfObj and mupdf::FzBuffer go out of scope, their destructors automatically call pdf_drop_obj() and fz_drop_buffer() as appropriate. Thus we have eliminated a whole class of bugs that can cause resources leaks and crashes.

The use of native C++ exceptions means that there is also no need to mark local variables with fz_var(); the rules for when one should use fz_var() are necessarily subtle, so this also eliminates a potential source of bugs. And in this particular case, because the code is used in SWIG Python bindings, which automatically convert C++ exceptions into Python exceptions, we don't need any catch() block to return a special value after errors.