A Conversation With the Creator of pdf2docx, Yun Jian

Jamie Lemon·January 16, 2024

PyMuPDFpdf2docx

A Conversation With the Creator of pdf2docx, Yun Jian

Artifex recently acquired the popular Python library, pdf2docx. Below is a conversation between Yun Jian, the creator of pdf2docx, and Jamie Lemon, Senior Product Manager at Artifex.

JL: What was the motivation behind creating the pdf2docx repo?
YJ: The motivation started from a project my department took on in 2019 that we spent a substantial amount of money hiring an external translation company to translate thousands of pages of our product manuals from English to Chinese. At the time, I wondered if there was a more efficient and cost-effective way to accomplish this task by first converting the PDFs to Word documents and then utilizing machine translation. After some research, I found that commercial software was prohibitively expensive, online document conversion services produced poor quality results and posed data security risks. And what surprised me the most was that there were no viable open source solutions available. Existing open source code could only extract text and images but couldn't extract tables and preserve document formatting. So I decided to develop a tool by myself with the goal of being the best open source converting pdf to docx tool possible. The pdf2docx repo was born out of the desire to fill this gap and help others facing similar challenges.

JL: How long did it take to write and when was the first release?
YJ: Based on reviewing the git commit history, I found that the first line of code was committed on June 20, 2019. The first release v0.0.1 was published on June 30, 2020. The first few months were spent mainly exploring text extraction and machine translation. Starting in April 2020, intensive development towards the goal of creating a pdf2docx library began, occupying almost every evening and even overnight into the early morning.

JL: Regarding the great developer experience, can you explain how you've obtained so many users?
YJ: The primary reason is that PDF to Word conversion is an extremely common need for both students and office workers. Additionally, although pdf2docx is not yet perfect, it is arguably the best open source PDF to Word tool in Python available in the open source community. There is strong demand for PDF to Word conversion, but most people do not need it daily and are unlikely to purchase commercial software outright. There are also data security issues with online document conversion services that charge per conversion. So I suspect many users find pdf2docx like I did - by searching the open source community for a solution, only this time they don't walk away disappointed like I initially did. By filling an important gap with a quality solution, pdf2docx has been able to gain many users.

JL: Can you explain some of the main challenges to maintain the repo?
YJ: Work and life balance is the most challenging part. As with any side project, finding time becomes more and more difficult as my family and job responsibilities continue to grow. My free time is increasingly limited. The second will be issue handling. Diagnosing, reproducing and resolving issues reported on GitHub can be time consuming. Some issues are also difficult to address due to current development stage and feature limitations. So, co-maintaining with Artifex will help me a lot.

JL: Now that you've sold the repo to Artifex what are your next plans?
YJ: As pdf2docx relies heavily on pymupdf, Artifex taking over is the best outcome for its future and ability to help more people. My passion for this project remains, with issue handling as the top priority to ensure the project stays functional and available to users. After that, developing new features like header/footer, title/subtitle, and table of contents is next on my roadmap to keep enhancing quality. With Artifex's resources, I believe we can accelerate development on these fronts.