PDFMathTranslate: PDF scientific paper translation with preserved formats

PDF scientific paper translation and bilingual comparison.

  • 📊 Preserve formulas, charts, table of contents, and annotations.
  • 🌐 Support multiple languages, and diverse translation services.
  • 🤖 Provides commandline toolinteractive user interface, and Docker

Preview

Installation and Usage

We provide four methods for using this project: CommandlinePortable, GUI, and Docker.

Method I. Commandline

Python installed (3.8 <= version <= 3.12)

Install our package:

pip install pdf2zh

Execute translation, files generated in current working directory:

pdf2zh document.pdf

Method II. Portable

No need to pre-install Python environment

Download setup.bat and double-click to run

Method III. GUI

Python installed (3.8 <= version <= 3.12)

Install our package:

pip install pdf2zh

Start using in browser:

pdf2zh -i

If your browswer has not been started automatically, goto

http://localhost:7860/

See documentation for GUI for more details.

Method IV. Docker

Pull and run:

docker pull byaidu/pdf2zh
docker run -d -p 7860:7860 byaidu/pdf2zh

Open in browser:

http://localhost:7860/

Advanced Options

Execute the translation command in the command line to generate the translated document example-mono.pdf and the bilingual document example-dual.pdf in the current working directory. Use Google as the default translation service.

Full / partial document translation

Entire document

pdf2zh example.pdf

Part of the document

pdf2zh example.pdf -p 1-3,5

Specify source and target languages

See Google Languages CodesDeepL Languages Codes

pdf2zh example.pdf -li en -lo ja

Translate with Different Services

Use -s service or -s service:model to specify service:

pdf2zh example.pdf -s openai:gpt-4o-mini

Or specify model with environment variables:

set OPENAI_MODEL=gpt-4o-mini
pdf2zh example.pdf -s openai

Translate wih exceptions

Use regex to specify formula fonts and characters that need to be preserved:

pdf2zh example.pdf -f "(CM[^RT].*|MS.*|.*Ital)" -c "(\(|\||\)|\+|=|\d|[\u0080-\ufaff])"

Preserve LatexMonoCodeItalicSymbol and Math fonts by default:

pdf2zh example.pdf -f "(CM[^R]|(MS|XY|MT|BL|RM|EU|LA|RS)[A-Z]|LINE|LCIRCLE|TeX-|rsfs|txsy|wasy|stmary|.*Mono|.*Code|.*Ital|.*Sym|.*Math)"

Specify threads

Use -t to specify how many threads to use in translation:

pdf2zh example.pdf -t 1

API

Python

from pdf2zh import translate, translate_stream

params = {"lang_in": "en", "lang_out": "zh", "service": "google", "thread": 4}
file_mono, file_dual = translate(files=["example.pdf"], **params)[0]
with open("example.pdf", "rb") as f:
    stream_mono, stream_dual = translate_stream(stream=f.read(), **params)

HTTP

pip install pdf2zh[backend]
pdf2zh --flask
pdf2zh --celery worker


curl http://localhost:11008/v1/translate -F "file=@example.pdf" -F "data={\"lang_in\":\"en\",\"l
ang_out\":\"zh\",\"service\":\"google\",\"thread\":4}"
{"id":"d9894125-2f4e-45ea-9d93-1a9068d2045a"}

curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a
{"info":{"n":13,"total":506},"state":"PROGRESS"}

curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a
{"state":"SUCCESS"}

curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a/mono --output example-mono.pdf

curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a/dual --output example-dual.pdf

curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a -X DELETE