PDFMathTranslate: PDF scientific paper translation with preserved formats
PDF scientific paper translation and bilingual comparison.
- 📊 Preserve formulas, charts, table of contents, and annotations.
- 🌐 Support multiple languages, and diverse translation services.
- 🤖 Provides commandline tool, interactive user interface, and Docker
Preview
Installation and Usage
We provide four methods for using this project: Commandline, Portable, GUI, and Docker.
Method I. Commandline
Python installed (3.8 <= version <= 3.12)
Install our package:
pip install pdf2zh
Execute translation, files generated in current working directory:
pdf2zh document.pdf
Method II. Portable
No need to pre-install Python environment
Download setup.bat and double-click to run
Method III. GUI
Python installed (3.8 <= version <= 3.12)
Install our package:
pip install pdf2zh
Start using in browser:
pdf2zh -i
If your browswer has not been started automatically, goto
http://localhost:7860/
See documentation for GUI for more details.
Method IV. Docker
Pull and run:
docker pull byaidu/pdf2zh
docker run -d -p 7860:7860 byaidu/pdf2zh
Open in browser:
http://localhost:7860/
Advanced Options
Execute the translation command in the command line to generate the translated document example-mono.pdf and the bilingual document example-dual.pdf in the current working directory. Use Google as the default translation service.
Full / partial document translation
Entire document
pdf2zh example.pdf
Part of the document
pdf2zh example.pdf -p 1-3,5
Specify source and target languages
See Google Languages Codes, DeepL Languages Codes
pdf2zh example.pdf -li en -lo ja
Translate with Different Services
Use -s service or -s service:model to specify service:
pdf2zh example.pdf -s openai:gpt-4o-mini
Or specify model with environment variables:
set OPENAI_MODEL=gpt-4o-mini
pdf2zh example.pdf -s openai
Translate wih exceptions
Use regex to specify formula fonts and characters that need to be preserved:
pdf2zh example.pdf -f "(CM[^RT].*|MS.*|.*Ital)" -c "(\(|\||\)|\+|=|\d|[\u0080-\ufaff])"
Preserve Latex, Mono, Code, Italic, Symbol and Math fonts by default:
pdf2zh example.pdf -f "(CM[^R]|(MS|XY|MT|BL|RM|EU|LA|RS)[A-Z]|LINE|LCIRCLE|TeX-|rsfs|txsy|wasy|stmary|.*Mono|.*Code|.*Ital|.*Sym|.*Math)"
Specify threads
Use -t to specify how many threads to use in translation:
pdf2zh example.pdf -t 1
API
Python
from pdf2zh import translate, translate_stream
params = {"lang_in": "en", "lang_out": "zh", "service": "google", "thread": 4}
file_mono, file_dual = translate(files=["example.pdf"], **params)[0]
with open("example.pdf", "rb") as f:
stream_mono, stream_dual = translate_stream(stream=f.read(), **params)
HTTP
pip install pdf2zh[backend]
pdf2zh --flask
pdf2zh --celery worker
curl http://localhost:11008/v1/translate -F "file=@example.pdf" -F "data={\"lang_in\":\"en\",\"l
ang_out\":\"zh\",\"service\":\"google\",\"thread\":4}"
{"id":"d9894125-2f4e-45ea-9d93-1a9068d2045a"}
curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a
{"info":{"n":13,"total":506},"state":"PROGRESS"}
curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a
{"state":"SUCCESS"}
curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a/mono --output example-mono.pdf
curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a/dual --output example-dual.pdf
curl http://localhost:11008/v1/translate/d9894125-2f4e-45ea-9d93-1a9068d2045a -X DELETE