docling: get your documents ready for gen AI
Docling parses documents and exports them to the desired format with ease and speed.
Features
- Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
- Advanced PDF document understanding incl. page layout, reading order & table structures
- Unified, expressive DoclingDocument representation format
- Easy integration with LlamaIndex & LangChain for powerful RAG / QA applications
- OCR support for scanned PDFs
- Simple and convenient CLI
Installation
To use Docling, simply install docling from your Python package manager, e.g. pip:
pip install docling
Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.
Usage
Conversion
Convert a single document
To convert individual PDF documents, use convert(), for example:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
CLI
You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.
A simple example would look like this:
docling https://arxiv.org/pdf/2206.01062
To see all available options (export formats etc.) run docling --help. More details in the CLI reference page.
Chunking
You can chunk a Docling document using a chunker, such as a HybridChunker, as shown below (for more details check out this example):
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
doc = conv_res.document
chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5") # set tokenizer as needed
chunk_iter = chunker.chunk(doc)
An example chunk would look like this:
print(list(chunk_iter)[11])
# {
# "text": "In this paper, we present the DocLayNet dataset. [...]",
# "meta": {
# "doc_items": [{
# "self_ref": "#/texts/28",
# "label": "text",
# "prov": [{
# "page_no": 2,
# "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...},
# }], ...,
# }, ...],
# "headings": ["1 INTRODUCTION"],
# }
# }
Examples
Go hands-on with examples, demonstrating how to address different application use cases with Docling.