docling: get your documents ready for gen AI

Dec 18, 2024

https://ds4sd.github.io/docling/https://github.com/DS4SD/docling

Docling parses documents and exports them to the desired format with ease and speed.

Features

Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
Advanced PDF document understanding incl. page layout, reading order & table structures
Unified, expressive DoclingDocument representation format
Easy integration with LlamaIndex & LangChain for powerful RAG / QA applications
OCR support for scanned PDFs
Simple and convenient CLI

Installation

To use Docling, simply install docling from your Python package manager, e.g. pip:

pip install docling

Works on macOS, Linux, and Windows, with support for both x86_64 and arm64 architectures.

Usage

Conversion

Convert a single document

To convert individual PDF documents, use convert(), for example:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "### Docling Technical Report[...]"

CLI

You can also use Docling directly from your command line to convert individual files —be it local or by URL— or whole directories.

A simple example would look like this:

docling https://arxiv.org/pdf/2206.01062

To see all available options (export formats etc.) run docling --help. More details in the CLI reference page.

Chunking

You can chunk a Docling document using a chunker, such as a HybridChunker, as shown below (for more details check out this example):

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

conv_res = DocumentConverter().convert("https://arxiv.org/pdf/2206.01062")
doc = conv_res.document

chunker = HybridChunker(tokenizer="BAAI/bge-small-en-v1.5")  # set tokenizer as needed
chunk_iter = chunker.chunk(doc)

An example chunk would look like this:

print(list(chunk_iter)[11])
# {
#   "text": "In this paper, we present the DocLayNet dataset. [...]",
#   "meta": {
#     "doc_items": [{
#       "self_ref": "#/texts/28",
#       "label": "text",
#       "prov": [{
#         "page_no": 2,
#         "bbox": {"l": 53.29, "t": 287.14, "r": 295.56, "b": 212.37, ...},
#       }], ...,
#     }, ...],
#     "headings": ["1 INTRODUCTION"],
#   }
# }

Examples

Go hands-on with examples, demonstrating how to address different application use cases with Docling.