What it does

Key features

On-premise: APAC fully local PDF conversion without cloud API calls
Table detection: TableTransformer-based APAC complex table structure recognition
Reading order: APAC multi-column document correct column ordering
Figure extraction: APAC embedded image and chart metadata preservation
LlamaIndex/LangChain: APAC direct RAG pipeline integration
Open-source: MIT licensed from IBM Research for APAC enterprise use

When to reach for it

Best for

APAC enterprises with data sovereignty requirements needing on-premise PDF conversion for RAG pipelines — particularly APAC financial institutions, government agencies, and regulated industries where confidential documents cannot be sent to cloud parsing APIs.

Don't get burned

Limitations to know

! Slower than LlamaParse API for APAC high-volume processing — local inference requires GPU for speed
! CJK document accuracy improving but not yet at the level of APAC specialized OCR tools
! Smaller APAC community than Unstructured.io — fewer enterprise integrations and examples

Context

About Docling

Docling is an open-source document conversion toolkit from IBM Research — providing accurate PDF-to-Markdown and PDF-to-JSON conversion with table detection, figure extraction, and reading order correction that runs entirely on-premise without cloud API calls. APAC enterprises with data sovereignty requirements use Docling to convert confidential APAC documents (financial statements, regulatory filings, internal reports) to clean formats for LLM ingestion without sending document content to external services.

Docling uses deep learning models for layout analysis and table structure recognition — running TableTransformer (for table detection and structure recognition) and DocLayNet (for document layout segmentation) on-device. For APAC regulatory documents with complex table structures (MAS consultation papers, APRA guidelines, financial institution annual reports), Docling accurately reconstructs table structure including merged cells, multi-row headers, and nested APAC table hierarchies.

Docling's reading order correction handles APAC multi-column documents where naive top-to-bottom text extraction mixes content from different columns — Docling's layout model identifies column boundaries and produces text in correct reading order. For APAC academic papers and regulatory documents with two or three column layouts, correct reading order is critical for RAG chunking that preserves semantic coherence.

Docling integrates directly with LlamaIndex and LangChain through document loaders — APAC teams add Docling to existing RAG pipelines with minimal code changes. Docling's output can be chunked at semantic boundaries (section headers, table boundaries) rather than fixed character counts, improving APAC RAG retrieval quality. As an IBM Research project, Docling is designed for APAC enterprise reliability and receives regular updates for improved APAC document format support.

Docling

Key features

Best for

Limitations to know

About Docling

Where this category meets practice depth.