APAC RAG Document Ingestion: PDF Parsing, Multi-Format ETL, and On-Premise Conversion
The quality of an APAC RAG application is bounded by the quality of its document ingestion pipeline — poorly parsed PDFs with garbled tables, merged columns, and lost structure produce noisy chunks that cause retrieval failures regardless of embedding model or vector database quality. This guide covers the document parsing and conversion tools APAC teams use to produce clean, accurately structured content from the diverse document formats in APAC enterprise knowledge bases.
Three tools address the APAC document ingestion layer:
LlamaParse — LLM-powered PDF parsing service handling complex APAC document layouts including multi-column layouts, embedded tables, and figures.
Unstructured — open-source document ETL framework parsing 20+ file formats with enterprise source connectors for APAC SharePoint, Confluence, and S3.
Docling — IBM open-source PDF-to-Markdown converter running on-premise with table detection and reading order correction for APAC data-sovereign environments.
APAC Document Ingestion Decision Framework
APAC Document Scenario → Tool → Why
Complex PDF (annual reports, → LlamaParse LLM understands layout;
regulatory docs, research papers) → table-spanning pages
Multi-format enterprise repository → Unstructured Handles PDF+DOCX+HTML;
(SharePoint, Confluence, S3 mix) → enterprise connectors
Confidential APAC documents → Docling On-premise only;
(financial statements, IP docs) → no cloud API calls
High-volume simple PDF pipeline → PyPDF2/pdfplumber Fast; good enough for
(text-dominant, few tables) → simple APAC PDFs
APAC scanned documents (OCR needed) → Unstructured OCR pipeline built-in;
→ + Tesseract handles image PDFs
Mixed APAC + English documents → LlamaParse LLM-based extraction
(CJK regulatory filings) → handles CJK natively
LlamaParse: APAC Complex PDF Parsing
LlamaParse APAC setup and parsing
# APAC: LlamaParse — LLM-powered PDF parsing for complex APAC documents
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
# APAC: Initialize LlamaParse
apac_parser = LlamaParse(
api_key=os.environ["LLAMA_CLOUD_API_KEY"],
result_type="markdown", # APAC: structured Markdown output
parsing_instruction=(
"This is an APAC regulatory document. Extract all tables accurately. "
"Preserve section hierarchy. Note Singapore dollar amounts as SGD."
),
language="en", # APAC: or "zh" for Chinese documents
verbose=True,
)
# APAC: Parse single complex PDF
apac_parsed = apac_parser.load_data("mas_circular_2026_ai_governance.pdf")
print(apac_parsed[0].text[:500])
# → "# MAS Circular on AI Governance for Financial Institutions\n\n
# ## 1. Introduction\n\nThe Monetary Authority of Singapore (MAS)..."
# APAC: Heading hierarchy preserved, not collapsed to flat text
# APAC: Integrate with LlamaIndex for RAG ingestion
file_extractor = {".pdf": apac_parser}
apac_documents = SimpleDirectoryReader(
input_dir="/apac/regulatory-docs/",
file_extractor=file_extractor,
).load_data()
# APAC: All PDFs in directory parsed via LlamaParse
print(f"APAC documents loaded: {len(apac_documents)}")
print(f"Avg chars per APAC doc: {sum(len(d.text) for d in apac_documents) // len(apac_documents)}")
LlamaParse APAC table extraction quality
# APAC: LlamaParse vs naive PDF extraction — table quality comparison
import pypdf # naive parser for comparison
# APAC: Example: MAS FEAT assessment table in PDF
apac_pdf_path = "mas_feat_assessment_criteria.pdf"
# Naive PyPDF extraction (problematic for tables):
apac_pypdf = pypdf.PdfReader(apac_pdf_path)
apac_naive_text = apac_pypdf.pages[3].extract_text()
print("Naive extraction:")
print(apac_naive_text[:300])
# → "Criterion Score Description Weight Fairness 4 Model outputs 0.25..."
# APAC: Table columns merged, structure lost, scores mixed with descriptions
# LlamaParse extraction (accurate table structure):
apac_parsed = apac_parser.load_data(apac_pdf_path)
print("\nLlamaParse extraction:")
print(apac_parsed[3].text[:400])
# → "| Criterion | Score | Description | Weight |
# |-----------|-------|-------------|--------|
# | Fairness | 4 | Model outputs do not discriminate | 0.25 |
# | Ethics | 3 | Aligned with APAC ethical AI principles | 0.20 |"
# APAC: Markdown table with correct column alignment
# → RAG chunks over this table retrieve correctly for "MAS FEAT scores"
Unstructured: APAC Multi-Format Document ETL
Unstructured APAC multi-format parsing
# APAC: Unstructured — parse any APAC document format to structured elements
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import (
Title, NarrativeText, Table, ListItem
)
# APAC: Partition handles any format automatically
def apac_parse_document(file_path: str) -> list:
"""Parse any APAC document type to structured elements."""
apac_elements = partition(
filename=file_path,
strategy="hi_res", # APAC: high-quality with OCR fallback
languages=["eng", "chi_sim", "chi_tra", "jpn"], # APAC: CJK support
)
return apac_elements
# APAC: Parse diverse document types with same API
apac_pdf_elements = apac_parse_document("apac_annual_report_2026.pdf")
apac_word_elements = apac_parse_document("apac_policy_document.docx")
apac_html_elements = apac_parse_document("mas_website_guidance.html")
apac_pptx_elements = apac_parse_document("apac_ai_strategy_deck.pptx")
# APAC: Filter by element type for selective processing
apac_tables = [e for e in apac_pdf_elements if isinstance(e, Table)]
apac_headings = [e for e in apac_pdf_elements if isinstance(e, Title)]
apac_narrative = [e for e in apac_pdf_elements if isinstance(e, NarrativeText)]
print(f"APAC tables found: {len(apac_tables)}")
print(f"APAC section headings: {len(apac_headings)}")
# APAC: Semantic chunking by title boundaries
apac_chunks = chunk_by_title(
apac_pdf_elements,
max_characters=1500, # APAC: max chunk size for embedding
new_after_n_chars=1200, # APAC: start new chunk after this
combine_text_under_n_chars=500, # APAC: merge small APAC elements
)
print(f"APAC semantic chunks: {len(apac_chunks)}")
Unstructured APAC enterprise connector setup
# APAC: Unstructured — SharePoint connector for APAC enterprise intranet
from unstructured.ingest.connector.sharepoint import SharepointConnector
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig
# APAC: Ingest from SharePoint Online (APAC intranet)
apac_sharepoint_connector = SharepointConnector(
client_id=os.environ["APAC_SHAREPOINT_CLIENT_ID"],
client_credential=os.environ["APAC_SHAREPOINT_SECRET"],
site="apac-enterprise.sharepoint.com/sites/AIGovernance",
path="Shared Documents/Regulatory",
)
# APAC: Configure processing pipeline
apac_processor = ProcessorConfig(
partition_config=PartitionConfig(
strategy="hi_res",
languages=["eng"],
),
num_processes=4, # APAC: parallel APAC document processing
output_dir="/apac/parsed-docs/",
)
# APAC: Run ingestion — processes all APAC SharePoint docs
apac_sharepoint_connector.run(apac_processor)
# APAC: Parsed elements saved to /apac/parsed-docs/ as JSON
# → Ready for embedding and APAC vector DB ingestion
Docling: APAC On-Premise PDF Conversion
Docling APAC setup and conversion
# APAC: Docling — on-premise PDF parsing (no cloud API required)
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
# APAC: Configure Docling pipeline
apac_pipeline_options = PdfPipelineOptions(
do_ocr=True, # APAC: enable OCR for scanned PDFs
do_table_structure=True, # APAC: enable TableTransformer
table_structure_options={
"do_cell_matching": True, # APAC: match cells across merged regions
},
)
# APAC: Initialize converter
apac_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=apac_pipeline_options,
)
}
)
# APAC: Convert confidential document locally — no internet required
apac_result = apac_converter.convert("apac_confidential_financial_report.pdf")
apac_markdown = apac_result.document.export_to_markdown()
print(apac_markdown[:600])
# → "# APAC Corporation Annual Financial Report 2025\n\n
# ## Revenue Summary\n\n
# | Quarter | Revenue (SGD M) | YoY Growth |\n
# |---------|-----------------|------------|\n
# | Q1 2025 | 234.5 | 12.3% |\n..."
# APAC: Tables correctly extracted; document never left the APAC server
# APAC: Save for RAG ingestion
with open("/apac/parsed/financial_report_2025.md", "w") as f:
f.write(apac_markdown)
Related APAC Document AI Resources
For the vector search and embedding tools (Jina AI, Weaviate Cloud, Marqo) that consume LlamaParse, Unstructured, and Docling outputs — embedding the clean structured Markdown into APAC vector databases for semantic retrieval — see the APAC vector search and embedding guide.
For the structured output tools (Outlines, Guidance AI) that work downstream of document parsing to extract typed structured data from the parsed Markdown — converting cleaned APAC regulatory document text into Pydantic objects with guaranteed schema conformance — see the APAC structured LLM output guide.
For the RAG infrastructure tools (pgvector, Haystack, Instructor) that provide the vector storage, retrieval pipeline, and LLM integration layer consuming parsed APAC document content after LlamaParse or Unstructured preprocessing, see the APAC RAG infrastructure guide.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.