Sixty-one engineers, architects, and technical leads from across APAC attended our January session. The focus: what actually goes wrong when RAG systems go to production in APAC enterprise environments.
This was a technical session — we assumed familiarity with RAG concepts and focused on the failure modes, mitigations, and APAC-specific challenges that are underrepresented in English-language documentation.
What We Covered
Section 1 — The five most common RAG production failures
We opened with a live audit of anonymised production RAG systems from five client deployments. Across all five, we identified the same five failure modes:
-
Chunking strategy chosen for development, not production — Most RAG systems are prototyped with fixed-size chunking (e.g., 512 tokens). In production, this fails for long-form documents (policy manuals, regulations, technical specifications) where the relevant context spans multiple chunks and retrieval returns partial answers.
-
Embedding model mismatch — The embedding model used to index the corpus is different from (or an older version of) the model used to embed the query at inference time. This causes silent retrieval degradation — the system returns results, but they are semantically misaligned. We saw this in two deployments where the corpus was indexed months before production and the embedding model had been updated.
-
Context window exhaustion — Retrieving 5–8 chunks that, combined, exceed the context window causes the LLM to silently truncate. The model does not error — it just ignores the truncated content. Systems with long retrieved documents need explicit context budget management.
-
No reranking step — Vector similarity retrieval is fast but imprecise. The top-k results often include semantically adjacent but factually irrelevant chunks. Without a reranking step (using a cross-encoder or BM25 hybrid), retrieval precision degrades significantly on longer corpora.
-
Stale index — The corpus is updated weekly (or monthly), but queries reference current information. For dynamic content (pricing, regulatory updates, product specifications), the retrieval index needs to be refreshed at the same frequency as the source content updates.
Section 2 — APAC-specific language challenges
This was the section with the most attendee engagement.
Chinese-language (Simplified and Traditional) — Word boundary tokenisation is fundamentally different in Chinese than in English. Most embedding models are trained on English-dominant corpora; retrieval quality degrades significantly on Chinese text, particularly for:
- Technical terminology (AI/ML terms translated into Chinese vary significantly by market — Taiwan uses different calques than Mainland China)
- Informal language and dialectal expressions (common in customer service applications)
- Mixed-language documents (Chinese technical documents often contain English acronyms and model names)
Mitigation: Use embedding models with strong multilingual training (Cohere Embed Multilingual, text-embedding-3-large from OpenAI with multilingual data, or models fine-tuned on Traditional/Simplified Chinese). Test retrieval quality on a held-out Chinese-language query set before production.
Japanese-language — Japanese presents an additional challenge: three writing systems (hiragana, katakana, kanji) that are often mixed within a single sentence. Katakana is used for technical terms, many of which are transliterations of English AI terms. Chunking that splits across character type boundaries produces poor retrieval.
Korean-language — Korean agglutinative morphology means that a single word can encode what English encodes in multiple words (verb stems with tense, aspect, and politeness markers combined). Simple tokenisation misses this; retrieving by surface form fails on Korean queries that use different morphological forms of the same root.
Section 3 — Latency at scale
We showed latency profiles from three production deployments at different scales:
- 1,000 queries/day: vector retrieval P95 < 200ms; total RAG pipeline P95 < 2.5s
- 50,000 queries/day: vector retrieval P95 300–500ms; total P95 3–8s
- 500,000 queries/day: vector retrieval P95 1–3s unless using approximate nearest neighbour (ANN) with HNSW; total P95 5–15s
The lesson: HNSW-based approximate retrieval is essential at scale, but requires tuning ef_construction and m parameters for your specific corpus size and query pattern. Default parameters in most vector databases are optimised for small-to-medium corpora.
Selected Q&A
Q: What is the right chunk size for enterprise documents?
A: It depends on the retrieval granularity you need. For policy documents where you need paragraph-level answers, 256–512 tokens with 50-token overlap works well. For technical reference documents where you need section-level context, 1024–2048 tokens. The critical insight: chunk size should match the granularity of the questions you expect, not the granularity of the source documents.
Q: We are seeing high hallucination rates even when the relevant content is in the retrieved chunks. What is going wrong?
A: The most common cause is prompt structure — if the retrieved context is presented after the question, some models use their parametric knowledge before "seeing" the context. Put the retrieved context before the question in the prompt. Also check whether your retrieved chunks are actually containing the answer — run a retrieval quality evaluation (not just end-to-end accuracy) to isolate whether the failure is retrieval or generation.
Q: Is Chroma still the right vector database for a 10M document corpus?
A: Chroma is well-suited for development and small-to-medium production deployments (up to ~1M documents in our experience). At 10M documents, we recommend Qdrant, Weaviate, or managed services (Pinecone, Zilliz) — they offer better performance on ANN at scale, more mature HNSW implementations, and production-grade persistence and replication.
Q: How do you handle document updates in a production RAG system?
A: The cleanest approach is a metadata-filtered deletion + re-index. When a document changes, delete all vectors associated with its document ID (using metadata filtering in your vector store), then re-embed and re-index the updated chunks. This is more reliable than attempting partial updates.
Resources
- RAG encyclopedia entry
- Vector databases encyclopedia entry
- Multi-Agent AI Systems: Design Patterns
- Enterprise AI Evaluation Framework
Our next technical webinar covers LLM fine-tuning for APAC enterprise use cases. Contact us to register.
Where this applies
How AIMenta turns these ideas into engagements — explore the relevant service lines, industries, and markets.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.