Retrieval-Augmented Generation (RAG) — AIMenta AI Encyclopedia

Retrieval-Augmented Generation (RAG) is the pattern of grounding a language model's response in documents retrieved at inference time from an external knowledge source. The flow is simple in outline: the user's query is embedded into a vector, a vector database returns the most semantically similar passages from a corpus, those passages are inserted into the LLM's prompt as context, and the model generates a response grounded in that context rather than solely in pretraining memory. Introduced by Lewis et al. (2020), RAG became the default pattern for enterprise AI as soon as foundation models with long-enough context windows made it practical.

A production RAG stack has four moving parts. **Ingestion and chunking** — documents are split into semantically coherent passages (typically 256–1024 tokens each with overlap). **Embedding** — each chunk is encoded into a vector using an embedding model (OpenAI text-embedding-3, Cohere embed-v3, BGE, Jina, GTE, or a fine-tuned domain-specific embedder). **Retrieval** — at query time the query is embedded and matched against the vector store via approximate nearest neighbour search (HNSW, IVF, PQ), often combined with keyword search (BM25) for hybrid retrieval. **Generation** — top-k retrieved passages are placed in the LLM prompt, usually with explicit "answer only from the provided context" instructions.

For APAC mid-market enterprises, RAG is the correct starting architecture for nearly every LLM-over-proprietary-data project — internal knowledge bases, customer support over product documentation, legal research, policy Q&A, compliance search. It gives you freshness (update the corpus, not the model), auditability (you can cite the retrieved passages), and governance (sensitive documents can be filtered pre-retrieval rather than relying on model-level policy). Fine-tuning or training a custom model is rarely the right first move; RAG handles the long tail of business knowledge far more cheaply.

The non-obvious operational truth: **RAG failures are almost always retrieval failures, not generation failures**. Teams invest heavily in prompt engineering and model selection while their retrieval quality — chunking strategy, embedding choice, reranking, query rewriting — remains unexamined. Instrumented retrieval metrics (recall@k on labelled queries, hit-rate-at-position, MRR) usually reveal that the model could answer correctly if only the right passages were retrieved. Budget explicitly for retrieval evaluation; it pays back more than any other single investment.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

service Infrastructure & Cloud

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Other service pillars

AI Strategy & Advisory Training & Enablement Talent & Hiring Workflow Automation Software & Platforms

By industry

Financial services Retail & e-commerce Manufacturing Logistics Healthcare Professional services Public sector Real estate Technology Education

By Asian market

🇭🇰 Hong Kong 🇨🇳 Mainland China 🇹🇼 Taiwan 🇯🇵 Japan 🇰🇷 Korea 🇸🇬 Singapore 🇲🇾 Malaysia 🇻🇳 Vietnam 🇮🇩 Indonesia

Continue with All terms · AI tools · Insights · Case studies