TL;DR
- Production LLM systems typically have 50-70% inference cost waste from sub-optimal patterns.
- Five techniques (caching, routing, distillation, prompt compression, batching) recover most of it.
- Cost engineering should start when monthly inference spend exceeds US$5,000. Below that, optimisation is premature.
Why now
LLM inference cost has fallen 70-80% per token over 2023-2025 as frontier model providers compete. Costs in absolute terms are still significant for high-volume use cases. A customer-support copilot serving 50,000 conversations per month at US$0.08 per conversation costs US$48,000 per year. The same workload with cost engineering can run at US$18,000 per year.
The cost differential is meaningful at mid-market scale. Bain's Technology Report 2025 tracks inference cost as one of the top three operational concerns for enterprise AI deployers in 2025.[^1] Yet most teams treat inference cost as a fixed input rather than an engineering problem.
This article describes five techniques used by production teams to reduce LLM inference cost by 50-70% without quality regression.
Where the cost actually is
Before optimising, instrument. The biggest cost lines in a typical RAG-based system are:
- Generation tokens (output): often 40-60% of total cost
- Context tokens (input): often 30-50% of total cost
- Embeddings (at query time and index time): typically 5-10%
- Vector store and infrastructure: typically 5-15%
Generation cost is high per-token. Context cost is high per-call because RAG systems pass large contexts. The two together often dominate. Embeddings and infrastructure are usually small enough to ignore for cost optimisation.
Once you know the breakdown, the techniques apply selectively.
Technique 1: response caching
Most production LLM workloads have repetitive queries. The same FAQ question. The same document summary requested by multiple users. The same agent task with the same inputs.
Response caching: if the input matches a previous input, return the previous output. Implementations range from exact-match caching (simple, often 10-20% hit rate) to semantic caching (embed the input, find similar past inputs, return their cached output if similarity exceeds threshold; 30-50% hit rate is common).
The cost reduction equals the cache hit rate times the per-call cost minus the cache lookup cost. A 30% hit rate at US$0.05 per call (replaced by a US$0.001 cache lookup) saves US$0.015 per call across the entire workload. On a 50,000-call-per-month workload, that is US$750 per month saved.
Cautions: semantic caching can produce wrong-context returns (a similar question with a different correct answer). Validate carefully. Apply caching only where context-independence is reasonable.
Technique 2: model routing
Not every query needs the most capable (and most expensive) model. A model router classifies the incoming query and routes it to the appropriate model: simple queries to a cheap small model, complex queries to a frontier model.
A working router typically achieves 60-80% of queries on cheaper models with no quality regression on a curated eval set. With the cheaper model at 1/10 the cost of the frontier model, the average cost per query drops 50-65%.
Router design options:
- Rule-based router (cheap, simple, brittle to new query patterns)
- Small classifier model (more flexible, requires training data)
- LLM-as-router (use a small LLM to classify, balances cost and flexibility)
Run the router itself on a cheap fast model. The router decision should add less than 10% to the average per-call latency.
Technique 3: prompt compression
RAG systems pass large contexts. Much of the context is not strictly necessary. Prompt compression reduces context size without quality loss.
Approaches:
- Better retrieval: tighter chunks, better re-ranking, fewer chunks per query (often the highest-payoff approach)
- LLM-based summarisation of retrieved chunks before insertion (use a cheap model)
- Token-level compression using specialised models (LLMLingua and similar)
- Structured prompt templates that omit unnecessary boilerplate
A well-tuned compression strategy can reduce context tokens by 40-60% with negligible quality regression on most use cases. The cost saving is significant because context tokens are usually a large share of total cost.
Technique 4: distillation
For high-volume use cases with consistent task patterns, distil the behaviour into a smaller fine-tuned model.
The pattern:
- Run the production frontier-model-based system in shadow alongside a smaller model
- Capture frontier model outputs as training data
- Fine-tune the smaller model on the captured data
- Evaluate against the curated test set
- Switch traffic to the fine-tuned model when quality matches
A distilled small model often runs at 10-25% of the frontier model cost while reaching 90-95% of the quality on the specific task. For a high-volume task this can be the largest single cost optimisation.
Caution: distillation only pays back at sufficient volume. Below 100,000 calls per month the engineering investment exceeds the savings. Above 1 million calls per month it is often the dominant cost technique.
Technique 5: batching
Generation can be batched: multiple requests served in a single inference call where the model and infrastructure support it. Some providers expose batch APIs at lower per-token cost (often 50% off) for non-real-time workloads.
Workloads that can use batching:
- Document processing pipelines that run on a schedule
- Research and analysis tasks where 30-minute or 24-hour latency is acceptable
- Backfills and re-processing of historical data
For these workloads the cost saving is mechanical: switch from real-time inference to batch inference, get the discount.
Batching is not appropriate for user-facing real-time workloads. Use it where latency budgets allow.
Combining the techniques
Most production cost optimisations combine several techniques. A working pattern for a high-volume customer-support copilot:
- Semantic caching catches 30% of queries with no model call
- Model routing sends 70% of remaining queries to a small cheap model
- Prompt compression on the retrieved context reduces input tokens by 45%
- Distillation may apply at higher scale
- Batch processing for offline analysis tasks
Combined cost reduction: 60-75% versus a naive frontier-model-only deployment. Quality maintained on the curated eval set. Latency unchanged or improved.
When to start cost engineering
Cost engineering investment pays back once monthly inference spend exceeds US$5,000. Below that it is premature optimisation.
Rough thresholds for a mid-market enterprise:
- Monthly inference spend below US$2,000: optimisation is premature; build the system, measure, learn
- US$2,000-US$5,000: instrument and document the cost structure; do not yet optimise
- US$5,000-US$20,000: start with response caching and model routing
- US$20,000-US$100,000: add prompt compression and consider distillation for the highest-volume use cases
- Above US$100,000: comprehensive cost engineering programme; expect 50-70% reduction
The order matters. Caching and routing are usually the largest gains for the lowest engineering investment. Distillation is the biggest gain at the largest engineering investment.
Implementation playbook
A 90-day cost engineering programme for a system spending US$10,000-US$50,000 per month on inference.
- Days 1-15: Instrumentation. Tag every inference call with use case, model, input tokens, output tokens, latency. Build a per-use-case cost dashboard.
- Days 16-30: Caching. Implement exact-match cache (week 1), semantic cache (week 2). Measure hit rate and cost reduction.
- Days 31-50: Routing. Build a simple rule-based or small-classifier router. Define quality eval set. Validate quality on cheaper models. Roll out gradually.
- Days 51-70: Compression. Audit retrieval: are too many chunks being fetched? Is the context bloated with boilerplate? Apply LLM-based compression where retrieval tuning is exhausted.
- Days 71-90: Measurement and iteration. Compare cost-per-successful-outcome to baseline. Identify remaining cost drivers. Plan next quarter's work (often distillation for the highest-volume cases).
- Quarterly: Re-instrument, re-measure. Frontier model pricing changes. Workload patterns shift. Re-evaluate.
What good looks like
A team that has run cost engineering well has:
- A real-time per-use-case cost dashboard
- A documented cost structure: where the money goes, in what proportions
- Active caching with measured hit rates
- A model router with quality validation per route
- Compression strategy documented per use case
- Quarterly review of cost-per-successful-outcome trends
Cost discipline is a culture as much as a technology stack. Teams that talk about cost in their weekly engineering reviews are the teams whose costs are under control.
Counter-arguments
"Inference cost will keep falling; do not bother." It will. Optimisation today still pays back today. The savings compound. And quality engineering on the cost dimension is what produces the cost-per-outcome discipline that scales.
"Caching produces wrong answers." Naive caching can. Well-designed semantic caching with appropriate similarity thresholds and context awareness does not. The risk is real but manageable with good engineering.
"Routing introduces complexity." It does. The complexity pays back once monthly inference spend exceeds US$5,000. Below that threshold the complexity is not justified.
Bottom line
Production LLM systems typically have substantial cost waste that engineering can recover. Five techniques (caching, routing, distillation, prompt compression, batching) deliver 50-70% cost reduction in production deployments with no quality loss. The right time to start is when monthly inference spend exceeds US$5,000. Below that threshold, build the system right, measure, and revisit.
If your team is spending US$30,000+ per month on inference today and has not run a cost engineering exercise, the savings are likely to fund the engineering investment within 60 days.
Next read
- Choosing a Vector Database in 2026: 7 Options Compared
- What Asian Mid-Market AI Pilots Actually Cost in 2026
By Hyejin Lee, Director, CFO Advisory.
[^1]: Bain & Company, Technology Report 2025, October 2025, p. 51.
Where this applies
How AIMenta turns these ideas into engagements — explore the relevant service lines, industries, and markets.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.