Inference (Serving) — AIMenta AI Encyclopedia

Inference is the operational phase of machine learning — running a trained model to produce predictions on new inputs. The terminology distinguishes inference from training, which is the phase where a model's parameters are learned. For most production AI systems, inference is where the recurring cost lives: training happens once (or on a periodic retraining cadence); inference happens on every request, for every user, across the product's lifetime. As models have scaled, inference cost and latency have become dominant engineering concerns rivalling or exceeding training cost for many workloads.

The inference stack has several layers. **Hardware** — GPUs (H100, H200, B100 for largest models; A100, L40S, A10 for mid-tier; T4, L4 for smaller), TPUs, Apple Silicon, increasingly specialised inference chips (Groq, Cerebras, AWS Inferentia, SambaNova). **Runtime** — PyTorch, TensorFlow, ONNX Runtime, vLLM, TensorRT-LLM, MLC-LLM, llama.cpp — each with different performance characteristics and hardware targets. **Serving frameworks** — Triton, KServe, vLLM server, BentoML, SageMaker / Vertex / Azure ML endpoints — handle batching, queueing, autoscaling, and multi-model routing. **Optimisation techniques** — quantisation (INT8, FP8, 4-bit), speculative decoding, continuous batching, KV-cache management, tensor / pipeline parallelism — squeeze more throughput from the hardware.

For APAC mid-market enterprises, the right inference strategy depends on workload shape. **High-volume low-latency** (user-facing chat, real-time classification) favours dedicated serving infrastructure with continuous batching and aggressive quantisation. **Bulk asynchronous** (overnight batch processing, large-scale document analysis) tolerates higher per-request latency in exchange for throughput-optimised serving at lower cost per token. **Mixed** workloads usually benefit from separate serving tiers rather than one-size-fits-all. The most expensive mistake is sizing inference infrastructure for peak load with no autoscaling — GPU hours idle at 3am are cash set on fire.

The non-obvious operational note: **LLM inference latency is dominated by the decode phase**, not the prefill phase. Each output token requires a full forward pass through the model with the current KV cache. Techniques that attack this — speculative decoding (small draft model proposes tokens, large model verifies), Medusa-style parallel decoding, prompt-caching across similar requests — can materially reduce latency and cost. Benchmark your actual workload's prefill/decode ratio before choosing optimisations; they target different parts of the curve.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

service Infrastructure & Cloud

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Other service pillars

AI Strategy & Advisory Training & Enablement Talent & Hiring Workflow Automation Software & Platforms

By industry

Financial services Retail & e-commerce Manufacturing Logistics Healthcare Professional services Public sector Real estate Technology Education

By Asian market

🇭🇰 Hong Kong 🇨🇳 Mainland China 🇹🇼 Taiwan 🇯🇵 Japan 🇰🇷 Korea 🇸🇬 Singapore 🇲🇾 Malaysia 🇻🇳 Vietnam 🇮🇩 Indonesia

Continue with All terms · AI tools · Insights · Case studies