Quantization — AIMenta AI Encyclopedia

Quantization compresses a neural network by representing weights and activations in lower-precision formats — typically INT8, INT4, or FP4 — instead of the FP16/BF16/FP32 precision used during training. The memory footprint drops proportionally (INT8 halves FP16, INT4 quarters it) and inference latency improves because integer arithmetic is faster on most hardware. The trade-off is quality: aggressive quantization loses numerical fidelity, and the question for every deployment is how far the quality can be pushed before the model degrades materially. Modern techniques preserve quality well enough that most frontier LLMs are served in INT8 or INT4 in production, not in their training precision.

The 2026 landscape has sophisticated tooling. **Post-training quantization (PTQ)** quantises a trained model without retraining — cheap, fast, usually sufficient for INT8. **Quantization-aware training (QAT)** simulates quantization during training so the model adapts — necessary for aggressive INT4 or when PTQ regresses quality. **Weight-only quantization** compresses weights while keeping activations in higher precision — the common pattern for LLM inference. **GPTQ** (OPT, group quantization), **AWQ** (Activation-aware Weight Quantization), **SmoothQuant** are the dominant LLM weight-quantization techniques. **bitsandbytes** ships fast INT8/INT4 inference kernels for PyTorch. **GGUF** (llama.cpp format) dominates CPU/Apple-silicon LLM inference. **FP4** on Blackwell GPUs is the 2025+ frontier.

For APAC mid-market teams, the pragmatic rule is **quantize inference aggressively, keep training at higher precision**. For serving LLMs, INT8 is usually a free win (negligible quality loss, material speedup and memory savings), and INT4 is often acceptable with a brief evaluation pass on your actual workload. Training stays in FP8 or BF16 where quality matters most. For embedders and classifiers, INT8 PTQ works reliably; INT4 typically needs QAT or a quality-eval pass. Ship to the strongest inference quantization that passes your eval thresholds — the cost savings at scale are substantial.

The non-obvious failure mode is **quality regression on the tails**. Quantization hurts the long tail of the input distribution more than the common middle: rare tokens, non-English content, long-context reasoning, low-resource APAC languages (Burmese, Khmer, Lao, Tagalog variants) often regress materially while English-benchmark scores barely move. If your workload includes non-English or specialised content, evaluate quantized models on that distribution specifically, not on English MMLU or GSM8K. The generic-benchmark story of 'quantization is free' hides real failures for the workloads APAC mid-market enterprises actually run.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

service Infrastructure & Cloud

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Other service pillars

AI Strategy & Advisory Training & Enablement Talent & Hiring Workflow Automation Software & Platforms

By industry

Financial services Retail & e-commerce Manufacturing Logistics Healthcare Professional services Public sector Real estate Technology Education

By Asian market

🇭🇰 Hong Kong 🇨🇳 Mainland China 🇹🇼 Taiwan 🇯🇵 Japan 🇰🇷 Korea 🇸🇬 Singapore 🇲🇾 Malaysia 🇻🇳 Vietnam 🇮🇩 Indonesia

Continue with All terms · AI tools · Insights · Case studies