What it does

Key features

W8A8 quantization: APAC 8-bit weight + activation compression with 2× memory reduction
Pruning: APAC 50% structured sparsity for CPU-deployable compressed APAC models
GPTQ support: APAC standard 4-bit post-training quantization pipeline
vLLM integration: APAC compressed models load directly into vLLM serving engine
Knowledge distillation: APAC accuracy recovery after pruning with teacher model
Calibration data: APAC APAC-language calibration for domain-appropriate compression

When to reach for it

Best for

APAC deployment teams that use vLLM for production serving and need compression beyond AutoGPTQ/AutoAWQ — particularly APAC organizations that need structured pruning for CPU deployment, W8A8 activation quantization for the best inference throughput tradeoff with vLLM, or knowledge distillation to recover accuracy after aggressive compression of APAC-language fine-tuned models.

Don't get burned

Limitations to know

! APAC more complex than AutoGPTQ or AutoAWQ for simple weight quantization use cases
! APAC pruning + distillation requires teacher model inference during training — GPU compute intensive
! APAC sparse acceleration only benefits on CPU hardware with sparse tensor support

Context

About llm-compressor

Llm-compressor is an open-source model compression library from Neural Magic (now part of the vLLM project) that provides APAC ML engineering teams with a unified framework for applying multiple compression techniques to large language models — supporting structured and unstructured pruning, weight quantization (GPTQ, W8A8, W4A16), activation quantization, and knowledge distillation — producing compressed models optimized for inference with vLLM, enabling APAC deployment teams to achieve sub-linear compute cost scaling with model size.

Llm-compressor's W8A8 quantization (8-bit weights + 8-bit activations) achieves 2× memory reduction with near-zero accuracy loss on most APAC NLP tasks — distinguishing it from weight-only quantization (GPTQ/AWQ) by also quantizing the matrix multiplication activations, enabling more aggressive compression with Neural Magic's SparseGPT-derived calibration. APAC serving teams deploying compressed models with vLLM measure 1.5–2× throughput improvement over uncompressed inference at equivalent quality for most APAC text generation tasks.

Llm-compressor's pruning capability removes less important model weights to create sparse models — 50% structured sparsity (removing 50% of model parameters) with knowledge distillation to recover accuracy produces models with approximately 2× inference speed on CPUs that support sparse tensor operations. APAC organizations deploying LLMs on CPU-based infrastructure (APAC enterprises without GPU servers, APAC edge deployments without GPU hardware) use pruned + quantized models to achieve acceptable inference throughput without GPU compute.

Llm-compressor integrates directly with vLLM — compressed models produced by llm-compressor load into vLLM's inference engine with full continuous batching and PagedAttention support, combining compression-based memory savings with vLLM's throughput optimizations. APAC teams using vLLM for production serving use llm-compressor to compress their fine-tuned APAC-language models before deploying to vLLM, reducing GPU memory consumption and increasing batch size capacity for higher throughput.

llm-compressor

Key features

Best for

Limitations to know

About llm-compressor

Where this category meets practice depth.