Why APAC Enterprises Are Self-Hosting LLMs in 2026
The assumption that enterprise AI runs exclusively through managed API endpoints (OpenAI, Anthropic, Google) is breaking down across APAC. Three converging pressures have made self-hosted LLM infrastructure a serious engineering priority for APAC mid-market and enterprise organizations:
Data sovereignty requirements. Singapore MAS, Japan FSA, South Korean KISA, and Indonesia Kominfo all impose data residency obligations on financial services data. An APAC bank using OpenAI's API sends customer query text to US infrastructure — a compliance risk that, in 2026, is no longer acceptable to APAC regulators or APAC enterprise legal teams.
Cost at inference scale. An APAC e-commerce platform generating 10 million AI-assisted product descriptions per month at $0.002 per thousand output tokens pays $20,000 per month in inference costs — with no GPU ownership. Self-hosted inference on 4× A100 80GB nodes in Singapore colocation can serve the same workload for $8,000 per month in GPU rental, with marginal cost declining as utilization increases.
Model customization requirements. APAC enterprises investing in fine-tuned models for Japanese customer service, Bahasa Indonesia document processing, or Traditional Chinese financial text cannot serve those models through external API providers. Self-hosted inference is the only path to deploying proprietary fine-tunes at scale.
vLLM, Ollama, and LiteLLM form a coherent APAC LLM infrastructure stack — production GPU serving, developer workstation access, and unified routing proxy — covering the full lifecycle from local APAC developer experimentation to production APAC inference deployment.
vLLM: Production LLM Serving for APAC GPU Infrastructure
PagedAttention and APAC throughput economics
vLLM is the dominant open-source LLM inference server for GPU clusters, developed at UC Berkeley and now maintained by the vLLM community with contributions from NVIDIA, AMD, and major APAC hyperscalers. Its defining technical contribution is PagedAttention — a KV cache management algorithm that borrows from OS virtual memory paging to eliminate the GPU memory fragmentation that makes naive LLM inference servers inefficient.
The practical implication for APAC GPU infrastructure: vLLM can serve 2-4× more concurrent users per GPU compared to naïve implementations, because it eliminates the KV cache reservation overhead that wastes 60-80% of allocated GPU memory in standard serving approaches.
For APAC enterprises paying $2-4 per GPU-hour in Singapore, Tokyo, or Sydney colocation, the 2-4× throughput improvement translates directly to infrastructure cost reduction.
vLLM deployment for APAC production workloads
# Install vLLM (requires CUDA 11.8+ and Python 3.9+)
pip install vllm
# Serve Llama 3.1 8B on single GPU (APAC dev environment)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192
# Serve Qwen2.5 72B on multi-GPU (APAC production: 4× A100)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
vLLM exposes an OpenAI-compatible API — the same endpoint format as api.openai.com/v1/chat/completions — so APAC applications built against the OpenAI SDK can switch to self-hosted vLLM by changing the base_url parameter, with no application code changes.
APAC model selection for vLLM
For APAC enterprises, model selection is shaped by language requirements that US-centric benchmarks underweight:
| APAC Use Case | Recommended Model | Rationale |
|---|---|---|
| Japanese customer service | Qwen2.5-72B or Llama-3-ELYZA-JP-8B | Strong JA/ZH multilingual; ELYZA fine-tuned on JP |
| Bahasa Indonesia/Malaysia | Llama 3.1 70B or SEA-LION | Strong EN+BM/ID performance; SEA-LION APAC-tuned |
| Traditional Chinese (HK/TW) | Qwen2.5-32B | Best ZH-HK/ZH-TW coverage in open-weight class |
| Korean enterprise | EXAONE 3.5 or Llama-3-KoELECTRA | LG AI Research EXAONE trained on Korean corpora |
| Multilingual APAC general | Qwen2.5-72B | Best open-weight multilingual across 9 APAC languages |
vLLM Kubernetes deployment for APAC platform teams
For APAC platform teams running Kubernetes, vLLM deploys as a standard container workload with GPU node selection:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen25-72b
namespace: apac-inference
spec:
replicas: 2
template:
spec:
nodeSelector:
accelerator: nvidia-a100-80gb # APAC GPU node pool
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen2.5-72B-Instruct"
- "--tensor-parallel-size"
- "4"
- "--max-model-len"
- "32768"
resources:
limits:
nvidia.com/gpu: "4"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: apac-model-cache-pvc # shared model storage
APAC platform teams typically share a model cache PVC across inference pods — downloading a 72B model once (140GB) rather than per-pod, which at APAC datacenter egress rates ($0.08-0.12/GB from HuggingFace CDN) represents significant cost savings.
Ollama: APAC Developer Workstation LLM Access
The local LLM developer experience problem
Before Ollama, running an LLM locally on an APAC developer's MacBook or Windows workstation required: downloading model weights in raw safetensors format, installing llama.cpp with Metal/CUDA compilation, writing inference scripts, managing model quantization manually. The setup friction meant most APAC developers used ChatGPT or Claude.ai for experimentation rather than models they'd actually deploy in production.
Ollama collapsed this workflow into a single binary with a Docker-like UX: ollama pull llama3.1:8b downloads and serves the model; ollama run qwen2.5:7b opens an interactive chat. This matters for APAC AI engineering teams because developers experimenting locally with the same model family as production increases the fidelity of prompt iteration before production deployment.
Ollama for APAC developer environments
# macOS/Linux install
curl -fsSL https://ollama.com/install.sh | sh
# Pull APAC-relevant models
ollama pull qwen2.5:7b # Alibaba Qwen2.5 7B — strong multilingual APAC
ollama pull llama3.1:8b # Meta Llama 3.1 8B — good baseline
ollama pull gemma2:9b # Google Gemma 2 — efficient for Apple Silicon
# Run interactive chat (Japanese system prompt example)
ollama run qwen2.5:7b
# Serve OpenAI-compatible API on localhost (for APAC dev integration testing)
# Ollama auto-starts REST server at localhost:11434
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "こんにちは、今日の天気は?"}]
}'
Ollama in APAC CI/CD pipelines
APAC platform teams also use Ollama in CI/CD pipelines for LLM-assisted test generation and code review without hitting external APIs — important for APAC financial services teams with strict data egress controls:
# Buildkite pipeline step: LLM-assisted code review (no external API calls)
- label: "AI Code Review"
commands:
- docker run -d --name ollama-ci ollama/ollama
- docker exec ollama-ci ollama pull qwen2.5:7b
- |
git diff HEAD~1 | docker exec -i ollama-ci ollama run qwen2.5:7b \
"Review this APAC financial services code change for security issues and compliance gaps. Respond in English."
artifact_paths: "ai-review-output.txt"
Ollama for APAC model development iteration
APAC data science teams fine-tuning models for language-specific tasks use Ollama to load their fine-tuned GGUF exports for rapid evaluation before pushing to vLLM production:
# Load custom APAC fine-tuned model (GGUF format from training output)
ollama create apac-finserv-ja \
--file ./Modelfile \
--model ./apac-finserv-qwen2.5-finetuned-Q4_K_M.gguf
# Modelfile:
# FROM ./apac-finserv-qwen2.5-finetuned-Q4_K_M.gguf
# SYSTEM "You are an APAC financial services assistant. Answer in Japanese."
# Test the custom model
ollama run apac-finserv-ja "MASの規制要件について説明してください"
LiteLLM: Unified APAC LLM Proxy and Routing Layer
The APAC multi-provider routing problem
APAC enterprises in 2026 rarely use a single LLM provider. A common APAC AI infrastructure pattern:
- Primary provider: Self-hosted vLLM (Qwen2.5-72B) for sensitive APAC workloads (data residency)
- Fallback: Anthropic Claude Sonnet for complex reasoning where the open model underperforms
- Batch processing: AWS Bedrock with Llama 3.1 70B for cost-optimized bulk APAC document processing
- Developer testing: Local Ollama for iteration without API costs
Four providers, four SDKs, four API formats, four billing dashboards. Coordinating retry logic, fallback chains, cost tracking, and rate limit handling across this stack requires either a bespoke routing layer or a purpose-built proxy.
LiteLLM is that purpose-built proxy: a Python library and Docker-deployable server that exposes a single OpenAI-compatible endpoint, routing requests to 100+ LLM providers (vLLM, Ollama, Anthropic, OpenAI, AWS Bedrock, Azure OpenAI, Google Gemini, Cohere, and all APAC providers including Alibaba Tongyi, Baidu ERNIE, and MiniMax) behind a unified API.
LiteLLM deployment for APAC routing
# litellm-config.yaml: APAC multi-provider routing configuration
model_list:
# Primary: self-hosted vLLM (APAC Singapore cluster)
- model_name: apac-primary
litellm_params:
model: openai/qwen2.5-72b # vLLM uses 'openai/' prefix for custom models
api_base: http://vllm-service.apac-inference.svc.cluster.local:8000/v1
api_key: "none" # vLLM without auth
# Fallback: Anthropic Claude (non-sensitive APAC workloads)
- model_name: apac-fallback
litellm_params:
model: claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
# Batch: AWS Bedrock Singapore (ap-southeast-1)
- model_name: apac-batch
litellm_params:
model: bedrock/meta.llama3-1-70b-instruct-v1:0
aws_region_name: ap-southeast-1
# Dev: local Ollama (developer environments)
- model_name: apac-dev
litellm_params:
model: ollama/qwen2.5:7b
api_base: http://localhost:11434
router_settings:
routing_strategy: least-busy # route to least-loaded APAC endpoint
fallbacks:
- apac-primary: ["apac-fallback"] # auto-fallback if vLLM unavailable
num_retries: 3
retry_after: 30
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
# Deploy LiteLLM proxy (APAC Kubernetes)
docker run -d \
-p 4000:4000 \
-v $(pwd)/litellm-config.yaml:/app/config.yaml \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-e LITELLM_MASTER_KEY=$LITELLM_MASTER_KEY \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml \
--port 4000
All APAC application teams point to http://litellm-proxy:4000/v1 — the same OpenAI SDK format — and specify model names from the config. When the Singapore vLLM cluster is down, LiteLLM automatically routes to Anthropic without application code changes.
APAC cost tracking with LiteLLM
LiteLLM's spend tracking is particularly useful for APAC enterprises with multiple business units sharing LLM infrastructure:
# APAC application: tag requests for cost attribution
import openai
client = openai.OpenAI(
base_url="http://litellm-proxy:4000/v1",
api_key="your-litellm-master-key"
)
response = client.chat.completions.create(
model="apac-primary",
messages=[{"role": "user", "content": "Summarize this APAC report..."}],
extra_body={
"metadata": {
"user_id": "apac-finserv-team",
"project": "apac-document-processing",
"spend_category": "internal" # APAC cost center attribution
}
}
)
LiteLLM's /spend/logs endpoint aggregates costs by user_id, project, or spend_category — enabling APAC finance teams to chargeback AI infrastructure costs to the consuming business unit at the token level, regardless of which underlying provider served the request.
The Complete APAC LLM Inference Stack
vLLM, Ollama, and LiteLLM address different stages of the APAC LLM lifecycle:
APAC LLM Infrastructure Stack:
DEVELOPER LAYER (local iteration, no API costs)
└── Ollama: developer workstation LLM serving
- ollama pull qwen2.5:7b
- localhost:11434 OpenAI-compatible API
- APAC developer prompt iteration before production
PRODUCTION LAYER (GPU cluster, data residency)
└── vLLM: GPU inference server with PagedAttention
- 2-4× throughput vs naive serving
- Kubernetes deployment with GPU node selection
- Tensor parallelism for large APAC models (72B+)
- Model cache shared across APAC inference pods
ROUTING LAYER (unified API, multi-provider)
└── LiteLLM: OpenAI-compatible proxy
- Single endpoint for all APAC providers
- Automatic fallback: vLLM → Anthropic → Bedrock
- Per-request APAC cost attribution and spend limits
- Rate limit handling across provider quotas
APAC data sovereignty routing pattern
The most common production pattern for APAC regulated industries:
# Application code routes by data classification
def get_model_for_request(data_classification: str) -> str:
routing = {
"sensitive": "apac-primary", # vLLM Singapore (data never leaves APAC)
"internal": "apac-batch", # Bedrock ap-southeast-1 (APAC region)
"non-sensitive": "apac-fallback", # Anthropic (best quality, non-regulated)
"development": "apac-dev", # Ollama (no cost, local iteration)
}
return routing.get(data_classification, "apac-fallback")
APAC compliance teams can audit the routing table to verify that data classified as "sensitive" under MAS Notice 655, PDPA (Singapore), or APPI (Japan) never leaves the APAC-hosted vLLM cluster — a concrete, auditable compliance control that external API-only architectures cannot provide.
APAC Deployment Considerations
GPU procurement for APAC vLLM clusters: APAC GPU availability varies significantly by market. Singapore has the best NVIDIA H100/A100 availability (AWS ap-southeast-1, GCP asia-southeast1, Azure Southeast Asia). Japan (Tokyo) is second. Seoul and Sydney have limited spot availability. APAC platform teams targeting <$8/GPU-hour in 2026 typically use reserved GPU instances in Singapore colocation or AWS Reserved Instances with 1-year commitment.
Model weight storage for APAC compliance: HuggingFace model weights are cached on first pull from HuggingFace CDN (US-hosted). APAC enterprises with strict data residency requirements should pre-mirror approved model weights to a Singapore-region S3 bucket or GCS bucket and configure vLLM's TRANSFORMERS_CACHE to point to the APAC mirror — ensuring model weights never traverse from US storage to APAC GPU nodes during production inference.
Network latency for LiteLLM routing: Place the LiteLLM proxy in the same APAC region as the primary vLLM cluster (Singapore for SEA workloads, Tokyo for Japan workloads). Cross-region routing adds 80-200ms latency per request — acceptable for async batch but problematic for APAC real-time customer-facing applications with <500ms SLA requirements.
Related APAC AI Infrastructure Resources
For the Kubernetes platform that hosts APAC vLLM clusters, see the APAC Kubernetes platform engineering guide covering vCluster, External Secrets, and ExternalDNS.
For the DevSecOps controls governing what images run in APAC GPU clusters, see the APAC Kubernetes DevSecOps guide covering Kyverno, Cosign, and Kubescape.
For the Infrastructure as Code layer provisioning APAC GPU node pools, see the APAC Infrastructure as Code guide covering OpenTofu, Ansible, and AWS CDK.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.