APAC AI Infrastructure: Serverless Inference, eBPF Observability, and LLM Experiment Tracking
APAC AI teams operating in production face three infrastructure gaps that standard tools do not address: deploying open-source LLMs as production APIs without GPU cluster management, understanding microservice behavior without instrumenting every service, and systematically tracking LLM experiment quality across prompt versions and model changes. This guide covers three tools that fill these APAC operational gaps.
Lepton AI — serverless GPU platform for deploying open-source LLMs and custom APAC models as production API endpoints with sub-second cold starts.
Coroot — open-source eBPF-based observability that automatically discovers APAC service topology and detects performance anomalies without code instrumentation.
Braintrust — LLM experiment tracking and prompt management platform for APAC AI teams systematically improving LLM application quality.
APAC Tool Selection Framework
APAC Scenario → Tool → Why
Open-source LLM as production API → Lepton AI Serverless GPU; no K8s;
(no infrastructure team) → pay-per-GPU-second
Sporadic APAC LLM inference jobs → Modal Labs Task-based; best for
(not persistent API) → batch + scheduled runs
Always-on APAC LLM service → Anyscale Ray Serve; persistent;
(guaranteed latency SLA) → autoscale + HA replicas
APAC K8s service mesh visibility → Coroot eBPF; no code changes;
(legacy + uninstrumented services) → auto service map
Full APAC distributed tracing → Jaeger/OTel Span-level; requires
(span-level debugging) → instrumentation
APAC LLM experiment comparison → Braintrust Prompt versioning +
(prompt iteration workflow) → multi-scorer evaluation
APAC LLM production tracing → Langfuse Open-source; self-hosted;
(data sovereignty required) → session + cost tracking
Lepton AI: APAC Serverless LLM Inference
Lepton AI APAC deployment
# APAC: Lepton AI — deploy open-source LLM as serverless API endpoint
import leptonai
from leptonai.photon import Photon
# APAC: Define inference endpoint as Python class
class ApacQwenEndpoint(Photon):
"""APAC: Qwen 2.5 7B inference endpoint for CJK language tasks."""
# APAC: Lepton downloads and caches model on first cold start
requirement_dependency = ["vllm>=0.4.0", "transformers>=4.40.0"]
def init(self):
from vllm import LLM, SamplingParams
# APAC: Qwen for Chinese/Japanese/Korean + English multilingual tasks
self.llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
dtype="float16",
gpu_memory_utilization=0.90,
)
self.params = SamplingParams(temperature=0.7, max_tokens=512)
@Photon.handler
def apac_generate(self, prompt: str, system: str = "") -> str:
"""Generate APAC LLM response for given prompt."""
apac_messages = []
if system:
apac_messages.append({"role": "system", "content": system})
apac_messages.append({"role": "user", "content": prompt})
apac_formatted = self.llm.tokenizer.apply_chat_template(
apac_messages, tokenize=False, add_generation_prompt=True
)
apac_outputs = self.llm.generate([apac_formatted], self.params)
return apac_outputs[0].outputs[0].text
# APAC: Deploy to Lepton AI cloud (single command)
# lepton photon run -n apac-qwen-endpoint -e ApacQwenEndpoint --resource-shape gpu.a10
# APAC: Call deployed APAC endpoint
import requests
apac_response = requests.post(
"https://apac-qwen-endpoint.lepton.ai/apac_generate",
headers={"Authorization": f"Bearer {LEPTON_API_KEY}"},
json={
"prompt": "新加坡MAS对AI治理有哪些要求?", # Chinese: "MAS AI governance requirements?"
"system": "You are an APAC regulatory compliance expert.",
}
)
print(apac_response.json())
# APAC: Response in Chinese — Qwen handles CJK natively without translation
Lepton AI APAC cost comparison
APAC Inference Cost Comparison (Qwen 2.5 7B, 1M tokens/day):
Provider GPU Type Cost/Day Cold Start Data Sovereignty
Lepton AI A10G ~$8-12 <1 second APAC regions available
Modal Labs A10G ~$10-15 2-3 seconds US/EU only
Anyscale A10G ~$15-20 On-demand APAC regions available
Self-hosted K8s A10G ~$20-25 N/A (always) Full control
OpenAI API N/A ~$5-8 N/A US data centers
APAC Decision:
If APAC data sovereignty matters → Self-hosted K8s or Lepton (APAC region)
If cost-sensitivity matters → OpenAI API or Lepton (serverless, variable)
If latency SLA matters → Self-hosted K8s (always-on, no cold start)
If ML experimentation pace matters → Lepton or Modal (fast deployment)
Coroot: APAC eBPF Service Observability
Coroot APAC Kubernetes deployment
# APAC: Coroot — deploy on existing APAC Kubernetes cluster
# No code changes required — eBPF captures data at kernel level
# APAC: Install Coroot via Helm
helm repo add coroot https://coroot.github.io/helm-charts
helm repo update
helm install coroot coroot/coroot \
--namespace apac-monitoring --create-namespace \
--set corootNode.enabled=true \
--set prometheus.enabled=true # APAC: or use existing Prometheus
# APAC: Coroot agent (coroot-node-agent) runs as DaemonSet on every node
# Uses eBPF to capture:
# - Network connections between APAC pods
# - HTTP/gRPC request latency and error rates at network layer
# - CPU/memory/disk/network resource utilization per APAC pod
# - Database query performance (MySQL, PostgreSQL, Redis) via eBPF
# APAC: No changes required to:
# - APAC application code (no OpenTelemetry SDK)
# - APAC Kubernetes manifests (no sidecar injection)
# - APAC service configuration (no tracing headers)
Coroot APAC anomaly detection
APAC: Coroot automatic anomaly detection example
Service: apac-payment-service
Anomaly detected: 2026-05-17 14:32 SGT
Root cause surfaced by Coroot:
→ apac-payment-service → apac-database (PostgreSQL)
→ Query latency: normal 8ms → current 340ms (42x degradation)
→ CPU utilization on db-node-02: 95% (baseline: 30%)
Correlated signals:
→ apac-database disk IO wait: 87% (baseline: 5%)
→ apac-order-service error rate: 12% (baseline: 0.1%)
→ apac-payment-service p99 latency: 4200ms (baseline: 180ms)
Probable cause: APAC database disk IO saturation
→ Coroot shows this is a new pattern starting 14:28 SGT
→ Correlates with apac-analytics-service batch job started 14:25 SGT
APAC team action: throttle apac-analytics batch job or move to off-peak
→ Identified in 8 minutes vs typical 45-minute APAC incident investigation
Braintrust: APAC LLM Experiment Tracking
Braintrust APAC SDK instrumentation
# APAC: Braintrust — LLM experiment logging with lightweight SDK
import braintrust
from braintrust import init_logger, current_span
# APAC: Initialize Braintrust logger for production monitoring
braintrust.init(api_key=os.environ["BRAINTRUST_API_KEY"])
# APAC: Log LLM experiments — wraps any LLM call
@braintrust.traced
def apac_rag_answer(query: str) -> str:
"""APAC RAG pipeline with automatic Braintrust experiment logging."""
# APAC: Trace retrieval step
with current_span().start_span("apac_retrieve") as span:
apac_docs = vector_search(query, top_k=5)
span.log(input=query, output=[d["text"][:100] for d in apac_docs])
# APAC: Trace generation step
with current_span().start_span("apac_generate") as span:
apac_context = "\n".join([d["text"] for d in apac_docs])
apac_response = call_llm(
system=f"Answer based on context: {apac_context}",
user=query,
)
span.log(input={"query": query, "context": apac_context}, output=apac_response)
return apac_response
# APAC: Run experiment — all calls logged to Braintrust
with braintrust.init_experiment("APAC RAG v2.3 — Qwen context"):
for apac_query in apac_test_queries:
apac_answer = apac_rag_answer(apac_query)
Braintrust APAC prompt version management
# APAC: Braintrust — manage APAC system prompts as versioned artifacts
# APAC: Fetch live prompt from Braintrust (no code deployment needed to update)
apac_prompt = braintrust.load_prompt(
project="APAC Enterprise Chatbot",
slug="apac-system-prompt",
)
# APAC: Use prompt in LLM call — version tracked automatically
apac_response = call_llm(
system=apac_prompt.build(
# APAC: Template variables filled at runtime
market="Singapore",
regulation="MAS FEAT",
language="English",
)["system"],
user=user_query,
)
# APAC: Non-engineer APAC stakeholders iterate prompt in Braintrust UI:
# 1. Edit system prompt in Braintrust playground
# 2. Test against APAC golden test case dataset
# 3. Compare output scores against current production version
# 4. Promote to production — zero code deployment required
# 5. APAC engineering team sees version change in experiment history
Related APAC AI Infrastructure Resources
For the LLM observability platforms (Arize Phoenix, AgentOps) that complement Braintrust's experiment tracking with production trace visualization, APAC session replay, and real-time cost attribution per LLM call in production traffic, see the APAC LLM observability guide.
For the serverless AI compute platforms (Modal Labs, E2B) that complement Lepton AI for APAC workloads beyond persistent LLM endpoints — batch inference jobs, AI agent code execution sandboxes, and scheduled ML pipeline tasks — see the APAC serverless AI compute guide.
For the Kubernetes observability tools (eBPF-based Hubble, Pixie) that operate at the same kernel level as Coroot but focus specifically on APAC network security policy enforcement and pod-to-pod traffic visualization in Cilium-managed clusters, see the APAC eBPF observability guide.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.