Skip to main content
Global
AIMenta
Blog

APAC LLM Observability Guide 2026: Arize Phoenix, AgentOps, and Lunary

A practitioner guide for APAC AI engineering teams implementing LLM observability in 2026 — covering Arize Phoenix as an open-source OTel-based LLM tracing platform providing span-level visibility into RAG retrieval quality and agent workflows with automated hallucination detection and relevance scoring for CI/CD quality gates; AgentOps as an agent-focused observability platform with one-line framework instrumentation for LangChain, AutoGen, and CrewAI providing session-level replay, step-by-step agent action tracing, and real-time cost anomaly detection; and Lunary as a lightweight open-source LLM logging platform capturing prompts, responses, and costs across production deployments with user feedback collection for CSAT tracking and self-hosted PostgreSQL deployment for APAC data sovereignty requirements.

AE By AIMenta Editorial Team ·

Why APAC Teams Need LLM-Specific Observability

Standard application monitoring (Datadog, Prometheus) tracks request counts, latency, and error rates — metrics that tell APAC teams that something went wrong but not why an LLM response was incorrect, why a RAG pipeline retrieved irrelevant context, or which step in a 20-step agent workflow caused task failure. LLM-specific observability tools capture the semantic content of AI interactions — prompts, responses, retrieved chunks, tool calls, and quality scores — providing the diagnostic layer that standard APAC observability tools cannot.

Three tools cover different layers of APAC LLM observability:

Arize Phoenix — open-source LLM observability with span-level tracing and evaluation metrics for APAC RAG pipelines and agent workflows.

AgentOps — agent-focused session tracing with step-by-step replay and cost tracking for APAC multi-step agent debugging.

Lunary — lightweight open-source LLM logging with user feedback collection and cost analytics for APAC production AI applications.


APAC LLM Observability Layer Map

APAC AI Application Quality Loop:

Production Logging (what happened):
  Lunary → logs prompts/responses/costs in production
  → "Claude responded incorrectly to APAC customer query"

Agent Debugging (what the agent did):
  AgentOps → replays session step-by-step
  → "Agent called wrong tool at step 7 due to ambiguous APAC context"

Evaluation Pipeline (why output was wrong):
  Phoenix → runs automated quality metrics over traced spans
  → "RAG retrieval relevance score: 0.31 (APAC threshold: 0.70)"
  → "Root cause: APAC chunking strategy splits regulatory context across chunks"

Fix → Test → Redeploy → Monitor (Lunary) → Repeat

Arize Phoenix: APAC RAG and Agent Tracing

Phoenix APAC instrumentation for RAG pipelines

# APAC: Arize Phoenix — RAG pipeline tracing

import phoenix as px
from phoenix.otel import register
from opentelemetry import trace

# APAC: Start Phoenix server and register OTel tracer
px.launch_app()
register(project_name="apac-rag-pipeline")

tracer = trace.get_tracer("apac-rag")

# APAC: Traced RAG pipeline — each step becomes a Phoenix span
def retrieve_apac_context(query: str) -> list[str]:
    with tracer.start_as_current_span("apac.retrieval") as span:
        span.set_attribute("apac.query", query)
        span.set_attribute("apac.retrieval.top_k", 5)

        # APAC: Vector search
        apac_docs = vector_store.similarity_search(query, k=5)

        span.set_attribute("apac.retrieval.num_docs", len(apac_docs))
        span.set_attribute("apac.retrieval.doc_ids", [d.id for d in apac_docs])
        return [d.content for d in apac_docs]

def generate_apac_response(query: str, context: list[str]) -> str:
    with tracer.start_as_current_span("apac.llm_call") as span:
        span.set_attribute("apac.llm.model", "claude-sonnet-4-6")
        span.set_attribute("apac.llm.context_length", sum(len(c) for c in context))

        apac_prompt = f"Context:\n{chr(10).join(context)}\n\nQuestion: {query}"
        apac_response = llm.invoke(apac_prompt)

        span.set_attribute("apac.llm.response_length", len(apac_response))
        return apac_response

# APAC: Phoenix UI shows full span tree:
# apac-rag-pipeline
#   ├── apac.retrieval (45ms, 5 docs retrieved)
#   └── apac.llm_call (1,230ms, 2,847 context chars → 412 response chars)

Phoenix APAC automated evaluation

# APAC: Phoenix evaluation — automated RAG quality scoring

from phoenix.evals import (
    HallucinationEvaluator,
    RelevanceEvaluator,
    run_evals,
)

# APAC: Load traced dataset from Phoenix
apac_dataset = px.Client().get_dataset(name="apac-rag-traces")

# APAC: Run evaluation metrics
apac_eval_model = OpenAIModel(model="gpt-4o")

apac_evaluators = [
    HallucinationEvaluator(apac_eval_model),   # Is APAC response grounded in context?
    RelevanceEvaluator(apac_eval_model),        # Is retrieved context relevant to APAC query?
]

apac_results = run_evals(
    dataframe=apac_dataset.as_dataframe(),
    evaluators=apac_evaluators,
    provide_explanation=True,
)

# APAC: Results show per-trace scores
# trace_id | hallucination_score | relevance_score | explanation
# abc123   | 0.0 (grounded)      | 0.85 (relevant) | "Context matches APAC query"
# def456   | 1.0 (hallucination) | 0.31 (poor)     | "APAC regulatory context missing"

AgentOps: APAC Agent Session Debugging

AgentOps APAC one-line agent instrumentation

# APAC: AgentOps — automatic agent session tracing

import agentops
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI

# APAC: Initialize AgentOps — instruments all subsequent LLM calls
agentops.init(
    api_key=os.environ["AGENTOPS_API_KEY"],
    tags=["apac-research-agent", "singapore", "production"],
)

apac_llm = ChatOpenAI(model="gpt-4o", temperature=0)
apac_tools = [apac_web_search_tool, apac_calculator_tool, apac_database_tool]

# APAC: Agent automatically traced — no code changes needed
apac_agent = initialize_agent(
    tools=apac_tools,
    llm=apac_llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=False,
)

apac_result = apac_agent.run(
    "Research AI adoption rates in Singapore financial services sector "
    "and calculate the projected market size for 2027 based on current trends."
)

# APAC: AgentOps records session:
# Session: apac-research-001
#   Step 1: apac_web_search("AI adoption Singapore finance") → 8 results
#   Step 2: apac_web_search("MAS AI governance framework statistics") → 5 results
#   Step 3: apac_database_tool("fintech_market_data WHERE region='SG'") → 23 rows
#   Step 4: apac_calculator_tool("CAGR(baseline=180, rate=0.34, years=3)") → 434.8
#   LLM calls: 6 | Total tokens: 8,432 | Cost: $0.043 | Duration: 12.3s

agentops.end_session("Success")

AgentOps APAC cost monitoring and anomaly detection

# APAC: AgentOps — session cost tracking and budget alerts

agentops.init(
    api_key=os.environ["AGENTOPS_API_KEY"],
    # APAC: Alert if session exceeds token budget
    max_wait_time=30_000,    # ms before timeout
    tags=["apac-enterprise", "cost-monitored"],
)

# APAC: Track cost per task type in AgentOps dashboard
# Filter: tags["apac-enterprise"] | last 7 days
# Sessions: 1,243 | Avg cost: $0.031 | Max cost: $4.82 (anomaly flagged)
# Top cost drivers: apac_database_tool (42%), LLM reasoning (38%), search (20%)

Lunary: APAC Production LLM Logging

Lunary APAC basic logging setup

# APAC: Lunary — lightweight LLM production logging

from lunary import LunaryCallbackHandler
from langchain_openai import ChatOpenAI

# APAC: Add Lunary callback — logs all LLM calls automatically
apac_llm = ChatOpenAI(
    model="gpt-4o",
    callbacks=[
        LunaryCallbackHandler(
            app_id=os.environ["LUNARY_APP_ID"],
            # APAC: Tag for feature attribution
            metadata={
                "feature": "apac-customer-support",
                "market": "singapore",
                "user_tier": "enterprise",
            }
        )
    ]
)

# APAC: Standard LangChain usage — Lunary logs transparently
apac_response = apac_llm.invoke("Explain AI governance requirements for Singapore banks")
# APAC: Lunary captures:
# - Prompt: "Explain AI governance requirements..."
# - Response: "..."
# - Model: gpt-4o | Tokens: 847 | Cost: $0.0084 | Latency: 2.3s
# - Tags: feature=apac-customer-support, market=singapore

Lunary APAC user feedback collection

# APAC: Lunary — user feedback on AI responses

import lunary

# APAC: Log LLM call with run ID for feedback tracking
apac_run_id = lunary.open_ai.wrap(openai_client).chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": apac_user_query}],
    metadata={"user_id": apac_user.id, "session_id": apac_session_id}
).id

# APAC: When user rates the response (thumbs up/down in UI)
@app.post("/api/feedback")
def apac_submit_feedback(run_id: str, rating: int, comment: str = None):
    lunary.track_feedback(
        run_id=run_id,
        # APAC: Score: 1 = positive, -1 = negative, 0 = neutral
        score=1 if rating > 3 else (-1 if rating < 3 else 0),
        comment=comment,
    )
    # APAC: Feedback now attached to run in Lunary dashboard
    # Trend: feature=apac-customer-support | CSAT: 82% (last 30 days)

APAC LLM Observability Selection

APAC Need                         → Tool         → Why

APAC RAG quality debugging        → Phoenix       Span-level retrieval
(retrieval relevance, hallucination) →             and LLM evaluation

APAC multi-step agent debugging   → AgentOps      Session replay;
(complex workflow failures)        →               step-by-step trace

APAC production cost tracking     → Lunary         Lightweight; per-
(feature-level LLM spend)          → AgentOps       feature attribution

APAC user satisfaction monitoring → Lunary         Feedback API;
(CSAT on AI output quality)        →                CSAT trend tracking

APAC CI/CD quality regression     → Phoenix        Automated eval
(pre-deploy quality checks)        →                in deployment pipeline

APAC all-in-one LLM logging       → Helicone       Proxy-based;
(existing Helicone users)          → (if existing)  zero-code logging

Related APAC AI Observability Resources

For the ML model monitoring tools (Evidently AI, whylogs, Fiddler) that monitor data drift and feature distributions in APAC ML models upstream of LLM pipelines, see the APAC ML model monitoring guide.

For the LLM evaluation platforms (promptfoo, DeepEval, Ragas) that run systematic offline evaluation suites against APAC AI application quality benchmarks, see the APAC AI evaluation guide.

For the AI agent frameworks (AutoGen, PydanticAI, smolagents) that AgentOps and Phoenix instrument to trace APAC multi-agent conversations and tool invocations, see the APAC AI agent frameworks guide.

Beyond this insight

Cross-reference our practice depth.

If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.

Keep reading

Related reading

Blog

APAC AI Execution Infrastructure Guide 2026: E2B, Baseten, and Cerebrium

A practitioner guide for APAC AI engineering teams selecting execution infrastructure for AI agent code sandboxes, ML model inference, and serverless GPU compute in 2026 — covering E2B as secure cloud sandboxes for running LLM-generated Python code in isolated environments, enabling APAC AI data analyst and coding agent applications to execute arbitrary code safely without production infrastructure risk; Baseten as a managed ML model inference platform that converts PyTorch and HuggingFace models to auto-scaling GPU APIs via its Truss packaging framework, with TensorRT optimization and scale-to-zero for APAC variable traffic workloads; and Cerebrium as a serverless GPU cloud with sub-second cold starts on H100/A100 hardware, charging per GPU-second for APAC teams with bursty inference or training workloads who need flexible access to high-end GPU without committed instance costs.

Blog

APAC Computer Vision Deployment Guide 2026: Ultralytics, LandingAI, and Roboflow Inference

A practitioner guide for APAC ML and engineering teams building and deploying computer vision systems in 2026 — covering Ultralytics YOLO as the state-of-the-art real-time CV framework for training, fine-tuning, and exporting YOLO models to TensorRT, ONNX, and TFLite for APAC edge and cloud deployment with one Python API; LandingAI as a no-code visual inspection platform enabling APAC factory quality engineers to build defect detection models using active learning with 50-200 labeled images and no ML expertise, with edge deployment for on-premise factory inference; and Roboflow Inference as an open-source CV model serving engine that deploys YOLO, GroundingDINO, and SAM2 as Docker APIs with one command, with Workflows for chaining multi-model CV pipelines into single API calls for APAC engineering teams.

Blog

APAC ML Experiment Tracking and Data Versioning Guide 2026: DagsHub, Aim, and DVC

A practitioner guide for APAC data science teams implementing ML reproducibility through data versioning and experiment tracking in 2026 — covering DVC as a Git-compatible data version control tool that tracks large datasets and model artifacts in APAC cloud storage while storing lightweight metadata in Git, enabling reproducible ML pipelines with pipeline stage caching that skips unchanged preprocessing stages; DagsHub as an integrated ML project collaboration platform combining Git hosting, DVC data versioning, MLflow-compatible experiment tracking, and model registry in a GitHub-like interface; and Aim as an open-source self-hosted ML experiment tracker providing APAC regulated industry teams with complete data sovereignty over training metadata, rich run comparison, and hyperparameter visualization without cloud vendor dependency.

Want this applied to your firm?

We use these frameworks daily in client engagements. Let's see what they look like for your stage and market.