APAC LLM Observability Guide 2026: Arize Phoenix, AgentOps, and Lunary

Why APAC Teams Need LLM-Specific Observability

Standard application monitoring (Datadog, Prometheus) tracks request counts, latency, and error rates — metrics that tell APAC teams that something went wrong but not why an LLM response was incorrect, why a RAG pipeline retrieved irrelevant context, or which step in a 20-step agent workflow caused task failure. LLM-specific observability tools capture the semantic content of AI interactions — prompts, responses, retrieved chunks, tool calls, and quality scores — providing the diagnostic layer that standard APAC observability tools cannot.

Three tools cover different layers of APAC LLM observability:

Arize Phoenix — open-source LLM observability with span-level tracing and evaluation metrics for APAC RAG pipelines and agent workflows.

AgentOps — agent-focused session tracing with step-by-step replay and cost tracking for APAC multi-step agent debugging.

Lunary — lightweight open-source LLM logging with user feedback collection and cost analytics for APAC production AI applications.

APAC LLM Observability Layer Map

APAC AI Application Quality Loop:

Production Logging (what happened):
  Lunary → logs prompts/responses/costs in production
  → "Claude responded incorrectly to APAC customer query"

Agent Debugging (what the agent did):
  AgentOps → replays session step-by-step
  → "Agent called wrong tool at step 7 due to ambiguous APAC context"

Evaluation Pipeline (why output was wrong):
  Phoenix → runs automated quality metrics over traced spans
  → "RAG retrieval relevance score: 0.31 (APAC threshold: 0.70)"
  → "Root cause: APAC chunking strategy splits regulatory context across chunks"

Fix → Test → Redeploy → Monitor (Lunary) → Repeat

Arize Phoenix: APAC RAG and Agent Tracing

Phoenix APAC instrumentation for RAG pipelines

# APAC: Arize Phoenix — RAG pipeline tracing

import phoenix as px
from phoenix.otel import register
from opentelemetry import trace

# APAC: Start Phoenix server and register OTel tracer
px.launch_app()
register(project_name="apac-rag-pipeline")

tracer = trace.get_tracer("apac-rag")

# APAC: Traced RAG pipeline — each step becomes a Phoenix span
def retrieve_apac_context(query: str) -> list[str]:
    with tracer.start_as_current_span("apac.retrieval") as span:
        span.set_attribute("apac.query", query)
        span.set_attribute("apac.retrieval.top_k", 5)

        # APAC: Vector search
        apac_docs = vector_store.similarity_search(query, k=5)

        span.set_attribute("apac.retrieval.num_docs", len(apac_docs))
        span.set_attribute("apac.retrieval.doc_ids", [d.id for d in apac_docs])
        return [d.content for d in apac_docs]

def generate_apac_response(query: str, context: list[str]) -> str:
    with tracer.start_as_current_span("apac.llm_call") as span:
        span.set_attribute("apac.llm.model", "claude-sonnet-4-6")
        span.set_attribute("apac.llm.context_length", sum(len(c) for c in context))

        apac_prompt = f"Context:\n{chr(10).join(context)}\n\nQuestion: {query}"
        apac_response = llm.invoke(apac_prompt)

        span.set_attribute("apac.llm.response_length", len(apac_response))
        return apac_response

# APAC: Phoenix UI shows full span tree:
# apac-rag-pipeline
#   ├── apac.retrieval (45ms, 5 docs retrieved)
#   └── apac.llm_call (1,230ms, 2,847 context chars → 412 response chars)

Phoenix APAC automated evaluation

# APAC: Phoenix evaluation — automated RAG quality scoring

from phoenix.evals import (
    HallucinationEvaluator,
    RelevanceEvaluator,
    run_evals,
)

# APAC: Load traced dataset from Phoenix
apac_dataset = px.Client().get_dataset(name="apac-rag-traces")

# APAC: Run evaluation metrics
apac_eval_model = OpenAIModel(model="gpt-4o")

apac_evaluators = [
    HallucinationEvaluator(apac_eval_model),   # Is APAC response grounded in context?
    RelevanceEvaluator(apac_eval_model),        # Is retrieved context relevant to APAC query?
]

apac_results = run_evals(
    dataframe=apac_dataset.as_dataframe(),
    evaluators=apac_evaluators,
    provide_explanation=True,
)

# APAC: Results show per-trace scores
# trace_id | hallucination_score | relevance_score | explanation
# abc123   | 0.0 (grounded)      | 0.85 (relevant) | "Context matches APAC query"
# def456   | 1.0 (hallucination) | 0.31 (poor)     | "APAC regulatory context missing"

AgentOps: APAC Agent Session Debugging

AgentOps APAC one-line agent instrumentation

# APAC: AgentOps — automatic agent session tracing

import agentops
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI

# APAC: Initialize AgentOps — instruments all subsequent LLM calls
agentops.init(
    api_key=os.environ["AGENTOPS_API_KEY"],
    tags=["apac-research-agent", "singapore", "production"],
)

apac_llm = ChatOpenAI(model="gpt-4o", temperature=0)
apac_tools = [apac_web_search_tool, apac_calculator_tool, apac_database_tool]

# APAC: Agent automatically traced — no code changes needed
apac_agent = initialize_agent(
    tools=apac_tools,
    llm=apac_llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=False,
)

apac_result = apac_agent.run(
    "Research AI adoption rates in Singapore financial services sector "
    "and calculate the projected market size for 2027 based on current trends."
)

# APAC: AgentOps records session:
# Session: apac-research-001
#   Step 1: apac_web_search("AI adoption Singapore finance") → 8 results
#   Step 2: apac_web_search("MAS AI governance framework statistics") → 5 results
#   Step 3: apac_database_tool("fintech_market_data WHERE region='SG'") → 23 rows
#   Step 4: apac_calculator_tool("CAGR(baseline=180, rate=0.34, years=3)") → 434.8
#   LLM calls: 6 | Total tokens: 8,432 | Cost: $0.043 | Duration: 12.3s

agentops.end_session("Success")

AgentOps APAC cost monitoring and anomaly detection

# APAC: AgentOps — session cost tracking and budget alerts

agentops.init(
    api_key=os.environ["AGENTOPS_API_KEY"],
    # APAC: Alert if session exceeds token budget
    max_wait_time=30_000,    # ms before timeout
    tags=["apac-enterprise", "cost-monitored"],
)

# APAC: Track cost per task type in AgentOps dashboard
# Filter: tags["apac-enterprise"] | last 7 days
# Sessions: 1,243 | Avg cost: $0.031 | Max cost: $4.82 (anomaly flagged)
# Top cost drivers: apac_database_tool (42%), LLM reasoning (38%), search (20%)

Lunary: APAC Production LLM Logging

Lunary APAC basic logging setup

# APAC: Lunary — lightweight LLM production logging

from lunary import LunaryCallbackHandler
from langchain_openai import ChatOpenAI

# APAC: Add Lunary callback — logs all LLM calls automatically
apac_llm = ChatOpenAI(
    model="gpt-4o",
    callbacks=[
        LunaryCallbackHandler(
            app_id=os.environ["LUNARY_APP_ID"],
            # APAC: Tag for feature attribution
            metadata={
                "feature": "apac-customer-support",
                "market": "singapore",
                "user_tier": "enterprise",
            }
        )
    ]
)

# APAC: Standard LangChain usage — Lunary logs transparently
apac_response = apac_llm.invoke("Explain AI governance requirements for Singapore banks")
# APAC: Lunary captures:
# - Prompt: "Explain AI governance requirements..."
# - Response: "..."
# - Model: gpt-4o | Tokens: 847 | Cost: $0.0084 | Latency: 2.3s
# - Tags: feature=apac-customer-support, market=singapore

Lunary APAC user feedback collection

# APAC: Lunary — user feedback on AI responses

import lunary

# APAC: Log LLM call with run ID for feedback tracking
apac_run_id = lunary.open_ai.wrap(openai_client).chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": apac_user_query}],
    metadata={"user_id": apac_user.id, "session_id": apac_session_id}
).id

# APAC: When user rates the response (thumbs up/down in UI)
@app.post("/api/feedback")
def apac_submit_feedback(run_id: str, rating: int, comment: str = None):
    lunary.track_feedback(
        run_id=run_id,
        # APAC: Score: 1 = positive, -1 = negative, 0 = neutral
        score=1 if rating > 3 else (-1 if rating < 3 else 0),
        comment=comment,
    )
    # APAC: Feedback now attached to run in Lunary dashboard
    # Trend: feature=apac-customer-support | CSAT: 82% (last 30 days)

APAC LLM Observability Selection

APAC Need                         → Tool         → Why

APAC RAG quality debugging        → Phoenix       Span-level retrieval
(retrieval relevance, hallucination) →             and LLM evaluation

APAC multi-step agent debugging   → AgentOps      Session replay;
(complex workflow failures)        →               step-by-step trace

APAC production cost tracking     → Lunary         Lightweight; per-
(feature-level LLM spend)          → AgentOps       feature attribution

APAC user satisfaction monitoring → Lunary         Feedback API;
(CSAT on AI output quality)        →                CSAT trend tracking

APAC CI/CD quality regression     → Phoenix        Automated eval
(pre-deploy quality checks)        →                in deployment pipeline

APAC all-in-one LLM logging       → Helicone       Proxy-based;
(existing Helicone users)          → (if existing)  zero-code logging

Related APAC AI Observability Resources

For the ML model monitoring tools (Evidently AI, whylogs, Fiddler) that monitor data drift and feature distributions in APAC ML models upstream of LLM pipelines, see the APAC ML model monitoring guide.

For the LLM evaluation platforms (promptfoo, DeepEval, Ragas) that run systematic offline evaluation suites against APAC AI application quality benchmarks, see the APAC AI evaluation guide.

For the AI agent frameworks (AutoGen, PydanticAI, smolagents) that AgentOps and Phoenix instrument to trace APAC multi-agent conversations and tool invocations, see the APAC AI agent frameworks guide.