APAC LLM Engineering Quality: Workflow, Tracing, and Testing
APAC AI engineering teams shipping LLM-powered features face three distinct quality challenges: managing prompt versions and experimentation without engineering bottlenecks, tracing LLM application behavior in production to debug quality issues, and applying systematic testing discipline to LLM outputs that don't have a single correct answer. This guide covers three complementary tools APAC teams use to address each layer of the LLM quality problem.
Vellum — LLM workflow platform for APAC product teams, combining prompt versioning, A/B testing, visual workflow builder, and production monitoring without code deployments.
Opik by Comet — open-source LLM tracing and testing platform for APAC ML teams, providing call-level observability and automated test suites that integrate with Comet ML experiment tracking.
Deepchecks — continuous testing framework for APAC LLM applications and ML models, running automated quality checks, drift detection, and RAG pipeline validation in Python-first CI/CD pipelines.
APAC LLM Quality Engineering Tool Selection
APAC Team Profile → Tool → Why
APAC product + eng team, LLM feature → Vellum Prompt versioning + A/B without
(non-technical PMs need visibility) → code deploys; product-friendly
APAC ML team, Comet ML users → Opik Tracing + testing + Comet ML
(open-source preferred, self-host) → integration; OSS self-host option
APAC data science, test discipline → Deepchecks Python pytest-style quality gates;
(CI/CD gates for AI quality) → LLM + ML + RAG in one framework
APAC team, lightweight OTel tracing → OpenLLMetry OTel spans to existing Jaeger/
(already have Grafana/Jaeger) → Grafana — no new platform needed
APAC team, full LLMOps (eval+human → Humanloop Enterprise A/B + human feedback
feedback + fine-tuning) → + fine-tuning at scale
APAC LLM Quality Stack by Maturity:
Starter (1 LLM feature): Vellum (prompt mgmt) + Parea AI (testing)
Intermediate (3+ features): Vellum + Opik/Langfuse (tracing) + Deepchecks (quality gates)
Advanced (ML + LLM): Opik + Comet ML + Deepchecks (unified experiment tracking)
Enterprise: Humanloop + Patronus AI + Galileo (full-stack)
Vellum: APAC Prompt Versioning and A/B Testing
Vellum APAC Python SDK integration
# APAC: Vellum — run production prompt via SDK without embedding prompt in code
import vellum
import os
apac_vellum = vellum.Vellum(api_key=os.environ["VELLUM_API_KEY"])
async def apac_run_compliance_workflow(
apac_regulation: str,
apac_query: str,
apac_context: str,
) -> str:
"""APAC: Run production compliance assistant workflow from Vellum."""
# APAC: Vellum fetches the CURRENTLY DEPLOYED prompt version automatically
# APAC: Product team can update prompt in Vellum UI → this code uses new version
apac_result = apac_vellum.execute_workflow(
workflow_deployment_name="apac-mas-compliance-assistant",
release_tag="PRODUCTION", # APAC: pinned to production deployment tag
inputs=[
vellum.WorkflowRequestStringInputRequest(
name="regulation",
value=apac_regulation,
),
vellum.WorkflowRequestStringInputRequest(
name="query",
value=apac_query,
),
vellum.WorkflowRequestStringInputRequest(
name="context_documents",
value=apac_context,
),
],
)
# APAC: Extract output from workflow execution
apac_output = next(
(o.value for o in apac_result.data.outputs if o.name == "compliance_response"),
None,
)
return apac_output
# APAC: Usage — product team changes prompt in Vellum UI
# APAC: No code deployment needed; next API call uses updated prompt
apac_response = await apac_run_compliance_workflow(
apac_regulation="MAS FEAT",
apac_query="What are the explainability requirements for credit AI models?",
apac_context="MAS FEAT Principles document...",
)
Vellum APAC A/B test setup
# APAC: Vellum — configure A/B test between two prompt variants
# APAC: In Vellum UI: Deployments → apac-mas-compliance-assistant → A/B Testing
# APAC: Configure:
# - Variant A: "apac-prompt-v2-concise" (80% traffic)
# - Variant B: "apac-prompt-v3-detailed" (20% traffic)
# - Routing: random per request (no sticky sessions)
# - Metrics: response_quality score from automated LLM-as-judge
# APAC: Application code unchanged — Vellum routes to correct variant
# APAC: After 200 requests (100 per variant at steady state):
# → Vellum dashboard shows: variant A score 0.81, variant B score 0.87
# → Product team promotes variant B to 100% without engineering deployment
# → Variant A traffic: 0% (archived but retrievable)
# APAC: Track custom metrics by logging quality signals back to Vellum
apac_vellum.submit_completion_actuals(
deployment_id="apac-mas-compliance-assistant",
completion_id=apac_result.data.completion_id,
actuals=[
vellum.SubmitCompletionActualRequest(
quality=0.92, # APAC: automated LLM-as-judge score logged per response
label="accepted",
)
],
)
# APAC: Vellum aggregates quality scores per variant in real time
# APAC: Statistical significance calculated automatically — promotes winner at threshold
Vellum APAC workflow builder for RAG pipeline
APAC: Vellum Workflow Builder — visual RAG pipeline for APAC compliance assistant
┌─────────────────────────────────────────────────────────────┐
│ APAC Workflow: apac-mas-compliance-assistant │
│ │
│ Input: {regulation, query, user_segment} │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Search Node │───▶│ LLM Node │───▶│ Guard Node │ │
│ │ │ │ │ │ │ │
│ │ Qdrant search│ │ gpt-4o-mini │ │ Check: │ │
│ │ top_k=5 │ │ │ │ - no advice │ │
│ │ APAC docs │ │ Prompt v7 │ │ - cites reg │ │
│ └──────────────┘ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Output Node │ │
│ │ │ │
│ │ compliance_ │ │
│ │ response │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
APAC: Each node version-controlled independently
APAC: Swap LLM node model without changing retrieval or guard logic
APAC: A/B test prompt changes in LLM node with 10/90 traffic split
APAC: Production deployment: tag workflow version as PRODUCTION
APAC: Rollback: re-tag previous version as PRODUCTION in Vellum UI
Opik by Comet: APAC Open-Source LLM Tracing
Opik APAC SDK tracing setup
# APAC: Opik — instrument APAC LLM application with automatic call tracing
import opik
from opik.integrations.openai import track_openai
opik.configure(api_key=os.environ["OPIK_API_KEY"]) # APAC: or self-hosted URL
# APAC: Auto-instrument OpenAI client — all calls traced automatically
import openai
apac_client = track_openai(openai.OpenAI())
# APAC: Decorator for function-level tracing
@opik.track(name="apac-compliance-rag")
async def apac_compliance_rag_pipeline(apac_query: str) -> str:
"""APAC: Full RAG pipeline traced end-to-end in Opik."""
# APAC: Step 1 — Retrieval (tracked as sub-span)
@opik.track(name="apac-qdrant-retrieval")
async def apac_retrieve(query: str) -> list[str]:
return await apac_qdrant_client.search(
collection_name="apac-mas-regulations",
query_text=query,
limit=5,
)
apac_chunks = await apac_retrieve(apac_query)
apac_context = "\n\n".join(apac_chunks)
# APAC: Step 2 — LLM call (auto-traced by track_openai)
apac_response = await apac_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Answer based only on this context:\n{apac_context}",
},
{"role": "user", "content": apac_query},
],
)
apac_answer = apac_response.choices[0].message.content
return apac_answer
# APAC: Opik dashboard shows:
# apac-compliance-rag (1.34s total)
# ├── apac-qdrant-retrieval (0.08s, 5 docs)
# └── openai.chat.completions (1.26s, 312 tokens, $0.000047)
# APAC: Click any trace → view full input/output + token breakdown + cost
Opik APAC automated test dataset and scoring
# APAC: Opik — build test dataset and run LLM-as-judge scoring for CI/CD
import opik
from opik.integrations.openai import track_openai
from opik import Dataset, DatasetItem
apac_opik_client = opik.Opik()
# APAC: Create test dataset in Opik (stored in Opik backend)
apac_dataset = apac_opik_client.get_or_create_dataset(
name="apac-mas-compliance-qa-v2",
)
# APAC: Add test cases (question + expected regulation citation)
apac_test_cases = [
{"question": "What are MAS FEAT fairness criteria?", "expected_regulation": "FEAT"},
{"question": "What are HKMA AI governance principles?", "expected_regulation": "HKMA"},
{"question": "How does OJK regulate AI in lending?", "expected_regulation": "OJK"},
]
for apac_case in apac_test_cases:
apac_dataset.insert([DatasetItem(input=apac_case["question"], expected_output=apac_case["expected_regulation"])])
# APAC: Define scorer — checks if regulation is cited in response
def apac_regulation_cited_scorer(dataset_item, llm_output: str) -> float:
"""APAC: Score 1.0 if expected regulation cited in output, else 0.0."""
apac_expected = dataset_item.expected_output
return 1.0 if apac_expected.lower() in llm_output.lower() else 0.0
# APAC: Run test suite
apac_results = opik.run_experiment(
dataset=apac_dataset,
task=apac_compliance_rag_pipeline,
scoring_metrics=[apac_regulation_cited_scorer],
experiment_name="apac-compliance-rag-ci-run-v3",
)
apac_avg_score = apac_results.mean_score("apac_regulation_cited_scorer")
print(f"APAC: Test suite score: {apac_avg_score:.2f}")
# APAC: CI/CD gate
if apac_avg_score < 0.85:
raise SystemExit(f"APAC: Quality gate failed — {apac_avg_score:.2f} < 0.85 threshold")
Deepchecks: APAC Continuous AI Quality Testing
Deepchecks APAC LLM quality suite
# APAC: Deepchecks — run automated LLM quality checks before production deployment
from deepchecks.llm_eval import LLMTestSuite, SingleSampleCheck
from deepchecks.llm_eval.checks import (
ResponseCoherence,
ContextAdherence,
ResponseToxicity,
Groundedness,
)
# APAC: Define LLM test suite with quality checks
apac_llm_suite = LLMTestSuite(
name="APAC Compliance Assistant Quality Suite",
checks=[
ResponseCoherence(min_score=0.75), # APAC: response must be logically coherent
ContextAdherence(min_score=0.80), # APAC: must stay within retrieved context
ResponseToxicity(max_score=0.05), # APAC: toxicity must be near-zero
Groundedness(min_score=0.85), # APAC: claims must be grounded in context
],
)
# APAC: Test data — compliance QA pairs with retrieved context
apac_test_samples = [
{
"question": "What are MAS FEAT accountability requirements?",
"context": "MAS FEAT requires banks to designate a senior responsible individual...",
"response": apac_compliance_rag_pipeline("What are MAS FEAT accountability requirements?"),
},
{
"question": "Explain PDPA data breach notification requirements.",
"context": "PDPA Section 26C requires notification to PDPC within 3 business days...",
"response": apac_compliance_rag_pipeline("Explain PDPA data breach notification requirements."),
},
]
# APAC: Run test suite
apac_suite_result = apac_llm_suite.run(samples=apac_test_samples)
apac_suite_result.show()
# APAC: Output format:
# Check | Score | Status
# ResponseCoherence | 0.88 | PASS
# ContextAdherence | 0.82 | PASS
# ResponseToxicity | 0.01 | PASS
# Groundedness | 0.79 | FAIL ← blocks deployment
# APAC: Fix: review retrieval quality for PDPA breach notification context
Deepchecks APAC ML model drift monitoring
# APAC: Deepchecks — monitor production ML model for feature drift
from deepchecks.tabular import Dataset as DCDataset
from deepchecks.tabular.checks import (
TrainTestFeatureDrift,
FeatureImportanceVsFeatureDrift,
)
import pandas as pd
# APAC: Load training data distribution (stored as baseline)
apac_train_df = pd.read_parquet("apac_credit_model_train_features_2025q4.parquet")
apac_train_dataset = DCDataset(apac_train_df, label="default_risk", cat_features=["sector", "region"])
# APAC: Load recent production scoring data (last 30 days)
apac_prod_df = pd.read_parquet("apac_credit_model_prod_features_2026q2.parquet")
apac_prod_dataset = DCDataset(apac_prod_df, label="default_risk", cat_features=["sector", "region"])
# APAC: Run feature drift check
apac_drift_check = TrainTestFeatureDrift()
apac_drift_result = apac_drift_check.run(apac_train_dataset, apac_prod_dataset)
# APAC: Display drift results
apac_drift_result.show()
# APAC: Output:
# Feature | Drift Score | Method | Status
# annual_revenue | 0.12 | PSI | PASS
# debt_ratio | 0.31 | PSI | WARN ← investigate
# sector | 0.08 | Cramér's V | PASS
# region | 0.19 | Cramér's V | WARN ← investigate
#
# APAC: debt_ratio and region drifting → likely post-COVID credit environment shift
# APAC: Recommendation: retrain model on 2026 data within 30 days
APAC LLM Quality Engineering Tool Comparison
Dimension Vellum Opik (Comet) Deepchecks
Primary audience Product + eng ML + data science Data science + ML eng
Technical level Low-medium Medium-high High
Primary use case Prompt workflow Tracing + testing Systematic quality gates
Self-host option No Yes (OSS) Yes (OSS)
LLM tracing Via monitoring Strong (core) Limited
Prompt versioning Core feature Limited No
A/B testing Core feature Via experiments No
ML model testing No No Core feature
RAG quality checks Via monitoring Via test datasets Core feature (Groundedness)
Drift detection No No Core feature
CI/CD integration API-based Native Native (pytest-style)
APAC data residency Cloud (review) Self-host option Self-host option
APAC stack recommendation:
→ Vellum + Deepchecks: product-centric team that wants prompt mgmt + quality gates
→ Opik + Deepchecks: ML-centric team building both models and LLM apps
→ All three: large APAC AI platform team covering all quality dimensions
Related APAC LLM Quality Resources
For the LLM safety and hallucination platforms (Patronus AI, Lakera Guard, Galileo AI) that extend Deepchecks' automated quality checks with dedicated red-teaming, prompt injection protection, and production faithfulness scoring — addressing adversarial safety beyond standard quality metrics — see the APAC LLM safety and hallucination guide.
For the lighter-weight APAC LLM testing and observability tools (Latitude, Parea AI, OpenLLMetry) that address similar problems to Vellum and Opik but with different trade-offs in pricing, openness, and OTel compatibility — see the APAC LLM prompt testing and observability guide.
For the LLMOps platforms (Humanloop, Pezzo, W&B Weave) that combine the prompt management capabilities of Vellum with enterprise-scale human feedback collection and fine-tuning workflows — positioning above Vellum in the maturity ladder for APAC teams ready for full LLMOps — see the APAC LLMOps and prompt management guide.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.