Skip to main content
Global
AIMenta
Blog

APAC LLM Workflow and Testing Guide 2026: Vellum, Opik, and Deepchecks

A practitioner guide for APAC AI engineering and data science teams implementing LLM workflow management, tracing, and quality testing platforms in 2026 — covering Vellum as a product-friendly LLM workflow platform enabling APAC teams to version prompts, run A/B tests across prompt variants, build visual multi-step RAG pipelines, and monitor production quality without engineering deployments; Opik by Comet as an open-source LLM tracing and testing platform that auto-instruments OpenAI and Anthropic calls as traces, builds automated test datasets with LLM-as-judge scoring, and integrates with Comet ML experiment tracking for unified ML and LLM observability; and Deepchecks as a continuous testing framework applying software testing discipline to LLM output quality with automated coherence, groundedness, and toxicity checks alongside ML model data drift detection for APAC teams managing both traditional ML models and LLM applications in production.

AE By AIMenta Editorial Team ·

APAC LLM Engineering Quality: Workflow, Tracing, and Testing

APAC AI engineering teams shipping LLM-powered features face three distinct quality challenges: managing prompt versions and experimentation without engineering bottlenecks, tracing LLM application behavior in production to debug quality issues, and applying systematic testing discipline to LLM outputs that don't have a single correct answer. This guide covers three complementary tools APAC teams use to address each layer of the LLM quality problem.

Vellum — LLM workflow platform for APAC product teams, combining prompt versioning, A/B testing, visual workflow builder, and production monitoring without code deployments.

Opik by Comet — open-source LLM tracing and testing platform for APAC ML teams, providing call-level observability and automated test suites that integrate with Comet ML experiment tracking.

Deepchecks — continuous testing framework for APAC LLM applications and ML models, running automated quality checks, drift detection, and RAG pipeline validation in Python-first CI/CD pipelines.


APAC LLM Quality Engineering Tool Selection

APAC Team Profile                      → Tool          → Why

APAC product + eng team, LLM feature   → Vellum         Prompt versioning + A/B without
(non-technical PMs need visibility)    →                code deploys; product-friendly

APAC ML team, Comet ML users           → Opik           Tracing + testing + Comet ML
(open-source preferred, self-host)     →                integration; OSS self-host option

APAC data science, test discipline     → Deepchecks     Python pytest-style quality gates;
(CI/CD gates for AI quality)           →                LLM + ML + RAG in one framework

APAC team, lightweight OTel tracing    → OpenLLMetry    OTel spans to existing Jaeger/
(already have Grafana/Jaeger)          →                Grafana — no new platform needed

APAC team, full LLMOps (eval+human     → Humanloop      Enterprise A/B + human feedback
feedback + fine-tuning)                →                + fine-tuning at scale

APAC LLM Quality Stack by Maturity:
  Starter (1 LLM feature):   Vellum (prompt mgmt) + Parea AI (testing)
  Intermediate (3+ features): Vellum + Opik/Langfuse (tracing) + Deepchecks (quality gates)
  Advanced (ML + LLM):        Opik + Comet ML + Deepchecks (unified experiment tracking)
  Enterprise:                 Humanloop + Patronus AI + Galileo (full-stack)

Vellum: APAC Prompt Versioning and A/B Testing

Vellum APAC Python SDK integration

# APAC: Vellum — run production prompt via SDK without embedding prompt in code

import vellum
import os

apac_vellum = vellum.Vellum(api_key=os.environ["VELLUM_API_KEY"])

async def apac_run_compliance_workflow(
    apac_regulation: str,
    apac_query: str,
    apac_context: str,
) -> str:
    """APAC: Run production compliance assistant workflow from Vellum."""

    # APAC: Vellum fetches the CURRENTLY DEPLOYED prompt version automatically
    # APAC: Product team can update prompt in Vellum UI → this code uses new version
    apac_result = apac_vellum.execute_workflow(
        workflow_deployment_name="apac-mas-compliance-assistant",
        release_tag="PRODUCTION",       # APAC: pinned to production deployment tag
        inputs=[
            vellum.WorkflowRequestStringInputRequest(
                name="regulation",
                value=apac_regulation,
            ),
            vellum.WorkflowRequestStringInputRequest(
                name="query",
                value=apac_query,
            ),
            vellum.WorkflowRequestStringInputRequest(
                name="context_documents",
                value=apac_context,
            ),
        ],
    )

    # APAC: Extract output from workflow execution
    apac_output = next(
        (o.value for o in apac_result.data.outputs if o.name == "compliance_response"),
        None,
    )
    return apac_output

# APAC: Usage — product team changes prompt in Vellum UI
# APAC: No code deployment needed; next API call uses updated prompt
apac_response = await apac_run_compliance_workflow(
    apac_regulation="MAS FEAT",
    apac_query="What are the explainability requirements for credit AI models?",
    apac_context="MAS FEAT Principles document...",
)

Vellum APAC A/B test setup

# APAC: Vellum — configure A/B test between two prompt variants

# APAC: In Vellum UI: Deployments → apac-mas-compliance-assistant → A/B Testing
# APAC: Configure:
#   - Variant A: "apac-prompt-v2-concise" (80% traffic)
#   - Variant B: "apac-prompt-v3-detailed" (20% traffic)
#   - Routing: random per request (no sticky sessions)
#   - Metrics: response_quality score from automated LLM-as-judge

# APAC: Application code unchanged — Vellum routes to correct variant
# APAC: After 200 requests (100 per variant at steady state):
#   → Vellum dashboard shows: variant A score 0.81, variant B score 0.87
#   → Product team promotes variant B to 100% without engineering deployment
#   → Variant A traffic: 0% (archived but retrievable)

# APAC: Track custom metrics by logging quality signals back to Vellum
apac_vellum.submit_completion_actuals(
    deployment_id="apac-mas-compliance-assistant",
    completion_id=apac_result.data.completion_id,
    actuals=[
        vellum.SubmitCompletionActualRequest(
            quality=0.92,  # APAC: automated LLM-as-judge score logged per response
            label="accepted",
        )
    ],
)

# APAC: Vellum aggregates quality scores per variant in real time
# APAC: Statistical significance calculated automatically — promotes winner at threshold

Vellum APAC workflow builder for RAG pipeline

APAC: Vellum Workflow Builder — visual RAG pipeline for APAC compliance assistant

┌─────────────────────────────────────────────────────────────┐
│  APAC Workflow: apac-mas-compliance-assistant                │
│                                                             │
│  Input: {regulation, query, user_segment}                   │
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │ Search Node  │───▶│  LLM Node    │───▶│ Guard Node   │  │
│  │              │    │              │    │              │  │
│  │ Qdrant search│    │ gpt-4o-mini  │    │ Check:       │  │
│  │ top_k=5      │    │              │    │ - no advice  │  │
│  │ APAC docs    │    │ Prompt v7    │    │ - cites reg  │  │
│  └──────────────┘    └──────────────┘    └──────┬───────┘  │
│                                                 │          │
│                                                 ▼          │
│                                        ┌──────────────┐    │
│                                        │ Output Node  │    │
│                                        │              │    │
│                                        │ compliance_  │    │
│                                        │ response     │    │
│                                        └──────────────┘    │
└─────────────────────────────────────────────────────────────┘

APAC: Each node version-controlled independently
APAC: Swap LLM node model without changing retrieval or guard logic
APAC: A/B test prompt changes in LLM node with 10/90 traffic split
APAC: Production deployment: tag workflow version as PRODUCTION
APAC: Rollback: re-tag previous version as PRODUCTION in Vellum UI

Opik by Comet: APAC Open-Source LLM Tracing

Opik APAC SDK tracing setup

# APAC: Opik — instrument APAC LLM application with automatic call tracing

import opik
from opik.integrations.openai import track_openai

opik.configure(api_key=os.environ["OPIK_API_KEY"])  # APAC: or self-hosted URL

# APAC: Auto-instrument OpenAI client — all calls traced automatically
import openai
apac_client = track_openai(openai.OpenAI())

# APAC: Decorator for function-level tracing
@opik.track(name="apac-compliance-rag")
async def apac_compliance_rag_pipeline(apac_query: str) -> str:
    """APAC: Full RAG pipeline traced end-to-end in Opik."""

    # APAC: Step 1 — Retrieval (tracked as sub-span)
    @opik.track(name="apac-qdrant-retrieval")
    async def apac_retrieve(query: str) -> list[str]:
        return await apac_qdrant_client.search(
            collection_name="apac-mas-regulations",
            query_text=query,
            limit=5,
        )

    apac_chunks = await apac_retrieve(apac_query)
    apac_context = "\n\n".join(apac_chunks)

    # APAC: Step 2 — LLM call (auto-traced by track_openai)
    apac_response = await apac_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"Answer based only on this context:\n{apac_context}",
            },
            {"role": "user", "content": apac_query},
        ],
    )
    apac_answer = apac_response.choices[0].message.content

    return apac_answer

# APAC: Opik dashboard shows:
# apac-compliance-rag (1.34s total)
#   ├── apac-qdrant-retrieval (0.08s, 5 docs)
#   └── openai.chat.completions (1.26s, 312 tokens, $0.000047)
# APAC: Click any trace → view full input/output + token breakdown + cost

Opik APAC automated test dataset and scoring

# APAC: Opik — build test dataset and run LLM-as-judge scoring for CI/CD

import opik
from opik.integrations.openai import track_openai
from opik import Dataset, DatasetItem

apac_opik_client = opik.Opik()

# APAC: Create test dataset in Opik (stored in Opik backend)
apac_dataset = apac_opik_client.get_or_create_dataset(
    name="apac-mas-compliance-qa-v2",
)

# APAC: Add test cases (question + expected regulation citation)
apac_test_cases = [
    {"question": "What are MAS FEAT fairness criteria?", "expected_regulation": "FEAT"},
    {"question": "What are HKMA AI governance principles?", "expected_regulation": "HKMA"},
    {"question": "How does OJK regulate AI in lending?",  "expected_regulation": "OJK"},
]

for apac_case in apac_test_cases:
    apac_dataset.insert([DatasetItem(input=apac_case["question"], expected_output=apac_case["expected_regulation"])])

# APAC: Define scorer — checks if regulation is cited in response
def apac_regulation_cited_scorer(dataset_item, llm_output: str) -> float:
    """APAC: Score 1.0 if expected regulation cited in output, else 0.0."""
    apac_expected = dataset_item.expected_output
    return 1.0 if apac_expected.lower() in llm_output.lower() else 0.0

# APAC: Run test suite
apac_results = opik.run_experiment(
    dataset=apac_dataset,
    task=apac_compliance_rag_pipeline,
    scoring_metrics=[apac_regulation_cited_scorer],
    experiment_name="apac-compliance-rag-ci-run-v3",
)

apac_avg_score = apac_results.mean_score("apac_regulation_cited_scorer")
print(f"APAC: Test suite score: {apac_avg_score:.2f}")

# APAC: CI/CD gate
if apac_avg_score < 0.85:
    raise SystemExit(f"APAC: Quality gate failed — {apac_avg_score:.2f} < 0.85 threshold")

Deepchecks: APAC Continuous AI Quality Testing

Deepchecks APAC LLM quality suite

# APAC: Deepchecks — run automated LLM quality checks before production deployment

from deepchecks.llm_eval import LLMTestSuite, SingleSampleCheck
from deepchecks.llm_eval.checks import (
    ResponseCoherence,
    ContextAdherence,
    ResponseToxicity,
    Groundedness,
)

# APAC: Define LLM test suite with quality checks
apac_llm_suite = LLMTestSuite(
    name="APAC Compliance Assistant Quality Suite",
    checks=[
        ResponseCoherence(min_score=0.75),    # APAC: response must be logically coherent
        ContextAdherence(min_score=0.80),     # APAC: must stay within retrieved context
        ResponseToxicity(max_score=0.05),     # APAC: toxicity must be near-zero
        Groundedness(min_score=0.85),         # APAC: claims must be grounded in context
    ],
)

# APAC: Test data — compliance QA pairs with retrieved context
apac_test_samples = [
    {
        "question": "What are MAS FEAT accountability requirements?",
        "context": "MAS FEAT requires banks to designate a senior responsible individual...",
        "response": apac_compliance_rag_pipeline("What are MAS FEAT accountability requirements?"),
    },
    {
        "question": "Explain PDPA data breach notification requirements.",
        "context": "PDPA Section 26C requires notification to PDPC within 3 business days...",
        "response": apac_compliance_rag_pipeline("Explain PDPA data breach notification requirements."),
    },
]

# APAC: Run test suite
apac_suite_result = apac_llm_suite.run(samples=apac_test_samples)
apac_suite_result.show()

# APAC: Output format:
# Check                  | Score | Status
# ResponseCoherence      | 0.88  | PASS
# ContextAdherence       | 0.82  | PASS
# ResponseToxicity       | 0.01  | PASS
# Groundedness           | 0.79  | FAIL  ← blocks deployment
# APAC: Fix: review retrieval quality for PDPA breach notification context

Deepchecks APAC ML model drift monitoring

# APAC: Deepchecks — monitor production ML model for feature drift

from deepchecks.tabular import Dataset as DCDataset
from deepchecks.tabular.checks import (
    TrainTestFeatureDrift,
    FeatureImportanceVsFeatureDrift,
)

import pandas as pd

# APAC: Load training data distribution (stored as baseline)
apac_train_df = pd.read_parquet("apac_credit_model_train_features_2025q4.parquet")
apac_train_dataset = DCDataset(apac_train_df, label="default_risk", cat_features=["sector", "region"])

# APAC: Load recent production scoring data (last 30 days)
apac_prod_df = pd.read_parquet("apac_credit_model_prod_features_2026q2.parquet")
apac_prod_dataset = DCDataset(apac_prod_df, label="default_risk", cat_features=["sector", "region"])

# APAC: Run feature drift check
apac_drift_check = TrainTestFeatureDrift()
apac_drift_result = apac_drift_check.run(apac_train_dataset, apac_prod_dataset)

# APAC: Display drift results
apac_drift_result.show()

# APAC: Output:
# Feature          | Drift Score | Method          | Status
# annual_revenue   | 0.12        | PSI             | PASS
# debt_ratio       | 0.31        | PSI             | WARN  ← investigate
# sector           | 0.08        | Cramér's V      | PASS
# region           | 0.19        | Cramér's V      | WARN  ← investigate
#
# APAC: debt_ratio and region drifting → likely post-COVID credit environment shift
# APAC: Recommendation: retrain model on 2026 data within 30 days

APAC LLM Quality Engineering Tool Comparison

Dimension               Vellum          Opik (Comet)      Deepchecks

Primary audience        Product + eng   ML + data science Data science + ML eng
Technical level         Low-medium      Medium-high       High
Primary use case        Prompt workflow Tracing + testing Systematic quality gates
Self-host option        No              Yes (OSS)         Yes (OSS)
LLM tracing             Via monitoring  Strong (core)     Limited
Prompt versioning       Core feature    Limited           No
A/B testing             Core feature    Via experiments   No
ML model testing        No              No                Core feature
RAG quality checks      Via monitoring  Via test datasets Core feature (Groundedness)
Drift detection         No              No                Core feature
CI/CD integration       API-based       Native            Native (pytest-style)
APAC data residency     Cloud (review)  Self-host option  Self-host option

APAC stack recommendation:
  → Vellum + Deepchecks: product-centric team that wants prompt mgmt + quality gates
  → Opik + Deepchecks: ML-centric team building both models and LLM apps
  → All three: large APAC AI platform team covering all quality dimensions

Related APAC LLM Quality Resources

For the LLM safety and hallucination platforms (Patronus AI, Lakera Guard, Galileo AI) that extend Deepchecks' automated quality checks with dedicated red-teaming, prompt injection protection, and production faithfulness scoring — addressing adversarial safety beyond standard quality metrics — see the APAC LLM safety and hallucination guide.

For the lighter-weight APAC LLM testing and observability tools (Latitude, Parea AI, OpenLLMetry) that address similar problems to Vellum and Opik but with different trade-offs in pricing, openness, and OTel compatibility — see the APAC LLM prompt testing and observability guide.

For the LLMOps platforms (Humanloop, Pezzo, W&B Weave) that combine the prompt management capabilities of Vellum with enterprise-scale human feedback collection and fine-tuning workflows — positioning above Vellum in the maturity ladder for APAC teams ready for full LLMOps — see the APAC LLMOps and prompt management guide.

Beyond this insight

Cross-reference our practice depth.

If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.

Keep reading

Related reading

Want this applied to your firm?

We use these frameworks daily in client engagements. Let's see what they look like for your stage and market.