APAC Data-Centric AI: Fixing the Data, Not Just the Model
Model architecture improvements have diminishing returns when training data quality is poor. APAC ML teams increasingly adopt a data-centric approach — systematically improving training data quality through smarter annotation prioritization, team management at scale, and automated detection of labeling mistakes. This guide covers three platforms APAC ML teams use to implement data-centric AI practices across the annotation and data quality lifecycle.
Encord — active learning annotation platform helping APAC ML teams identify which unlabeled data will most improve model performance, reducing annotation volume by 40-60% without sacrificing accuracy gains.
SuperAnnotate — enterprise AI annotation platform with AI-assisted pre-labeling and annotator team management for APAC organizations producing training data at production scale across computer vision and NLP.
Cleanlab — automated data quality detection using confident learning, finding label errors, outliers, and near-duplicates in APAC training datasets that degrade model performance without any architectural changes.
APAC Data-Centric AI Tool Selection
APAC Problem → Tool → Why
"We annotate randomly but need to → Encord Active learning prioritizes
prioritize high-value samples" → annotation effort by model ROI
"We manage 20+ annotators and need → SuperAnnotate Enterprise team management,
quality control at scale" → QA workflows, task routing
"Model performance is stuck and we → Cleanlab Finds label errors causing
suspect our labels are noisy" → model to learn wrong patterns
"We want lowest-cost self-hosted → Label Studio Open-source; full annotation
annotation with full control" → type flexibility; no vendor
"We need RLHF + enterprise multimodal → Scale AI Managed human annotation
annotation from external workforce" → with APAC language coverage
APAC Data-Centric AI Framework:
Phase 1: Smart collection → Encord active learning (which data to label?)
Phase 2: Quality annotation → SuperAnnotate team mgmt (who labels it well?)
Phase 3: Data cleaning → Cleanlab error detection (what's wrong with labels?)
Phase 4: Model training → standard ML pipeline
Phase 5: Quality feedback → Encord analytics (where is model still weak?)
→ Repeat with Phase 1 informed by Phase 5 findings
Encord: APAC Active Learning Annotation
Encord APAC active learning pipeline
# APAC: Encord — active learning to prioritize annotation of highest-value samples
from encord import EncordUserClient
from encord.objects import OntologyStructure
apac_client = EncordUserClient.create_with_ssh_private_key(
ssh_private_key_path="/home/apac-user/.ssh/encord_ed25519",
)
# APAC: Get project with labeled and unlabeled data
apac_project = apac_client.get_project("apac-chest-xray-pneumonia-detection")
# APAC: Step 1 — Export current model predictions on unlabeled data
# APAC: (Model trained on current labeled subset, predicting on unlabeled pool)
apac_unlabeled_predictions = []
for apac_data_unit in apac_project.list_unlabeled_data_units():
apac_pred = apac_inference_model.predict(apac_data_unit.image_path)
apac_unlabeled_predictions.append({
"data_unit_id": apac_data_unit.uid,
"predicted_class": apac_pred["class"],
"confidence": apac_pred["confidence"],
# APAC: Uncertainty = distance from decision boundary
# Low confidence = high uncertainty = most informative to label
"uncertainty": 1.0 - apac_pred["confidence"],
})
# APAC: Step 2 — Sort by uncertainty (highest uncertainty = most valuable to annotate)
apac_prioritized = sorted(
apac_unlabeled_predictions,
key=lambda x: x["uncertainty"],
reverse=True,
)
# APAC: Step 3 — Add top-100 most uncertain samples to annotation queue
apac_annotation_batch = apac_prioritized[:100]
for apac_sample in apac_annotation_batch:
apac_project.add_to_annotation_queue(
data_unit_id=apac_sample["data_unit_id"],
priority=apac_sample["uncertainty"], # APAC: Encord sorts by priority in UI
batch_name="apac-active-learning-batch-03",
)
print(f"APAC: Added {len(apac_annotation_batch)} high-uncertainty samples to annotation queue")
print(f"APAC: Uncertainty range: {apac_annotation_batch[0]['uncertainty']:.3f} - {apac_annotation_batch[-1]['uncertainty']:.3f}")
# APAC: Why this matters:
# Random annotation of 1,000 samples → model accuracy: 84.2%
# Active learning of 400 samples → model accuracy: 84.7%
# APAC: 60% fewer labels needed for equivalent accuracy improvement
Encord APAC annotation quality monitoring
# APAC: Encord — monitor inter-annotator agreement across APAC annotation team
# APAC: Get annotation quality metrics for current batch
apac_quality_metrics = apac_project.get_annotation_quality_metrics(
batch_name="apac-active-learning-batch-03",
)
for apac_category in apac_quality_metrics["categories"]:
apac_iaa = apac_category["inter_annotator_agreement"]
if apac_iaa < 0.75:
print(
f"APAC: Low agreement on '{apac_category['name']}': "
f"IAA={apac_iaa:.2f} — review annotation guideline for this class"
)
# APAC: Typical output for APAC medical AI:
# Low agreement on 'early_infiltrate': IAA=0.61
# → APAC radiologists disagree on borderline cases
# → Action: add visual examples to APAC annotation guidelines for this class
# → Run consensus review: 3 radiologists review and vote on each disputed case
# APAC: Encord surfaces annotator-level performance
for apac_annotator in apac_quality_metrics["annotators"]:
if apac_annotator["accuracy_vs_consensus"] < 0.80:
print(
f"APAC: Annotator {apac_annotator['id']} below threshold: "
f"{apac_annotator['accuracy_vs_consensus']:.2f} — additional training needed"
)
SuperAnnotate: APAC Enterprise Annotation at Scale
SuperAnnotate APAC Python SDK integration
# APAC: SuperAnnotate — programmatic dataset and annotation management
import superannotate as sa
import os
sa.init(token=os.environ["SUPERANNOTATE_TOKEN"])
# APAC: Create new annotation project for APAC retail object detection
apac_project = sa.create_project(
project_name="APAC Retail Shelf Detection Q2 2026",
project_description="Product detection and classification for APAC supermarket chain",
project_type="Vector", # APAC: bounding box and polygon annotation
settings=[
{"attribute": "ImageQuality", "value": "compressed"},
{"attribute": "AnnotatorType", "value": "internal"},
],
)
# APAC: Upload images from APAC cloud storage
sa.attach_items_from_s3(
project=apac_project["name"],
s3_bucket="apac-retail-images",
s3_folder="shelf-photos/singapore-2026-q2/",
annotation_status="NotStarted",
)
# APAC: Assign annotation tasks to APAC team members
apac_team_members = ["[email protected]", "[email protected]", "[email protected]"]
sa.assign_items(
project=apac_project["name"],
items=sa.search_items(project=apac_project["name"], annotation_status="NotStarted")[:300],
annotator=apac_team_members[0],
)
print(f"APAC: Assigned 300 images to {apac_team_members[0]}")
# APAC: Upload pre-labels (from existing model predictions)
for apac_item in apac_items_with_predictions:
sa.upload_annotations(
project=apac_project["name"],
item_name=apac_item["name"],
annotations=apac_item["predicted_annotations"],
annotation_status="InProgress", # APAC: annotator reviews pre-labels rather than starting from scratch
)
print("APAC: Pre-labels uploaded — annotators verify and correct instead of annotating from scratch")
SuperAnnotate APAC NLP annotation for multilingual models
# APAC: SuperAnnotate — NLP annotation for APAC multilingual intent classification
# APAC: Create NLP annotation project for APAC customer service chatbot training
apac_nlp_project = sa.create_project(
project_name="APAC Customer Service Intent Annotation",
project_description="Intent and entity annotation for APAC multilingual chatbot training",
project_type="Conversational", # APAC: conversation annotation type for chatbot data
)
# APAC: Configure APAC NLP taxonomy
apac_nlp_ontology = {
"intents": [
"account_inquiry",
"payment_dispute",
"product_return",
"technical_support",
"general_inquiry",
"escalation_request",
],
"entities": [
{"name": "APAC_ACCOUNT_NUMBER", "type": "ENTITY"},
{"name": "APAC_PRODUCT_SKU", "type": "ENTITY"},
{"name": "APAC_DATE", "type": "ENTITY"},
{"name": "APAC_AMOUNT_SGD", "type": "ENTITY"},
],
"languages": ["en", "zh", "ms", "id"], # APAC: English, Mandarin, Malay, Bahasa
}
sa.create_class(project=apac_nlp_project["name"], name="Intent", attribute_groups=apac_nlp_ontology["intents"])
# APAC: Route annotation tasks to language-specific annotators
apac_language_routing = {
"en": ["[email protected]"],
"zh": ["[email protected]", "[email protected]"],
"ms": ["[email protected]"],
"id": ["[email protected]"],
}
# APAC: Each language's conversations routed to native-speaker annotators
# APAC: Quality consensus: minimum 2 annotators per 15% of samples for QA sampling
print("APAC: NLP annotation project configured for 4-language APAC chatbot training")
Cleanlab: APAC Automated Label Error Detection
Cleanlab APAC Python library for label error detection
# APAC: Cleanlab — find label errors in APAC credit risk training dataset
import cleanlab
from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# APAC: Load APAC credit risk dataset with potentially noisy labels
apac_df = pd.read_parquet("apac_credit_risk_labeled_2025.parquet")
apac_X = apac_df[apac_features].values # APAC: feature matrix
apac_y = apac_df["risk_label"].values # APAC: labels: 0=low, 1=medium, 2=high
# APAC: Step 1 — Get cross-validated predicted probabilities
# APAC: (Cleanlab needs model probabilities per class, not just predictions)
from sklearn.model_selection import cross_val_predict
apac_model = RandomForestClassifier(n_estimators=200, random_state=42)
apac_pred_probs = cross_val_predict(
apac_model,
apac_X,
apac_y,
cv=5,
method="predict_proba",
)
# APAC: Step 2 — Find label issues using confident learning
apac_label_issues = find_label_issues(
labels=apac_y,
pred_probs=apac_pred_probs,
return_indices_ranked_by="self_confidence", # APAC: most likely errors first
)
print(f"APAC: Found {len(apac_label_issues)} potential label errors out of {len(apac_y)} samples")
print(f"APAC: Estimated label error rate: {len(apac_label_issues)/len(apac_y)*100:.1f}%")
# APAC: Typical output:
# Found 847 potential label errors out of 12,400 samples
# Estimated label error rate: 6.8%
# APAC: 6.8% label error rate is above industry average for enterprise datasets
# APAC: Step 3 — Inspect the most suspicious label errors
apac_error_df = apac_df.iloc[apac_label_issues[:20]].copy()
apac_error_df["predicted_class"] = apac_pred_probs[apac_label_issues[:20]].argmax(axis=1)
print("\nAPAC: Top 20 most likely label errors:")
print(apac_error_df[["customer_id", "risk_label", "predicted_class", "annual_revenue", "debt_ratio"]])
# APAC: Output shows cases where:
# - Model predicts "low risk" but label says "high risk" → likely annotation error
# - Annotator may have confused currency units (USD vs SGD) when assessing revenue
Cleanlab APAC near-duplicate and outlier detection
# APAC: Cleanlab — find near-duplicates and outliers in APAC image dataset
from cleanlab.outlier import OutOfDistribution
from cleanlab.dataset import find_overlapping_classes
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
# APAC: Extract image embeddings using pretrained ResNet
apac_feature_extractor = models.resnet50(pretrained=True)
apac_feature_extractor.fc = torch.nn.Identity()
apac_feature_extractor.train(False) # APAC: inference mode for feature extraction
apac_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
# APAC: Extract embeddings for all training images
apac_embeddings = []
for apac_image_path in apac_image_paths:
apac_img = Image.open(apac_image_path).convert("RGB")
apac_tensor = apac_transform(apac_img).unsqueeze(0)
with torch.no_grad():
apac_embedding = apac_feature_extractor(apac_tensor).squeeze().numpy()
apac_embeddings.append(apac_embedding)
apac_embeddings = np.array(apac_embeddings)
# APAC: Detect out-of-distribution outliers
apac_ood = OutOfDistribution()
apac_ood_scores = apac_ood.fit_score(features=apac_embeddings, labels=apac_y)
# APAC: Flag images with very low in-distribution scores
apac_outlier_threshold = np.percentile(apac_ood_scores, 2) # bottom 2%
apac_outliers = np.where(apac_ood_scores < apac_outlier_threshold)[0]
print(f"APAC: Detected {len(apac_outliers)} outlier images to review")
# APAC: Outliers often: corrupted images, wrong category, extreme edge cases
# APAC: Review and remove before training to reduce noise in model learning
# APAC: Detect overlapping class definitions (ambiguous category boundaries)
apac_overlaps = find_overlapping_classes(
labels=apac_y,
pred_probs=apac_pred_probs,
class_names=apac_class_names,
)
print("\nAPAC: Most confused class pairs:")
for apac_pair in apac_overlaps[:3]:
print(f" '{apac_pair[0]}' ↔ '{apac_pair[1]}': {apac_pair[2]:.3f} overlap score")
# APAC: High overlap → annotation guidelines for these classes need clarification
APAC Data-Centric AI ROI: Before and After
Case study: APAC medical imaging model (pneumonia detection from chest X-rays)
Starting state:
- 8,000 labeled images from 3-year annotation campaign
- Model accuracy: 82.3% on held-out test set
- 3 months of additional model architecture experimentation: +0.8% accuracy gain
- Engineering conclusion: "We've hit an architecture ceiling"
Data-centric investigation with Cleanlab + Encord:
Cleanlab analysis:
- Label error rate: 9.4% (751 images with likely incorrect labels)
- Most common error: "pneumonia" labeled as "normal" for borderline cases
- Root cause: annotation guidelines had ambiguous criteria for early-stage pneumonia
After correcting 600 highest-confidence label errors:
- Model accuracy: 85.1% (2.8% improvement without any architecture change)
Encord active learning for next 500 annotations:
- Targeted borderline early-stage pneumonia cases (highest model uncertainty)
- Model accuracy: 87.4% (further 2.3% improvement from 500 targeted labels)
Total improvement: +5.1% accuracy
Architecture experimentation: +0.8% (3 months, engineering time)
Data-centric intervention: +5.1% (6 weeks, data team effort)
APAC lesson: For production models in mature categories,
data quality often has more headroom than architecture.
Cleanlab finds the ceiling; Encord helps you push through it.
Related APAC Data Labeling Resources
For the enterprise annotation platforms (Scale AI, Labelbox, V7 Labs) that handle high-volume APAC annotation production where external workforce, RLHF data, and model-assisted active learning are the primary requirements — complementing Encord and SuperAnnotate for large-scale data production — see the APAC enterprise data labeling guide.
For the open-source annotation tools (Label Studio, Roboflow, Argilla) that Cleanlab integrates with for the annotation correction workflow once label errors are identified — providing the labeling interface where APAC teams fix the errors Cleanlab surfaces — see the APAC ML data labeling guide.
For the ML monitoring platforms (Evidently AI, WhyLabs, Arize AI) that detect production data drift and model degradation — the production-time complement to Cleanlab's training-time data quality analysis, identifying when production data distribution shifts away from clean training data — see the APAC ML model monitoring guide.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.