APAC ML Reproducibility: Versioning Data, Code, and Experiments Together
APAC data science teams face a reproducibility crisis: ML models trained on version 3 of a dataset using code commit A7F produce different results than the same code on version 4 — and most teams have no systematic way to track which combination of data, code, and hyperparameters produced which model. This guide covers three open-source-first tools APAC teams use to establish ML reproducibility through data versioning, experiment tracking, and integrated project collaboration.
DagsHub — ML project collaboration platform combining Git hosting, DVC data versioning, experiment tracking, and model registry in a GitHub-like interface for APAC data science teams.
Aim — open-source self-hosted ML experiment tracker for APAC research teams who need full data sovereignty, rich run comparison, and hyperparameter visualization without cloud vendor dependency.
DVC — Git-compatible data version control for APAC ML teams, versioning large datasets and model artifacts using familiar Git workflows stored in APAC cloud storage.
APAC ML Reproducibility Tool Selection
APAC Team Profile → Tool → Why
APAC DS team, git-native, DVC users → DagsHub Full ML collab: code+data+
(GitHub for ML; want one platform) → experiments+model registry
APAC research team, data sovereign → Aim Self-hosted; no cloud vendor;
(gov, defense, regulated industries) → rich comparison; free OSS
APAC team, data versioning priority → DVC Git extension for large files;
(large datasets, reproducible pipelines)→ pipeline caching; storage-agnostic
APAC team, cloud-native, rich UI → W&B Best-in-class UI; real-time
(comfortable with cloud, budget OK) → training viz; sweep automation
APAC team, enterprise MLOps platform → MLflow Battle-tested; model registry;
(production model registry needed) → Databricks integration
APAC ML Reproducibility Stack:
Code versioning: Git (universal)
Data versioning: DVC (open-source) or DagsHub (managed)
Experiment tracking: Aim (self-hosted) or DagsHub (integrated) or MLflow/W&B
Model registry: DagsHub model registry or MLflow model registry
Collaboration: DagsHub (all-in-one) or separate Git + experiment tracking
DVC: APAC Data Version Control Foundation
DVC APAC setup and dataset versioning
# APAC: DVC setup for ML project with S3 remote storage
# Initialize DVC in existing Git repo
git init apac-credit-model
cd apac-credit-model
dvc init
git add .dvc/
git commit -m "APAC: Initialize DVC for ML project"
# APAC: Configure remote storage (APAC S3 bucket for data sovereignty)
dvc remote add -d apac-s3-remote s3://apac-ml-data-singapore/credit-model/
dvc remote modify apac-s3-remote region ap-southeast-1
git add .dvc/config
git commit -m "APAC: Configure S3 remote storage (Singapore region)"
# APAC: Track first dataset version
dvc add data/apac_credit_training_v1.parquet
# Creates: data/apac_credit_training_v1.parquet.dvc (metadata file)
# .gitignore updated: data/apac_credit_training_v1.parquet (excluded from Git)
git add data/apac_credit_training_v1.parquet.dvc data/.gitignore
git commit -m "APAC: Add training dataset v1 (47,230 records, Singapore 2024)"
# APAC: Push data to S3 remote
dvc push
# Uploads: apac_credit_training_v1.parquet → S3 (any team member can dvc pull)
DVC APAC pipeline definition
# APAC: DVC pipeline — define ML stages as reproducible DAG
# File: dvc.yaml
# APAC: Define pipeline stages
stages:
preprocess:
cmd: python src/preprocess.py --input data/raw/ --output data/processed/
deps:
- src/preprocess.py
- data/raw/ # APAC: tracked by DVC
outs:
- data/processed/features.parquet # APAC: DVC caches this output
params:
- params.yaml:
- preprocessing.outlier_threshold
- preprocessing.encoding_method
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/features.parquet # APAC: must complete after preprocess
outs:
- models/apac_credit_model_v1.pkl # APAC: tracked as DVC artifact
metrics:
- metrics/train_results.json:
cache: false # APAC: metrics tracked in Git, not DVC
params:
- params.yaml:
- model.learning_rate
- model.max_depth
- model.n_estimators
# APAC: Run only changed pipeline stages (DVC caches unchanged outputs)
dvc repro
# APAC: Output:
# Stage 'preprocess': cached (inputs unchanged — skipping)
# Stage 'train': running (hyperparameters changed in params.yaml)
# → Only the training stage reruns; preprocessing output reused from cache
# APAC: 40-minute preprocessing step skipped → 3-minute experiment iteration
DVC APAC experiment management
# APAC: DVC experiments — run and compare hyperparameter configurations
# APAC: Run experiment with modified hyperparameters (without changing params.yaml)
dvc exp run --set-param model.learning_rate=0.01 --set-param model.max_depth=6
dvc exp run --set-param model.learning_rate=0.05 --set-param model.max_depth=4
dvc exp run --set-param model.learning_rate=0.10 --set-param model.max_depth=8
# APAC: Compare all experiments
dvc exp show
# APAC: Output table:
# Experiment | model.lr | model.depth | val_auc | val_precision
# workspace | 0.05 | 4 | 0.847 | 0.891
# exp-abc123 | 0.01 | 6 | 0.831 | 0.876
# exp-def456 | 0.10 | 8 | 0.852 | 0.883 ← best AUC
# main | 0.05 | 4 | 0.847 | 0.891
# APAC: Promote best experiment to production branch
dvc exp apply exp-def456
git add .
git commit -m "APAC: Apply best hyperparams from DVC experiment (lr=0.10, depth=8, AUC=0.852)"
dvc push # Push winning model artifact to S3
Aim: APAC Self-Hosted Experiment Tracking
Aim APAC Python SDK integration
# APAC: Aim — log training metrics with full data sovereignty (self-hosted)
from aim import Run
import torch
import torch.nn as nn
# APAC: Initialize Aim run (logs to local Aim repository, no cloud)
apac_run = Run(
repo="/opt/apac-ml/aim-repo", # APAC: self-hosted Aim repository path
experiment="apac-nlp-intent-classification-v3",
)
# APAC: Log hyperparameters
apac_run["hyperparameters"] = {
"model_type": "bert-base-multilingual",
"learning_rate": 2e-5,
"batch_size": 32,
"epochs": 10,
"apac_languages": ["en", "zh", "ja", "ko"],
"max_sequence_length": 128,
}
# APAC: Log APAC dataset metadata
apac_run["dataset"] = {
"name": "apac-customer-service-intents-v4",
"train_samples": 45_230,
"val_samples": 5_650,
"num_classes": 18,
"languages": ["en", "zh", "ja", "ko"],
}
# APAC: Training loop with metric logging
for apac_epoch in range(10):
apac_train_loss = apac_train_one_epoch(apac_epoch)
apac_val_loss, apac_val_accuracy = apac_validate(apac_epoch)
# APAC: Log scalar metrics per step
apac_run.track(apac_train_loss, name="train_loss", epoch=apac_epoch)
apac_run.track(apac_val_loss, name="val_loss", epoch=apac_epoch)
apac_run.track(apac_val_accuracy, name="val_accuracy", epoch=apac_epoch)
# APAC: Log per-language validation accuracy
for apac_lang in ["en", "zh", "ja", "ko"]:
apac_lang_acc = apac_validate_language(apac_epoch, apac_lang)
apac_run.track(apac_lang_acc, name=f"val_accuracy_{apac_lang}", epoch=apac_epoch)
print(f"APAC Epoch {apac_epoch}: train_loss={apac_train_loss:.4f}, val_acc={apac_val_accuracy:.4f}")
apac_run.close()
print("APAC: Run logged to self-hosted Aim repository — no data sent externally")
Aim APAC run comparison query
# APAC: Aim — query and compare experiment runs programmatically
from aim import Repo
apac_repo = Repo("/opt/apac-ml/aim-repo")
# APAC: Query all runs from the APAC NLP experiment
apac_runs = apac_repo.query_runs(
"run.experiment == 'apac-nlp-intent-classification-v3'"
)
# APAC: Find best runs by validation accuracy
apac_results = []
for apac_run in apac_runs.iter_runs():
apac_metrics = apac_run.get_metric("val_accuracy")
if apac_metrics:
apac_best_acc = max(apac_metrics.values.tolist())
apac_results.append({
"run_hash": apac_run.hash,
"learning_rate": apac_run["hyperparameters"]["learning_rate"],
"batch_size": apac_run["hyperparameters"]["batch_size"],
"best_val_accuracy": apac_best_acc,
})
# APAC: Sort by validation accuracy — identify top experiments
apac_results.sort(key=lambda x: x["best_val_accuracy"], reverse=True)
print("APAC: Top 5 runs by validation accuracy:")
for i, apac_result in enumerate(apac_results[:5]):
print(
f" {i+1}. Run {apac_result['run_hash'][:8]}: "
f"lr={apac_result['learning_rate']}, "
f"batch={apac_result['batch_size']}, "
f"acc={apac_result['best_val_accuracy']:.4f}"
)
DagsHub: APAC Integrated ML Project Collaboration
DagsHub APAC MLflow integration
# APAC: DagsHub — log experiments with MLflow SDK to DagsHub backend
import mlflow
import os
# APAC: Point MLflow to DagsHub tracking server
os.environ["MLFLOW_TRACKING_URI"] = "https://dagshub.com/apac-team/credit-model.mlflow"
os.environ["MLFLOW_TRACKING_USERNAME"] = "apac-ml-user"
os.environ["MLFLOW_TRACKING_PASSWORD"] = os.environ["DAGSHUB_TOKEN"]
# APAC: Standard MLflow experiment logging — all data goes to DagsHub
with mlflow.start_run(experiment_id="apac-credit-risk-v3"):
mlflow.log_param("model_type", "xgboost")
mlflow.log_param("learning_rate", 0.05)
mlflow.log_param("max_depth", 6)
mlflow.log_param("n_estimators", 500)
mlflow.log_param("apac_markets", "SG,HK,MY,ID")
# APAC: Train model
apac_model = apac_train_credit_model(apac_params)
# APAC: Log metrics
mlflow.log_metric("val_auc", apac_val_auc)
mlflow.log_metric("val_precision", apac_val_precision)
mlflow.log_metric("val_recall", apac_val_recall)
mlflow.log_metric("apac_sg_auc", apac_sg_auc) # APAC: per-market breakdown
mlflow.log_metric("apac_hk_auc", apac_hk_auc)
mlflow.log_metric("apac_my_auc", apac_my_auc)
# APAC: Log model artifact
mlflow.xgboost.log_model(apac_model, "credit-risk-model")
# APAC: DagsHub shows:
# - Experiment in MLflow-compatible UI
# - Linked to Git commit (current code state)
# - Linked to DVC data version (dataset v4 used for this run)
# - Team members can view, comment, and compare in DagsHub UI
print("APAC: Run logged to DagsHub — code + data + experiment linked")
DagsHub APAC model registry promotion
# APAC: DagsHub model registry — promote trained model to production
# APAC: Register model in DagsHub model registry
dvc push # Push model artifact to DagsHub remote storage
# APAC: Tag model version in Git (DagsHub creates registry entry automatically)
git tag -a "model-v3.2-production" -m "APAC: Credit risk model v3.2 — AUC 0.852, all APAC markets"
git push origin "model-v3.2-production"
# APAC: DagsHub model registry now shows:
# Model: apac-credit-risk-model
# Version: v3.2
# Stage: production
# Linked commit: a7f3c91
# Linked DVC data version: data/processed/features_v4.parquet
# Metrics: val_auc=0.852, val_precision=0.883
# Promoted by: [email protected]
# Promoted at: 2026-06-03 14:23 SGT
# APAC: MLOps team deploys from model registry (reproducible deployment)
# APAC: If production issues arise → DagsHub shows exactly which data + code to reproduce
APAC ML Reproducibility ROI
Problem: APAC bank's credit model unexpectedly degrades after Q4 retraining
Without ML reproducibility tools:
Investigation time: 3-4 weeks
- Engineers manually compare training logs (if they exist)
- Dataset version unknown — may have changed since Q3
- Code changes since Q3 model mixed with new changes
- Root cause: never definitively identified
Outcome: retrain with best guess; risk of repeating the issue
With DVC + Aim + DagsHub:
Investigation time: 2-3 hours
- dvc checkout HEAD~1 → restore exact Q3 dataset version
- git checkout <Q3-commit> → restore exact Q3 code
- Aim comparison: Q3 vs Q4 training curves side-by-side
- Finding: Q4 training data included new demographic field
with 28% missing values → model learned noise pattern
- Fix: retrain with Q3 data quality standards applied to Q4 data
Outcome: root cause identified; fix applied; recurrence prevented
Related APAC ML Infrastructure Resources
For the ML experiment tracking platforms (MLflow, W&B, Comet ML) that complement DVC data versioning with richer UI, managed cloud infrastructure, and enterprise team features — the mature cloud-native alternatives when APAC teams need production-grade MLOps without self-hosted infrastructure management — see the APAC AI tools catalog.
For the ML feature store platforms (Feast, Tecton, Hopsworks) that manage the structured training features produced from DVC-versioned datasets — serving features to training pipelines tracked in Aim or DagsHub — see the APAC feature store guide.
For the data quality platforms (Encord, SuperAnnotate, Cleanlab) that detect label errors and annotation quality issues in DVC-versioned APAC training datasets — identifying when data problems explain poor model metrics visible in Aim or DagsHub experiment tracking — see the APAC data-centric AI guide.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.