Data Readiness for AI — APAC Enterprise Playbook

What This Playbook Covers

Data readiness is the most commonly underestimated prerequisite for enterprise AI. Most mid-market Asian enterprises that have attempted an AI pilot — and stalled — were stopped by data problems, not technology problems. This playbook provides a practical framework for assessing and improving data readiness before committing to an AI deployment.

This is not a data engineering textbook. It is a decision-making guide for senior leaders and their AI advisory partners to determine whether your organisation's data is ready for the specific AI use case you are considering — and what it will take to get there if it isn't.

The Data Readiness Gap in APAC Enterprise

A 2025 survey of mid-market enterprises across Singapore, Hong Kong, Malaysia, and Korea found that 67% of AI pilots that failed to reach production cited "data quality or availability" as a primary or contributing factor. The same survey found that only 23% of enterprises had conducted a formal data readiness assessment before starting their AI project.

The gap between having data and having AI-ready data is significant:

You have data when: Your CRM has customer records. Your ERP has transaction history. Your production floor has sensor outputs. Your HR system has employee records.

You have AI-ready data when: Your data is clean enough for model training or retrieval, structured consistently enough for feature extraction, labelled appropriately for supervised tasks, complete enough to avoid systematic bias, governed well enough to pass compliance review, and accessible in the right format for the AI system you are building.

Most mid-market enterprises have the first. Very few have the second — without deliberate preparation.

The Five Dimensions of Data Readiness

Dimension 1: Data Quality

Data quality for AI has specific requirements that differ from data quality for reporting or analytics:

Completeness: Are the fields your AI model needs populated consistently? Missing values are manageable in reporting (fill with N/A) but create systematic bias in AI models. A credit risk model trained on data where income is missing for 30% of records will not behave as you expect.

Consistency: Are the same concepts represented the same way across your data? Customer names might appear in three different formats across CRM, ERP, and billing systems. Products might use different SKU codes in different systems. AI models don't automatically reconcile inconsistency — you either clean it in the pipeline or accept it as noise.

Accuracy: Is the data actually correct? This is harder to assess than completeness or consistency. Common accuracy problems: manual data entry errors (especially in legacy systems migrated from paper records), stale data that hasn't been updated (customer contact information that was accurate 3 years ago), and derived data that was calculated incorrectly.

Timeliness: Is the data current enough for your use case? A predictive maintenance model trained on sensor data from 2022-2024 may not reflect equipment upgrades made in 2025. A recommendation engine trained on purchase data from before a product range overhaul will recommend products that no longer exist.

Assessment approach: Run a data profiling exercise on the specific tables and fields your AI use case will depend on. Profiling tools (dbt, Great Expectations, or even Excel summary statistics) reveal missing rate, value distribution, and duplicate rate. Aim for <5% missing rate on critical fields before starting model development.

Dimension 2: Data Volume

AI models — particularly machine learning models that learn from examples — need sufficient training data. "Sufficient" varies enormously by use case:

Large language model fine-tuning: Requires thousands to tens of thousands of examples in the target domain. If you are fine-tuning a model on your company's customer service transcripts to create a CS AI, you need a minimum of 5,000-10,000 representative transcript examples — not 200.

Supervised classification (fraud detection, document routing, sentiment): Typically requires 1,000-10,000 labelled examples per category for good initial performance. If you have 3 fraud categories and 100 examples per category, your model will be unreliable.

Retrieval-augmented generation (RAG): Volume requirements are lower — RAG retrieves from your document corpus rather than training on it. But the corpus must cover the topics users will ask about. A knowledge base with 50 documents covering only product specs will fail when users ask about return policy, warranties, or installation guides.

Computer vision (defect detection, quality control): Typically requires 1,000-5,000 images per defect class for production-quality performance. Rare defects (which often appear less than 0.1% of the time) require special oversampling strategies.

Common mistake: Scoping an AI project assuming data volume is adequate, then discovering during development that historical records are too sparse or too old to train a reliable model. Fix: Run a data volume count against your target use case before project kickoff.

Dimension 3: Data Labelling

Many AI use cases require labelled data — examples where a human expert has marked the correct answer. This is true for:

Classification tasks (is this document a contract, an invoice, or a policy?)
Named entity recognition (mark all company names, dates, and financial figures in these documents)
Sentiment analysis (is this customer review positive, negative, or neutral?)
Defect detection (circle all visible cracks in these product photos)

Labelling is expensive, slow, and often requires domain expertise. Mid-market enterprises routinely underestimate labelling requirements:

Time: A labeller can typically annotate 100-300 documents per day for simple classification, 50-100 documents per day for complex structured extraction, and 10-30 images per day for detailed bounding-box annotation. A dataset of 5,000 documents with complex extraction requirements takes 2-3 full-time months of labelling.

Cost: Outsourced data labelling in APAC costs USD 5-25 per hour for simple tasks. Specialist labelling (medical imaging, legal documents, financial statements) costs USD 50-150 per hour or more for domain-expert annotators.

Consistency: Human labellers disagree. Define clear labelling guidelines with worked examples, run inter-annotator agreement checks (target >80% agreement on the same sample before full labelling), and build a review workflow.

Alternative strategies: Consider using large language models to generate initial labels at low cost and quality, then have human experts review and correct — this can reduce labelling time by 40-60% for text classification tasks.

Dimension 4: Data Governance and Compliance

AI training and inference on your data has compliance implications that differ from traditional data processing:

Personal data in training sets: If your customer service transcripts contain customer names, account numbers, or contact details, training an AI model on those transcripts constitutes processing of personal data. Under PDPA (Singapore/Malaysia), PDPO (Hong Kong), PIPA (Korea), or PIPL (China), you need a legal basis for this processing.

Employee data: Using HR data, performance records, or internal communications to train AI has additional employment law implications in most APAC jurisdictions. In Japan, employee data use in AI requires specific disclosure under employment contracts.

Right to erasure: If a customer invokes their data subject rights and you have used their data in model training, what is your obligation? Most AI governance frameworks distinguish between data in a training set (removable by re-training without that record) vs data represented in trained model weights (more complex). Know your position before training on customer data.

Data lineage: Regulators and enterprise risk committees are increasingly asking "what data was this model trained on?" If you cannot answer this question with a documented data provenance record, you have a governance gap that will surface during audit.

Practical step: Before using any personal data for AI training or RAG ingestion, run it through your legal/compliance team with a specific question: "Given our data processing agreements and regulatory jurisdiction, what is the legal basis for using this data to train or run an AI model?" The answer shapes your data pipeline design.

Dimension 5: Data Infrastructure and Accessibility

Data readiness is not just about the data itself — it is about whether your data systems can support AI workloads:

Data centralisation: Most mid-market enterprises in APAC have data distributed across multiple systems — ERP (SAP, Oracle, local systems), CRM (Salesforce, HubSpot, custom), HR system, financial system, document stores. AI typically needs data from multiple systems to be useful. Do you have an integration layer (data warehouse, data lake, or ETL pipeline) that brings these together?

API accessibility: Can your data systems expose data programmatically? Many older APAC enterprise systems (especially in Korea, Japan, and Taiwan where legacy systems from the 1990s are still running) don't have APIs. Data extraction from these systems requires batch exports, custom connectors, or complete system migration — all significant work.

Real-time vs batch: Some AI use cases (fraud detection, real-time recommendation) require real-time data access. If your current infrastructure is batch-export only, building a real-time AI layer requires stream processing infrastructure (Kafka, Kinesis, or equivalent) that may not exist.

Storage and compute for AI: AI workloads — particularly vector databases for RAG, model inference, and training pipelines — have different infrastructure requirements from transactional workloads. A database sized for your ERP transactions is not necessarily sized for a RAG system that runs embedding searches across 100,000 documents.

The Data Readiness Assessment: A 20-Point Checklist

Before starting an AI project, run through this checklist for your specific use case:

Quality

Missing rate on critical fields is <10% (target <5%)
Field values are consistent across systems (same entity, same format)
Data accuracy has been spot-checked against source records
Data is current (updated within the recency window required by the use case)
Duplicates have been identified and handled

Volume

Record count meets minimum threshold for the AI approach (see Dimension 2 above)
Historical depth covers the time period needed for training
Minority classes (rare defects, edge cases) are adequately represented

Labelling

Labelling requirements have been scoped (what needs to be annotated, by whom)
Labelling budget and timeline are allocated before project kickoff
Labelling guidelines have been written and validated

Governance

Legal basis for data processing in AI context has been confirmed
Personal data in training sets complies with applicable data protection law
Data lineage and provenance documentation exists for training data
Data subject rights implications have been assessed

Infrastructure

Data is accessible from a central location (or a plan exists to centralise it)
API or export mechanism for AI pipeline exists
Real-time access is available if the use case requires it (or batch is confirmed acceptable)
Infrastructure capacity for AI workloads (vector DB, inference, pipeline) has been sized
Data backup and recovery for AI-specific data assets is included in DR planning

Common Data Readiness Failure Patterns in APAC

The "we have it in the system" problem: Data exists in the ERP, but it hasn't been consistently used, so fields are sporadically populated. A customer's industry sector, which your AI needs for segmentation, was only captured for 40% of accounts because salespeople found it optional.

The silos-without-keys problem: Customer records in the CRM use a CRM ID; orders in the ERP use an ERP order number; support tickets use a ticket number. None of them share a common customer identifier, making cross-system joins unreliable. AI that needs to see the full customer picture (CRM + orders + support history) can't.

The label schema problem: You have 50,000 historical customer service tickets, but they were categorised into 85 inconsistently-applied categories by a team of agents who each interpreted the taxonomy differently. Category "Billing inquiry" and "Payment question" are the same thing in 60% of cases. Cleaning this down to a coherent label schema takes weeks.

The legacy system prison: Critical data is in a system that exports only monthly batch files with a 30-day lag, in a proprietary format that requires licensed software to parse. Building the AI pipeline depends on solving this upstream problem first — and solving it may cost more than the AI itself.

The consent time bomb: You begin training your AI on customer email data, then discover that the consent mechanism in your original email marketing sign-up did not cover use of customer data for AI training. The legal team requires you to either get fresh consent from 300,000 customers or reconstruct your training set without email data.

Data Readiness by AI Use Case: Quick Reference

Use Case	Minimum Volume	Key Quality Requirement	Governance Risk
Customer service chatbot (RAG)	500+ documents	Coverage of all query topics	Low (internal docs)
Credit scoring	5,000+ labelled loan outcomes	Complete financial features	High (personal data, explainability)
Document extraction (contracts)	1,000+ annotated documents	Consistent annotation schema	Medium (confidential data)
Predictive maintenance	6+ months sensor history	Timestamped failure events	Low (equipment data)
HR performance analytics	2+ years employee records	Consistent evaluation criteria	High (employee data law)
Demand forecasting	2+ years transaction history	SKU-level weekly granularity	Low
Fraud detection	10,000+ flagged and clean cases	Balanced class representation	High (personal data)
Sentiment analysis (surveys)	2,000+ labelled responses	Inter-annotator >80% agreement	Low
Computer vision (QC)	1,000+ images per defect class	Bounding-box annotation	Low

What to Do If Your Data Isn't Ready

Data readiness is a solvable problem — but it takes time and investment. If your assessment reveals gaps, the path forward depends on severity:

Minor gaps (one or two quality issues, borderline volume): Address in parallel with AI development. A 3-month AI project can accommodate 4-6 weeks of parallel data cleaning work if the scope is defined at project start.

Structural gaps (missing integration layer, major system silos): Address before AI project starts. Building a data warehouse or integration layer typically takes 3-9 months, depending on complexity. This is not a delay — it is the prerequisite work. AI built on top of unresolved structural data problems fails in production.

Governance gaps (unclear legal basis, consent issues): Resolve with legal counsel before data collection or model training begins. A governance problem discovered mid-project can shut down the entire project. Resolving it upfront adds 2-4 weeks to the timeline but avoids catastrophic risk.

Volume gaps (not enough labelled data): Address with a labelling programme, data augmentation, or by pivoting to a use case with better data coverage. Alternatively, consider a pre-trained model approach (RAG or fine-tuning from a foundation model) that requires less labelled data than training from scratch.

AIMenta's AI Readiness Assessment service maps your data state against your target AI use case and produces a data readiness gap report with a remediation plan and timeline. Contact us if you want to run a structured assessment before your next AI initiative.

Data Readiness for AI: The APAC Enterprise Playbook

What This Playbook Covers

The Data Readiness Gap in APAC Enterprise

The Five Dimensions of Data Readiness

Dimension 1: Data Quality

Dimension 2: Data Volume

Dimension 3: Data Labelling

Dimension 4: Data Governance and Compliance

Dimension 5: Data Infrastructure and Accessibility

The Data Readiness Assessment: A 20-Point Checklist

Common Data Readiness Failure Patterns in APAC

Data Readiness by AI Use Case: Quick Reference

What to Do If Your Data Isn't Ready

Where this applies

Cross-reference our practice depth.

Related reading

Multi-Agent AI Systems: Enterprise Design Patterns for APAC Deployments

China Enterprise AI in 2026: Regulatory Complexity, Domestic Model Leadership, and the Hong Kong Gateway

Government and Public Sector AI in APAC 2026: Procurement, Data Sovereignty, and the Three-Tier Market

Want this applied to your firm?