Agentic AI in Production: 12 Deployments Reviewed

TL;DR

Of 12 reviewed agentic AI deployments at Asian mid-market enterprises, 5 are running successfully in production, 4 are running with significant constraints, and 3 were rolled back.
The successful deployments share three traits: narrow scope, strong human checkpoints, and explicit cost ceilings.
Multi-agent systems with broad autonomy remain the highest-risk category. Single-agent, narrow-scope deployments have a much better track record.

Why now

Agentic AI moved from research demo to enterprise pilot in 2024 and into production in 2025. Vendors marketed agentic capabilities aggressively. Enterprises bought the marketing. The production track record is now visible enough to draw lessons.

Gartner's Hype Cycle for AI 2024 placed agentic AI at the peak of inflated expectations.[^1] By mid-2025, the trough of disillusionment was clearly arriving for unconstrained multi-agent systems. The narrow-scope deployments, by contrast, were quietly delivering value.

This article reviews 12 production deployments across Asia in 2024-2025, anonymised. The patterns are stable enough to act on.

What "agentic" means here

Agentic AI in this article means LLM-based systems that:

Decompose tasks into multiple steps
Choose actions from a defined toolset
Execute multi-step plans with limited or no human intervention per step
Maintain state across the task

The boundary with "automation with LLM-in-the-loop" is fuzzy. The deployments reviewed here all involve at least three sequential tool calls per task and at least one decision point where the model chose between alternative actions.

The 12 deployments

Five sectors, four markets. Anonymised but real.

Customer-support triage agent (650-person SaaS, Singapore). Successful in production. Narrow scope.
Document-extraction agent (380-person professional services, Hong Kong). Successful. Tight tooling.
Sales-research agent (520-person enterprise software, Tokyo). Successful with constraints.
Code-review agent (280-person fintech, Singapore). Successful in production for limited code categories.
Compliance-monitoring agent (700-person specialty insurer, Tokyo). Successful. Heavy human checkpointing.
Marketing-content agent (450-person retail, Hong Kong). Running with constraints. Quality variance.
HR-screening agent (340-person staffing firm, Seoul). Running with constraints. Bias monitoring overhead.
IT-incident-response agent (600-person manufacturer, Penang). Running with constraints. Limited to specific incident classes.
Procurement-negotiation agent (480-person retailer, Hong Kong). Running with constraints. Human approval at every commitment point.
Multi-agent research assistant (320-person consulting firm, Singapore). Rolled back. Quality and cost issues.
Multi-agent campaign-planning system (270-person digital agency, Tokyo). Rolled back. Coordination failures.
Autonomous trading-research agent (190-person asset manager, Singapore). Rolled back. Compliance and risk concerns.

The five successful deployments are all single-agent, narrow-scope, with strong human checkpoints. The three rolled-back deployments are all multi-agent or broad-scope autonomous.

Pattern 1: narrow scope wins

The successful deployments solve one task. The customer-support triage agent classifies inbound tickets and routes them; it does not also draft responses, escalate to managers, or update the CRM. The document-extraction agent extracts fields from a defined document type; it does not also interpret meaning or take downstream action.

Narrow scope makes evaluation tractable, failure modes finite, and human review feasible. Broad scope explodes the surface area faster than any team can govern.

McKinsey's Generative AI in the Enterprise tracks scope as a predictor of deployment success and finds that "narrowly scoped, single-task agents" reach production at a 4x higher rate than "general-purpose autonomous agents."[^2]

Pattern 2: human checkpoints, not human review

The successful deployments do not require humans to review every output. They require humans at specific checkpoints where the cost of an autonomous mistake exceeds the cost of waiting.

The compliance-monitoring agent at the Tokyo insurer checks contracts against policy. The agent flags suspected violations. A human reviews flagged items only. False negatives (missed violations) are caught by sampling; false positives (incorrect flags) are caught at human review. The agent operates autonomously most of the time; humans operate at the point of consequence.

This is different from "human in the loop" defined as "human reviews every output." The latter eliminates the productivity benefit. The former preserves it.

Pattern 3: explicit cost ceilings

Agentic systems can rack up large inference costs through deep reasoning loops. The successful deployments have explicit per-task cost ceilings enforced in code.

The sales-research agent at the Tokyo software company has a US$0.40 per-task cost ceiling. If the agent has not produced a satisfactory output by the ceiling, it returns its best partial output and flags the task for human completion. This prevents runaway loops where a stuck agent generates US$50 of inference cost on a single low-value task.

The rolled-back multi-agent research assistant did not have this ceiling. Average cost per task was US$1.20 in the design; actual cost in production was US$8.40 due to inter-agent argument loops. The economics did not work.

Pattern 4: single-agent over multi-agent

All three rolled-back deployments are multi-agent. All five successful deployments are single-agent or single-agent-with-tools.

Multi-agent systems are appealing in design (specialised agents, division of labour) and difficult in production (coordination overhead, error compounding, cost variance). The current state of the technology favours single-agent designs with good tool calling over multi-agent designs with elaborate coordination.

This may change. Frontier model providers are investing in multi-agent coordination. As of mid-2026 the production data favours single-agent.

Pattern 5: tools over reasoning

The successful deployments rely on the agent calling well-defined tools rather than on the agent reasoning through everything. The document-extraction agent calls a structured-output tool that enforces a JSON schema. The customer-support triage agent calls a classification tool with a fixed taxonomy. The reasoning happens in the tool design and the tool docstrings, not in the agent's free-form thought.

The pattern: invest engineering effort in well-designed tools. Let the agent be a smart router between tools, not a free-form reasoner. Anthropic's Building effective agents guide aligns with this view.[^3]

What killed the rollbacks

The three rolled-back deployments share specific failure modes.

Multi-agent argument loops. Two agents disagree, argue at length, and produce neither resolution nor useful output. Cost ballooned, quality dropped.

Coordination failures. Multi-agent systems require shared state, planning, and recovery. Building this well is harder than the agent demos suggest. Most teams underestimated.

Compliance and risk. The autonomous trading-research agent could not satisfy the asset manager's risk and compliance review. The agent's reasoning was not auditable enough; the chain of decisions could not be reconstructed for regulatory inquiry.

Each failure mode is fixable in principle. None was fixable in the timeframe and budget the deployments had.

Implementation playbook

How to scope an agentic deployment with a higher chance of success.

Pick a narrow scope. One task, one input shape, one output shape. Resist the temptation to bundle.
Design tools first. Before designing the agent, design the tools the agent will call. Each tool should be evaluable independently.
Build single-agent. Multi-agent only after single-agent has been mature in production for at least six months.
Define human checkpoints at points of consequence. Where would an autonomous mistake cost more than waiting for a human? Insert a checkpoint there.
Set explicit cost ceilings. Per-task cost cap enforced in code. The ceiling is a forcing function for tight design.
Plan the rollback path. Production agents need an off switch. Know how to disable the agent in under 60 seconds and fall back to the prior workflow.
Measure cost-per-successful-outcome, not per-call. A successful outcome includes any human time at checkpoints. Compare to the baseline workflow.

Counter-arguments

"Multi-agent is the future." Possibly. The future is not yet production-ready for most enterprise contexts. Adopt single-agent today and revisit multi-agent in 12-18 months.

"Narrow scope misses the productivity ceiling." It does. It also misses the failure ceiling. The trade-off is rational at the current state of the technology.

"This pattern advice is conservative." Yes. The 5 successful deployments delivered measurable value. The 3 rolled-back deployments wasted budget and credibility. The conservative pattern wins in the current generation.

Bottom line

Agentic AI is real but immature. The successful production deployments are narrow-scope, single-agent, with human checkpoints and explicit cost ceilings. Broad-scope multi-agent systems remain the highest-risk category and have a poor production track record in mid-market Asian deployments through 2025.

If your team is designing an agentic deployment now, pressure-test it against the five patterns above. If it fails any of them, scope down before launch.

Next read

By Maya Tan, Practice Lead, AI Strategy.

[^1]: Gartner, Hype Cycle for Artificial Intelligence, 2024, July 2024. [^2]: McKinsey & Company, Generative AI in the Enterprise, June 2025. [^3]: Anthropic, Building effective agents, December 2024.

Agentic AI in Production: Lessons from 12 Mid-Market Deployments

TL;DR

Why now

What "agentic" means here

The 12 deployments

Pattern 1: narrow scope wins

Pattern 2: human checkpoints, not human review

Pattern 3: explicit cost ceilings

Pattern 4: single-agent over multi-agent

Pattern 5: tools over reasoning

What killed the rollbacks

Implementation playbook

Counter-arguments

Bottom line

Next read

Where this applies

Cross-reference our practice depth.

Related reading

Multi-Agent AI Systems: Enterprise Design Patterns for APAC Deployments

Enterprise AI Evaluation Framework: How to Select the Right LLM for Your Workload

Professional Services AI in APAC 2026: Legal, Accounting, and Consulting Transformation

Want this applied to your firm?