This is a recap of AIMenta's April 10 webinar, attended by 218 enterprise AI leads across 9 APAC markets.
The question we started with
We opened with a poll: "Has your organisation run an agentic AI pilot?"
- 61% said yes (of whom 44% said it was "still running" and 38% said it had been "paused or discontinued")
- 39% said no
That discontinuation rate — 38% of pilots paused or stopped — is the number we spent most of the session on. Because the failure modes are remarkably consistent, and remarkably avoidable.
What agentic AI actually means in enterprise context
Before the failure modes, a definitions moment that generated significant chat activity.
Most of the audience was using "agentic AI" to mean different things:
- Single-agent with tools: A language model that can call external functions (search, database query, send email) to complete a task. This is the most common "agentic" deployment and the one with the highest success rate.
- Multi-agent orchestration: Multiple specialised AI agents coordinating — a planner agent, an executor agent, a quality-checker agent, a memory agent. More powerful but dramatically more complex.
- Autonomous long-horizon tasks: Agents that operate over hours or days without human checkpoints to complete multi-step objectives. Currently the hardest to deploy reliably.
Our recommendation: start with single-agent-with-tools. Most enterprise use cases that are being pitched as "multi-agent" problems are actually single-agent-with-tools problems that have been over-architected.
The five failure modes we see most often
1. Tool reliability underestimated
The most common failure mode in production agentic systems: the tools the agent calls are unreliable. The database returns a timeout. The API returns an unexpected format. The web search returns stale results.
LLM orchestrators handle tool failures poorly unless explicitly trained to. An agent that calls a tool, gets an error, and then hallucinates a plausible-looking result — rather than surfacing the error to the human — is a production incident waiting to happen.
Fix: Design for graceful degradation. Every tool call needs an explicit error handling path. The agent should be prompted to escalate uncertainty rather than fill in gaps.
2. Context window management
Multi-step agentic tasks accumulate context fast. A task that requires 10 tool calls, each returning 2,000 tokens of data, has consumed 20,000 tokens before the agent has written a word of output. At 50 tool calls — common in complex document processing tasks — you've hit the limit of most deployed models.
Fix: Summarise intermediate results aggressively. Don't carry full tool outputs in the context chain; carry extracted key facts. This requires deliberate prompt engineering at each step, not an afterthought.
3. Human-in-the-loop designed out
Many enterprise agentic pilots are designed as fully autonomous — the appeal of "it just runs" is strong. But in production, fully autonomous agents that make consequential decisions (send an email to a client, update a database record, submit a form) are a compliance and trust risk.
The agents that succeed in production almost universally have structured human checkpoint moments: after planning, before execution of high-stakes actions, before delivering outputs to external parties.
Fix: Design human-in-the-loop as a feature, not a concession. The checkpoint interface — how the human reviews and approves the agent's proposed action — should be as carefully designed as the agent itself.
4. Prompt brittleness at scale
An agent that works perfectly on the 30 test cases used in development fails on 15% of production inputs because the prompt doesn't handle edge cases. Edge cases that were rare in the test set become common at production volume.
Fix: Red-team your agent's prompts with adversarial examples before production. Ask: what happens when the input is in a different language? When a field is missing? When the document has an unusual structure? When the user asks a question the agent wasn't designed for?
5. Observability treated as optional
Most pilot agentic systems have no logging of what the agent actually decided and why. When something goes wrong in production, you can't diagnose it because you don't have a record of the agent's reasoning chain.
Fix: Log everything. Every tool call, every intermediate reasoning step, every output. Store it in a structured format that lets you query "show me all cases where the agent used tool X and then produced output Y." This is not optional in enterprise production.
What's working well
The audience Q&A surfaced several success patterns:
Document extraction pipelines: Single-agent systems that read a document, extract structured data, and route it for human review are the most reliably successful agentic pattern. Success rate in the room: 8 of 11 teams that had deployed this pattern reported it in production. The key: a narrow, well-defined task scope.
Customer service triage: Agents that read an incoming customer query, classify it, extract key information, and route it to the right team (with a pre-drafted response for simple cases) are working well. The key: the agent is not autonomous — it routes for human action, it doesn't act.
Internal knowledge Q&A: RAG-backed agents that answer employee questions from internal documentation. Success in the room: 6 of 8 teams. The failure case is when the agent tries to answer questions that the documentation doesn't cover, rather than escalating to a human.
The operational cost conversation
We spent 20 minutes on operational costs because this is consistently the biggest surprise for teams moving from pilot to production.
The token cost of an agentic task is 5–20× higher than a single-shot LLM call, because agentic tasks involve multiple LLM calls per user request. At production volume (10,000 agent tasks/day), this is a meaningful infrastructure line item.
The hidden cost is human review time. If your agent has a 10% error rate on high-stakes tasks and each error requires 15 minutes of human remediation, at 1,000 tasks/day that's 25 hours of remediation per day — the equivalent of 3 full-time staff.
Model for total cost of agentic AI = (token costs) + (compute/infrastructure) + (human review at error rate) + (monitoring/logging) + (retraining cadence). Most pilots only model the first two.
Audience question highlights
"What frameworks should we use for multi-agent orchestration?" LangGraph for Python teams, Microsoft AutoGen for .NET environments, CrewAI for simpler orchestration patterns. Our practical view: the framework matters less than the task design. A badly designed agentic task in any framework fails; a well-designed task in any major framework works.
"Should we build or buy?" For the orchestration layer: buy or use open source (LangGraph, AutoGen). For the domain-specific agents: build, because your tools and task definitions are specific to your business context and a vendor won't know them better than you.
"How do we handle PII in agentic systems?" Design the agent to not ingest PII unless strictly necessary for the task. If it must handle PII, ensure all tool calls are to systems within your data perimeter, log PII handling with appropriate controls, and ensure the model you're using has a data processing agreement compatible with your jurisdiction's requirements.
Next webinar: May 20 — "RAG in Production for APAC Languages: Chunking, Retrieval, and Quality at Scale." Registration opens next week.
Where this applies
How AIMenta turns these ideas into engagements — explore the relevant service lines, industries, and markets.
Beyond this insight
Cross-reference our practice depth.
If this article matches your stage of thinking, the underlying capabilities ship across all six pillars, ten verticals, and nine Asian markets.