Why AI Agents Fail (And It's Usually Not the AI)

An agent breaks in production. The team's first instinct: debug the prompt. Swap the model. Add more guardrails. Sometimes that fixes it. Often it doesn't, because the problem isn't the AI. It's that nobody specified clearly enough what the AI was supposed to do.

This happens everywhere. And now, there's evidence for what practitioners have suspected: most agent failures aren't AI problems at all.

The specification problem

UC Berkeley's MAST taxonomy, presented at NeurIPS 2025, analysed 1,600+ annotated execution traces across seven agent frameworks: MetaGPT, ChatDev, and others. The researchers categorised 14 distinct failure modes. The largest category was specification and system design failures. Not model hallucination. Not reasoning errors. Not tool limitations. The failure happened because the specification was wrong, incomplete, or misunderstood.

This validates something I've been building into the AGENTIC Framework from the start: you cannot engineer your way out of a specification problem. If the spec is wrong, the build will be wrong. And no amount of prompt tuning fixes that.

The five core specification failures show up repeatedly:

Unclear role specification. The agent doesn't know what persona or constraints it operates under. Am I validating proposals or creating them? Can I reject requests or only fulfil them? Do I have the authority to make judgment calls or do I escalate everything?
Incorrect agent-task pairing. The agent is built for one task but deployed against another. The system was designed for summarising documents, but someone asks it to evaluate them. Similar-looking tasks. Different requirements. Failure.
Poor system architecture. Too many agents, unclear handoffs, responsibilities scattered. One agent's output feeds into another's, but nobody specified what happens when the output doesn't match what the downstream agent expects.
Insufficient tool access. The agent can't reach the data or systems it needs to complete the task. It's asked to verify information it can't access, or update records in a system it doesn't have permission to touch.
Missing termination conditions. The agent doesn't know when to stop, when to escalate, or when to admit defeat. It loops, it improvises, it makes decisions it shouldn't.

If you're debugging prompts when the problem is your specification, you'll iterate forever. The prompt is downstream of the spec. Fix the spec first.

The Berkeley researchers tested these same modes across all seven frameworks. The failure patterns were identical. MetaGPT failed on unclear specs. ChatDev failed on unclear specs. A handwriting-recognition agent and a data-extraction agent both failed when termination conditions weren't specified. The failures weren't framework-specific. They were structural problems with how agents were defined.

Three categories, one routing decision

The MAST taxonomy organises failures into three categories. Each category requires a different fix.

Category 1: Specification and design. These failures originate in Assess. Unclear roles, mismatched tasks, poor architecture, missing conditions, insufficient tool access. They're upstream problems. They don't get fixed by adjusting the deployed agent. They get fixed by respecifying the workflow.

Category 2: Inter-agent misalignment. These are coordination failures that happen at Engineer and emerge fully at Nurture. Task derailment when one agent's output doesn't match what the downstream agent expects. Information loss at handoffs. Conflicting behaviours when agents optimise for different success criteria. Communication breakdowns. Cascading errors where one agent's mistake cascades into another's.

Category 3: Task verification. These failures happen because success criteria are undefined or loose. Inadequate output verification. False positive completion, where the agent thinks it succeeded when it didn't. Criteria mismatch, where different stakeholders define success differently. Missing edge cases that only show up at scale.

The routing matters. Without categorisation, teams default to tweaking prompts for every failure. With categorisation, you identify the root cause and fix the actual problem. A specification gap needs respecification, not re-prompting. A coordination gap needs architectural redesign. A verification gap needs better test case coverage.

The gap between the documented process and the real process is where agent deployments fail. The official version says one thing. The person doing the work does another. Assess captures the second one.

What a good specification prevents

Before a specification leaves Assess, three checks. Roles and responsibilities unambiguous. Handoffs and termination conditions explicit. Success criteria match what the workflow actually needs to produce.

That third check is where most teams get caught. The documented process describes the ideal workflow. The actual workflow includes workarounds, exceptions, shortcuts. Someone has been handling edge cases in a particular way for three years. The documentation doesn't mention it. The new specification doesn't account for it. The agent hits that edge case, breaks, and everyone's confused about why the agent is failing when it's deployed exactly as specified.

This is why Assess is conversation-based, not document-based. Feeding existing SOPs into an AI gives you the official process. You need the real one. The gaps between these are where deployments fail.

When a production failure happens at Nurture, categorise it using the MAST framework. Specification gap? Inter-agent coordination problem? Verification failure? Log it in the Vault. Over time, new workflows can check the failure registry during Assess to anticipate common problems in similar domains. The Vault learns from real failures, not theoretical ones.

The compounding cost of specification debt

When someone patches an agent without updating the specification, you accumulate specification debt. The spec and reality diverge. The next person who reads the spec is working from a false picture. The Agent at Track tries to improve the workflow and makes changes based on outdated documentation. Cascade.

The Builder at Nurture should modify the spec first, then derive implementation changes. Any implementation change that happens without a spec update compounds the debt. Over enough changes, nobody trusts the spec. You're flying blind, and you don't even know it.

The AGENTIC Framework treats the specification as the source of truth. Everything downstream derives from it. The test suite validates against the spec. The monitoring at Track checks for drift between spec and execution. If the spec is wrong, the whole system is wrong. If the spec is right but execution drifts, Track catches it.

Specification debt compounds like technical debt. Every implementation change that happens without a spec update makes the system's documentation less reliable. Over enough changes, nobody trusts the spec, and you're flying blind.

The Berkeley research confirms what practitioners already know. Most agent failures aren't AI problems. They're people-and-process problems, rooted in specification gaps. The research also confirms the solution. Rigorous specification before you build. Clear categorisation when things break. A specification-first methodology where the spec is the source of truth and everything else derives from it.

The gap between "the AI can theoretically do this" and "this actually works in production" is wider than it looks. That gap is where specifications matter. And once the specification is right, the next question is how much autonomy to give the agent, which is a design decision, not a capability outcome.

Why AI agents fail

The specification problem

Three categories, one routing decision

What a good specification prevents

The compounding cost of specification debt

Go deeper

The specification problem

Three categories, one routing decision

What a good specification prevents

The compounding cost of specification debt

Go deeper

Get notified