If you've been following the AI adoption conversation, you've probably heard that "the hard part isn't the AI." You might have nodded and moved on. But what does that actually mean in practice? Where does all that extra work live? And how much is too much?
A team at Mass General Brigham set out to find out. They built an AI agent to detect immune-related adverse events from clinical notes. They tracked every hour. Less than 20% was prompt engineering and model development. Over 80% was the sociotechnical work: data integration, regulatory navigation, stakeholder alignment, governance, and proving economic value. They called it the 80/20 rule, and they weren't describing a failure of planning. They were describing the actual shape of AI deployment work.
The field guide that confirms it
The research comes from Gallifant et al. (2025), published in the Kellogg Review under the title "The Hidden Costs of Clinical AI: A Field Guide to the 80/20 Rule." The team built an agent called irAE-Agent for detecting immune-related adverse events in oncology patients. It works. The model is good. But here's what the hours revealed: the most technically straightforward parts of the deployment took the least time.
They identified five specific "heavy lifts" that consumed that 80%: data integration and pipeline reliability, model validation and parallel evaluation, economic value demonstration, system drift detection and response, and governance architecture. Not all of these required technical work. Many of them didn't. But all of them had to happen for the agent to move from a working prototype to a production system.
This matches what I've seen applying the AGENTIC Framework across two organisations. The interesting problems in AI deployment are almost never technical. They're organisational. The technology reaches maturity before the people and processes around it can keep pace with it. You'll have a working model weeks before you have alignment on who's accountable for deciding when it fails, or who controls the override, or how decisions get documented.
The interesting problems in AI deployment are rarely technical. They're organisational. The tech works faster than the people and processes around it can adapt.
Here's what matters about the 80/20 ratio: when you know where the work actually is, you can staff for it, plan for it, and build a system that accounts for it from the start. Most AI adoption plans don't. They build the governance layer last, or bolt it on halfway through, or worse, treat it as something the technical team should handle on the side. By then, you've already lost time and momentum, and you're asking the wrong people to do the work.
The five heavy lifts
Let me walk through what each one looks like and where it sits in the AGENTIC Framework.
Data integration and reliability
This is usually the biggest time sink. Getting data into the system reliably, at the right quality, with the right schema, with proper lineage tracking, and with audit controls. The Kellogg team spent more time on data pipelines and validation than on prompt engineering. They had to map where data lived across multiple systems, negotiate access, build ETL workflows, set up monitoring for data quality, and design rollback procedures for when the data goes wrong.
This maps directly to the Assess stage of the AGENTIC Framework. Assess is where you map the actual data flows, not just the workflow steps on paper. You surface the undocumented rules, the edge cases, the places where human judgment is hiding in a spreadsheet instead of being explicit. The work is tedious and unsung, but it's foundational. Data that's well-understood at Assess means everything downstream moves faster.
Model validation and continuous evaluation
You have a working model. Great. But how do you know it still works next month? What's the baseline for "it's working"? How do you catch degradation before it affects real decisions? Validation isn't a single gate. It's a continuous process. Shadow deployment, parallel runs, ongoing evaluation against ground truth. The Kellogg team ran their agent in parallel with the existing system for weeks, tracking every prediction against what the human reviewers said. Only after that evidence accumulated did they move to primary decision-making.
This is the Engineer stage: sandbox, parallel-run, evidence-based trust. You're not asking "does it work in the lab?" You're asking "does it work in the world, with our data, with our edge cases, at the scale we need?" That distinction takes time to validate properly.
Economic value demonstration
Someone has to prove this saves money or time or both. Not just theoretically. Demonstrably. The agent flags 5,000 clinical notes a year. If a human would spend 30 seconds on each one to triage, that's 2,500 hours of human time saved annually. But only if the agent is accurate enough that the human can trust it without rechecking. So the economic calculation isn't just hours saved. It's hours saved minus quality assurance overhead minus failure costs if the agent gets it wrong.
If you can't show this value clearly, the project gets killed. Not because it's a bad idea. Because there's no evidence it's worth the resources. The AGENTIC scoring dimensions include cost-benefit for this reason. It's easy to build an impressive prototype. It's much harder to build one that pays for itself, and that's what matters to an organisation running on finite resources.
System drift detection and response
Models change. Data distributions shift. The world moves on. Clinical guidelines update. Patient populations change. If you deploy an agent and never look at it again, performance degrades silently. By the time someone notices, the agent is making decisions that don't match the current clinical standard. That's not okay. Drift detection means monitoring performance continuously, setting alert thresholds, and having a response process when performance falls.
This is exactly what the Nurture stage is built for. Nurture is where you capture overrides, detect drift, collect continuous feedback, and maintain the relationship with the system over time. It's not a one-time build and deploy. It's ongoing maintenance and evolution. The difference between an agent that stays valuable and one that slowly becomes irrelevant.
Governance and accountability
Who's responsible when the agent makes a bad decision? Who has authority to override it? How does the decision get documented? What's the audit trail? Who approves changes to the model or the data? These aren't optional questions. They're especially not optional in regulated environments like healthcare. But even in less regulated settings, they matter. If nobody knows who's accountable, you'll find out the hard way, usually at the worst possible time.
The AI Governance Stream in the AGENTIC Framework runs as a parallel track alongside the technical pipeline from day one. Not because governance is fun. Because building it in parallel means you don't have to retrofit it later. You design the system with governance in mind, not as an afterthought. Decisions about oversight, auditing, and control authority get made at Assess, not discovered at deployment.
Why the ratio matters for planning
Most AI project plans allocate budget and time for the build. They underestimate everything else. Then the project runs over. The team gets frustrated. The pilot succeeds technically but struggles to get resources for production. Or worse, the pilot never makes it to production because the organisational wrapper wasn't ready.
The data backs this up. RAND's analysis of enterprise AI projects: over 80% never reach meaningful production deployment. McKinsey's research: 70% of companies struggle with AI implementation beyond pilots. BCG/MIT: 74% of companies report difficulty achieving and scaling AI value at speed. Gartner's prediction: over 40% of agentic AI projects initiated in 2024 will be cancelled by end of 2027. The pattern is consistent. Technically successful pilots. Organisationally unsuccessful deployments.
Technically successful pilots that die because nobody built the organisational wrapper alongside the technical one. That's the most common AI deployment failure mode, and it's entirely preventable.
Building the 80% in
The AGENTIC Framework is structured around this ratio. It doesn't treat governance and adoption as add-ons. They're built in from the start.
The AI Governance Stream and AI Adoption Stream run as parallel tracks alongside the AGENT Pipeline. They start at Kickoff, not at deployment. Governance decisions about oversight, authority, auditing, and risk tolerance get made while you're still mapping the workflow. Adoption strategy gets built while you're validating the model. By the time you reach production, the organisational layer is already supporting the technical one.
The AGENTIC Vault captures everything: specifications, governance decisions, override logs, adoption data, validation evidence. When the pilot succeeds and someone asks "can we do this at scale?" or "can we apply this to a similar workflow?", the answer is in the documentation. It's not in one person's head or scattered across email and meeting notes.
The Prioritisation Matrix scores organisational readiness (team readiness, owner buy-in, staff dependency) alongside workflow characteristics and implementation factors. A workflow that's technically easy but organisationally unready gets flagged before resources are committed. A workflow that's technically complex but has strong organisational support usually finds a way to succeed. The scoring reflects that difference.
A technically easy workflow that's organisationally unready will fail. A technically hard workflow with strong organisational support will find a way. The scoring should reflect that.
This isn't about slowing down the technical work. It's the opposite. When you do the organisational work in parallel with the technical work, the technical work has somewhere to land. You're not waiting until deployment to figure out governance. You're not discovering stakeholder misalignment when it's too late to fix it. You're building the full system, technical and organisational, all at once.
The ratio is useful information
The 80/20 breakdown isn't bad news. It's useful information. When you know where the work actually is, you can plan for it, staff for it, and build a system that accounts for it from the start.
If someone comes to you with an AI project and says "it's mostly prompt engineering," you now know that's a red flag. Either they don't understand the scope, or they're planning to skip the hard parts and deal with the consequences later. The research on why AI agents fail says the same thing: the biggest failure category is specification and design, not the AI itself.
If someone says "we're treating governance as phase two," that's also a red flag. Governance that gets bolted on after the technical build is almost always incomplete, and you spend months retrofitting it.
The right approach: acknowledge the ratio. Plan for it. Staff for it. Build the technical and organisational work in parallel. Document everything. Validate continuously. And understand that the pilot succeeding technically is just the beginning. The real work is making it stick.