Everyone's asking the wrong question. In strategy sessions, in vendor presentations, on team calls, the question is always the same: "How capable is this agent?" But capability and autonomy are not the same thing. A highly capable agent processing sensitive data might need to operate at a low autonomy level. A simple agent handling internal formatting might safely run fully autonomous. Capability sets the ceiling. Governance sets the level.
This distinction matters because organisations are making autonomy decisions by accident, not by design. They build an agent, it works in testing, and it ships at whatever autonomy level the deployment just happens to support. Nobody asked whether that level made sense for the data, the stakes, or the organisation's appetite for error. Nobody documented why. And nobody specified what would have to change for the autonomy level to move.
The capability trap
The confusion comes from a conflation in how we talk about AI. When we say an agent is "capable," we usually mean it can do the task well. When we say it's "autonomous," we mean it does the task without asking for permission. Those are different dimensions. A highly intelligent system can be built to ask for confirmation on every step. A simple system can be built to run without oversight.
Research from the University of Washington formalises what practitioners already know from experience. Feng et al. (2025) describe five operational levels: Operator, where the human commands each action; Collaborator, where the human and agent work together; Consultant, where the agent proposes and the human decides; Approver, where the agent executes and the human reviews; and Observer, where the agent runs fully independently. The same system, the same model, the same tools can operate at any of these levels depending on how you configure it.
The AGENTIC Framework maps these to the Collaboration Spectrum: Human-run, Human-led/AI-assisted, AI-led/human-verified, AI-run. The terminology differs but the insight is identical. Autonomy is not a property that emerges from capability. It's a design choice.
Stop asking "how capable is this agent?" Start asking "how much autonomy should we give it?" Capability sets the ceiling. Governance sets the level.
This is where frameworks like the Cloud Security Alliance's enterprise AI taxonomy (2026) become practical. They acknowledge that autonomy levels exist on a spectrum and that the operational choice matters more than the capability. An organisation can have the most sophisticated agent available and choose to run it at a collaborator level for sensitive workflows. That's not a limitation of the technology. It's a governance choice.
The autonomy rationale
In the AGENTIC Framework, this choice is explicit. When a workflow reaches the Greenlight stage, every step gets an autonomy rationale. Three questions:
First: what's the maximum autonomy level this step should operate at, and why? Not the minimum the technology can support, but the maximum that makes sense given the data, the stakes, the organisation's readiness, and the regulatory environment. If the step involves financial transactions or legal interpretation, the maximum might be quite low. If it's internal data transformation, it might be higher.
Second: what would have to be true for it to move up one level? This is where the insight lives. For data to move from Human-run to Human-led, you might need lower error rates on similar data. For Human-led to AI-led, you might need audit trails, better exception handling, or domain expert confidence. The conditions make promotion possible without requiring someone to write a new specification later.
Third: what would trigger an immediate reduction? A spike in error rates. A change in the type of data. A new regulatory requirement. New personnel handling the workflow who don't have the judgment history. These are the circuit breakers. They exist in the rationale so Nurture and Track know what to monitor and when to pull back.
A single workflow might need four different autonomy levels across its steps. Setting one level for the whole workflow either over-automates the sensitive parts or under-automates the routine ones.
The rationale documents consequence of error, reversibility, data sensitivity, edge case frequency, and regulatory constraints. It travels with the specification so Engineer knows what control mechanisms matter most. Nurture knows what to monitor. Track knows when the conditions for promotion have been met or when demotion is warranted.
Why per-step matters
Most frameworks set one autonomy level per workflow. That approach prevents a more effective design: different autonomy levels across the steps within a single workflow. A data ingestion step might be fully AI-run because it's straightforward and the cost of error is low. The formatting step after it might also be fully automated. A judgment call in the middle that requires understanding context might stay at Human-led. The final sign-off on an edge case might be Human-run. Same workflow. Four different levels.
Designing per step rather than per workflow prevents two common problems. The first is over-automation: you see the workflow overall as "ready for autonomy" so you set everything to AI-run, but one step in the middle actually requires human judgment and now that judgment gets bypassed. The data quality suffers. The second is under-automation: one step legitimately needs human review so the whole workflow gets set to Human-run, and now you're paying for human time on routine steps that would be faster and cheaper if automated.
The per-step approach also acknowledges something that most organisations know but most frameworks ignore: workflows evolve. A step that was Human-run for six months might now be ready to move to Human-led. Another step that you thought was safe to automate turns out to have edge cases. With per-step autonomy rationales, those changes become evolutionary adjustments, not wholesale redesigns. Change one step's autonomy level. Keep the others stable. This is how operational agility actually works.
Promotion isn't just performance
When the data looks good, when override rates stay low, when errors decline, it's tempting to promote the autonomy level. Let the agent run without as much human oversight. The team is comfortable with it. Performance metrics support it. What could go wrong?
The dangerous assumption is that low override rates mean the human doesn't need to be involved anymore. Sometimes low overrides mean the system is running well. Sometimes they mean the human isn't catching errors because they're not paying close attention anymore. This is the core insight from Bainbridge's paradox of automation: the more reliable the automation, the less humans practise their judgment, and the worse they become at intervening when they need to. When autonomy increases, the skill required to intervene actually increases.
Promotion requires three things. First, the performance data, yes. But second, an honest assessment from the team about whether they can still perform well at the higher autonomy level if they need to. Can they step back in? Do they maintain the skills? And third, governance sign-off. The governance role doesn't run the project. But the governance role sets the boundaries the project operates within. If the governance model isn't adequate for the higher autonomy level, you don't promote just because the metrics are good. You upgrade the governance first.
Low override rates might mean the human isn't catching errors, not that there are none. Good performance data is necessary for promotion but not sufficient.
The most common mistake is treating good override rates as evidence that the human can step back further. The override rate tells you the system is working as intended. It doesn't tell you the human has the judgment depth to manage a failure if one happens. Promotion in the AGENTIC Framework requires metrics and sign-off and an honest conversation about what the team would need to do if the autonomous system broke tomorrow. If that conversation reveals gaps, you stay at the current level until you fix them. That's not being cautious. That's being operational.
How this travels through the framework
The autonomy rationale gets written at Greenlight. It travels into Engineer as part of the specification. Engineer builds the control mechanisms that enforce the declared autonomy level. If the level is Human-led, the system asks for confirmation on high-impact decisions. If it's AI-run, the system executes immediately and logs for audit.
At Nurture, the system monitors against the declared level. Are the decision patterns staying within the expected boundaries? Have the conditions for promotion been met? Are any of the circuit breakers being triggered? The autonomy rationale tells Nurture exactly what to measure.
At Track, when capability evolves or organisational readiness changes, the review looks back at the rationale. Is it still accurate? Should this step move up a level? Should it move down? The rationale makes that conversation explicit and data-driven instead of vague and political. It answers the question "why are we operating at this autonomy level?" with evidence, not instinct.
Autonomy design is where governance becomes operational
Abstract principles about responsible AI are useful. Frameworks that say "ensure human oversight" are correct but vague. The AGENTIC autonomy rationale makes it concrete. Not "this step should have human oversight." Rather: "this step operates at Human-led/AI-assisted because the cost of error on edge cases is high and we've only seen the system handle 92% of variation. When we hit 97% on the test set, and the Workflow Owner confirms the team's judgment is still sharp, we'll move to AI-led. If error rate exceeds 8% in production, we'll drop back to Human-run."
That's operational. That's what governance actually looks like when it's built into how you work, not added on top as a compliance box.