Where to Draw the Line Between AI and Human Work
A framework for deciding what to delegate to AI and what your team should own
The story in enterprise AI right now is the agentic model. AI agents do the work. Humans oversee. It sounds efficient, and in some contexts it will be.
But the question underneath it is reliability.
There is an important distinction that I think most of the conversation skips over. AI capability has been advancing rapidly. Arguably faster than any technology in recent memory. But reliability, the kind that lets you take a human out of the loop and trust the outcome, has improved much more slowly.
Automation depends on reliability, not capability.
A system that is 95% capable and 60% reliable is not ready for autonomous operation. It is ready for supervised use. Those are fundamentally different operating models with fundamentally different human requirements. And the choice between them may be the most consequential design decision organizations face right now: where do you draw the line between what AI does and what humans do?
The Gap Between Capable and Reliable
You can see this dynamic play out across industries. Autonomous driving is the most visible example. Waymo spent over a decade closing the gap between a vehicle that could handle most driving scenarios and one reliable enough to operate without a human ready to take over. That gap was not about capability. It was about reliability.
But you do not need to leave software to see it. Every QE leader has lived a version of this. A test automation framework that works for 90% of scenarios and fails unpredictably on the other 10% is not a reliable framework. It is a framework that generates maintenance overhead and erodes trust. The capability was there. The reliability was not. And the team paid for it.
AI has this same dynamic at a larger scale. The failure modes are novel, inconsistent, and difficult to anticipate. That is a real constraint on how much autonomy you can safely delegate to it, and it should inform how you design the human-AI relationship in your organization.
A Risk Worth Naming
The default agentic model looks like this: AI agents perform the work. Humans monitor the output. Humans intervene when something goes wrong.
On paper, this preserves human judgment. But I think there is a real risk that it hollows it out over time.
Monitoring is not the same as doing.
When a QE lead actively designs a test strategy, they are building and reinforcing expertise. Working through scenarios. Triaging failures against domain knowledge. Deciding what coverage a release needs. Each decision sharpens their mental model. Each edge case deepens their judgment.
When that same QE lead is repositioned as a monitor, something different happens. Reviewing AI-generated test plans. Approving AI-made triage decisions. Watching dashboards for anomalies. The cognitive engagement drops. The expertise that was supposed to backstop the system begins to atrophy. Not immediately. Not obviously. But steadily.
I think the agentic enterprise often assumes that supervision sustains human judgment. My experience suggests that expertise develops through active engagement, not passive observation. When reliability is still evolving, and in enterprise AI it absolutely is, that distinction matters. It may be the difference between a system that gets safer over time and one that gets more fragile.
The Delegation Line
This is the question I keep coming back to: where do you draw the line between what AI does and what humans do?
I think many organizations will default to drawing it in a place that feels efficient but creates long-term risk. The natural tendency is to delegate judgment to AI and ask humans to monitor, because that maximizes the amount of work the AI handles. But it creates the risk I described above: the human’s role becomes reactive, supervisory, and increasingly disconnected from the cognitive work that built their expertise.
The model I have arrived at draws the line differently. Delegate execution to AI. Keep humans doing the thinking.
I call this the Delegation Line. On one side: the cognitive work. Design, strategy, risk assessment, judgment calls, domain reasoning. Humans own this. On the other side: the mechanical work. The repetitive, time-consuming, well-defined tasks that consume most of a team’s capacity but do not require human judgment on each instance. AI owns this.
The distinction matters because of where reliability breaks down. AI fails most dangerously in novel situations requiring judgment. Those are exactly the situations where you need human expertise to be sharp. AI is most reliable in repetitive execution against well-defined parameters. That is exactly the work that consumes the most human capacity today.
Draw the line at execution:
Human expertise stays sharp because humans are actively engaged in the hard problems
AI reliability is highest because it is operating in its most predictable mode
The human is not a monitor. They are a decision-maker who happens to have an execution engine
Draw the line at judgment:
Human expertise atrophies because the cognitive work has been offloaded
AI reliability is lowest because it is making novel decisions in ambiguous contexts
The human is a monitor whose ability to catch AI errors may degrade over time
One model gets safer as it scales. The other gets more fragile.
A Model You May Already Operate
Here is what I find interesting. If you run a QE organization in the enterprise, you may already operate a version of the Delegation Line. You just do not call it that.
Most enterprise QE functions split into two layers. Your FTE QE leads and managers own the cognitive work: test strategy, coverage decisions, risk assessment, release readiness, standards. They are accountable for quality. That does not change regardless of who or what executes underneath them.
The execution layer is handled by a delivery team. Writing scripts, preparing test data, running tests, triaging results, compiling reports. In many organizations, this is an SI partner managing onshore and offshore resources. The QE lead sets the direction. The delivery team converts that direction into output.
This is the Delegation Line, drawn by hand. The QE lead thinks. The delivery team executes. The accountability model is clear. And it works.
The constraint is not the model. It is that the execution layer scales linearly. When a sprint delivers more changes than the delivery team can absorb, the options are: delay the release, reduce coverage, or add headcount. All three are expensive. Capacity is directly proportional to how many people you can staff, train, and manage.
Now consider applying AI to that same model. Not by replacing the QE lead with a monitor, but by evolving the execution layer.
Above the line, the QE lead’s cognitive work stays the same, and gets amplified. The QE lead co-designs test strategy with AI as a thinking partner. Not monitoring AI output. Working alongside it, the same way a senior tester works with a peer in pair testing. Challenging assumptions. Surfacing edge cases. Applying domain expertise that comes from years inside the business. The QE lead makes the final design decisions.
Below the line, the execution layer becomes agentic. AI agents handle scripting, data preparation, test execution, triage, and reporting. The same work the delivery team does today, but without the linear scaling constraint. The work is well-defined and repetitive. It is exactly where AI reliability is highest and where human capacity is most constrained.
The accountability model does not change. The QE lead is still accountable for quality. Nothing ships without human approval. The difference is that the execution capacity underneath them is no longer limited by headcount.
And the QE lead never becomes a monitor. They are doing the same cognitive work they do today, with more leverage.
The Execution Layer Is Evolving
I want to be clear about what I am suggesting and what I am not.
This is not about replacing delivery teams. It is about recognizing what the execution layer is going to look like in 18 months. The managed services model that most enterprises already operate is, in my view, the right starting architecture for agentic AI. The accountability structure works. The roles and responsibilities are clear. The evolution is in how the execution layer delivers: from purely labor-based to AI-augmented, and eventually to agentic.
I have been building a version of this. Five specialized AI agents working under one human QA analyst for Playwright E2E testing. The human retains full oversight. Nothing gets tested without approval. Nothing gets reported without review. But the human is not watching AI work. The human is doing the thinking. The AI is doing the execution.
The architecture maps directly to the operating model QE organizations already run. The QE lead’s role does not change. The execution capacity underneath them does.
Why This Matters Now
Gartner projects that 40% of enterprise applications will integrate task-specific AI agents by end of 2026. Goldman Sachs has deployed AI agents across its 12,000-person technology organization. Amazon used AI agents to upgrade tens of thousands of production applications, saving an estimated 4,500 developer-years.
The shift is here. Agents are going to get more capable faster than most organizations can absorb. The question is not whether to adopt them. It is how.
And “how” comes down to where you draw the Delegation Line.
Organizations that delegate judgment to AI and position humans as monitors risk a compounding problem: the expertise they need to oversee AI safely may erode precisely because the humans are no longer doing the work that built that expertise. The more they rely on AI judgment, the less capable the humans become of catching AI errors.
Organizations that delegate execution to AI and keep humans engaged in cognitive work will likely build a different kind of system. One where human expertise stays sharp, AI operates in its most reliable mode, and the whole system gets stronger over time.
The Delegation Line is not a technical decision. It is an operating model decision. And the operating model most enterprises already use, where senior people own the thinking and delivery teams handle the execution, is a strong foundation to build on.
Draw the line in the right place. Delegate execution, not judgment. Keep your people thinking.
That is how I believe you build an agentic system that actually works.
