Inside the Machine: How a 5-Agent AI Testing System Actually Works

Mar 09, 2026

This is the third piece in a series. Meet Your AI-Powered QA Team introduced the concept. Where to Draw the Line Between Ai and Human Work established the framework. This article shows you what the system looks like from the inside - how the pieces connect, why they're shaped the way they are, and what changes when you run it against a real application.

The Problem This Solves

Most QE organizations hit the same wall. Test automation can’t keep pace with delivery. The backlog of unautomated scenarios grows every sprint. Regression suites take longer to maintain than they save. And the team is stuck in a loop: write scripts, fix scripts, rewrite scripts, fall behind, repeat.

The usual response is to add headcount or buy a platform. Both scale linearly. Double the work, double the cost. And neither addresses the core issue: the cognitive work of test design, coverage assessment, and risk evaluation is bottlenecked on the same people doing the mechanical work of writing and maintaining scripts.

What if you separated those two things?

Not in theory. In architecture.

The Operating Model

In Where to Draw the Line Between AI and Human Work, I described an operating model where humans own the cognitive work and AI handles execution. This isn’t a new idea in QE. Most enterprise testing organizations already work this way with people: FTE QE leads own strategy, coverage decisions, and release readiness. A delivery team handles scripting, data prep, test runs, and triage.

The system I built applies that same structure to AI agents. Five of them, each with a defined role, communicating through files, coordinated by a lead agent that acts as the single interface between the human and the execution layer.

The human stays above the Delegation Line. The agents sit below it.

That sentence sounds simple but it drives every architectural decision in the system. Who can create test scenarios? The agent drafts them, but the human approves them. Who decides what to test? The human. Who writes the Playwright code? The agent. Who decides if a test failure is a bug or a flake? The human - with the agent’s analysis as input.

The line is structural, not aspirational. The system enforces it.

Five Agents, One Team

The system uses five specialized agents. Not one general-purpose AI assistant. Five, each with a narrow scope and clear boundaries.

The Architect

This is the lead. It’s the only agent the human talks to. It receives requests, decides what needs to happen, delegates to specialists, reviews their outputs, and assembles deliverables. It analyzes applications, runs impact analysis on code changes, manages the review workflow, and interprets test results.

What it doesn’t do: write test scenarios or Playwright code. That’s what the other agents are for. The Architect coordinates. It doesn’t do everything.

The Scenario Designer

Designs structured test scenarios across four coverage categories: happy path, edge cases, negative scenarios, and error conditions. Every scenario has four layers: the user flow, validation points, expected results, and instrumentation (screenshots, network captures, console logs). Outputs go to markdown files with a parallel review file for the human to approve.

The Script Engineer

Generates Playwright TypeScript from approved scenarios. Creates test specs, Page Objects, fixtures, and helpers. Enforces a hard gate: it won’t generate code for scenarios the human hasn’t reviewed. This is the architectural enforcement of the Delegation Line - cognitive work must be complete before execution begins.

The Reporter

Produces formal deliverables: Requirements Traceability Matrix, coverage reports, release readiness assessments, version delta comparisons. Everything traceable. Everything in markdown that can go into a release package or a stakeholder review.

The Validator

Quality control for the agents themselves. Validates outputs from every other agent before the human sees them. Three modes: scenario validation (structure, ID uniqueness, coverage completeness), script validation (TypeScript compilation, import resolution, selector accuracy), and results validation (failure categorization, accuracy). Issues go back to the originating agent for fixes - two rounds max - then escalate to the human.

Why Five?

Because specialization creates reliability. A single general-purpose agent that does everything is capable but unpredictable. Narrow agents with clear boundaries are more consistent. Each one has a defined input, a defined output, and a defined scope of responsibility. When something goes wrong, you know exactly where to look.

This mirrors how high-performing QE teams work. You don’t have one person who does strategy, scripting, execution, reporting, and quality control. You have specialists who are excellent at their specific jobs, coordinated by a lead who understands the full picture.

File-Based Communication

The agents don’t share memory. They don’t use message queues or databases. They communicate through files.

This is a deliberate choice, not a limitation. Files are auditable. Files are git-traceable. Files survive session boundaries. A new session can reconstruct the entire project state by reading the file system. No context to rebuild. No state to restore. Just read the files.

The practical impact: every test scenario, every review decision, every script, every result, every report is a file you can inspect, version, diff, and trace. When a VP asks “what changed between v2.3 and v2.4 and how do we know the regression suite covers it?” - the answer is in the files. Not in someone’s head. Not in a tool’s database. In files your team can read.

The Review Workflow

This is where most AI testing approaches fall apart. They generate tests and assume they’re correct. Or they put a human “in the loop” as a rubber stamp - a monitor watching AI work, clicking approve on things they barely read because the volume is too high.

This system does the opposite. The human is the decision-maker, not the monitor.

When the Scenario Designer produces test scenarios, it also produces a review file. Each scenario gets a line item: ACCEPT, REJECT, or NEEDS CHANGES. The human reads each scenario, applies judgment (is this the right coverage? does this match how real users behave? is this priority correct?), and marks their decision. The system won’t proceed until the review is complete.

This isn’t a bottleneck. This is the point. The cognitive work - deciding what to test and why - is the highest-value activity in QE. Automating it away doesn’t make your testing better. It makes your testing unexamined.

The review gate enforces this structurally. The Script Engineer literally cannot generate code for scenarios that haven’t been approved. It checks. If the review file doesn’t exist or has unresolved items, it stops and tells the human.

Impact Analysis and Version Control

Here’s where the system earns its keep in a real release cycle.

When a new version of the application lands, the Architect reads the git diff and categorizes every existing test into one of four buckets: NEW (net new feature needing full design), MUST UPDATE (existing tests that will break), CHECK (existing tests that might be affected), and UNAFFECTED (no relation to any change, inherit as-is).

This produces a version manifest - a structured record of what changed, what’s affected, and what needs work before you can run the regression suite. The manifest chains to previous versions, so you can trace coverage decisions across the entire release history.

The practical outcome: instead of running the full suite and hoping, you know exactly what needs attention before a single test executes. Your team’s time goes to the tests that matter, not to re-running 400 unaffected scenarios to see green checkmarks.

For organizations running biweekly or continuous delivery, this is the difference between a testing practice that keeps up and one that’s perpetually behind.

What This Looks Like in Practice

A typical workflow for a new sprint:

The QE lead gets the sprint scope. They run impact analysis against the code changes. The system produces a manifest showing 3 new features needing test design, 8 existing tests that need updates, and 12 that should be verified. The remaining 180 tests are unaffected and inherit from the previous version.

The lead runs test design for the new features. The Scenario Designer produces 15 scenarios across the 3 features. The lead reviews them - accepts 12, marks 2 for changes (one is missing an edge case, one has the wrong priority), rejects 1 (it’s duplicating coverage from another component). The system incorporates the feedback.

Scripts are generated from approved scenarios. The Validator checks them for compilation errors, missing imports, and selector issues before the lead sees them. Tests run. Results come back categorized: 2 real failures (bugs), 1 environment issue, 1 flaky test.

The lead files the bugs, flags the environment issue for the DevOps team, and marks the flaky test for investigation. A release readiness report generates automatically, showing coverage percentages, open defects, and risk areas.

Total cognitive effort from the lead: the review decisions, the failure triage, and the release call. Everything else was execution - and the agents handled it.

What Changes at Scale

The system supports multiple applications from the same repository. Each app is an isolated project with its own profile, test cases, scripts, and reports. The methodology stays the same. The agents stay the same. Only the context switches.

For a QE organization managing 5 or 10 or 20 applications, this means the operating model is consistent across the portfolio. The same workflow, the same review discipline, the same traceability, the same reporting structure - regardless of which app you’re looking at.

The scaling math changes too. Adding a new application to the portfolio doesn’t require proportional headcount. The cognitive work - strategy, coverage design, risk assessment - still needs a human. But the execution work - scripting, running, triaging, reporting - is handled by agents that don’t have a capacity ceiling the way people do.

This doesn’t mean zero headcount growth. It means the ratio changes. One strong QE lead can cover more ground when the mechanical work isn’t consuming their calendar.

The Scope - and the Starting Point

What I’ve described so far is a system for web UI end-to-end functional testing. It handles the full lifecycle - scenario design, script generation, execution, triage, impact analysis, version management, and reporting. For that specific scope, it’s comprehensive.

But web UI E2E is one layer of a testing organization’s responsibility. And this system is deliberately scoped to that layer as a starting point, not because it’s the only thing that matters, but because it’s the right place to prove the model before expanding.

What’s not covered yet

Other testing types. API testing, ETL testing, data validation, contract testing - these are different execution patterns with different agent architectures. The Delegation Line applies to all of them, but the agents below the line look different. A web UI agent team needs Playwright and browser automation. An API testing team needs HTTP clients, schema validators, and contract tools. The methodology transfers. The implementation doesn’t.

Non-functional testing. Security testing, performance testing, accessibility, load testing - these are specialized disciplines with their own toolchains, expertise requirements, and risk models. Each one could have its own agent team below the Delegation Line, but the cognitive work above the line requires domain-specific expertise that’s distinct from functional testing.

Test data operations. Building test data, seeding backend databases, managing environment state, provisioning test accounts - these are operational capabilities that sit underneath the testing lifecycle. The current system assumes test data exists. A mature implementation needs agents that can construct and manage it.

Tool integrations. Requirements management systems, test management platforms, defect tracking, CI/CD pipelines - the current system is self-contained. In an enterprise context, these agent teams need to integrate with the existing toolchain: pulling requirements from Jira, syncing coverage to a TMS, filing defects automatically, triggering from pipeline events.

Exploratory testing. This is structured, scenario-based automation. Exploratory testing still needs a human with curiosity and domain knowledge probing the application in ways no scenario anticipated. That’s above the Delegation Line by definition.

The point is not that these gaps are problems. The point is that this is a starting point - a proven architecture for one well-defined scope that’s designed to extend.

Governance for Non-Deterministic Systems

Here’s the part most AI testing conversations skip entirely.

These are agentic systems. They use large language models. By their nature, they are non-deterministic. Give the same agent the same input twice and you may get slightly different outputs. The scenarios might be worded differently. The scripts might use different selector strategies. The failure analysis might emphasize different factors.

This is not a flaw to be eliminated. It’s a characteristic to be governed.

The review workflow is the first layer of governance - nothing reaches production without human approval. But governance goes deeper than review gates. It includes auditability (every decision is in a file, every file is versioned), validation (the QA Validator checks agent outputs against structural and technical rules before a human ever sees them), and traceability (every test scenario traces to a requirement, every script traces to a scenario, every result traces to a script).

For QE leaders evaluating agentic approaches, this is the question that separates serious implementations from demos: how do you control a system that doesn’t produce identical outputs every time? The answer isn’t to make it deterministic - that would eliminate the flexibility that makes it useful. The answer is to build control structures around the non-determinism so that variability in execution never becomes variability in quality decisions.

The human decides. The system executes. The governance layer ensures that boundary holds even when the execution path varies.

What This Doesn’t Replace

It doesn’t replace QE judgment. The system is deliberately designed to keep humans making the hard calls - what to test, whether coverage is sufficient, whether a failure matters, whether the release is ready. AI is capable of attempting those decisions. It is not yet reliable enough to make them autonomously. That’s the Delegation Line.

It doesn’t work without expertise. The human lead needs to understand testing methodology, risk assessment, and the application under test. The system amplifies expertise. It doesn’t generate it. A junior engineer with this system will produce more than a junior engineer without it - but they won’t produce what a senior lead with this system produces.

It doesn’t eliminate maintenance entirely. Scripts still break when applications change. The difference is that impact analysis tells you which scripts will break before you run them, and the agents can regenerate code from approved scenarios rather than requiring manual fixes.

The Underlying Bet

Every QE organization is going to have to figure out how to work with AI agents. The question isn’t whether - it’s how. And the organizations that figure it out first will have a structural advantage that compounds over time.

The bet this system makes is that the right architecture separates thinking from execution, enforces human authority at the decision points, and lets AI handle the work where it’s most reliable and humans are most constrained. It’s not about replacing your team. It’s about changing what your team spends their time on.

The QE leads who are currently spending 60% of their week on script maintenance and regression triage? They could be spending that time on coverage strategy, risk modeling, and release quality - the work that actually determines whether your releases are safe.

That’s not a technology decision. That’s an operating model decision. And it’s one that gets harder to make the longer you wait, because the organizations that move first will have the methodology, the institutional knowledge, and the velocity advantage.

The Rollout Mental Model

You don’t deploy agentic AI testing teams the way you deploy a tool. You don’t install it on Monday and have it running your regression suite by Friday. These systems need to be developed and rolled out systematically, in the context of your specific applications, environments, team structures, and quality standards.

The mental model is closer to standing up a new team than installing new software. You start with one application. You configure the agents for that application’s tech stack, selector patterns, data requirements, and environment topology. You run the first design cycle. You review the scenarios - not as a formality, but because the agents are learning your application and you’re learning the agents. You iterate. You build trust through evidence.

Then you expand. A second application. A third. Each one goes through the same onboarding discipline, because each application has its own context that the agents need to understand. But the methodology is the same, the governance is the same, and the operating model is the same. What you learned on app one accelerates app two.

The broader roadmap follows the same pattern. Web UI E2E first - that’s the system described here. Then API testing, with agent teams built for that execution pattern. Then test data operations. Then integrations with your requirements and test management systems. Then non-functional testing disciplines, each with their own specialized agent teams. Each layer follows the Delegation Line. Each layer gets its own governance. Each layer earns trust independently.

This is not a big bang transformation. It’s a systematic expansion of a proven model, one capability at a time, with human oversight at every stage.

Where the Line Moves

The Delegation Line doesn’t have to be static. As AI reliability improves for specific, measurable tasks, you can move the line - carefully. Not all at once. Not based on capability demos. Based on measured reliability over time in your environment, with your applications, against your quality bar.

Today the line sits at: agents execute, humans decide. Tomorrow, some of those decisions might move below the line - but only when the data supports it. Auto-approving scenarios that match established patterns with 99%+ accuracy. Auto-triaging failures that have been categorized correctly 50 times in a row. Progressive trust, earned through evidence.

The architecture supports this because the review gates and validation checkpoints are configurable, not hardcoded. You don’t rebuild the system to move the line. You adjust the gates based on measured confidence.

But that’s tomorrow. Today, the system works as designed: you think, agents execute, Playwright runs. And that’s already a meaningful shift from where most QE organizations are operating.

This is the architecture behind the system I introduced in Meet Your AI-Powered QA Team. What I’ve described here is the first layer - web UI E2E - and the operating model that governs it. The layers that come next - API testing, test data operations, tool integrations, non-functional testing - follow the same architecture and the same Delegation Line. But they need to be built in context: your applications, your environments, your quality standards, your team.

If you’re a QE leader thinking about how to bring agentic AI into your testing organization - or if you’re running a team that’s hitting the scaling wall and wondering what the next operating model looks like - I’d like to hear from you.

Quality Reimagined

Discussion about this post

Ready for more?