Skip to content
Roles / product

The role that splits in two — one judges the agent, the other judges the product

Evolved from
QA Engineer SDET QA Lead Test Engineer
HELM role QA Engineer + Evaluation Lead
Table of contents

The Shift

This role does not merely evolve — it divides. Traditional QA held a single mandate: own quality end to end. That meant test plans and cases, automation suites, regression cycles, bug triage, and release sign-off. One craft, one lens: does the software behave as specified before it reaches users?

In the AI era, two quality surfaces sit on top of each other. First, agent output quality: whether the system that produces work — code, copy, tests, designs — is instruction-faithful, structurally sound, and fit to enter the product pipeline. Second, product quality: whether what ships actually serves the user — correct behavior against intent, accessible interfaces, coherent experience. The first asks whether the machine did its job well enough to merit human and product scrutiny. The second asks whether the result deserves the user’s trust.

Those questions demand different tools, mental models, and accountability. HELM makes the line explicit in the Leadership Guide: the Evaluation Lead owns agent output correctness; the QA Engineer owns product correctness. One judges the agent; the other judges the product. Splitting the role is not optional nuance — it is how serious teams avoid collapsing two problems into one overloaded job description.

What the Traditional Job Description Looked Like

Typical postings asked for experience writing test plans and test cases, proficiency in automation frameworks (Selenium, Cypress, Jest), CI/CD integration for tests, bug tracking and regression discipline, and fluency in SDLC and QA methodology. “Quality” lived in a single column on the org chart. Technical correctness and user-facing quality were assumed to be the same problem, solvable by the same person with the same playbook.

The Transformed Role

Core Mission

Own quality across two surfaces — agent output correctness (Evaluation Lead) and product correctness from the user's perspective (QA Engineer) — ensuring both are covered without gaps or duplication.

Key Responsibilities

  • Evaluation Lead: Design evaluation suites for agent behavior that go beyond conventional unit and integration tests
  • Evaluation Lead: Define passing thresholds and quality bars per agent workflow so "green" means something operational
  • Evaluation Lead: Track quality metrics over time to catch drift early — before it shows up as customer pain or rework
  • Evaluation Lead: Enforce evaluation as a precondition to ship: no green eval, no merge
  • Evaluation Lead: Build automated evaluation that can validate agent output without defaulting to human-in-the-loop
  • Evaluation Lead: Partner with the AI Architect on what "correct" means per task type
  • QA Engineer: Translate acceptance criteria into testable assertions that reflect real user outcomes
  • QA Engineer: Build or curate suites that stress UX regressions, accessibility, copy, and interaction quality
  • QA Engineer: Monitor product-side drift: issues that clear agent evaluation but still violate user expectations
  • QA Engineer: Coordinate with the Evaluation Lead so coverage spans both agent output and end-to-end product behavior
  • QA Engineer: Review agent-generated UI for design-system fit, accessibility, and interaction quality
  • QA Engineer: Own "does it work for the user?" at every Verify phase of the operating loop

Required Competencies

  • Evaluation design — Defining eval suites for non-deterministic or open-ended output — where "correct" is graded, not always unique.
  • Drift detection — Spotting gradual degradation that no single PR or green build exposes.
  • Product judgment — Assessing experience quality beyond functional pass/fail.
  • Statistical thinking — Setting thresholds, confidence, and sampling when binary gates mislead.
  • Automation at scale — Infrastructure that keeps up with high-volume agent output without drowning the team in manual review.
  • Cross-functional communication — Turning quality signals into concrete, prioritized feedback for engineering and product.

What We No Longer Screen For

  • Manual test execution as the primary value proposition
  • Depth in a single framework without judgment about what to automate and why
  • Quality defined only as absence of defects, ignoring intent and experience
  • Assumptions that all code is human-written, reviewed at human cadence, and stable between releases
  • QA as a final gate after development, disconnected from the continuous Build-Verify rhythm

How We Interview

  • Evaluation design — An agent generates API endpoints. Design an evaluation suite that decides whether the output is production-ready. What do you measure beyond tests passing?
  • Drift detection — CI is green on agent PRs, but customer bug reports are up ~15% month over month. How do you investigate?
  • Product quality — An agent-built checkout passes all functional tests. What do you still verify? (Probe UX, accessibility, copy, edge cases, trust.)
  • Threshold setting — For a task with no single right answer, how do you define "good enough"? Walk through your framework.
  • Process design — For a team at Maturity Level 3, design the quality workflow. Where does evaluation run? Where does product QA run? How do they hand off and escalate?

Day in the Life

Evaluation Lead: Morning starts with overnight metrics — pass rates, score distributions, and workflow-level regressions in agent output. You flag a pattern: coverage metrics met the bar, but a class of edge cases never appears in eval data; you tighten scenarios and thresholds with the AI Architect. Before lunch you block a merge where eval green-lighted structurally valid code that violates architectural rules for that service — agent correctness is not yet product-ready.

QA Engineer: Afternoon you walk a new feature built largely by agents against acceptance criteria and accessibility checks. Functional tests pass; you still catch a flow that obeys the design system but feels wrong — order of steps, unclear error states, keyboard traps. You file crisp, user-centered issues and update assertions so the next cycle catches the class of failure.

Both: You end the day aligned on coverage — where eval stopped and product QA picked up, what slipped through both, and how today's incidents refine tomorrow's criteria. The theme is constant: quality is now two disciplines, each deep enough to stand alone, interlocked enough to protect the whole system.

Connection to HELM

This page maps directly to QA Engineer and Evaluation Lead in the HELM Leadership Guide, including the explicit split of ownership: agent output correctness versus product correctness. In the Practitioner Guide, that split lands in Layer 2: Quality Guardrails: evaluations and product checks are guardrails, not optional polish. Operationally, both roles anchor the Verify phase of the HELM operating loop — evaluation before merge and integration, product QA before release confidence.

Failure Mode 3: Silent Quality Drift is the shared enemy: metrics look fine while experience and agent behavior erode. The KPI Dashboard's Quality section should reflect both surfaces — agent-eval health and product-quality signals — so leadership sees drift before it becomes a narrative crisis. HELM treats quality as infrastructure for human-first execution; splitting QA into Evaluation Lead and QA Engineer is how that infrastructure stays honest in the AI era.