Skip to content

Leadership Guide

How we structure, adopt, govern, and measure agentic operations.

Table of contents

Organizational Model

The Structural Shift

Traditional software teams organize in horizontal layers: frontend team, backend team, QA team, DevOps team. Work moves laterally through handoffs. This model breaks under agentic workflows because a single agent action can span all layers simultaneously — modifying frontend components, backend APIs, database schemas, and infrastructure configs in one task.

What works instead: vertical cross-functional pods where small teams own features end-to-end, with AI agents acting as connective tissue between layers.

Before (Horizontal Silos)
Product
Frontend
Backend
QA
DevOps

Handoff delays between teams

After (Vertical Pods)
Pod A
PM Designer Architect Engineers Agent
Feature A
Pod B
PM Designer Architect Engineers Agent
Feature B
Pod C
PM Designer Architect Engineers Agent
Feature C
01

Workflows: Design for AI-first, not AI-assisted

Stop asking "where can an agent help in this process?" Start asking "if we built this process from scratch with agents, what would it look like?" Redesign the workflow before automating it.

02

Leadership: From directing execution to defining constraints

Leaders stop specifying how work gets done and start defining what "good" looks like, what boundaries exist, and what must not happen. PMs write constraints and quality bars, not step-by-step specifications.

03

Talent: From specialists to T-shaped integrators

Individual contributors need breadth across the stack because agents blur layer boundaries. Hire and develop for judgment across domains, not just depth in one.

04

Culture: Build continuous reinvention

Tools, models, and patterns change quarterly. Build a culture where workflows are versioned and revisited, rules files are living documents, and "the way we do things" is explicitly up for revision.

05

Structure: From functional teams to outcome-oriented pods

Reorganize around outcomes (features, services, customer journeys) rather than functions (frontend, backend, QA). Each pod includes product roles alongside engineering roles.

06

People systems: Measure impact, not output volume

Agent-assisted teams will produce more PRs, more designs, more docs, more tests. None of these are meaningful measures of contribution. Measure product outcomes, quality, and decision quality.

Senior leads oversee pods delivering complete product slices. Agents handle bounded implementation within each pod. Handoffs between teams are replaced by direct ownership.

Roles with Explicit Authority

Nine roles that must carry real authority, not secondary responsibilities stacked on existing jobs. The distinction matters: when evaluation quality is “someone’s side project,” it doesn’t get done until an incident forces it.

Engineering Roles

AI Architect

Owns: End-to-end orchestration and structural decisions.

Responsibilities:

  • Selects models and defines which model handles which task
  • Designs data flow from input to output
  • Decides orchestration pattern (single agent, multi-agent, workflow)
  • Defines failure modes and recovery paths
  • Makes the structural decisions the rest of the team builds on

For the full transformation story, see Staff / Principal Engineer: Roles in the AI Era.

AI Reliability Engineer

Owns: Observability, cost measurement, and failure recovery. The SRE equivalent for AI systems.

Responsibilities:

  • Defines what to measure to know the system works
  • Monitors cost per execution and flags unsustainable patterns
  • Manages failure detection and recovery mechanisms
  • Owns the guardrail stack implementation and enforcement
  • Runs incident response for agent-related failures

For the full transformation story, see SRE / DevOps Engineer: Roles in the AI Era.

Evaluation Lead

Owns: Test coverage and evaluation strategy. Not unit tests for code — evaluation coverage for agent outputs.

Responsibilities:

  • Defines “how do we know this is good enough to ship?”
  • Designs eval suites for agent behavior (beyond standard test suites)
  • Sets passing thresholds and quality bars
  • Ensures evaluation runs before every ship decision
  • Tracks quality metrics over time to detect drift

The Evaluation Lead emerges from the traditional QA role splitting in two. For the full transformation story, see QA Engineer / SDET: Roles in the AI Era.

Product Engineer

Owns: Feature velocity and integration.

Responsibilities:

  • Runs the agent execution loop for scoped delivery tasks
  • Creates and maintains task templates and agent instructions
  • Integrates agent-generated output into the product
  • Ensures agent output meets product requirements and UX standards
  • Manages the Plan-Execute-Verify-Ship-Learn loop (see Practitioner Guide)

For the full transformation story, see Software Engineer: Roles in the AI Era.

Platform Engineer

Owns: Infrastructure, model hosting, and inference serving. Optional early, required at scale.

Responsibilities:

  • Manages compute infrastructure for agent execution
  • Optimizes cost-efficiency at the infrastructure layer
  • Handles latency and reliability of model inference
  • Implements the governance layer (registry, access control, observability)
  • Manages secrets, API keys, and secure agent-to-system connectivity

For the full transformation story, see Platform Engineer: Roles in the AI Era.

Engineering Manager

Owns: Team capability, adoption equity, and outcome-based measurement. The person who executes the Six Organizational Shifts day-to-day.

Responsibilities:

  • Assesses and advances the team’s maturity level using the Maturity Model (Levels 1-5)
  • Ensures adoption equity: the team’s effective level equals the least-adopted member in a critical-path role
  • Defines and enforces the team’s operating rhythm around the Plan-Execute-Verify-Ship-Learn loop
  • Restructures team workflows from horizontal silos to vertical cross-functional pods
  • Establishes KPIs that measure product impact, not output volume
  • Coaches engineers on judgment, review quality, and context engineering
  • Manages the human side of AI adoption: resistance, identity shifts, and role redefinition

For the full transformation story, see Engineering Manager: Roles in the AI Era.

Product Roles

Product Manager

Owns: Problem definition, acceptance criteria, and product quality.

Responsibilities:

  • Defines what to build and why (the “Plan” phase of the operating loop)
  • Writes acceptance criteria that agents can execute against
  • Reviews agent output for product correctness (does it solve the user’s problem?)
  • Defines constraints and quality bars instead of writing detailed specifications
  • Tracks product outcome metrics alongside delivery metrics

For the full transformation story, see Product Manager: Roles in the AI Era.

Product Designer

Owns: UX quality, design system, and interaction patterns.

Responsibilities:

  • Maintains the design system that agents generate from (tokens, components, patterns)
  • Reviews agent-generated UI for UX quality and design consistency
  • Defines design tokens and component specifications as agent instructions
  • Focuses on design governance and quality auditing rather than pixel-level execution
  • Addresses the emerging discipline sometimes called Agent Experience (AX): designing for both human and agent actors

For the full transformation story, see Product Designer: Roles in the AI Era.

QA Engineer

Owns: Product-level quality from the user’s perspective. Distinct from the Evaluation Lead.

Responsibilities:

  • Translates acceptance criteria into testable assertions
  • Builds evaluation suites that validate product behavior, not just code correctness
  • Monitors quality drift from a user-facing perspective (UX regressions, accessibility, copy errors)
  • Works with the Evaluation Lead on comprehensive quality coverage

Boundary with Evaluation Lead: The Evaluation Lead owns agent output correctness (did the agent follow instructions? is the code well-structured? does it pass technical evaluation?). The QA Engineer owns product correctness (does the shipped feature meet the user’s need? does it work as specified? does it meet accessibility and UX standards?). One judges the agent. The other judges the product.

For the full transformation story — how each of these roles evolved from traditional job descriptions and what to screen for when hiring — see Roles in the AI Era.

Scaling Path

Start (7-8 people):

  • 1 AI Architect (leads)
  • 2-3 Product Engineers
  • 1 AI Reliability Engineer (may share duties with the Architect early on)
  • 1 Engineering Manager
  • 1 Product Manager (may be part-time)
  • 1 Product Designer (may be shared across pods early on)

Scale (12-16 people):

  • 1 AI Architect
  • 3-4 Product Engineers (some specializing in different surfaces)
  • 1-2 AI Reliability Engineers
  • 1-2 Evaluation Leads
  • 1 Platform Engineer
  • 1-2 Engineering Managers (one per pod at scale)
  • 1 Product Manager
  • 1-2 Product Designers
  • 1 QA Engineer

The key: roles exist with authority, not as hats stacked on other jobs. This applies equally to product roles. A PM who owns acceptance criteria must have the authority to reject agent output that doesn’t meet the bar.

Decision Rights Matrix

Ownership clarity alone isn’t enough. You also need clear decision rights: every critical decision has a single owner, not a consensus process.

DecisionOwnerConsultedRationale
Model selectionAI Architect + Evaluation LeadProduct EngineerTechnical fit + eval data must align
Orchestration patternAI Architect (single owner)TeamArchitecture cascades into everything; needs one voice
Cost controlAI Reliability EngineerAI ArchitectToken spend, compute budgets, cost alerts
Eval thresholds (“can we ship?”)Evaluation LeadProduct, ArchitectMust be decided before emotional attachment to shipping
Feature prioritizationProduct Manager + AI ArchitectTeamArchitect says what’s feasible, PM decides what matters
What to build (problem selection)Product ManagerArchitect, TeamPM owns problem definition; engineering owns solution
Acceptance criteriaProduct ManagerEngineering, DesignCriteria must be agent-executable; PM defines, engineering validates feasibility
UX quality standardsProduct DesignerPM, EngineeringDesign system compliance, accessibility, interaction quality
Design system changesProduct DesignerPM, AI ArchitectComponents and tokens agents generate from
Product-level quality thresholdsQA EngineerPM, Evaluation LeadUser-facing quality distinct from agent output correctness
Architecture decisionsAI ArchitectTeamNo agent makes architecture decisions
Security decisionsAI Architect + Reliability EngTeamHumans only, never delegated to agents
Release decisionsProduct Manager + AI ArchitectReliability EngHuman judgment on production readiness

The single mistake most teams make: involving too many people in every decision, hoping consensus catches problems. Instead, they get slow, uncertain decision-making where nobody feels accountable when things go wrong.


Maturity Model

The full five-level maturity model (dimension tables, assessment criteria, prerequisites, and failure modes for each level) lives in the Practitioner Guide. Both guides share this model so practitioners and leaders use a common vocabulary.

Level 1

Assisted

AI provides suggestions that developers accept, modify, or reject. The developer drives all decisions and execution.

"We use Copilot for suggestions"
See full maturity model with dimensions, criteria, and failure modes

Critical reminder: A team’s effective level is set by its least-adopted member in a critical-path role — see the Maturity Model in the Practitioner Guide for the full rationale.

Use the maturity level to calibrate the Adoption Roadmap phases and to set KPI targets in the Measurement section below.


Adoption Roadmap

A four-phase rollout over 180 days. Each phase has defined goals, activities, exit criteria, and risk mitigations. The principle: staged adoption gives you speed without blind trust.

Establish baseline metrics, validate feasibility, build team confidence in a controlled environment.
Activities
  • Select one repository with moderate complexity
  • Define 3 repeatable task types (e.g., API endpoint, test generation, refactor)
  • Measure baseline metrics: PR cycle time, change failure rate, test coverage, bug rate
  • Set up basic guardrails: scope definition, CI as gate, senior review on all agent PRs
  • Every team member runs at least one agent-assisted task
  • Document what works, what fails, and what surprises

Product PM participates in defining task types. Designer reviews agent-generated UI. Baseline product metrics recorded.

Exit Criteria
  • Baseline metrics recorded for comparison
  • At least 10 agent-assisted tasks completed and reviewed
  • No critical quality incidents from agent output
  • Team can articulate which tasks agents handle well and which they don't
  • Basic rules file created and shared across the team
Risk Mitigations
  • Senior review on 100% of agent PRs — no exceptions in Phase 1
  • Start with low-risk, well-bounded tasks only
  • If a quality incident occurs, pause and retrospect before continuing

Measurement and Failure Modes

KPI Dashboard

Track these metrics weekly. The goal isn’t to maximize agent usage. It’s to deliver faster without quality regression.

Lead time

Decrease

Time from issue opened to code merged

PR review time

Decrease

Time from PR opened to approved

Change failure rate

Stable or decrease

% of deployments causing incidents or rollbacks

Rollback frequency

Stable or decrease

Number of rollbacks per deployment period

Escaped defects

Decrease

Bugs found in production per sprint

Test coverage delta

Increase

Change in test coverage over time

Deployment frequency

Increase

How often the team deploys to production

The Success Formula
Lead time AND Deploy frequency AND Change failure rate AND Escaped defects

Six Failure Modes

Each failure mode includes the symptom (how you detect it), the root cause (why it happens), and the mitigation (how to fix it).

Tasks selected for agents are easy-to-automate busywork rather than genuine bottlenecks. The team optimizes for agent-friendly tasks rather than high-impact tasks.

  • Tie every agent workflow to a measurable delivery KPI
  • Require a "so what?" test: if automated, what bottleneck does it remove?
  • Review task selection criteria quarterly

Agent output velocity exceeds the team's review capacity. Often caused by large, unfocused agent PRs.

  • Enforce smaller PR scope (one concern per PR, bounded by task template)
  • Tighten acceptance criteria so PRs are more focused
  • Scale review capacity: train more team members
  • Implement risk-based review: low-risk PRs get sampling-based review
  • Consider review automation for mechanical aspects

Verification gates are incomplete. Agent-generated code passes CI but introduces subtle issues not covered by tests.

  • Expand evaluation coverage beyond unit tests (integration tests, performance benchmarks, architecture fitness functions)
  • Track post-release defect rate specifically for agent-generated code
  • Implement regular "agent output audits"
  • Monitor change failure rate as an early warning signal

Effective agent interaction patterns are not captured and shared. Knowledge stays in individual heads.

  • Convert individual prompts into shared task templates
  • Maintain team-level rules files (not personal ones)
  • Publish an internal "agentic SOP" with examples
  • Pair programming sessions where skilled users demonstrate approach
  • Make the Learn phase mandatory: every insight gets codified

Governance infrastructure (Layer 5) was not built before scaling. Organizations skipped standardization.

  • Implement governance layer before cross-team scaling
  • Start with minimum viable governance: agent registry + cost tracking
  • Add access control and audit trails as usage grows
  • Assign a governance owner (AI Reliability or Platform Engineer)
  • Review governance completeness quarterly

Agent adoption accelerated delivery without improving problem selection. The team is building the wrong things faster.

  • Tie agent task selection to product outcome metrics
  • Require PM sign-off on every task plan
  • Measure feature adoption and user satisfaction alongside delivery speed
  • Apply "redesign, don't automate" to product discovery, not just delivery

Further Reading