Leadership Guide

Organizational Model

The Structural Shift

Traditional software teams organize in horizontal layers: frontend team, backend team, QA team, DevOps team. Work moves laterally through handoffs. This model breaks under agentic workflows because a single agent action can span all layers simultaneously — modifying frontend components, backend APIs, database schemas, and infrastructure configs in one task.

What works instead: vertical cross-functional pods where small teams own features end-to-end, with AI agents acting as connective tissue between layers.

Before (Horizontal Silos)

Product

Frontend

Backend

DevOps

Handoff delays between teams

After (Vertical Pods)

Pod A

PM Designer Architect Engineers Agent

Feature A

Pod B

PM Designer Architect Engineers Agent

Feature B

Pod C

PM Designer Architect Engineers Agent

Feature C

Workflows: Design for AI-first, not AI-assisted

Stop asking "where can an agent help in this process?" Start asking "if we built this process from scratch with agents, what would it look like?" Redesign the workflow before automating it.

Leadership: From directing execution to defining constraints

Leaders stop specifying how work gets done and start defining what "good" looks like, what boundaries exist, and what must not happen. PMs write constraints and quality bars, not step-by-step specifications.

Talent: From specialists to T-shaped integrators

Individual contributors need breadth across the stack because agents blur layer boundaries. Hire and develop for judgment across domains, not just depth in one.

Culture: Build continuous reinvention

Tools, models, and patterns change quarterly. Build a culture where workflows are versioned and revisited, rules files are living documents, and "the way we do things" is explicitly up for revision.

Structure: From functional teams to outcome-oriented pods

Reorganize around outcomes (features, services, customer journeys) rather than functions (frontend, backend, QA). Each pod includes product roles alongside engineering roles.

People systems: Measure impact, not output volume

Agent-assisted teams will produce more PRs, more designs, more docs, more tests. None of these are meaningful measures of contribution. Measure product outcomes, quality, and decision quality.

Senior leads oversee pods delivering complete product slices. Agents handle bounded implementation within each pod. Handoffs between teams are replaced by direct ownership.

Roles with Explicit Authority

Nine roles that must carry real authority, not secondary responsibilities stacked on existing jobs. The distinction matters: when evaluation quality is “someone’s side project,” it doesn’t get done until an incident forces it.

Engineering Roles

AI Architect

Owns: End-to-end orchestration and structural decisions.

Responsibilities:

Selects models and defines which model handles which task
Designs data flow from input to output
Decides orchestration pattern (single agent, multi-agent, workflow)
Defines failure modes and recovery paths
Makes the structural decisions the rest of the team builds on

For the full transformation story, see Staff / Principal Engineer: Roles in the AI Era.

AI Reliability Engineer

Owns: Observability, cost measurement, and failure recovery. The SRE equivalent for AI systems.

Responsibilities:

Defines what to measure to know the system works
Monitors cost per execution and flags unsustainable patterns
Manages failure detection and recovery mechanisms
Owns the guardrail stack implementation and enforcement
Runs incident response for agent-related failures

For the full transformation story, see SRE / DevOps Engineer: Roles in the AI Era.

Evaluation Lead

Owns: Test coverage and evaluation strategy. Not unit tests for code — evaluation coverage for agent outputs.

Responsibilities:

Defines “how do we know this is good enough to ship?”
Designs eval suites for agent behavior (beyond standard test suites)
Sets passing thresholds and quality bars
Ensures evaluation runs before every ship decision
Tracks quality metrics over time to detect drift

The Evaluation Lead emerges from the traditional QA role splitting in two. For the full transformation story, see QA Engineer / SDET: Roles in the AI Era.

Product Engineer

Owns: Feature velocity and integration.

Responsibilities:

Runs the agent execution loop for scoped delivery tasks
Creates and maintains task templates and agent instructions
Integrates agent-generated output into the product
Ensures agent output meets product requirements and UX standards
Manages the Plan-Execute-Verify-Ship-Learn loop (see Practitioner Guide)

For the full transformation story, see Software Engineer: Roles in the AI Era.

Platform Engineer

Owns: Infrastructure, model hosting, and inference serving. Optional early, required at scale.

Responsibilities:

Manages compute infrastructure for agent execution
Optimizes cost-efficiency at the infrastructure layer
Handles latency and reliability of model inference
Implements the governance layer (registry, access control, observability)
Manages secrets, API keys, and secure agent-to-system connectivity

For the full transformation story, see Platform Engineer: Roles in the AI Era.

Engineering Manager

Owns: Team capability, adoption equity, and outcome-based measurement. The person who executes the Six Organizational Shifts day-to-day.

Responsibilities:

Assesses and advances the team’s maturity level using the Maturity Model (Levels 1-5)
Ensures adoption equity: the team’s effective level equals the least-adopted member in a critical-path role
Defines and enforces the team’s operating rhythm around the Plan-Execute-Verify-Ship-Learn loop
Restructures team workflows from horizontal silos to vertical cross-functional pods
Establishes KPIs that measure product impact, not output volume
Coaches engineers on judgment, review quality, and context engineering
Manages the human side of AI adoption: resistance, identity shifts, and role redefinition

For the full transformation story, see Engineering Manager: Roles in the AI Era.

Product Roles

Product Manager

Owns: Problem definition, acceptance criteria, and product quality.

Responsibilities:

Defines what to build and why (the “Plan” phase of the operating loop)
Writes acceptance criteria that agents can execute against
Reviews agent output for product correctness (does it solve the user’s problem?)
Defines constraints and quality bars instead of writing detailed specifications
Tracks product outcome metrics alongside delivery metrics

For the full transformation story, see Product Manager: Roles in the AI Era.

Product Designer

Owns: UX quality, design system, and interaction patterns.

Responsibilities:

Maintains the design system that agents generate from (tokens, components, patterns)
Reviews agent-generated UI for UX quality and design consistency
Defines design tokens and component specifications as agent instructions
Focuses on design governance and quality auditing rather than pixel-level execution
Addresses the emerging discipline sometimes called Agent Experience (AX): designing for both human and agent actors

For the full transformation story, see Product Designer: Roles in the AI Era.

QA Engineer

Owns: Product-level quality from the user’s perspective. Distinct from the Evaluation Lead.

Responsibilities:

Translates acceptance criteria into testable assertions
Builds evaluation suites that validate product behavior, not just code correctness
Monitors quality drift from a user-facing perspective (UX regressions, accessibility, copy errors)
Works with the Evaluation Lead on comprehensive quality coverage

Boundary with Evaluation Lead: The Evaluation Lead owns agent output correctness (did the agent follow instructions? is the code well-structured? does it pass technical evaluation?). The QA Engineer owns product correctness (does the shipped feature meet the user’s need? does it work as specified? does it meet accessibility and UX standards?). One judges the agent. The other judges the product.

For the full transformation story — how each of these roles evolved from traditional job descriptions and what to screen for when hiring — see Roles in the AI Era.

Scaling Path

Start (7-8 people):

1 AI Architect (leads)
2-3 Product Engineers
1 AI Reliability Engineer (may share duties with the Architect early on)
1 Engineering Manager
1 Product Manager (may be part-time)
1 Product Designer (may be shared across pods early on)

Scale (12-16 people):

1 AI Architect
3-4 Product Engineers (some specializing in different surfaces)
1-2 AI Reliability Engineers
1-2 Evaluation Leads
1 Platform Engineer
1-2 Engineering Managers (one per pod at scale)
1 Product Manager
1-2 Product Designers
1 QA Engineer

The key: roles exist with authority, not as hats stacked on other jobs. This applies equally to product roles. A PM who owns acceptance criteria must have the authority to reject agent output that doesn’t meet the bar.

Decision Rights Matrix

Ownership clarity alone isn’t enough. You also need clear decision rights: every critical decision has a single owner, not a consensus process.

Decision	Owner	Consulted	Rationale
Model selection	AI Architect + Evaluation Lead	Product Engineer	Technical fit + eval data must align
Orchestration pattern	AI Architect (single owner)	Team	Architecture cascades into everything; needs one voice
Cost control	AI Reliability Engineer	AI Architect	Token spend, compute budgets, cost alerts
Eval thresholds (“can we ship?”)	Evaluation Lead	Product, Architect	Must be decided before emotional attachment to shipping
Feature prioritization	Product Manager + AI Architect	Team	Architect says what’s feasible, PM decides what matters
What to build (problem selection)	Product Manager	Architect, Team	PM owns problem definition; engineering owns solution
Acceptance criteria	Product Manager	Engineering, Design	Criteria must be agent-executable; PM defines, engineering validates feasibility
UX quality standards	Product Designer	PM, Engineering	Design system compliance, accessibility, interaction quality
Design system changes	Product Designer	PM, AI Architect	Components and tokens agents generate from
Product-level quality thresholds	QA Engineer	PM, Evaluation Lead	User-facing quality distinct from agent output correctness
Architecture decisions	AI Architect	Team	No agent makes architecture decisions
Security decisions	AI Architect + Reliability Eng	Team	Humans only, never delegated to agents
Release decisions	Product Manager + AI Architect	Reliability Eng	Human judgment on production readiness

The single mistake most teams make: involving too many people in every decision, hoping consensus catches problems. Instead, they get slow, uncertain decision-making where nobody feels accountable when things go wrong.

Maturity Model

The full five-level maturity model (dimension tables, assessment criteria, prerequisites, and failure modes for each level) lives in the Practitioner Guide. Both guides share this model so practitioners and leaders use a common vocabulary.

Level 1

Assisted

AI provides suggestions that developers accept, modify, or reject. The developer drives all decisions and execution.

"We use Copilot for suggestions"

See full maturity model with dimensions, criteria, and failure modes

Critical reminder: A team’s effective level is set by its least-adopted member in a critical-path role — see the Maturity Model in the Practitioner Guide for the full rationale.

Use the maturity level to calibrate the Adoption Roadmap phases and to set KPI targets in the Measurement section below.

Adoption Roadmap

A four-phase rollout over 180 days. Each phase has defined goals, activities, exit criteria, and risk mitigations. The principle: staged adoption gives you speed without blind trust.

Establish baseline metrics, validate feasibility, build team confidence in a controlled environment.

Activities

Select one repository with moderate complexity
Define 3 repeatable task types (e.g., API endpoint, test generation, refactor)
Measure baseline metrics: PR cycle time, change failure rate, test coverage, bug rate
Set up basic guardrails: scope definition, CI as gate, senior review on all agent PRs
Every team member runs at least one agent-assisted task
Document what works, what fails, and what surprises

Product PM participates in defining task types. Designer reviews agent-generated UI. Baseline product metrics recorded.

Exit Criteria

Baseline metrics recorded for comparison
At least 10 agent-assisted tasks completed and reviewed
No critical quality incidents from agent output
Team can articulate which tasks agents handle well and which they don't
Basic rules file created and shared across the team

Risk Mitigations

Senior review on 100% of agent PRs — no exceptions in Phase 1
Start with low-risk, well-bounded tasks only
If a quality incident occurs, pause and retrospect before continuing

Measurement and Failure Modes

KPI Dashboard

Track these metrics weekly. The goal isn’t to maximize agent usage. It’s to deliver faster without quality regression.

Lead time

Decrease

Time from issue opened to code merged

PR review time

Decrease

Time from PR opened to approved

Change failure rate

Stable or decrease

% of deployments causing incidents or rollbacks

Rollback frequency

Stable or decrease

Number of rollbacks per deployment period

Escaped defects

Decrease

Bugs found in production per sprint

Test coverage delta

Increase

Change in test coverage over time

Deployment frequency

Increase

How often the team deploys to production

The Success Formula

Lead time ↓ AND Deploy frequency ↑ AND Change failure rate ↔ AND Escaped defects ↔

Six Failure Modes

Each failure mode includes the symptom (how you detect it), the root cause (why it happens), and the mitigation (how to fix it).

Root Cause

Tasks selected for agents are easy-to-automate busywork rather than genuine bottlenecks. The team optimizes for agent-friendly tasks rather than high-impact tasks.

Mitigation

Tie every agent workflow to a measurable delivery KPI
Require a "so what?" test: if automated, what bottleneck does it remove?
Review task selection criteria quarterly

Root Cause

Agent output velocity exceeds the team's review capacity. Often caused by large, unfocused agent PRs.

Mitigation

Enforce smaller PR scope (one concern per PR, bounded by task template)
Tighten acceptance criteria so PRs are more focused
Scale review capacity: train more team members
Implement risk-based review: low-risk PRs get sampling-based review
Consider review automation for mechanical aspects

Root Cause

Verification gates are incomplete. Agent-generated code passes CI but introduces subtle issues not covered by tests.

Mitigation

Expand evaluation coverage beyond unit tests (integration tests, performance benchmarks, architecture fitness functions)
Track post-release defect rate specifically for agent-generated code
Implement regular "agent output audits"
Monitor change failure rate as an early warning signal

Root Cause

Effective agent interaction patterns are not captured and shared. Knowledge stays in individual heads.

Mitigation

Convert individual prompts into shared task templates
Maintain team-level rules files (not personal ones)
Publish an internal "agentic SOP" with examples
Pair programming sessions where skilled users demonstrate approach
Make the Learn phase mandatory: every insight gets codified

Root Cause

Governance infrastructure (Layer 5) was not built before scaling. Organizations skipped standardization.

Mitigation

Implement governance layer before cross-team scaling
Start with minimum viable governance: agent registry + cost tracking
Add access control and audit trails as usage grows
Assign a governance owner (AI Reliability or Platform Engineer)
Review governance completeness quarterly

Root Cause

Agent adoption accelerated delivery without improving problem selection. The team is building the wrong things faster.

Mitigation

Tie agent task selection to product outcome metrics
Require PM sign-off on every task plan
Measure feature adoption and user satisfaction alongside delivery speed
Apply "redesign, don't automate" to product discovery, not just delivery

Organizational Model

The Structural Shift

Workflows: Design for AI-first, not AI-assisted

Leadership: From directing execution to defining constraints

Talent: From specialists to T-shaped integrators

Culture: Build continuous reinvention

Structure: From functional teams to outcome-oriented pods

People systems: Measure impact, not output volume

Roles with Explicit Authority

Engineering Roles

AI Architect

AI Reliability Engineer

Evaluation Lead

Product Engineer

Platform Engineer

Engineering Manager

Product Roles

Product Manager

Product Designer

QA Engineer

Scaling Path

Decision Rights Matrix

Maturity Model

Assisted

Structured

Integrated

Autonomous

Orchestrated

Adoption Roadmap

Activities

Exit Criteria

Risk Mitigations

Activities

Exit Criteria

Risk Mitigations

Activities

Exit Criteria

Risk Mitigations

Activities

Exit Criteria

Risk Mitigations

Measurement and Failure Modes

KPI Dashboard

Lead time

PR review time

Change failure rate

Rollback frequency

Escaped defects

Test coverage delta

Deployment frequency

% PRs agent-assisted

% PRs passing CI first run

% tasks within SLA

Contribution split

Rules file update frequency

Cost per agent task

Feature adoption rate

User satisfaction delta

Requirement accuracy

Design compliance rate

Review rejection rate

Post-merge defect rate

Evaluation coverage

Guardrail trigger rate

Six Failure Modes

Root Cause

Mitigation

Root Cause

Mitigation

Root Cause

Mitigation

Root Cause

Mitigation

Root Cause

Mitigation

Root Cause

Mitigation

Further Reading