Skip to content

Practitioner Guide

How we build and operate with AI agents day-to-day.

Table of contents

Agent Architecture Patterns

Agent Anatomy

Every agent, regardless of framework or vendor, is composed of five core components.

ComponentWhat It DoesExample
ModelThe reasoning engine. Makes decisions, generates outputs, selects tools.GPT-5.4, Claude Opus 4.6, Gemini 3 Pro
ToolsExternal functions, APIs, or systems the agent can invoke.Database queries, web search, code execution, file I/O
InstructionsExplicit guidelines, scope constraints, and behavioral rules.System prompts, AGENTS.md, rules files, policy documents
MemoryContext that persists across interactions — short-term (conversation) and long-term (vector stores, key-value).Conversation history, session state, project context
RetrievalMechanisms to access external knowledge not in the model’s training data.RAG pipelines, document search, knowledge bases

These five components are the atoms. Everything else — workflows, agents, multi-agent systems — is a molecule built from them.

Workflows vs. Agents

A critical architectural distinction, first articulated by Anthropic, governs every design decision downstream:

DimensionWorkflowsAgents
ControlPredefined code paths orchestrate the LLMLLM dynamically directs its own processes
PredictabilityHigh — you know the execution pathLower — the model decides what to do next
FlexibilityLow — changes require code changesHigh — adapts to novel inputs
Best forWell-defined, repeatable tasksOpen-ended problems with unpredictable steps
Cost/latencyLower — fewer LLM calls, fixed pathsHigher — more calls, dynamic routing
Error handlingProgrammatic gates and checksAgent must self-correct or escalate

The decision rule: use workflows when you can define the task decomposition in advance; use agents when the decomposition depends on the input.

Tool Types

Tools fall into three types. This classification determines how you design, test, and permission agent capabilities:

Data tools — Retrieve context and information. Read-only, low-risk. Examples: Query databases, read documents, search the web, pull CRM records.

Action tools — Interact with systems to change state. Write operations, higher risk. Examples: Send emails, update records, create tickets, issue refunds, deploy code.

Orchestration tools — Other agents exposed as tools. Meta-level coordination. Examples: A “research agent” callable by a “manager agent,” a specialist agent invoked by a triage agent.

Risk rating should be assigned per tool: low (read-only), medium (reversible writes), high (irreversible writes, financial impact, external-facing). These ratings feed directly into the Guardrail Stack.

Composition Patterns

Eight patterns, ordered from simplest to most complex. The rule: always start at the top and move down only when the simpler pattern demonstrably fails.

Pattern 1

Prompt Chaining

Decompose a task into a fixed sequence of steps. Each LLM call processes the output of the previous one. Programmatic gates between steps validate intermediate results.

When to use

Task can be cleanly decomposed into fixed subtasks. You trade latency for accuracy by making each call simpler.

Example

Generate marketing copy, then translate it. Write an outline, validate it against criteria, then write the full document.

Pattern Selection Guide

Always start at the top. Move down only when the simpler pattern demonstrably fails.

Start Single LLM call with good prompting
Output quality insufficient
Consider Prompt Chaining or Evaluator-Optimizer
Start Prompt Chaining
Task decomposition isn't fixed
Consider Single Agent Loop
Start Single Agent Loop
Too many tools (>15) or overlapping concerns
Consider Manager or Orchestrator-Workers
Start Single Agent Loop
Distinct categories with different handling
Consider Routing + specialized agents
Start Manager pattern
Central agent bottlenecks; specialists need full autonomy
Consider Decentralized Handoff

Universal rule: Maximize a single agent's capabilities before splitting into multiple agents.


The Guardrail Stack

Five layers of defense, grounded in Principle 4: Guardrails Are Non-Negotiable. Each layer addresses a different failure mode. All five must be in place before scaling agent operations.

Fleet-level controls for managing agents at organizational scale.

Element Description
Registry Single source of truth tracking all agents, their capabilities, owners, and status
Access control Role-based permissions determining which agents can access which systems and data
Observability Unified monitoring across all agents — execution traces, cost tracking, error rates, latency
Interoperability Standards for agents to work across platforms and teams (e.g., Model Context Protocol)
Audit trail Complete record of agent actions, decisions, and outcomes for compliance and debugging
Cost budgeting Per-agent and per-team token/compute budgets with alerts and hard limits

Ensure humans retain authority over decisions that agents must not make autonomously.

Element Description
Architecture System design, technology choices, data model changes
Risk acceptance Shipping known tradeoffs, accepting technical debt
Release timing When code goes to production
Incident response Rollback decisions, postmortem actions
Security-critical changes Authentication, authorization, encryption
Cost commitments Actions with financial impact above defined thresholds

Enforce safety, compliance, and organizational rules that automated quality checks cannot catch.

Element Description Example
No secret exposure Automated secret scanning in pre-commit and CI Credentials leaking into repositories
PII filtering LLM-based or regex-based PII detection on outputs Privacy violations in generated content
Safety classification Detect prompt injection, jailbreak attempts System exploitation
Relevance classification Flag off-topic or out-of-scope agent behavior Scope drift and waste
Moderation Content safety checks on agent outputs Harmful or inappropriate generated content
Dependency policy Block unsafe dependency upgrades or additions Supply chain attacks
Branch policy No direct pushes to main/protected branches Unreviewed code reaching production

Enforce code and output correctness through automated checks before any human review.

Element Description
Formatting & linting Enforce style consistency (Black, ESLint, Prettier, etc.)
Type checking Static type verification (mypy, TypeScript strict mode)
Unit & integration tests Existing test suite must pass; new code must include tests
Static analysis Security scanning, dependency vulnerability checks
Coverage thresholds No regressions in test coverage
Design system compliance Agent-generated UI follows the component library and design tokens
Accessibility standards WCAG compliance checks on generated interfaces

Prevent agents from drifting beyond their assigned task — the most common failure mode in practice.

Element Description Example
Target Specific files, directories, or systems the agent may touch src/api/users/, payments_table
Non-goals What the agent must NOT change "Do not modify authentication logic"
Acceptance criteria Concrete definition of "done" "All tests pass, endpoint returns 200 with valid payload"
Allowed dependencies What the agent may import or call "No new external packages without approval"
Max iterations Upper bound on agent execution cycles 20 tool calls, 10 minutes wall time

The Operating Loop

The Plan-Execute-Verify-Ship-Learn Cycle

This is the day-to-day execution model. It replaces ad-hoc prompting with a loop you can actually repeat and improve.

Plan Execute Verify Ship Learn
Plan Product + Engineering

Before any agent touches code, define the contract that bounds execution.

  • Product defines: Goal, acceptance criteria, UX requirements
  • Engineering defines: Scope, non-goals, risk level, constraints, verification method
  • The plan is not a suggestion — it is the contract that bounds agent execution
  • Without it, you get creative drift

Task Classification Matrix

Not every task is equally suited for agent execution. Classify tasks along two dimensions: boundedness (how well-defined is the scope?) and risk (what is the blast radius if the agent gets it wrong?).

Low Risk
Medium Risk
High Risk
Well-bounded
Semi-bounded
Open-ended
Agent-driven Well-bounded / Low Risk

Automated verification. Sampling review.

Engineering
  • API endpoints and CRUD features
  • Code formatting, linting, and style fixes
  • Documentation and changelog generation
Product
  • Copy and microcopy generation within brand guidelines
  • Test case generation from acceptance criteria
  • Competitive analysis summaries from public data

The Human-Agent Boundary

The boundary is defined by a simple question: “If the agent gets this wrong, what happens?”

  • If the answer is “CI catches it and the PR is rejected” -> agent can execute autonomously
  • If the answer is “a customer sees wrong data” -> agent drafts, human verifies
  • If the answer is “we have a security breach” -> human leads, agent assists at most
  • If the answer is “we don’t know” -> human leads until we do know

The boundary moves over time. As guardrails improve, evaluation coverage increases, and confidence grows, tasks shift from human-led to agent-assisted to agent-driven. But you earn that movement through demonstrated reliability. You don’t assume it.

For how each traditional role transforms under these boundaries — from Software Engineer to Product Designer — see Roles in the AI Era. The Competency Evolution Explorer maps which competencies carry over, which are new, and which to sunset per role.


Maturity Model

A five-level maturity model, drawing on Eledath’s “Levels of Agentic Engineering” framework and extending it with team-level dimensions, product integration, and failure modes. Each level says where a team is, what it can do, what it needs to move up, and what risks come with the territory.

This is the canonical maturity model for the entire HELM framework. The Leadership Guide references this model and provides a quick-reference table for leadership conversations.

Level 1

Assisted

AI provides suggestions that developers accept, modify, or reject. The developer drives all decisions and execution.

"We use Copilot for suggestions"
Dimensions
Capabilities Tab completion, inline suggestions, single-turn Q&A, code explanation
Tools GitHub Copilot, ChatGPT, basic AI-assisted IDE features
Human role Full control. AI is a passive tool.
Agent autonomy None. Every output requires explicit human action.
Product dimension PMs and designers use AI for ad-hoc tasks (drafting docs, brainstorming). No integration with engineering workflows.
Risk profile Low. Developer reviews every suggestion.
Assessment Criteria
  • Developers use AI for suggestions but control all execution
  • No structured prompting or context engineering
  • No shared rules or templates for AI usage
  • AI usage is individual, not team-standardized
Failure mode Over-trust of suggestions without review; cargo-cult coding.

Critical reminder: The team’s effective level equals the level of the lowest-capability person in a critical-path role, not the highest-capability individual. A team with one Level 4 engineer and a Level 1 gatekeeper operates at Level 1.


Further Reading