Agent Architecture Patterns
Agent Anatomy
Every agent, regardless of framework or vendor, is composed of five core components.
| Component | What It Does | Example |
|---|---|---|
| Model | The reasoning engine. Makes decisions, generates outputs, selects tools. | GPT-5.4, Claude Opus 4.6, Gemini 3 Pro |
| Tools | External functions, APIs, or systems the agent can invoke. | Database queries, web search, code execution, file I/O |
| Instructions | Explicit guidelines, scope constraints, and behavioral rules. | System prompts, AGENTS.md, rules files, policy documents |
| Memory | Context that persists across interactions — short-term (conversation) and long-term (vector stores, key-value). | Conversation history, session state, project context |
| Retrieval | Mechanisms to access external knowledge not in the model’s training data. | RAG pipelines, document search, knowledge bases |
These five components are the atoms. Everything else — workflows, agents, multi-agent systems — is a molecule built from them.
Workflows vs. Agents
A critical architectural distinction, first articulated by Anthropic, governs every design decision downstream:
| Dimension | Workflows | Agents |
|---|---|---|
| Control | Predefined code paths orchestrate the LLM | LLM dynamically directs its own processes |
| Predictability | High — you know the execution path | Lower — the model decides what to do next |
| Flexibility | Low — changes require code changes | High — adapts to novel inputs |
| Best for | Well-defined, repeatable tasks | Open-ended problems with unpredictable steps |
| Cost/latency | Lower — fewer LLM calls, fixed paths | Higher — more calls, dynamic routing |
| Error handling | Programmatic gates and checks | Agent must self-correct or escalate |
The decision rule: use workflows when you can define the task decomposition in advance; use agents when the decomposition depends on the input.
Tool Types
Tools fall into three types. This classification determines how you design, test, and permission agent capabilities:
Data tools — Retrieve context and information. Read-only, low-risk. Examples: Query databases, read documents, search the web, pull CRM records.
Action tools — Interact with systems to change state. Write operations, higher risk. Examples: Send emails, update records, create tickets, issue refunds, deploy code.
Orchestration tools — Other agents exposed as tools. Meta-level coordination. Examples: A “research agent” callable by a “manager agent,” a specialist agent invoked by a triage agent.
Risk rating should be assigned per tool: low (read-only), medium (reversible writes), high (irreversible writes, financial impact, external-facing). These ratings feed directly into the Guardrail Stack.
Composition Patterns
Eight patterns, ordered from simplest to most complex. The rule: always start at the top and move down only when the simpler pattern demonstrably fails.
Prompt Chaining
Decompose a task into a fixed sequence of steps. Each LLM call processes the output of the previous one. Programmatic gates between steps validate intermediate results.
When to use
Task can be cleanly decomposed into fixed subtasks. You trade latency for accuracy by making each call simpler.
Example
Generate marketing copy, then translate it. Write an outline, validate it against criteria, then write the full document.
Routing
Classify the input and direct it to a specialized handler. Each route has its own optimized prompt and tools.
When to use
Distinct categories that are better handled separately. Classification can be done accurately.
Example
Customer service — route general questions, refund requests, and technical support to different downstream processes.
Parallelization
Run subtasks simultaneously and aggregate results. Two variants: Sectioning (independent subtasks) and Voting (same task, multiple perspectives).
When to use
Subtasks are independent (sectioning) or you need multiple perspectives (voting).
Example
One model processes the user query while another screens for safety. Multiple prompts review code for vulnerabilities; flag if any finds a problem.
Evaluator-Optimizer
One LLM generates a response. Another evaluates it and provides feedback. Loop until quality criteria are met.
When to use
Clear evaluation criteria exist, and iterative refinement provides measurable improvement.
Example
Literary translation with a critic loop. Complex search tasks requiring multiple rounds of analysis.
Single Agent Loop
A single LLM with tools operates in a loop until an exit condition is met (final output, no tool calls, error, or max iterations). The fundamental agent pattern.
When to use
Dynamic decision-making about which tools to call and in what order, but complexity does not warrant splitting across multiple agents.
Example
A coding agent that reads files, writes code, runs tests, and iterates until tests pass.
Orchestrator-Workers
A central LLM dynamically breaks down tasks, delegates to worker LLMs, and synthesizes results. Unlike parallelization, subtasks are not pre-defined.
When to use
Complex tasks where you cannot predict the number or nature of subtasks in advance.
Example
A coding product that determines which files need changing and dispatches changes to workers.
Manager (Agents-as-Tools)
A central "manager" agent calls specialized agents as tools. The manager retains control and context, synthesizing outputs into a unified interaction.
When to use
You want a single agent maintaining central control and user interaction while delegating specialized work.
Example
A manager agent that calls translator agents for Spanish, French, and Italian, synthesizing all results for the user.
Decentralized Handoff
Agents operate as peers, handing off full execution control to one another based on specialization. No central coordinator.
When to use
You don't need central control or synthesis. Each specialized agent can fully take over the interaction.
Example
A triage agent hands off entirely to technical support, sales, or order management. The receiving agent owns the conversation.
Pattern Selection Guide
Always start at the top. Move down only when the simpler pattern demonstrably fails.
Universal rule: Maximize a single agent's capabilities before splitting into multiple agents.
The Guardrail Stack
Five layers of defense, grounded in Principle 4: Guardrails Are Non-Negotiable. Each layer addresses a different failure mode. All five must be in place before scaling agent operations.
Fleet-level controls for managing agents at organizational scale.
| Element | Description |
|---|---|
| Registry | Single source of truth tracking all agents, their capabilities, owners, and status |
| Access control | Role-based permissions determining which agents can access which systems and data |
| Observability | Unified monitoring across all agents — execution traces, cost tracking, error rates, latency |
| Interoperability | Standards for agents to work across platforms and teams (e.g., Model Context Protocol) |
| Audit trail | Complete record of agent actions, decisions, and outcomes for compliance and debugging |
| Cost budgeting | Per-agent and per-team token/compute budgets with alerts and hard limits |
Ensure humans retain authority over decisions that agents must not make autonomously.
| Element | Description |
|---|---|
| Architecture | System design, technology choices, data model changes |
| Risk acceptance | Shipping known tradeoffs, accepting technical debt |
| Release timing | When code goes to production |
| Incident response | Rollback decisions, postmortem actions |
| Security-critical changes | Authentication, authorization, encryption |
| Cost commitments | Actions with financial impact above defined thresholds |
Enforce safety, compliance, and organizational rules that automated quality checks cannot catch.
| Element | Description | Example |
|---|---|---|
| No secret exposure | Automated secret scanning in pre-commit and CI | Credentials leaking into repositories |
| PII filtering | LLM-based or regex-based PII detection on outputs | Privacy violations in generated content |
| Safety classification | Detect prompt injection, jailbreak attempts | System exploitation |
| Relevance classification | Flag off-topic or out-of-scope agent behavior | Scope drift and waste |
| Moderation | Content safety checks on agent outputs | Harmful or inappropriate generated content |
| Dependency policy | Block unsafe dependency upgrades or additions | Supply chain attacks |
| Branch policy | No direct pushes to main/protected branches | Unreviewed code reaching production |
Enforce code and output correctness through automated checks before any human review.
| Element | Description |
|---|---|
| Formatting & linting | Enforce style consistency (Black, ESLint, Prettier, etc.) |
| Type checking | Static type verification (mypy, TypeScript strict mode) |
| Unit & integration tests | Existing test suite must pass; new code must include tests |
| Static analysis | Security scanning, dependency vulnerability checks |
| Coverage thresholds | No regressions in test coverage |
| Design system compliance | Agent-generated UI follows the component library and design tokens |
| Accessibility standards | WCAG compliance checks on generated interfaces |
Prevent agents from drifting beyond their assigned task — the most common failure mode in practice.
| Element | Description | Example |
|---|---|---|
| Target | Specific files, directories, or systems the agent may touch | src/api/users/, payments_table |
| Non-goals | What the agent must NOT change | "Do not modify authentication logic" |
| Acceptance criteria | Concrete definition of "done" | "All tests pass, endpoint returns 200 with valid payload" |
| Allowed dependencies | What the agent may import or call | "No new external packages without approval" |
| Max iterations | Upper bound on agent execution cycles | 20 tool calls, 10 minutes wall time |
The Operating Loop
The Plan-Execute-Verify-Ship-Learn Cycle
This is the day-to-day execution model. It replaces ad-hoc prompting with a loop you can actually repeat and improve.
Before any agent touches code, define the contract that bounds execution.
- Product defines: Goal, acceptance criteria, UX requirements
- Engineering defines: Scope, non-goals, risk level, constraints, verification method
- The plan is not a suggestion — it is the contract that bounds agent execution
- Without it, you get creative drift
Agents work autonomously within the bounds set by the Plan phase.
- Generate code, tests, documentation, or refactors
- Call tools as needed (data retrieval, API interactions, code execution)
- Iterate within the defined scope (run tests, fix failures, retry)
- Operate within configured iteration limits
- The human's role during execution is monitoring, not directing
Automated and human checks before anything merges.
- Automated: CI pipeline, static analysis, security scanning, coverage thresholds, policy checks
- Human — Engineering: code review, architecture alignment, edge case consideration
- Human — Product: acceptance review, UX review, copy review, accessibility check
- Review depth scales with risk level
Merge and deploy with full auditability and rollback capability.
- Audit trail: what was generated, by which agent, reviewed by whom
- Rollback path: every deployment must be reversible within a defined SLA
- Post-merge monitoring: watch for anomalies in error rates and latency
- Diff review: ensure merged code matches what was reviewed
After shipping, codify the lessons so each cycle makes the next one better.
- Update rules files: add rules to prevent recurrence of manual corrections
- Refine task templates: tighten plans that were ambiguous
- Update evaluation criteria: add checks the verify step missed
- Share across the team: convert successful patterns into shared templates and SOPs
- This is "compounding engineering" — without Learn, you get repetition instead of improvement
Task Classification Matrix
Not every task is equally suited for agent execution. Classify tasks along two dimensions: boundedness (how well-defined is the scope?) and risk (what is the blast radius if the agent gets it wrong?).
Automated verification. Sampling review.
Engineering
- API endpoints and CRUD features
- Code formatting, linting, and style fixes
- Documentation and changelog generation
Product
- Copy and microcopy generation within brand guidelines
- Test case generation from acceptance criteria
- Competitive analysis summaries from public data
Automated + human verification.
Engineering
- Frontend component generation and cleanup
- Test generation for existing business logic
- Migration scripts and repetitive refactors
Product
- PRD drafts from user research notes
- User story decomposition from high-level requirements
- Design-to-code translation using design system components
Human-led with agent drafts. Full review.
Engineering
- Security-adjacent feature implementation
- Payment flow modifications
- Data model changes with migration
Product
- User-facing copy with legal/compliance implications
- Onboarding flow changes affecting activation metrics
With human plan review.
Engineering
- Exploratory refactors with clear goals
- Performance optimization within defined bounds
Product
- Feature spec elaboration from brief outline
- Design exploration within existing system
Human review mandatory.
Engineering
- Cross-service integration work
- Complex business logic implementation
Product
- Multi-step workflow redesign
- New feature prototyping within constraints
Agent may draft, human designs and reviews.
Engineering
- Auth system modifications
- Infrastructure security hardening
Product
- Pricing model implementation
- Compliance-critical workflow changes
With agent support for research/exploration.
Engineering
- Technology evaluation and prototyping
- Architecture documentation drafts
Product
- Market research synthesis
- Competitive landscape analysis
Agent provides options, human decides.
Engineering
- Novel architecture decisions
- Performance-sensitive distributed systems
Product
- Product strategy options analysis
- User research synthesis and insight generation
Agent excluded.
Engineering
- Incident response and production debugging
- Security breach investigation
Product
- Product strategy and roadmap prioritization
- Brand voice definition and tone calibration
- Pricing and packaging decisions
The Human-Agent Boundary
The boundary is defined by a simple question: “If the agent gets this wrong, what happens?”
- If the answer is “CI catches it and the PR is rejected” -> agent can execute autonomously
- If the answer is “a customer sees wrong data” -> agent drafts, human verifies
- If the answer is “we have a security breach” -> human leads, agent assists at most
- If the answer is “we don’t know” -> human leads until we do know
The boundary moves over time. As guardrails improve, evaluation coverage increases, and confidence grows, tasks shift from human-led to agent-assisted to agent-driven. But you earn that movement through demonstrated reliability. You don’t assume it.
For how each traditional role transforms under these boundaries — from Software Engineer to Product Designer — see Roles in the AI Era. The Competency Evolution Explorer maps which competencies carry over, which are new, and which to sunset per role.
Maturity Model
A five-level maturity model, drawing on Eledath’s “Levels of Agentic Engineering” framework and extending it with team-level dimensions, product integration, and failure modes. Each level says where a team is, what it can do, what it needs to move up, and what risks come with the territory.
This is the canonical maturity model for the entire HELM framework. The Leadership Guide references this model and provides a quick-reference table for leadership conversations.
Assisted
AI provides suggestions that developers accept, modify, or reject. The developer drives all decisions and execution.
Dimensions
Assessment Criteria
- Developers use AI for suggestions but control all execution
- No structured prompting or context engineering
- No shared rules or templates for AI usage
- AI usage is individual, not team-standardized
Structured
AI operates within structured contexts. Teams use dedicated AI IDEs, maintain rules files, and follow defined prompting patterns.
Dimensions
Assessment Criteria
- Team uses AI IDEs with structured context (rules files, project-level instructions)
- Shared task templates exist for common operations
- All AI-generated output goes through standard code review
- Team has basic conventions for when and how to use AI tools
Integrated
AI agents are integrated into the development lifecycle through automated feedback loops. CI serves as the verification layer.
Dimensions
Assessment Criteria
- Agents iterate based on CI/test feedback without human intervention in the loop
- Rules files and templates are updated after each cycle (compounding engineering)
- Evaluation coverage is explicitly tracked and improving
- Guardrail stack (all 5 layers) is operational
- Team measures agentic adoption KPIs
Autonomous
Agents operate in the background, working on tasks asynchronously. Humans define tasks and review results.
Dimensions
Assessment Criteria
- Agents produce PRs asynchronously (not just in interactive sessions)
- Human review happens after completion, not during execution
- Cost tracking and budgeting is active per agent and per team
- Execution traces provide full visibility into agent reasoning and actions
- Incident response protocol exists for agent-caused failures
Orchestrated
Multiple agents coordinate in parallel, managed by orchestration systems.
Dimensions
Assessment Criteria
- An orchestrator coordinates multiple agents working on related tasks
- Agents are shared and reused across teams (internal skills marketplace)
- Fleet-level observability tracks all agents, costs, and outcomes in one view
- Governance framework (registry, access control, audit) is fully operational
- The organization can articulate and enforce policies across the agent fleet
Critical reminder: The team’s effective level equals the level of the lowest-capability person in a critical-path role, not the highest-capability individual. A team with one Level 4 engineer and a Level 1 gatekeeper operates at Level 1.
Further Reading
- Building Effective Agents — Anthropic
- A Practical Guide to Building Agents — OpenAI
- AI Agents Whitepaper — Google
- Cloud Adoption Framework — Microsoft
- Agentic Engineering for Software Teams — vibecoding.app
- The 8 Levels of Agentic Engineering — Eledath
- Top Strategic Technology Trends 2026 — Gartner
- Five Stages of Agentic Evolution — Gartner