Specialist Course
AI Agents
& Automation
Design, evaluate, and deploy useful agents that survive first contact with real business constraints — without the hype.
Welcome
Who are you, and what
brought you here today?
Take 60 seconds. Share your name, your role, and the one thing you most want to learn about AI agents.
Round-table introductions help us calibrate examples and pair people for lab work. No wrong answers.
“
Learn to design, evaluate, and deploy useful agents that survive first contact with real business constraints.
The Course Promise
Overview
Today's
10 Modules
A full-day journey from foundations through deployment, ending with your own capstone agent.
- 01 Agentic AI & Business Value
- 02 Models, Tools & Protocols
- 03 Single-Agent Design Patterns
- 04 Knowledge Systems
- 05 Automation & Business Actions
- 06 Multi-Agent Orchestration
- 07 Evaluation & Observability
- 08 Security, Governance & AU Compliance
- 09 Deployment & Lifecycle
- 10 Capstone Studio
Audience
Who This Course Is For
- Knowledge workers & operators — looking to automate repetitive workflows and reclaim strategic time
- Consultants & agency owners — wanting to offer agentic AI services or build internal efficiencies
- Product managers & ops leads — evaluating where agents fit in their product or process roadmap
- Technical builders & developers — ready to move from prototypes to production-grade agents
- Team leaders & enterprise champions — building the business case for responsible AI adoption
Time Commitment
~30 hrs
of guided learning across ten modules, plus 10–15 hours of independent project work building your capstone agent.
Module 01
Agentic AI &
Business Value
Separate signal from noise. Learn when agents create real value — and when they are expensive distractions.
01
Agentic AI & Business Value
Definitions, the agent spectrum, ROI frameworks, and knowing when NOT to build.
Module 01 · Objectives
What You'll
Learn
By the end of this module you will be able to:
- Distinguish assistants from automations from agents — and explain why the difference matters
- Choose when a full agent is justified versus simpler alternatives
- Score use cases for frequency, consequence, and ROI using a structured framework
- Identify when NOT to use an agent — the most valuable skill in agent design
Module 01 · Core Concept
What Is an AI Agent?
- Perceives its environment — reads data, monitors inboxes, watches dashboards, ingests context
- Plans and reasons about tasks — breaks goals into steps, weighs alternatives, re-plans when blocked
- Takes actions using tools — calls APIs, searches the web, writes files, sends messages
- Operates with some autonomy — makes decisions within boundaries without human approval at every step
- Has feedback loops to self-correct — evaluates its own outputs, retries on failure, escalates when uncertain
Module 01 · Comparison
Agent vs Chatbot vs Automation
💬
Chatbot
- Responds to prompts
- No tool access
- Stateless between sessions
- Human drives every step
⚙️
Automation
- Rule-based triggers
- Deterministic flow
- If/then logic only
- Breaks on exceptions
🤖
Agent
- Reasons about goals
- Uses tools dynamically
- Handles ambiguity
- Self-corrects on errors
Module 01 · Framework
The Agent Spectrum
Not everything needs to be a full agent. Match the level of autonomy to the task.
- Level 0 Prompt — a single instruction, no memory, no tools, no loops
- Level 1 Assistant — multi-turn conversation with context window but still human-driven
- Level 2 Tool-calling — the model selects and invokes functions but a human approves each call
- Level 3 Single Agent — autonomous planning, tool use, retries, and self-evaluation within guardrails
- Level 4 Multi-Agent System — multiple specialised agents coordinated by an orchestrator
Module 01 · Industry Data
69%
of Australian organisations are already using agentic AI, according to Deloitte (2026).
But only 22% report advanced agent governance in place.
Module 01 · Reality Check
ROI vs Hype
- Gartner warns 40%+ of agentic AI projects will be cancelled by end of 2027 due to cost, unclear value, and risk-control failures
- "Agent sprawl" is a real risk — organisations deploying agents without governance create fragile, overlapping systems nobody can audit
- The biggest cost is not the API bill — it is lost trust when an agent makes a visible, embarrassing, or costly mistake
- This course teaches you to avoid these traps with structured evaluation, approval gates, and clear success metrics
Module 01 · Framework
Use-Case Scoring Framework
Score each candidate workflow on six dimensions (1–5 scale). High frequency + low consequence + clear ROI = best starting point.
Rule of thumb: If total score is below 18, start with simple automation instead of an agent.
- Frequency of task — how often does this run? Daily beats monthly.
- Consequence of error — what happens when it is wrong? Reversible beats catastrophic.
- Permission sensitivity — what access does it need? Read-only beats admin.
- Data sensitivity — PII, financial, health? Lower is simpler.
- Expected ROI — hours saved, revenue gained, errors prevented.
- Human judgment required — how much nuance does the decision need?
Module 01 · Anti-Patterns
When NOT to Use an Agent
- The task is simple and rule-based — a Zapier automation or Make scenario is cheaper, faster, and more predictable
- The cost of error is catastrophic — financial transfers, medical decisions, legal filings need human oversight
- The workflow changes constantly — agents struggle when the rules shift weekly; use assisted tools instead
- There is no clear success metric — if you cannot measure whether it worked, define the metric first
- The data is too sensitive for any AI — some data should never leave your perimeter, full stop
Module 01 · Activity
Build Your Opportunity Map
- Choose one real workflow from your current work — something you do at least weekly
- Map the workflow: trigger, inputs, outputs, decision points, approvals
- Score it using the six-dimension framework (frequency, consequence, permissions, data, ROI, judgment)
- Write one paragraph on the business case — or why it is NOT a good agent candidate
15
minutes — individual work, then pair share
Module 01 · Recap
Module 1 Key Takeaways
- An agent perceives, plans, acts, and self-corrects — it is fundamentally different from a chatbot or an automation
- The agent spectrum ranges from simple prompts to multi-agent systems — match autonomy to the task
- 69% of AU orgs are using agentic AI, but governance lags — be intentional, not reactive
- Use the scoring framework to evaluate use cases before building anything
- Knowing when NOT to use an agent is the single most valuable lesson in this module
Module 02
Models, Tools
& Protocols
GPT-5.5, Claude Opus 4.7, Gemini 3.1, Zapier, Make, Copilot Studio — and how to choose.
02
Models, Tools & Protocols
Compare platforms, understand MCP and function calling, and apply model routing for cost discipline.
Module 02 · Objectives
What You'll
Learn
By the end of this module you will be able to:
- Compare major agent platforms on capabilities, pricing, and enterprise controls
- Understand the Responses API, Agents SDK, and managed agent patterns
- Explain MCP (Model Context Protocol) and function calling to non-technical stakeholders
- Apply model routing and cost discipline to cut spend 60–80% without quality loss
Module 02 · Landscape
The Platform Landscape
OpenAI
- Responses API & Agents SDK
- Built-in tools & tracing
- Remote MCP support
Anthropic
- Opus 4.7 & Managed Agents
- MCP originator
- Responsible Scaling Policy
Google
- Gemini 3.1 & ADK
- Enterprise Agent Platform
- IAM & VPC controls
Zapier
- No-code agents
- 8,000+ app integrations
- MCP & SOC 2 certified
Make
- Visual orchestration
- AI Agents beta
- 3,000+ apps, on-prem option
Microsoft
- Copilot Studio
- Foundry Agent Service
- Entra auth & M365 native
Module 02 · Platform Deep-Dive
OpenAI — Responses API & Agents SDK
- Built-in tools — web search, file search, code interpreter, and computer use available natively
- Remote MCP support — connect to any MCP-compliant tool server alongside built-in tools
- Agents SDK — open-source Python framework with guardrails, handoffs, and tracing built in
- Tracing & sandboxes — full observability of every tool call, decision, and output
- GPT-5.5 pricing: US$5 input / US$30 output per 1M tokens
Module 02 · Platform Deep-Dive
Anthropic — Opus 4.7 & Managed Agents
- Tool use, web search, code execution — plus the Model Context Protocol (MCP) they created
- Claude Managed Agents — public beta for orchestrated multi-step workflows with built-in safety
- Responsible Scaling Policy v3.0 — industry-leading safety framework with AI Safety Level evaluations
- Extended thinking — expose the model's reasoning chain for debugging and trust-building
- Opus 4.7 pricing: US$5 input / US$25 output per 1M tokens
Module 02 · Platform Deep-Dive
Google — Gemini 3.1 & ADK
- Enterprise Agent Platform — build, govern, and optimise agents within Google Cloud
- Open-source Agent Development Kit (ADK) — modular, composable, deployment-flexible
- Enterprise controls — IAM, audit logs, VPC Service Controls, and data residency compliance
- Agent evaluation & traffic splitting — A/B test agent versions before full rollout
- Workspace integration — native access to Gmail, Drive, Calendar, Docs within agent flows
Module 02 · No-Code Options
No-Code Agent Platforms
Zapier Agents
- 8,000+ app connectors
- MCP server connectivity
- SOC 2 Type II certified
- Natural language instructions
Make AI Agents
- Visual scenario builder
- Credits-based pricing
- On-premise deployment option
- 3,000+ integrations
Copilot Studio
- M365 & Teams native
- Entra ID authentication
- AU$299.30/25k message credits
- DLP & audit logging
Module 02 · Protocol
MCP — Model Context Protocol
- Standard protocol for connecting AI models to external tools, data sources, and services
- Created by Anthropic, now adopted by OpenAI (remote MCP support), Zapier, Make, and others
- Think of it as "USB-C for AI tools" — one protocol that works across models and platforms
- Server/client architecture — tools expose capabilities via MCP servers; models connect as clients
- Why it matters: Build a tool integration once, use it with any MCP-compatible model
Module 02 · Mechanics
Function Calling & Built-in Tools
- Function calling — the model reads a schema, decides which function to call, and generates structured arguments
- Built-in tools — pre-built capabilities like web search, code interpreter, and file search that require no custom code
- Custom tools — your own APIs and services, described via JSON Schema or MCP, executed by your infrastructure
- Tool selection is a design decision — more tools increase flexibility but also increase reasoning complexity and error surface
Module 02 · Cost Strategy
Model Routing & Cost Discipline
- Use stronger models for planning and review — Opus 4.7 or GPT-5.5 for complex reasoning, evaluation, and quality gates
- Use cheaper models for routing and classification — Sonnet or GPT-5.5 mini for intent detection and simple extraction
- All providers expose tiered pricing — match model capability to task complexity at each step in the pipeline
- Smart routing can cut costs 60–80% with no measurable quality loss on production benchmarks
Example: Route 80% of queries to a fast, cheap model. Escalate the 20% that need deep reasoning to a frontier model. Measure both paths.
Module 02 · Activity
Write Your Platform Selection Memo
- Pick your use case from Module 1's opportunity map exercise
- Compare 2–3 platforms that could serve it (one code, one no-code minimum)
- Evaluate each on: capabilities, pricing, enterprise controls, team skills, MCP support
- Write a one-page decision memo with your recommendation and rationale
15
minutes — individual work, then group debrief
Module 02 · Recap
Module 2 Key Takeaways
- Six major platforms compete for agent workloads — no single winner; the right choice depends on your constraints
- MCP is becoming the universal connector — learn it once, use it everywhere
- Function calling is the core mechanism — the model selects tools, your code executes them
- Model routing is the highest-ROI cost optimisation — use expensive models only where they add measurable value
- No-code platforms (Zapier, Make, Copilot Studio) are production-ready for many agent use cases
Module 03
Single-Agent
Design Patterns
System instructions, tool selection, approvals, recovery paths, and memory boundaries.
03
Single-Agent Design Patterns
The practical patterns that make individual agents reliable, safe, and useful in production.
Module 03 · Objectives
What You'll
Learn
By the end of this module you will be able to:
- Write effective system instructions that constrain behaviour without crippling capability
- Design tool boundaries and human approval gates at the right granularity
- Build retry and recovery strategies that fail gracefully under real-world conditions
- Manage memory within agent scope — short-term, medium-term, and long-term
Module 03 · Pattern
System Instructions Design
- Define the agent's role and boundaries clearly — "You are a research assistant for the marketing team. You do NOT make purchasing decisions."
- Specify what it CAN and CANNOT do — explicit allowlists beat implicit denylists every time
- Include output format requirements — structured outputs reduce downstream parsing errors to near zero
- Set escalation triggers — define exactly when the agent should stop and ask a human
- Test instructions against edge cases — adversarial testing during design prevents production surprises
Module 03 · Pattern
Tool Selection & Boundaries
Choosing Tools
Every tool you give an agent is a capability AND a risk surface. Design with intention.
Principle: Start with the minimum viable toolset. Add tools only when you can measure the improvement.
- Start with minimum viable tools — resist the urge to connect everything
- Each tool adds complexity and failure surface area
- Prefer built-in tools over custom where possible
- Define explicit permission scopes per tool
- Test tool combinations for conflicts and unintended interactions
Module 03 · Pattern
Human Approval Gates
- Not every action needs approval — over-gating kills adoption faster than any bug
- High-risk actions ALWAYS need human sign-off — financial commits, external communications, data deletions
- Design three tiers: auto-approve (read-only, low-risk) / notify (reversible writes) / require approval (irreversible, external, financial)
- Log all decisions for audit — even auto-approved actions need a paper trail
- Make approval UX fast — if it takes 5 clicks to approve, people will bypass the system
Module 03 · Pattern
Retry Strategy & Recovery
- Define max retry attempts per tool — typically 2–3 for API calls, zero for destructive operations
- Set timeout thresholds — an agent waiting 60 seconds for a response is an agent wasting money
- Build graceful degradation paths — if the preferred tool fails, what is the fallback?
- Never retry destructive actions — sending an email twice is worse than not sending it at all
- Log failures for debugging — every error is training data for improving the system
- Always have a "give up gracefully" path — escalate to a human with full context, not a cryptic error
Module 03 · Pattern
Memory Boundaries
- Short-term memory — the current conversation context window; resets between sessions
- Medium-term memory — session state, user preferences, and task progress persisted across turns
- Long-term memory — persistent knowledge bases, vector stores, and interaction history
- Rule: Store only what demonstrably improves the quality of future decisions
- Privacy imperative: Never persist sensitive data unnecessarily — minimise what you store, encrypt what you must, delete what you no longer need
Module 03 · Reference Architecture
The Conservative Baseline Architecture
- One orchestrator model — a single capable model (Opus 4.7, GPT-5.5) that plans, delegates, and evaluates
- Bounded tool set — minimum viable tools, each with explicit permissions and rate limits
- Explicitly chosen knowledge sources — curated vector stores or search indices, not "search the whole internet"
- Optional specialist sub-agents — add only when a single model cannot handle domain complexity
- Visible evaluation and approval loops — every decision logged, high-risk actions gated, outputs scored
Start here. Add complexity only when measurements prove the simpler version is insufficient.
Module 03 · Anti-Patterns
Common Pitfalls
- Over-engineering — start simple, add complexity only when proven needed; multi-agent is rarely day one
- Ignoring failure modes — agents WILL fail; the question is whether they fail gracefully or catastrophically
- No evaluation framework — if you cannot measure whether it worked, you cannot improve it or justify its cost
- Unbounded autonomy — always set limits on actions, spend, and scope; trust is earned incrementally
- Skipping human review at launch — shadow mode first, then supervised, then semi-autonomous; never jump to full auto
Module 03 · Lab
Build Your First Agent
- Choose your platform: OpenAI Responses API, Anthropic tool use, or Zapier Agents
- Build a simple research agent that gathers evidence from multiple sources and synthesises a recommendation
- Write clear system instructions — define the role, boundaries, output format, and escalation triggers
- Give it 2–3 tools maximum — web search, a document reader, and optionally a note-taker
- Test it against at least 3 different queries — one easy, one ambiguous, one adversarial
20
minutes — hands-on lab, then demo two volunteers
Module 03 · Recap
Module 3 Key Takeaways
- System instructions are the most important design artifact — invest time getting them right
- Minimum viable tools — every additional tool increases both capability and risk surface
- Three-tier approval gates (auto / notify / require) balance safety with usability
- Plan for failure — retry strategies, graceful degradation, and human escalation paths are not optional
- The conservative baseline (one orchestrator, bounded tools, visible loops) is the right starting point for every agent
Module 04
Knowledge Systems
Long context vs RAG vs search — and how to test which one works.
What You'll Learn
Module 4 Objectives
Choose between long context, RAG, and search
Prepare and process knowledge corpora
Design chunking and retrieval strategies
Test grounding quality with evaluation sets
The Knowledge Decision
Three approaches to giving agents knowledge
Long Context
Feed everything into the prompt. Simple but limited by window size. Best for small corpora under ~50 pages.
RAG
Retrieve relevant chunks at query time. Scales to large corpora. Needs infrastructure — embedding, indexing, retrieval pipeline.
Web Search
Live data, no corpus needed. Less control over quality. Best when currency matters more than consistency.
When to Use What
A practical decision framework
<50 pages? → Long context
Stable large corpus? → RAG
Need current info? → Web search
Need citations? → RAG or search with source tracking
Mixed needs? → Combine approaches
Corpus Preparation
Your knowledge is only as good as your source material
Clean and normalise source documents
Remove duplicates and outdated content
Structure for consistent retrieval
Version your corpus — knowledge changes
Document your sources for audit compliance
Chunking Strategies
How you split your corpus determines retrieval quality
Chunk by semantic meaning, not arbitrary length
Overlap chunks for context continuity
Include metadata (source, date, section)
Test chunk sizes against your actual queries
Smaller chunks = more precision, larger = more context
Retrieval Testing
Measure before you ship
Build a test set of 20–30 real questions
Compare retrieval accuracy across strategies
Measure: relevance, completeness, citation fidelity
Track failure modes (wrong chunk, missing info, hallucination)
Iterate until quality meets your threshold
Grounding Quality
The #1 risk in RAG systems
Grounding = making sure answers come from your sources, not the model's imagination.
Does the answer cite the right source?
Does it invent facts not in the corpus?
Does it handle "I don't know" correctly?
Grounding failures are the #1 RAG risk
Lab: RAG & Policy-Compliance Agent
Hands-on with real regulatory data
Use OAIC privacy guidance and Australian Privacy Principles as public corpus
Ingest, chunk, and index the documents
Build a test set of privacy compliance questions
Compare: plain long-context vs RAG vs RAG + policy rules
Expected output: Q&A agent + evaluation set + findings memo
Activity: Build Your Knowledge Agent
20 minutes — hands-on
Ingest a corpus relevant to your work
Generate 20–30 test questions
Compare three grounding strategies
Write a recommendation: which approach wins for your use case?
Module 4 Recap
Knowledge Systems — Key Takeaways
Choose Wisely
Long context for small corpora, RAG for large stable sets, web search for live data. Combine when needed.
Prepare Rigorously
Clean, chunk, and version your corpus. Semantic chunking with metadata outperforms arbitrary splits.
Test Grounding
Build evaluation sets. Measure citation fidelity. Grounding failures are the biggest RAG risk.
Module 05
Automation & Business Actions
Zapier Agents, Make AI Agents, Copilot Studio — building approval-gated operational agents.
What You'll Learn
Module 5 Objectives
Build approval-gated business agents
Design trigger-action-approval flows
Compare no-code automation platforms
Implement escalation rules and safety nets
Zapier Agents
The fastest path for operators
8,000–9,000+ app integrations
MCP connectivity for agent-to-agent communication
SOC 2 compliant, SAML/SCIM, audit logs
Best for: sales, marketing, support, internal ops
Fastest path from idea to working automation
Make AI Agents
Visual orchestration for cross-app workflows
Visual orchestration (still beta)
MCP Server support, 3,000+ apps
Credit-based pricing (Core US$12/mo, Pro US$21, Teams US$38)
Best for: cross-app visual workflows, hybrid human+agent automations
On-prem option available for enterprise
Copilot Studio
Low-code agents for the Microsoft ecosystem
Low-code business agents for M365/Teams/Dynamics
Foundry Agent Service for code-first builds
Entra authentication, scoped autonomy, analytics
In Australia: AU$299.30 per 25,000 credits/month
Best for: Microsoft-heavy organisations
Triggers, Actions & Approvals
The anatomy of an operational agent
Trigger: what starts the agent (email, form, schedule, webhook)
Action: what the agent does (send, create, update, classify)
Approval: human checkpoint before high-risk actions
Escalation: what happens when the agent is uncertain
Logging: every decision should be auditable
Escalation Rules
Never let an agent silently fail
Define confidence thresholds (high / medium / low)
High confidence → auto-execute + log
Medium confidence → execute + notify human
Low confidence → pause + require approval
Unknown / error → escalate immediately
Approval-Gated Operations
Lab preview
Build an agent that triages inbound requests. Takes real action only when confidence, permission scope, and policy fit are acceptable.
Support triage — route by severity and type
Lead qualification — score and assign
Invoice exception routing — flag anomalies for review
Platform Comparison
Choose based on your existing stack
Zapier
Fastest setup. Broadest integrations (8,000+). Plan-based pricing. Best for teams that need speed.
Make
Most visual builder. Flexible routing logic. Credit-based pricing. Best for complex cross-app flows.
Copilot Studio
Deepest M365 integration. Enterprise controls. Credit-based pricing. Best for Microsoft-heavy orgs.
Activity: Build Your Approval-Gated Agent
20 minutes — hands-on
Use 50 synthetic requests + category rules
Build a classifier to route requests to buckets
Require approval for high-risk actions
Output: run log, routing dashboard, policy document
Module 5 Recap
Automation & Business Actions — Key Takeaways
Platform Choice
Zapier for speed, Make for visual complexity, Copilot Studio for Microsoft ecosystems. Match to your stack.
Approval Gates
High-risk actions need human checkpoints. Define confidence tiers and route accordingly.
Never Silent Failure
Every agent decision must be logged. Escalation rules are non-negotiable for production agents.
Module 06
Multi-Agent Orchestration
Planner/executor, specialist agents, routing, handoffs, and knowing when multi-agent is overkill.
What You'll Learn
Module 6 Objectives
Design multi-agent architectures
Implement routing and handoff patterns
Handle failure modes in multi-agent systems
Know when multi-agent adds value vs complexity
Planner / Executor Pattern
The most reliable multi-agent pattern for beginners
Separate the "thinking" from the "doing." A planner agent decomposes complex tasks into steps, while executor agents handle each step with focused tools and scope.
Planner breaks task into steps
Executors handle individual steps
Planner monitors and re-plans if needed
Clear separation of concerns
Most debuggable multi-agent pattern
Specialist Agents
Bounded tools, clear scope, independently testable
Research agent — gathers and synthesises evidence
Operations agent — takes business actions
QA / critic agent — reviews outputs for quality
Each specialist has bounded tools and clear scope
Specialists should be independently testable
Routing & Handoffs
Context transfer is where multi-agent systems break
Router decides which specialist handles each sub-task
Handoff protocols must include context transfer
Define what data passes between agents
Never pass raw user input between agents without sanitisation
Test handoff paths independently
Failure Modes
What goes wrong in multi-agent systems
Agent A fails → does Agent B know?
Circular routing — agents pass work back and forth forever
Context loss during handoffs
Cascading failures — one agent down = whole system down
Budget overruns from uncontrolled tool calling
Mitigate: circuit breakers, timeouts, fallback paths.
Connected Agents
The pattern is converging across platforms
Modern platforms now support agent-to-agent communication natively:
OpenAI Agents SDK handoffs
Anthropic tool loops with child agents
Google ADK connected agent patterns
Microsoft Foundry multi-agent routing
The Reference Architecture
End-to-end agentic system design
Input: User or business trigger
Orchestrator: Planner / router layer
Execution: Tool layer + Knowledge layer + Specialist agents
Governance: Human approval gate
Observability: Tracing & evaluation
Compliance: Logs & audit trail
Lab: Content Operations Pipeline
Multi-agent system in practice
Build a multi-agent content pipeline:
Research agent → Outline agent → Drafter
QA / critic → Final reviewer
Define which steps are deterministic vs agentic
Map where human sign-off occurs
Identify handoff data and failure points
Activity: Design Your Multi-Agent System
20 minutes — hands-on
Create an architecture diagram
Define agent roles and bounded tool sets
Map handoff protocols and context transfer
Identify failure modes and mitigation strategies
Build a prototype if time allows
Module 6 Recap
Multi-Agent Orchestration — Key Takeaways
Start Simple
Planner/executor is the most reliable pattern. Add specialists only when single-agent complexity becomes unmanageable.
Handoffs Matter
Context transfer is where multi-agent systems break. Define protocols, sanitise inputs, test paths independently.
Plan for Failure
Circuit breakers, timeouts, budget caps. Every agent needs a fallback. Never let the system silently fail.
Module 07
Evaluation & Observability
Test sets, traces, tool-call accuracy, dashboards, and regression checks.
What You'll Learn
Module 7 Objectives
Create evaluation datasets
Interpret traces and tool accuracy
Design trajectory vs final-response evaluations
Build dashboards and regression gates
Why Evaluation Matters
The stakes are real
40%+
of agentic AI projects will be cancelled by end of 2027 — Gartner
Without evaluation, you can't prove value, catch regressions, or justify costs. Evaluation is what separates pilots from production.
Test Sets & Evaluation Datasets
Your agent's "unit tests"
Build 20–50 test cases per agent
Include: input, expected output, acceptable variations
Cover: happy path, edge cases, adversarial inputs
Version your test sets — they evolve with the agent
Test sets are your agent's "unit tests"
Traces & Tool-Call Accuracy
See exactly what your agent did and why
A trace = complete record of an agent's reasoning
Tool-call accuracy = did it call the right tool with right args?
Trace inspection reveals: wrong tool selection, missing context, unnecessary loops
OpenAI tracing, Anthropic managed agent logs, Google agent evaluation all provide trace data
Every production agent needs tracing enabled from day one
Trajectory vs Final-Response Evaluation
Two lenses on agent quality
Trajectory Eval
Did the agent take the right steps?
Were tool calls appropriate?
Was reasoning sound?
Final-Response Eval
Is the final answer correct?
Does it meet quality criteria?
Would a human approve this output?
Dashboards & Regression Checks
Your agent's health monitor
Track: success rate, latency, cost per run, error rate
Set up regression alerts: "success rate dropped below 90%"
Compare performance across agent versions
Dashboard is your agent's "health monitor"
Review weekly, not just at launch
Debugging Agent Failures
A systematic approach
Read the trace end-to-end
Identify where the agent went wrong (planning? tool call? output?)
Check: was the test case fair? Was the instruction clear?
Common fix: better system instructions, not more tools
Document every failure pattern for future test cases
Evaluation Pack Design
What to ship alongside every agent
Your evaluation pack should include:
Test dataset (20–50 cases)
Pass/fail thresholds per metric
Regression baseline from last version
Trace samples (good and bad)
One-page summary of what changed and why
Activity: Build Your Evaluation Pack
20 minutes — hands-on
Create a test set for your agent (20–50 cases)
Run evaluations against your test set
Set pass/fail thresholds for each metric
Document results and identify improvement areas
Module 7 Recap
Evaluation & Observability — Key Takeaways
Test Everything
20–50 test cases per agent. Cover happy paths, edge cases, and adversarial inputs. Version your test sets.
Trace Everything
Enable tracing from day one. Inspect tool-call accuracy. Better instructions beat more tools.
Monitor Always
Dashboards for success rate, latency, cost. Regression alerts. Review weekly, not just at launch.
Module 08
Security, Governance & Australian Compliance
Privacy, permissions, cross-border, auditability, prompt injection, and change control.
08
What You'll Learn
- Apply governance controls for agent deployment
- Navigate Australian privacy obligations
- Design permission and audit frameworks
- Defend against prompt injection attacks
22%
of Australian organisations report advanced agent governance
Deloitte 2026
Despite 69% using agentic AI. The gap between adoption and governance is the #1 enterprise risk.
Privacy & Permissions
- Privacy Act applies to ALL uses of AI involving personal information (OAIC)
- Both inputs AND outputs can trigger privacy obligations
- Least-privilege principle: agents should only access what they need
- Document every data source and its sensitivity level
- Review permissions quarterly
OAIC Guidance & Privacy Act
- OAIC is explicit: Privacy Act covers AI systems processing personal information
- APP 8 (cross-border disclosure) — critical when using offshore model APIs
- APP 11 (security) — relevant for SaaS automation platforms
- Government teams: DTA responsible AI policy v2.0 now active
- Mandatory capability-building in the APS
Cross-Border Considerations
- Using OpenAI, Anthropic, or Google APIs = data crosses borders
- APP 8 requires you to ensure overseas recipients handle data per APPs
- SaaS platforms (Zapier, Make) may process data in multiple jurisdictions
- Document your data flows
- Consider: regional processing options (OpenAI), data residency (Google)
Prompt Injection & Safety
- Prompt injection = malicious input that hijacks agent behaviour
- Defence: input sanitisation, output validation, guardrails
- Never let user input become system instructions
- Use content guardrails for sensitive outputs
- Define escalation paths for suspicious behaviour
- Test with adversarial inputs
Change Control & Auditability
- Version every agent configuration change
- Maintain audit trail of all agent decisions
- Log: who deployed, what changed, when, why
- Define rollback procedures before deployment
- Auditability is not optional — it's a governance requirement
The 10-Point Release Gate Checklist
- 1. Value definition — workflow, success metric, cost target, owner documented
- 2. Human boundaries — which decisions agent can make alone vs require approval
- 3. Data map — sensitive data, PII, retention rules, source systems documented
- 4. Permission scope — tool access is least-privilege, action scopes allowlisted
- 5. Knowledge integrity — grounding sources current, versioned, permission-aware
- 6. Safety & misuse — prompt injection handling, content guardrails, escalation defined
- 7. Evaluation — test set exists, pass threshold set, regression check in place
- 8. Observability — traces, logs, analytics, error alerts available
- 9. Lifecycle control — versioning, rollback, retirement, ownership defined
- 10. Australian compliance — privacy review done, cross-border & security checked
Activity
Complete Your Deployment Readiness Checklist
- Work through all 10 release gates for your agent
- Document what passes and what needs work
- Identify your biggest governance gap
15 minutes
Module 08 Recap
Security, Governance & Australian Compliance
- Privacy Act applies to all AI processing personal information
- Cross-border data flows require APP 8 compliance
- Prompt injection is a real threat — defend in layers
- Change control and auditability are governance requirements
- The 10-point release gate checklist is your deployment standard
Module 09
Deployment & Lifecycle Management
Versioning, traffic splitting, staged rollout, rollback, incident response, and cost routing.
09
What You'll Learn
- Plan staged agent deployments
- Implement versioning and rollback
- Design incident response procedures
- Optimise cost through smart routing
Versioning & Revisions
- Every agent deployment is a versioned release
- Track: system instructions, tool config, knowledge sources, model version
- Never edit production agents directly
- Use revision history for audit and rollback
- Google ADK and Microsoft Foundry both support revisioning
Traffic Splitting & Staged Rollout
- Don't go from 0% to 100% in one step
- Start with 5–10% of traffic to the new version
- Monitor: error rate, latency, user satisfaction
- If metrics hold, gradually increase to 25% → 50% → 100%
- Google's agent platform supports traffic splitting natively
Rollback Planning
- Define rollback criteria BEFORE deployment
- "If error rate exceeds X%, roll back immediately"
- Keep previous version ready to re-activate
- Test rollback procedure in staging first
- Document who has authority to trigger rollback
- Rollback should take minutes, not hours
Incident Response
When an agent fails in production:
- 1. Detect — monitoring alerts
- 2. Contain — pause or throttle agent
- 3. Diagnose — read traces, identify root cause
- 4. Fix — update instructions/tools/knowledge
- 5. Verify — run evaluation pack
- 6. Deploy fix — staged rollout
- 7. Post-mortem — document and improve
Cost Routing & Optimisation
- Use premium models for planning and review
- Use cheaper models for routing and extraction
- Batch non-urgent tasks for lower pricing tiers
- Cache repeated queries when appropriate
- Set cost alerts and per-run budgets
- Track cost-per-successful-outcome not just cost-per-call
Production Deployment Plan
Your deployment plan should include:
- Version identifier and change summary
- Rollout schedule (% traffic by day)
- Success metrics and rollback triggers
- Monitoring dashboard setup
- On-call responsibility during rollout
- Post-deployment review date
Activity
Write Your Production Deployment Plan
- Define versioning approach for your agent
- Map rollout stages with traffic percentages
- Set rollback criteria and monitoring plan
- Assign on-call schedule
15 minutes
Module 09 Recap
Deployment & Lifecycle Management
- Version every deployment — never edit production directly
- Stage rollouts with traffic splitting
- Define rollback criteria before you deploy
- Follow the 7-step incident response process
- Route costs smartly — premium for planning, cheap for routing
Module 10
Capstone Studio
Build, test, present, and certify your production-ready agent.
10
What You'll Learn
- Build end-to-end production agent
- Create complete evidence and governance pack
- Present and defend your design decisions
- Meet certification criteria
Three Capstone Pathways
Internal Knowledge Agent
Knowledge-backed Q&A for your organisation
Operations Triage Agent
Request routing with approval gates
Content/Research Agent
Multi-step research and synthesis pipeline
Capstone Requirements
- Business case document
- Architecture diagram
- Data map and permission model
- Evaluation set with 20+ test cases
- Run results with pass/fail evidence
- Governance checklist (10-point)
- Rollback plan
- 5-minute demo
Build Sprint
Build Time
You have the rest of this session to build. Use everything you've learned. Ask for help. Start with the simplest version that works, then iterate.
Peer Review & Demo Prep
Pair up for peer review. Check:
- Does the agent actually work?
- Is the evaluation evidence convincing?
- Would you trust this in production?
- Are the governance controls genuine?
Prepare a 5-minute demo: problem → solution → evidence → limits.
Assessment Rubric
- Problem framing — 15%
- Architecture — 15%
- Tooling & action design — 15%
- Grounding & data handling — 10%
- Evaluation evidence — 15%
- Governance & security — 15%
- Reliability & operational readiness — 10%
- Communication — 5%
Pass: 75%+ overall, capstone 75%+, no red-flag safety failure.
Certification Criteria
- Overall score 75%+
- Capstone score 75%+
- No critical failure in privacy/permissions/approval design
- All labs submitted with no critical omissions
- Module quizzes average 80%+
Certification rewards useful work, not flashy demos.
Future-Proofing
- Schedule quarterly agent reviews
- Update platform matrix as vendors change
- Re-run evaluation packs after model updates
- Keep governance checklist current
- Build a team habit of agent maintenance
Module 10 Recap
Capstone Studio
- Three capstone pathways — knowledge, triage, or research
- Full evidence pack required: business case through rollback plan
- Peer review strengthens your design
- Assessment rewards rigour, not flash
- Future-proof by building maintenance habits
Your Immediate Next Steps
- Pick one workflow to automate this week
- Complete your platform selection memo
- Run through the 10-point release gate checklist
- Schedule your first quarterly agent review
- Share this framework with your team
Resources
- Course materials: www.rupertchesman.com
- AI Prompt Builder: www.rupertchesman.com/tools/prompt-builder
- Cheat sheets: www.rupertchesman.com/cheatsheets
- All resources: www.rupertchesman.com/resources
Recommended Next Courses
- Mastering AI Tools — deep dive into prompting and tool workflows
- AI for Corporate Teams — AI adoption strategy and governance
- AI Productivity Systems — personal AI workflows
- Vibe Coding — creating apps by describing what you want
- Visit www.rupertchesman.com for all courses
Certificate
AI Agents & Automation Certificate
Complete all modules + labs + capstone = AI Agents & Automation Certificate
Questions
What would you like to know more about?
Thank You
www.rupertchesman.com
© Rupert Chesman 2026