RC
rupertchesman.com
navigate   F facilitator notes
1 / 121
Specialist Course

AI Agents
& Automation

Design, evaluate, and deploy useful agents that survive first contact with real business constraints — without the hype.

30+
Hours
10
Modules
5
Labs
1
Capstone
Welcome

Who are you, and what
brought you here today?

Take 60 seconds. Share your name, your role, and the one thing you most want to learn about AI agents.

Round-table introductions help us calibrate examples and pair people for lab work. No wrong answers.

Learn to design, evaluate, and deploy useful agents that survive first contact with real business constraints.

The Course Promise

Overview

Today's
10 Modules

A full-day journey from foundations through deployment, ending with your own capstone agent.

  • 01  Agentic AI & Business Value
  • 02  Models, Tools & Protocols
  • 03  Single-Agent Design Patterns
  • 04  Knowledge Systems
  • 05  Automation & Business Actions
  • 06  Multi-Agent Orchestration
  • 07  Evaluation & Observability
  • 08  Security, Governance & AU Compliance
  • 09  Deployment & Lifecycle
  • 10  Capstone Studio
Audience

Who This Course Is For

  • Knowledge workers & operators — looking to automate repetitive workflows and reclaim strategic time
  • Consultants & agency owners — wanting to offer agentic AI services or build internal efficiencies
  • Product managers & ops leads — evaluating where agents fit in their product or process roadmap
  • Technical builders & developers — ready to move from prototypes to production-grade agents
  • Team leaders & enterprise champions — building the business case for responsible AI adoption
Time Commitment
~30 hrs

of guided learning across ten modules, plus 10–15 hours of independent project work building your capstone agent.

Module 01

Agentic AI &
Business Value

Separate signal from noise. Learn when agents create real value — and when they are expensive distractions.

01

Agentic AI & Business Value

Definitions, the agent spectrum, ROI frameworks, and knowing when NOT to build.

Module 01 · Objectives

What You'll
Learn

By the end of this module you will be able to:

  • Distinguish assistants from automations from agents — and explain why the difference matters
  • Choose when a full agent is justified versus simpler alternatives
  • Score use cases for frequency, consequence, and ROI using a structured framework
  • Identify when NOT to use an agent — the most valuable skill in agent design
Module 01 · Core Concept

What Is an AI Agent?

  • Perceives its environment — reads data, monitors inboxes, watches dashboards, ingests context
  • Plans and reasons about tasks — breaks goals into steps, weighs alternatives, re-plans when blocked
  • Takes actions using tools — calls APIs, searches the web, writes files, sends messages
  • Operates with some autonomy — makes decisions within boundaries without human approval at every step
  • Has feedback loops to self-correct — evaluates its own outputs, retries on failure, escalates when uncertain
Module 01 · Comparison

Agent vs Chatbot vs Automation

💬

Chatbot

  • Responds to prompts
  • No tool access
  • Stateless between sessions
  • Human drives every step
⚙️

Automation

  • Rule-based triggers
  • Deterministic flow
  • If/then logic only
  • Breaks on exceptions
🤖

Agent

  • Reasons about goals
  • Uses tools dynamically
  • Handles ambiguity
  • Self-corrects on errors
Module 01 · Framework

The Agent Spectrum

Not everything needs to be a full agent. Match the level of autonomy to the task.

  • Level 0Prompt — a single instruction, no memory, no tools, no loops
  • Level 1Assistant — multi-turn conversation with context window but still human-driven
  • Level 2Tool-calling — the model selects and invokes functions but a human approves each call
  • Level 3Single Agent — autonomous planning, tool use, retries, and self-evaluation within guardrails
  • Level 4Multi-Agent System — multiple specialised agents coordinated by an orchestrator
Module 01 · Industry Data
69%

of Australian organisations are already using agentic AI, according to Deloitte (2026).

But only 22% report advanced agent governance in place.

Module 01 · Reality Check

ROI vs Hype

  • Gartner warns 40%+ of agentic AI projects will be cancelled by end of 2027 due to cost, unclear value, and risk-control failures
  • "Agent sprawl" is a real risk — organisations deploying agents without governance create fragile, overlapping systems nobody can audit
  • The biggest cost is not the API bill — it is lost trust when an agent makes a visible, embarrassing, or costly mistake
  • This course teaches you to avoid these traps with structured evaluation, approval gates, and clear success metrics
Module 01 · Framework

Use-Case Scoring Framework

Score each candidate workflow on six dimensions (1–5 scale). High frequency + low consequence + clear ROI = best starting point.

Rule of thumb: If total score is below 18, start with simple automation instead of an agent.

  • Frequency of task — how often does this run? Daily beats monthly.
  • Consequence of error — what happens when it is wrong? Reversible beats catastrophic.
  • Permission sensitivity — what access does it need? Read-only beats admin.
  • Data sensitivity — PII, financial, health? Lower is simpler.
  • Expected ROI — hours saved, revenue gained, errors prevented.
  • Human judgment required — how much nuance does the decision need?
Module 01 · Anti-Patterns

When NOT to Use an Agent

  • The task is simple and rule-based — a Zapier automation or Make scenario is cheaper, faster, and more predictable
  • The cost of error is catastrophic — financial transfers, medical decisions, legal filings need human oversight
  • The workflow changes constantly — agents struggle when the rules shift weekly; use assisted tools instead
  • There is no clear success metric — if you cannot measure whether it worked, define the metric first
  • The data is too sensitive for any AI — some data should never leave your perimeter, full stop
Module 01 · Activity

Build Your Opportunity Map

  • Choose one real workflow from your current work — something you do at least weekly
  • Map the workflow: trigger, inputs, outputs, decision points, approvals
  • Score it using the six-dimension framework (frequency, consequence, permissions, data, ROI, judgment)
  • Write one paragraph on the business case — or why it is NOT a good agent candidate
15
minutes — individual work, then pair share
Module 01 · Recap

Module 1 Key Takeaways

  • An agent perceives, plans, acts, and self-corrects — it is fundamentally different from a chatbot or an automation
  • The agent spectrum ranges from simple prompts to multi-agent systems — match autonomy to the task
  • 69% of AU orgs are using agentic AI, but governance lags — be intentional, not reactive
  • Use the scoring framework to evaluate use cases before building anything
  • Knowing when NOT to use an agent is the single most valuable lesson in this module
Module 02

Models, Tools
& Protocols

GPT-5.5, Claude Opus 4.7, Gemini 3.1, Zapier, Make, Copilot Studio — and how to choose.

02

Models, Tools & Protocols

Compare platforms, understand MCP and function calling, and apply model routing for cost discipline.

Module 02 · Objectives

What You'll
Learn

By the end of this module you will be able to:

  • Compare major agent platforms on capabilities, pricing, and enterprise controls
  • Understand the Responses API, Agents SDK, and managed agent patterns
  • Explain MCP (Model Context Protocol) and function calling to non-technical stakeholders
  • Apply model routing and cost discipline to cut spend 60–80% without quality loss
Module 02 · Landscape

The Platform Landscape

OpenAI

  • Responses API & Agents SDK
  • Built-in tools & tracing
  • Remote MCP support

Anthropic

  • Opus 4.7 & Managed Agents
  • MCP originator
  • Responsible Scaling Policy

Google

  • Gemini 3.1 & ADK
  • Enterprise Agent Platform
  • IAM & VPC controls

Zapier

  • No-code agents
  • 8,000+ app integrations
  • MCP & SOC 2 certified

Make

  • Visual orchestration
  • AI Agents beta
  • 3,000+ apps, on-prem option

Microsoft

  • Copilot Studio
  • Foundry Agent Service
  • Entra auth & M365 native
Module 02 · Platform Deep-Dive

OpenAI — Responses API & Agents SDK

  • Built-in tools — web search, file search, code interpreter, and computer use available natively
  • Remote MCP support — connect to any MCP-compliant tool server alongside built-in tools
  • Agents SDK — open-source Python framework with guardrails, handoffs, and tracing built in
  • Tracing & sandboxes — full observability of every tool call, decision, and output
  • GPT-5.5 pricing: US$5 input / US$30 output per 1M tokens
Module 02 · Platform Deep-Dive

Anthropic — Opus 4.7 & Managed Agents

  • Tool use, web search, code execution — plus the Model Context Protocol (MCP) they created
  • Claude Managed Agents — public beta for orchestrated multi-step workflows with built-in safety
  • Responsible Scaling Policy v3.0 — industry-leading safety framework with AI Safety Level evaluations
  • Extended thinking — expose the model's reasoning chain for debugging and trust-building
  • Opus 4.7 pricing: US$5 input / US$25 output per 1M tokens
Module 02 · Platform Deep-Dive

Google — Gemini 3.1 & ADK

  • Enterprise Agent Platform — build, govern, and optimise agents within Google Cloud
  • Open-source Agent Development Kit (ADK) — modular, composable, deployment-flexible
  • Enterprise controls — IAM, audit logs, VPC Service Controls, and data residency compliance
  • Agent evaluation & traffic splitting — A/B test agent versions before full rollout
  • Workspace integration — native access to Gmail, Drive, Calendar, Docs within agent flows
Module 02 · No-Code Options

No-Code Agent Platforms

Zapier Agents

  • 8,000+ app connectors
  • MCP server connectivity
  • SOC 2 Type II certified
  • Natural language instructions

Make AI Agents

  • Visual scenario builder
  • Credits-based pricing
  • On-premise deployment option
  • 3,000+ integrations

Copilot Studio

  • M365 & Teams native
  • Entra ID authentication
  • AU$299.30/25k message credits
  • DLP & audit logging
Module 02 · Protocol

MCP — Model Context Protocol

  • Standard protocol for connecting AI models to external tools, data sources, and services
  • Created by Anthropic, now adopted by OpenAI (remote MCP support), Zapier, Make, and others
  • Think of it as "USB-C for AI tools" — one protocol that works across models and platforms
  • Server/client architecture — tools expose capabilities via MCP servers; models connect as clients
  • Why it matters: Build a tool integration once, use it with any MCP-compatible model
Module 02 · Mechanics

Function Calling & Built-in Tools

  • Function calling — the model reads a schema, decides which function to call, and generates structured arguments
  • Built-in tools — pre-built capabilities like web search, code interpreter, and file search that require no custom code
  • Custom tools — your own APIs and services, described via JSON Schema or MCP, executed by your infrastructure
  • Tool selection is a design decision — more tools increase flexibility but also increase reasoning complexity and error surface
Module 02 · Cost Strategy

Model Routing & Cost Discipline

  • Use stronger models for planning and review — Opus 4.7 or GPT-5.5 for complex reasoning, evaluation, and quality gates
  • Use cheaper models for routing and classification — Sonnet or GPT-5.5 mini for intent detection and simple extraction
  • All providers expose tiered pricing — match model capability to task complexity at each step in the pipeline
  • Smart routing can cut costs 60–80% with no measurable quality loss on production benchmarks

Example: Route 80% of queries to a fast, cheap model. Escalate the 20% that need deep reasoning to a frontier model. Measure both paths.

Module 02 · Activity

Write Your Platform Selection Memo

  • Pick your use case from Module 1's opportunity map exercise
  • Compare 2–3 platforms that could serve it (one code, one no-code minimum)
  • Evaluate each on: capabilities, pricing, enterprise controls, team skills, MCP support
  • Write a one-page decision memo with your recommendation and rationale
15
minutes — individual work, then group debrief
Module 02 · Recap

Module 2 Key Takeaways

  • Six major platforms compete for agent workloads — no single winner; the right choice depends on your constraints
  • MCP is becoming the universal connector — learn it once, use it everywhere
  • Function calling is the core mechanism — the model selects tools, your code executes them
  • Model routing is the highest-ROI cost optimisation — use expensive models only where they add measurable value
  • No-code platforms (Zapier, Make, Copilot Studio) are production-ready for many agent use cases
Module 03

Single-Agent
Design Patterns

System instructions, tool selection, approvals, recovery paths, and memory boundaries.

03

Single-Agent Design Patterns

The practical patterns that make individual agents reliable, safe, and useful in production.

Module 03 · Objectives

What You'll
Learn

By the end of this module you will be able to:

  • Write effective system instructions that constrain behaviour without crippling capability
  • Design tool boundaries and human approval gates at the right granularity
  • Build retry and recovery strategies that fail gracefully under real-world conditions
  • Manage memory within agent scope — short-term, medium-term, and long-term
Module 03 · Pattern

System Instructions Design

  • Define the agent's role and boundaries clearly — "You are a research assistant for the marketing team. You do NOT make purchasing decisions."
  • Specify what it CAN and CANNOT do — explicit allowlists beat implicit denylists every time
  • Include output format requirements — structured outputs reduce downstream parsing errors to near zero
  • Set escalation triggers — define exactly when the agent should stop and ask a human
  • Test instructions against edge cases — adversarial testing during design prevents production surprises
Module 03 · Pattern

Tool Selection & Boundaries

Choosing Tools

Every tool you give an agent is a capability AND a risk surface. Design with intention.

Principle: Start with the minimum viable toolset. Add tools only when you can measure the improvement.

  • Start with minimum viable tools — resist the urge to connect everything
  • Each tool adds complexity and failure surface area
  • Prefer built-in tools over custom where possible
  • Define explicit permission scopes per tool
  • Test tool combinations for conflicts and unintended interactions
Module 03 · Pattern

Human Approval Gates

  • Not every action needs approval — over-gating kills adoption faster than any bug
  • High-risk actions ALWAYS need human sign-off — financial commits, external communications, data deletions
  • Design three tiers: auto-approve (read-only, low-risk) / notify (reversible writes) / require approval (irreversible, external, financial)
  • Log all decisions for audit — even auto-approved actions need a paper trail
  • Make approval UX fast — if it takes 5 clicks to approve, people will bypass the system
Module 03 · Pattern

Retry Strategy & Recovery

  • Define max retry attempts per tool — typically 2–3 for API calls, zero for destructive operations
  • Set timeout thresholds — an agent waiting 60 seconds for a response is an agent wasting money
  • Build graceful degradation paths — if the preferred tool fails, what is the fallback?
  • Never retry destructive actions — sending an email twice is worse than not sending it at all
  • Log failures for debugging — every error is training data for improving the system
  • Always have a "give up gracefully" path — escalate to a human with full context, not a cryptic error
Module 03 · Pattern

Memory Boundaries

  • Short-term memory — the current conversation context window; resets between sessions
  • Medium-term memory — session state, user preferences, and task progress persisted across turns
  • Long-term memory — persistent knowledge bases, vector stores, and interaction history
  • Rule: Store only what demonstrably improves the quality of future decisions
  • Privacy imperative: Never persist sensitive data unnecessarily — minimise what you store, encrypt what you must, delete what you no longer need
Module 03 · Reference Architecture

The Conservative Baseline Architecture

  • One orchestrator model — a single capable model (Opus 4.7, GPT-5.5) that plans, delegates, and evaluates
  • Bounded tool set — minimum viable tools, each with explicit permissions and rate limits
  • Explicitly chosen knowledge sources — curated vector stores or search indices, not "search the whole internet"
  • Optional specialist sub-agents — add only when a single model cannot handle domain complexity
  • Visible evaluation and approval loops — every decision logged, high-risk actions gated, outputs scored

Start here. Add complexity only when measurements prove the simpler version is insufficient.

Module 03 · Anti-Patterns

Common Pitfalls

  • Over-engineering — start simple, add complexity only when proven needed; multi-agent is rarely day one
  • Ignoring failure modes — agents WILL fail; the question is whether they fail gracefully or catastrophically
  • No evaluation framework — if you cannot measure whether it worked, you cannot improve it or justify its cost
  • Unbounded autonomy — always set limits on actions, spend, and scope; trust is earned incrementally
  • Skipping human review at launch — shadow mode first, then supervised, then semi-autonomous; never jump to full auto
Module 03 · Lab

Build Your First Agent

  • Choose your platform: OpenAI Responses API, Anthropic tool use, or Zapier Agents
  • Build a simple research agent that gathers evidence from multiple sources and synthesises a recommendation
  • Write clear system instructions — define the role, boundaries, output format, and escalation triggers
  • Give it 2–3 tools maximum — web search, a document reader, and optionally a note-taker
  • Test it against at least 3 different queries — one easy, one ambiguous, one adversarial
20
minutes — hands-on lab, then demo two volunteers
Module 03 · Recap

Module 3 Key Takeaways

  • System instructions are the most important design artifact — invest time getting them right
  • Minimum viable tools — every additional tool increases both capability and risk surface
  • Three-tier approval gates (auto / notify / require) balance safety with usability
  • Plan for failure — retry strategies, graceful degradation, and human escalation paths are not optional
  • The conservative baseline (one orchestrator, bounded tools, visible loops) is the right starting point for every agent
Module 04

Knowledge Systems

Long context vs RAG vs search — and how to test which one works.

04

What You'll Learn

Module 4 Objectives

Choose between long context, RAG, and search

Prepare and process knowledge corpora

Design chunking and retrieval strategies

Test grounding quality with evaluation sets

The Knowledge Decision

Three approaches to giving agents knowledge

Long Context

Feed everything into the prompt. Simple but limited by window size. Best for small corpora under ~50 pages.

RAG

Retrieve relevant chunks at query time. Scales to large corpora. Needs infrastructure — embedding, indexing, retrieval pipeline.

Web Search

Live data, no corpus needed. Less control over quality. Best when currency matters more than consistency.

When to Use What

A practical decision framework

<50 pages? → Long context

Stable large corpus? → RAG

Need current info? → Web search

Need citations? → RAG or search with source tracking

Mixed needs? → Combine approaches

Corpus Preparation

Your knowledge is only as good as your source material

Clean and normalise source documents

Remove duplicates and outdated content

Structure for consistent retrieval

Version your corpus — knowledge changes

Document your sources for audit compliance

Chunking Strategies

How you split your corpus determines retrieval quality

Chunk by semantic meaning, not arbitrary length

Overlap chunks for context continuity

Include metadata (source, date, section)

Test chunk sizes against your actual queries

Smaller chunks = more precision, larger = more context

Retrieval Testing

Measure before you ship

Build a test set of 20–30 real questions

Compare retrieval accuracy across strategies

Measure: relevance, completeness, citation fidelity

Track failure modes (wrong chunk, missing info, hallucination)

Iterate until quality meets your threshold

Grounding Quality

The #1 risk in RAG systems

Grounding = making sure answers come from your sources, not the model's imagination.

Does the answer cite the right source?

Does it invent facts not in the corpus?

Does it handle "I don't know" correctly?

Grounding failures are the #1 RAG risk

Lab: RAG & Policy-Compliance Agent

Hands-on with real regulatory data

Use OAIC privacy guidance and Australian Privacy Principles as public corpus

Ingest, chunk, and index the documents

Build a test set of privacy compliance questions

Compare: plain long-context vs RAG vs RAG + policy rules

Expected output: Q&A agent + evaluation set + findings memo

Activity: Build Your Knowledge Agent

20 minutes — hands-on

Ingest a corpus relevant to your work

Generate 20–30 test questions

Compare three grounding strategies

Write a recommendation: which approach wins for your use case?

Module 4 Recap

Knowledge Systems — Key Takeaways

Choose Wisely

Long context for small corpora, RAG for large stable sets, web search for live data. Combine when needed.

Prepare Rigorously

Clean, chunk, and version your corpus. Semantic chunking with metadata outperforms arbitrary splits.

Test Grounding

Build evaluation sets. Measure citation fidelity. Grounding failures are the biggest RAG risk.

Module 05

Automation & Business Actions

Zapier Agents, Make AI Agents, Copilot Studio — building approval-gated operational agents.

05

What You'll Learn

Module 5 Objectives

Build approval-gated business agents

Design trigger-action-approval flows

Compare no-code automation platforms

Implement escalation rules and safety nets

Zapier Agents

The fastest path for operators

8,000–9,000+ app integrations

MCP connectivity for agent-to-agent communication

SOC 2 compliant, SAML/SCIM, audit logs

Best for: sales, marketing, support, internal ops

Fastest path from idea to working automation

Make AI Agents

Visual orchestration for cross-app workflows

Visual orchestration (still beta)

MCP Server support, 3,000+ apps

Credit-based pricing (Core US$12/mo, Pro US$21, Teams US$38)

Best for: cross-app visual workflows, hybrid human+agent automations

On-prem option available for enterprise

Copilot Studio

Low-code agents for the Microsoft ecosystem

Low-code business agents for M365/Teams/Dynamics

Foundry Agent Service for code-first builds

Entra authentication, scoped autonomy, analytics

In Australia: AU$299.30 per 25,000 credits/month

Best for: Microsoft-heavy organisations

Triggers, Actions & Approvals

The anatomy of an operational agent

Trigger: what starts the agent (email, form, schedule, webhook)

Action: what the agent does (send, create, update, classify)

Approval: human checkpoint before high-risk actions

Escalation: what happens when the agent is uncertain

Logging: every decision should be auditable

Escalation Rules

Never let an agent silently fail

Define confidence thresholds (high / medium / low)

High confidence → auto-execute + log

Medium confidence → execute + notify human

Low confidence → pause + require approval

Unknown / error → escalate immediately

Approval-Gated Operations

Lab preview

Build an agent that triages inbound requests. Takes real action only when confidence, permission scope, and policy fit are acceptable.

Support triage — route by severity and type

Lead qualification — score and assign

Invoice exception routing — flag anomalies for review

Platform Comparison

Choose based on your existing stack

Zapier

Fastest setup. Broadest integrations (8,000+). Plan-based pricing. Best for teams that need speed.

Make

Most visual builder. Flexible routing logic. Credit-based pricing. Best for complex cross-app flows.

Copilot Studio

Deepest M365 integration. Enterprise controls. Credit-based pricing. Best for Microsoft-heavy orgs.

Activity: Build Your Approval-Gated Agent

20 minutes — hands-on

Use 50 synthetic requests + category rules

Build a classifier to route requests to buckets

Require approval for high-risk actions

Output: run log, routing dashboard, policy document

Module 5 Recap

Automation & Business Actions — Key Takeaways

Platform Choice

Zapier for speed, Make for visual complexity, Copilot Studio for Microsoft ecosystems. Match to your stack.

Approval Gates

High-risk actions need human checkpoints. Define confidence tiers and route accordingly.

Never Silent Failure

Every agent decision must be logged. Escalation rules are non-negotiable for production agents.

Module 06

Multi-Agent Orchestration

Planner/executor, specialist agents, routing, handoffs, and knowing when multi-agent is overkill.

06

What You'll Learn

Module 6 Objectives

Design multi-agent architectures

Implement routing and handoff patterns

Handle failure modes in multi-agent systems

Know when multi-agent adds value vs complexity

Planner / Executor Pattern

The most reliable multi-agent pattern for beginners

Separate the "thinking" from the "doing." A planner agent decomposes complex tasks into steps, while executor agents handle each step with focused tools and scope.

Planner breaks task into steps

Executors handle individual steps

Planner monitors and re-plans if needed

Clear separation of concerns

Most debuggable multi-agent pattern

Specialist Agents

Bounded tools, clear scope, independently testable

Research agent — gathers and synthesises evidence

Operations agent — takes business actions

QA / critic agent — reviews outputs for quality

Each specialist has bounded tools and clear scope

Specialists should be independently testable

Routing & Handoffs

Context transfer is where multi-agent systems break

Router decides which specialist handles each sub-task

Handoff protocols must include context transfer

Define what data passes between agents

Never pass raw user input between agents without sanitisation

Test handoff paths independently

Failure Modes

What goes wrong in multi-agent systems

Agent A fails → does Agent B know?

Circular routing — agents pass work back and forth forever

Context loss during handoffs

Cascading failures — one agent down = whole system down

Budget overruns from uncontrolled tool calling

Mitigate: circuit breakers, timeouts, fallback paths.

Connected Agents

The pattern is converging across platforms

Modern platforms now support agent-to-agent communication natively:

OpenAI Agents SDK handoffs

Anthropic tool loops with child agents

Google ADK connected agent patterns

Microsoft Foundry multi-agent routing

The Reference Architecture

End-to-end agentic system design

Input: User or business trigger

Orchestrator: Planner / router layer

Execution: Tool layer + Knowledge layer + Specialist agents

Governance: Human approval gate

Observability: Tracing & evaluation

Compliance: Logs & audit trail

Lab: Content Operations Pipeline

Multi-agent system in practice

Build a multi-agent content pipeline:

Research agent → Outline agent → Drafter

QA / critic → Final reviewer

Define which steps are deterministic vs agentic

Map where human sign-off occurs

Identify handoff data and failure points

Activity: Design Your Multi-Agent System

20 minutes — hands-on

Create an architecture diagram

Define agent roles and bounded tool sets

Map handoff protocols and context transfer

Identify failure modes and mitigation strategies

Build a prototype if time allows

Module 6 Recap

Multi-Agent Orchestration — Key Takeaways

Start Simple

Planner/executor is the most reliable pattern. Add specialists only when single-agent complexity becomes unmanageable.

Handoffs Matter

Context transfer is where multi-agent systems break. Define protocols, sanitise inputs, test paths independently.

Plan for Failure

Circuit breakers, timeouts, budget caps. Every agent needs a fallback. Never let the system silently fail.

Module 07

Evaluation & Observability

Test sets, traces, tool-call accuracy, dashboards, and regression checks.

07

What You'll Learn

Module 7 Objectives

Create evaluation datasets

Interpret traces and tool accuracy

Design trajectory vs final-response evaluations

Build dashboards and regression gates

Why Evaluation Matters

The stakes are real

40%+ of agentic AI projects will be cancelled by end of 2027 — Gartner

Without evaluation, you can't prove value, catch regressions, or justify costs. Evaluation is what separates pilots from production.

Test Sets & Evaluation Datasets

Your agent's "unit tests"

Build 20–50 test cases per agent

Include: input, expected output, acceptable variations

Cover: happy path, edge cases, adversarial inputs

Version your test sets — they evolve with the agent

Test sets are your agent's "unit tests"

Traces & Tool-Call Accuracy

See exactly what your agent did and why

A trace = complete record of an agent's reasoning

Tool-call accuracy = did it call the right tool with right args?

Trace inspection reveals: wrong tool selection, missing context, unnecessary loops

OpenAI tracing, Anthropic managed agent logs, Google agent evaluation all provide trace data

Every production agent needs tracing enabled from day one

Trajectory vs Final-Response Evaluation

Two lenses on agent quality

Trajectory Eval

Did the agent take the right steps?

Were tool calls appropriate?

Was reasoning sound?

Final-Response Eval

Is the final answer correct?

Does it meet quality criteria?

Would a human approve this output?

Dashboards & Regression Checks

Your agent's health monitor

Track: success rate, latency, cost per run, error rate

Set up regression alerts: "success rate dropped below 90%"

Compare performance across agent versions

Dashboard is your agent's "health monitor"

Review weekly, not just at launch

Debugging Agent Failures

A systematic approach

Read the trace end-to-end

Identify where the agent went wrong (planning? tool call? output?)

Check: was the test case fair? Was the instruction clear?

Common fix: better system instructions, not more tools

Document every failure pattern for future test cases

Evaluation Pack Design

What to ship alongside every agent

Your evaluation pack should include:

Test dataset (20–50 cases)

Pass/fail thresholds per metric

Regression baseline from last version

Trace samples (good and bad)

One-page summary of what changed and why

Activity: Build Your Evaluation Pack

20 minutes — hands-on

Create a test set for your agent (20–50 cases)

Run evaluations against your test set

Set pass/fail thresholds for each metric

Document results and identify improvement areas

Module 7 Recap

Evaluation & Observability — Key Takeaways

Test Everything

20–50 test cases per agent. Cover happy paths, edge cases, and adversarial inputs. Version your test sets.

Trace Everything

Enable tracing from day one. Inspect tool-call accuracy. Better instructions beat more tools.

Monitor Always

Dashboards for success rate, latency, cost. Regression alerts. Review weekly, not just at launch.

Module 08

Security, Governance & Australian Compliance

Privacy, permissions, cross-border, auditability, prompt injection, and change control.

08

What You'll Learn

  • Apply governance controls for agent deployment
  • Navigate Australian privacy obligations
  • Design permission and audit frameworks
  • Defend against prompt injection attacks

22%

of Australian organisations report advanced agent governance

Deloitte 2026

Despite 69% using agentic AI. The gap between adoption and governance is the #1 enterprise risk.

Privacy & Permissions

  • Privacy Act applies to ALL uses of AI involving personal information (OAIC)
  • Both inputs AND outputs can trigger privacy obligations
  • Least-privilege principle: agents should only access what they need
  • Document every data source and its sensitivity level
  • Review permissions quarterly

OAIC Guidance & Privacy Act

Cross-Border Considerations

  • Using OpenAI, Anthropic, or Google APIs = data crosses borders
  • APP 8 requires you to ensure overseas recipients handle data per APPs
  • SaaS platforms (Zapier, Make) may process data in multiple jurisdictions
  • Document your data flows
  • Consider: regional processing options (OpenAI), data residency (Google)

Prompt Injection & Safety

  • Prompt injection = malicious input that hijacks agent behaviour
  • Defence: input sanitisation, output validation, guardrails
  • Never let user input become system instructions
  • Use content guardrails for sensitive outputs
  • Define escalation paths for suspicious behaviour
  • Test with adversarial inputs

Change Control & Auditability

The 10-Point Release Gate Checklist

  1. 1. Value definition — workflow, success metric, cost target, owner documented
  2. 2. Human boundaries — which decisions agent can make alone vs require approval
  3. 3. Data map — sensitive data, PII, retention rules, source systems documented
  4. 4. Permission scope — tool access is least-privilege, action scopes allowlisted
  5. 5. Knowledge integrity — grounding sources current, versioned, permission-aware
  6. 6. Safety & misuse — prompt injection handling, content guardrails, escalation defined
  7. 7. Evaluation — test set exists, pass threshold set, regression check in place
  8. 8. Observability — traces, logs, analytics, error alerts available
  9. 9. Lifecycle control — versioning, rollback, retirement, ownership defined
  10. 10. Australian compliance — privacy review done, cross-border & security checked
Activity

Complete Your Deployment Readiness Checklist

  • Work through all 10 release gates for your agent
  • Document what passes and what needs work
  • Identify your biggest governance gap

15 minutes

Module 08 Recap

Security, Governance & Australian Compliance

  • Privacy Act applies to all AI processing personal information
  • Cross-border data flows require APP 8 compliance
  • Prompt injection is a real threat — defend in layers
  • Change control and auditability are governance requirements
  • The 10-point release gate checklist is your deployment standard
Module 09

Deployment & Lifecycle Management

Versioning, traffic splitting, staged rollout, rollback, incident response, and cost routing.

09

What You'll Learn

  • Plan staged agent deployments
  • Implement versioning and rollback
  • Design incident response procedures
  • Optimise cost through smart routing

Versioning & Revisions

  • Every agent deployment is a versioned release
  • Track: system instructions, tool config, knowledge sources, model version
  • Never edit production agents directly
  • Use revision history for audit and rollback
  • Google ADK and Microsoft Foundry both support revisioning

Traffic Splitting & Staged Rollout

  • Don't go from 0% to 100% in one step
  • Start with 5–10% of traffic to the new version
  • Monitor: error rate, latency, user satisfaction
  • If metrics hold, gradually increase to 25% → 50% → 100%
  • Google's agent platform supports traffic splitting natively

Rollback Planning

Incident Response

When an agent fails in production:

  • 1. Detect — monitoring alerts
  • 2. Contain — pause or throttle agent
  • 3. Diagnose — read traces, identify root cause
  • 4. Fix — update instructions/tools/knowledge
  • 5. Verify — run evaluation pack
  • 6. Deploy fix — staged rollout
  • 7. Post-mortem — document and improve

Cost Routing & Optimisation

Production Deployment Plan

Your deployment plan should include:

  • Version identifier and change summary
  • Rollout schedule (% traffic by day)
  • Success metrics and rollback triggers
  • Monitoring dashboard setup
  • On-call responsibility during rollout
  • Post-deployment review date
Activity

Write Your Production Deployment Plan

  • Define versioning approach for your agent
  • Map rollout stages with traffic percentages
  • Set rollback criteria and monitoring plan
  • Assign on-call schedule

15 minutes

Module 09 Recap

Deployment & Lifecycle Management

  • Version every deployment — never edit production directly
  • Stage rollouts with traffic splitting
  • Define rollback criteria before you deploy
  • Follow the 7-step incident response process
  • Route costs smartly — premium for planning, cheap for routing
Module 10

Capstone Studio

Build, test, present, and certify your production-ready agent.

10

What You'll Learn

  • Build end-to-end production agent
  • Create complete evidence and governance pack
  • Present and defend your design decisions
  • Meet certification criteria

Three Capstone Pathways

Internal Knowledge Agent

Knowledge-backed Q&A for your organisation

Operations Triage Agent

Request routing with approval gates

Content/Research Agent

Multi-step research and synthesis pipeline

Capstone Requirements

  • Business case document
  • Architecture diagram
  • Data map and permission model
  • Evaluation set with 20+ test cases
  • Run results with pass/fail evidence
  • Governance checklist (10-point)
  • Rollback plan
  • 5-minute demo
Build Sprint

Build Time

You have the rest of this session to build. Use everything you've learned. Ask for help. Start with the simplest version that works, then iterate.

Peer Review & Demo Prep

Pair up for peer review. Check:

  • Does the agent actually work?
  • Is the evaluation evidence convincing?
  • Would you trust this in production?
  • Are the governance controls genuine?

Prepare a 5-minute demo: problem → solution → evidence → limits.

Assessment Rubric

  • Problem framing — 15%
  • Architecture — 15%
  • Tooling & action design — 15%
  • Grounding & data handling — 10%
  • Evaluation evidence — 15%
  • Governance & security — 15%
  • Reliability & operational readiness — 10%
  • Communication — 5%

Pass: 75%+ overall, capstone 75%+, no red-flag safety failure.

Certification Criteria

  • Overall score 75%+
  • Capstone score 75%+
  • No critical failure in privacy/permissions/approval design
  • All labs submitted with no critical omissions
  • Module quizzes average 80%+

Certification rewards useful work, not flashy demos.

Future-Proofing

Module 10 Recap

Capstone Studio

  • Three capstone pathways — knowledge, triage, or research
  • Full evidence pack required: business case through rollback plan
  • Peer review strengthens your design
  • Assessment rewards rigour, not flash
  • Future-proof by building maintenance habits

Your Immediate Next Steps

  • Pick one workflow to automate this week
  • Complete your platform selection memo
  • Run through the 10-point release gate checklist
  • Schedule your first quarterly agent review
  • Share this framework with your team

Resources

  • Course materials: www.rupertchesman.com
  • AI Prompt Builder: www.rupertchesman.com/tools/prompt-builder
  • Cheat sheets: www.rupertchesman.com/cheatsheets
  • All resources: www.rupertchesman.com/resources

Recommended Next Courses

  • Mastering AI Tools — deep dive into prompting and tool workflows
  • AI for Corporate Teams — AI adoption strategy and governance
  • AI Productivity Systems — personal AI workflows
  • Vibe Coding — creating apps by describing what you want
  • Visit www.rupertchesman.com for all courses
Certificate

AI Agents & Automation Certificate

Complete all modules + labs + capstone = AI Agents & Automation Certificate

Questions

What would you like to know more about?

Thank You

www.rupertchesman.com

© Rupert Chesman 2026