Specialist Course

AI Agents
& Automation

Design, evaluate, and deploy useful agents that survive first contact with real business constraints — without the hype.

30+

Hours

10

Modules

5

Labs

1

Capstone

Welcome

Who are you, and what
brought you here today?

Take 60 seconds. Share your name, your role, and the one thing you most want to learn about AI agents.

Round-table introductions help us calibrate examples and pair people for lab work. No wrong answers.

“

Learn to design, evaluate, and deploy useful agents that survive first contact with real business constraints.

The Course Promise

Overview

Today's
10 Modules

A full-day journey from foundations through deployment, ending with your own capstone agent.

01 Agentic AI & Business Value
02 Models, Tools & Protocols
03 Single-Agent Design Patterns
04 Knowledge Systems
05 Automation & Business Actions
06 Multi-Agent Orchestration
07 Evaluation & Observability
08 Security, Governance & AU Compliance
09 Deployment & Lifecycle
10 Capstone Studio

Audience

Who This Course Is For

Knowledge workers & operators — looking to automate repetitive workflows and reclaim strategic time
Consultants & agency owners — wanting to offer agentic AI services or build internal efficiencies
Product managers & ops leads — evaluating where agents fit in their product or process roadmap
Technical builders & developers — ready to move from prototypes to production-grade agents
Team leaders & enterprise champions — building the business case for responsible AI adoption

Time Commitment

~30 hrs

of guided learning across ten modules, plus 10–15 hours of independent project work building your capstone agent.

Module 01

Agentic AI &
Business Value

Separate signal from noise. Learn when agents create real value — and when they are expensive distractions.

01

Agentic AI & Business Value

Definitions, the agent spectrum, ROI frameworks, and knowing when NOT to build.

Module 01 · Objectives

What You'll
Learn

By the end of this module you will be able to:

Distinguish assistants from automations from agents — and explain why the difference matters
Choose when a full agent is justified versus simpler alternatives
Score use cases for frequency, consequence, and ROI using a structured framework
Identify when NOT to use an agent — the most valuable skill in agent design

Module 01 · Core Concept

What Is an AI Agent?

Perceives its environment — reads data, monitors inboxes, watches dashboards, ingests context
Plans and reasons about tasks — breaks goals into steps, weighs alternatives, re-plans when blocked
Takes actions using tools — calls APIs, searches the web, writes files, sends messages
Operates with some autonomy — makes decisions within boundaries without human approval at every step
Has feedback loops to self-correct — evaluates its own outputs, retries on failure, escalates when uncertain

Module 01 · Comparison

Agent vs Chatbot vs Automation

💬

Chatbot

Responds to prompts
No tool access
Stateless between sessions
Human drives every step

⚙️

Automation

Rule-based triggers
Deterministic flow
If/then logic only
Breaks on exceptions

🤖

Agent

Reasons about goals
Uses tools dynamically
Handles ambiguity
Self-corrects on errors

Module 01 · Framework

The Agent Spectrum

Not everything needs to be a full agent. Match the level of autonomy to the task.

Level 0 Prompt — a single instruction, no memory, no tools, no loops
Level 1 Assistant — multi-turn conversation with context window but still human-driven
Level 2 Tool-calling — the model selects and invokes functions but a human approves each call
Level 3 Single Agent — autonomous planning, tool use, retries, and self-evaluation within guardrails
Level 4 Multi-Agent System — multiple specialised agents coordinated by an orchestrator

Module 01 · Industry Data

69%

of Australian organisations are already using agentic AI, according to Deloitte (2026).

But only 22% report advanced agent governance in place.

Module 01 · Reality Check

ROI vs Hype

Gartner warns 40%+ of agentic AI projects will be cancelled by end of 2027 due to cost, unclear value, and risk-control failures
"Agent sprawl" is a real risk — organisations deploying agents without governance create fragile, overlapping systems nobody can audit
The biggest cost is not the API bill — it is lost trust when an agent makes a visible, embarrassing, or costly mistake
This course teaches you to avoid these traps with structured evaluation, approval gates, and clear success metrics

Module 01 · Framework

Use-Case Scoring Framework

Score each candidate workflow on six dimensions (1–5 scale). High frequency + low consequence + clear ROI = best starting point.

Rule of thumb: If total score is below 18, start with simple automation instead of an agent.

Frequency of task — how often does this run? Daily beats monthly.
Consequence of error — what happens when it is wrong? Reversible beats catastrophic.
Permission sensitivity — what access does it need? Read-only beats admin.
Data sensitivity — PII, financial, health? Lower is simpler.
Expected ROI — hours saved, revenue gained, errors prevented.
Human judgment required — how much nuance does the decision need?

Module 01 · Anti-Patterns

When NOT to Use an Agent

The task is simple and rule-based — a Zapier automation or Make scenario is cheaper, faster, and more predictable
The cost of error is catastrophic — financial transfers, medical decisions, legal filings need human oversight
The workflow changes constantly — agents struggle when the rules shift weekly; use assisted tools instead
There is no clear success metric — if you cannot measure whether it worked, define the metric first
The data is too sensitive for any AI — some data should never leave your perimeter, full stop

Module 01 · Activity

Build Your Opportunity Map

Choose one real workflow from your current work — something you do at least weekly
Map the workflow: trigger, inputs, outputs, decision points, approvals
Score it using the six-dimension framework (frequency, consequence, permissions, data, ROI, judgment)
Write one paragraph on the business case — or why it is NOT a good agent candidate

15

minutes — individual work, then pair share

Module 01 · Recap

Module 1 Key Takeaways

An agent perceives, plans, acts, and self-corrects — it is fundamentally different from a chatbot or an automation
The agent spectrum ranges from simple prompts to multi-agent systems — match autonomy to the task
69% of AU orgs are using agentic AI, but governance lags — be intentional, not reactive
Use the scoring framework to evaluate use cases before building anything
Knowing when NOT to use an agent is the single most valuable lesson in this module

Module 02

Models, Tools
& Protocols

GPT-5.5, Claude Opus 4.8, Gemini 3.5, Zapier, Make, Copilot Studio — and how to choose.

02

Models, Tools & Protocols

Compare platforms, understand MCP and function calling, and apply model routing for cost discipline.

Module 02 · Objectives

What You'll
Learn

By the end of this module you will be able to:

Compare major agent platforms on capabilities, pricing, and enterprise controls
Understand the Responses API, Agents SDK, and managed agent patterns
Explain MCP (Model Context Protocol) and function calling to non-technical stakeholders
Apply model routing and cost discipline to cut spend 60–80% without quality loss

Module 02 · Landscape

The Platform Landscape

OpenAI

Responses API & Agents SDK
Built-in tools & tracing
Remote MCP support

Anthropic

Opus 4.8 & Managed Agents
MCP originator
Responsible Scaling Policy

Google

Gemini 3.5 & ADK
Enterprise Agent Platform
IAM & VPC controls

Zapier

No-code agents
8,000+ app integrations
MCP & SOC 2 certified

Make

Visual orchestration
AI Agents beta
3,000+ apps, on-prem option

Microsoft

Copilot Studio
Foundry Agent Service
Entra auth & M365 native

Module 02 · Platform Deep-Dive

OpenAI — Responses API & Agents SDK

Built-in tools — web search, file search, code interpreter, and computer use available natively
Remote MCP support — connect to any MCP-compliant tool server alongside built-in tools
Agents SDK — open-source Python framework with guardrails, handoffs, and tracing built in
Tracing & sandboxes — full observability of every tool call, decision, and output
GPT-5.5 pricing: US$5 input / US$30 output per 1M tokens

Module 02 · Platform Deep-Dive

Anthropic — Opus 4.8 & Managed Agents

Tool use, web search, code execution — plus the Model Context Protocol (MCP) they created
Claude Managed Agents — public beta for orchestrated multi-step workflows with built-in safety
Responsible Scaling Policy v3.0 — industry-leading safety framework with AI Safety Level evaluations
Extended thinking — expose the model's reasoning chain for debugging and trust-building
Opus 4.8 pricing: US$5 input / US$25 output per 1M tokens

Module 02 · Platform Deep-Dive

Google — Gemini 3.5 & ADK

Enterprise Agent Platform — build, govern, and optimise agents within Google Cloud
Open-source Agent Development Kit (ADK) — modular, composable, deployment-flexible
Enterprise controls — IAM, audit logs, VPC Service Controls, and data residency compliance
Agent evaluation & traffic splitting — A/B test agent versions before full rollout
Workspace integration — native access to Gmail, Drive, Calendar, Docs within agent flows

Module 02 · No-Code Options

No-Code Agent Platforms

Zapier Agents

8,000+ app connectors
MCP server connectivity
SOC 2 Type II certified
Natural language instructions

Make AI Agents

Visual scenario builder
Credits-based pricing
On-premise deployment option
3,000+ integrations

Copilot Studio

M365 & Teams native
Entra ID authentication
AU$299.30/25k message credits
DLP & audit logging

Module 02 · Protocol

MCP — Model Context Protocol

Standard protocol for connecting AI models to external tools, data sources, and services
Created by Anthropic, now adopted by OpenAI (remote MCP support), Zapier, Make, and others
Think of it as "USB-C for AI tools" — one protocol that works across models and platforms
Server/client architecture — tools expose capabilities via MCP servers; models connect as clients
Why it matters: Build a tool integration once, use it with any MCP-compatible model

Module 02 · Mechanics

Function Calling & Built-in Tools

Function calling — the model reads a schema, decides which function to call, and generates structured arguments
Built-in tools — pre-built capabilities like web search, code interpreter, and file search that require no custom code
Custom tools — your own APIs and services, described via JSON Schema or MCP, executed by your infrastructure
Tool selection is a design decision — more tools increase flexibility but also increase reasoning complexity and error surface

Module 02 · Cost Strategy

Model Routing & Cost Discipline

Use stronger models for planning and review — Opus 4.8 or GPT-5.5 for complex reasoning, evaluation, and quality gates
Use cheaper models for routing and classification — Sonnet or GPT-5.5 mini for intent detection and simple extraction
All providers expose tiered pricing — match model capability to task complexity at each step in the pipeline
Smart routing can cut costs 60–80% with no measurable quality loss on production benchmarks

Example: Route 80% of queries to a fast, cheap model. Escalate the 20% that need deep reasoning to a frontier model. Measure both paths.

Module 02 · Activity

Write Your Platform Selection Memo

Pick your use case from Module 1's opportunity map exercise
Compare 2–3 platforms that could serve it (one code, one no-code minimum)
Evaluate each on: capabilities, pricing, enterprise controls, team skills, MCP support
Write a one-page decision memo with your recommendation and rationale

15

minutes — individual work, then group debrief

Module 02 · Recap

Module 2 Key Takeaways

Six major platforms compete for agent workloads — no single winner; the right choice depends on your constraints
MCP is becoming the universal connector — learn it once, use it everywhere
Function calling is the core mechanism — the model selects tools, your code executes them
Model routing is the highest-ROI cost optimisation — use expensive models only where they add measurable value
No-code platforms (Zapier, Make, Copilot Studio) are production-ready for many agent use cases

Module 03

Single-Agent
Design Patterns

System instructions, tool selection, approvals, recovery paths, and memory boundaries.

03

Single-Agent Design Patterns

The practical patterns that make individual agents reliable, safe, and useful in production.

Module 03 · Objectives

What You'll
Learn

By the end of this module you will be able to:

Write effective system instructions that constrain behaviour without crippling capability
Design tool boundaries and human approval gates at the right granularity
Build retry and recovery strategies that fail gracefully under real-world conditions
Manage memory within agent scope — short-term, medium-term, and long-term

Module 03 · Pattern

System Instructions Design

Define the agent's role and boundaries clearly — "You are a research assistant for the marketing team. You do NOT make purchasing decisions."
Specify what it CAN and CANNOT do — explicit allowlists beat implicit denylists every time
Include output format requirements — structured outputs reduce downstream parsing errors to near zero
Set escalation triggers — define exactly when the agent should stop and ask a human
Test instructions against edge cases — adversarial testing during design prevents production surprises

Module 03 · Pattern

Tool Selection & Boundaries

Choosing Tools

Every tool you give an agent is a capability AND a risk surface. Design with intention.

Principle: Start with the minimum viable toolset. Add tools only when you can measure the improvement.

Start with minimum viable tools — resist the urge to connect everything
Each tool adds complexity and failure surface area
Prefer built-in tools over custom where possible
Define explicit permission scopes per tool
Test tool combinations for conflicts and unintended interactions

Module 03 · Pattern

Human Approval Gates

Not every action needs approval — over-gating kills adoption faster than any bug
High-risk actions ALWAYS need human sign-off — financial commits, external communications, data deletions
Design three tiers: auto-approve (read-only, low-risk) / notify (reversible writes) / require approval (irreversible, external, financial)
Log all decisions for audit — even auto-approved actions need a paper trail
Make approval UX fast — if it takes 5 clicks to approve, people will bypass the system

Module 03 · Pattern

Retry Strategy & Recovery

Define max retry attempts per tool — typically 2–3 for API calls, zero for destructive operations
Set timeout thresholds — an agent waiting 60 seconds for a response is an agent wasting money
Build graceful degradation paths — if the preferred tool fails, what is the fallback?
Never retry destructive actions — sending an email twice is worse than not sending it at all
Log failures for debugging — every error is training data for improving the system
Always have a "give up gracefully" path — escalate to a human with full context, not a cryptic error

Module 03 · Pattern

Memory Boundaries

Short-term memory — the current conversation context window; resets between sessions
Medium-term memory — session state, user preferences, and task progress persisted across turns
Long-term memory — persistent knowledge bases, vector stores, and interaction history
Rule: Store only what demonstrably improves the quality of future decisions
Privacy imperative: Never persist sensitive data unnecessarily — minimise what you store, encrypt what you must, delete what you no longer need

Module 03 · Reference Architecture

The Conservative Baseline Architecture

One orchestrator model — a single capable model (Opus 4.8, GPT-5.5) that plans, delegates, and evaluates
Bounded tool set — minimum viable tools, each with explicit permissions and rate limits
Explicitly chosen knowledge sources — curated vector stores or search indices, not "search the whole internet"
Optional specialist sub-agents — add only when a single model cannot handle domain complexity
Visible evaluation and approval loops — every decision logged, high-risk actions gated, outputs scored

Start here. Add complexity only when measurements prove the simpler version is insufficient.

Module 03 · Anti-Patterns

Common Pitfalls

Over-engineering — start simple, add complexity only when proven needed; multi-agent is rarely day one
Ignoring failure modes — agents WILL fail; the question is whether they fail gracefully or catastrophically
No evaluation framework — if you cannot measure whether it worked, you cannot improve it or justify its cost
Unbounded autonomy — always set limits on actions, spend, and scope; trust is earned incrementally
Skipping human review at launch — shadow mode first, then supervised, then semi-autonomous; never jump to full auto

Module 03 · Lab

Build Your First Agent

Choose your platform: OpenAI Responses API, Anthropic tool use, or Zapier Agents
Build a simple research agent that gathers evidence from multiple sources and synthesises a recommendation
Write clear system instructions — define the role, boundaries, output format, and escalation triggers
Give it 2–3 tools maximum — web search, a document reader, and optionally a note-taker
Test it against at least 3 different queries — one easy, one ambiguous, one adversarial

20

minutes — hands-on lab, then demo two volunteers

Module 03 · Recap

Module 3 Key Takeaways

System instructions are the most important design artifact — invest time getting them right
Minimum viable tools — every additional tool increases both capability and risk surface
Three-tier approval gates (auto / notify / require) balance safety with usability
Plan for failure — retry strategies, graceful degradation, and human escalation paths are not optional
The conservative baseline (one orchestrator, bounded tools, visible loops) is the right starting point for every agent

Module 04

Knowledge Systems

Long context vs RAG vs search — and how to test which one works.

04

What You'll Learn

Module 4 Objectives

Choose between long context, RAG, and search

Prepare and process knowledge corpora

Design chunking and retrieval strategies

Test grounding quality with evaluation sets

The Knowledge Decision

Three approaches to giving agents knowledge

Long Context

Feed everything into the prompt. Simple but limited by window size. Best for small corpora under ~50 pages.

RAG

Retrieve relevant chunks at query time. Scales to large corpora. Needs infrastructure — embedding, indexing, retrieval pipeline.

Web Search

Live data, no corpus needed. Less control over quality. Best when currency matters more than consistency.

When to Use What

A practical decision framework

<50 pages? → Long context

Stable large corpus? → RAG

Need current info? → Web search

Need citations? → RAG or search with source tracking

Mixed needs? → Combine approaches

Corpus Preparation

Your knowledge is only as good as your source material

Clean and normalise source documents

Remove duplicates and outdated content

Structure for consistent retrieval

Version your corpus — knowledge changes

Document your sources for audit compliance

Chunking Strategies

How you split your corpus determines retrieval quality

Chunk by semantic meaning, not arbitrary length

Overlap chunks for context continuity

Include metadata (source, date, section)

Test chunk sizes against your actual queries

Smaller chunks = more precision, larger = more context

Retrieval Testing

Measure before you ship

Build a test set of 20–30 real questions

Compare retrieval accuracy across strategies

Measure: relevance, completeness, citation fidelity

Track failure modes (wrong chunk, missing info, hallucination)

Iterate until quality meets your threshold

Grounding Quality

The #1 risk in RAG systems

Grounding = making sure answers come from your sources, not the model's imagination.

Does the answer cite the right source?

Does it invent facts not in the corpus?

Does it handle "I don't know" correctly?

Grounding failures are the #1 RAG risk

Lab: RAG & Policy-Compliance Agent

Hands-on with real regulatory data

Use OAIC privacy guidance and Australian Privacy Principles as public corpus

Ingest, chunk, and index the documents

Build a test set of privacy compliance questions

Compare: plain long-context vs RAG vs RAG + policy rules

Expected output: Q&A agent + evaluation set + findings memo

Activity: Build Your Knowledge Agent

20 minutes — hands-on

Ingest a corpus relevant to your work

Generate 20–30 test questions

Compare three grounding strategies

Write a recommendation: which approach wins for your use case?

Module 4 Recap

Knowledge Systems — Key Takeaways

Choose Wisely

Long context for small corpora, RAG for large stable sets, web search for live data. Combine when needed.

Prepare Rigorously

Clean, chunk, and version your corpus. Semantic chunking with metadata outperforms arbitrary splits.

Test Grounding

Build evaluation sets. Measure citation fidelity. Grounding failures are the biggest RAG risk.

Module 05

Automation & Business Actions

Zapier Agents, Make AI Agents, Copilot Studio — building approval-gated operational agents.

05

What You'll Learn

Module 5 Objectives

Build approval-gated business agents

Design trigger-action-approval flows

Compare no-code automation platforms

Implement escalation rules and safety nets

Zapier Agents

The fastest path for operators

8,000–9,000+ app integrations

MCP connectivity for agent-to-agent communication

SOC 2 compliant, SAML/SCIM, audit logs

Best for: sales, marketing, support, internal ops

Fastest path from idea to working automation

Make AI Agents

Visual orchestration for cross-app workflows

Visual orchestration (still beta)

MCP Server support, 3,000+ apps

Credit-based pricing (Core US$12/mo, Pro US$21, Teams US$38)

Best for: cross-app visual workflows, hybrid human+agent automations

On-prem option available for enterprise

Copilot Studio

Low-code agents for the Microsoft ecosystem

Low-code business agents for M365/Teams/Dynamics

Foundry Agent Service for code-first builds

Entra authentication, scoped autonomy, analytics

In Australia: AU$299.30 per 25,000 credits/month

Best for: Microsoft-heavy organisations

Triggers, Actions & Approvals

The anatomy of an operational agent

Trigger: what starts the agent (email, form, schedule, webhook)

Action: what the agent does (send, create, update, classify)

Approval: human checkpoint before high-risk actions

Escalation: what happens when the agent is uncertain

Logging: every decision should be auditable

Escalation Rules

Never let an agent silently fail

Define confidence thresholds (high / medium / low)

High confidence → auto-execute + log

Medium confidence → execute + notify human

Low confidence → pause + require approval

Unknown / error → escalate immediately

Approval-Gated Operations

Lab preview

Build an agent that triages inbound requests. Takes real action only when confidence, permission scope, and policy fit are acceptable.

Support triage — route by severity and type

Lead qualification — score and assign

Invoice exception routing — flag anomalies for review

Platform Comparison

Choose based on your existing stack

Zapier

Fastest setup. Broadest integrations (8,000+). Plan-based pricing. Best for teams that need speed.

Make

Most visual builder. Flexible routing logic. Credit-based pricing. Best for complex cross-app flows.

Copilot Studio

Deepest M365 integration. Enterprise controls. Credit-based pricing. Best for Microsoft-heavy orgs.

Activity: Build Your Approval-Gated Agent

20 minutes — hands-on

Use 50 synthetic requests + category rules

Build a classifier to route requests to buckets

Require approval for high-risk actions

Output: run log, routing dashboard, policy document

Module 5 Recap

Automation & Business Actions — Key Takeaways

Platform Choice

Zapier for speed, Make for visual complexity, Copilot Studio for Microsoft ecosystems. Match to your stack.

Approval Gates

High-risk actions need human checkpoints. Define confidence tiers and route accordingly.

Never Silent Failure

Every agent decision must be logged. Escalation rules are non-negotiable for production agents.

Module 06

Multi-Agent Orchestration

Planner/executor, specialist agents, routing, handoffs, and knowing when multi-agent is overkill.

06

What You'll Learn

Module 6 Objectives

Design multi-agent architectures

Implement routing and handoff patterns

Handle failure modes in multi-agent systems

Know when multi-agent adds value vs complexity

Planner / Executor Pattern

The most reliable multi-agent pattern for beginners

Separate the "thinking" from the "doing." A planner agent decomposes complex tasks into steps, while executor agents handle each step with focused tools and scope.

Planner breaks task into steps

Executors handle individual steps

Planner monitors and re-plans if needed

Clear separation of concerns

Most debuggable multi-agent pattern

Specialist Agents

Bounded tools, clear scope, independently testable

Research agent — gathers and synthesises evidence

Operations agent — takes business actions

QA / critic agent — reviews outputs for quality

Each specialist has bounded tools and clear scope

Specialists should be independently testable

Routing & Handoffs

Context transfer is where multi-agent systems break

Router decides which specialist handles each sub-task

Handoff protocols must include context transfer

Define what data passes between agents

Never pass raw user input between agents without sanitisation

Test handoff paths independently

Failure Modes

What goes wrong in multi-agent systems

Agent A fails → does Agent B know?

Circular routing — agents pass work back and forth forever

Context loss during handoffs

Cascading failures — one agent down = whole system down

Budget overruns from uncontrolled tool calling

Mitigate: circuit breakers, timeouts, fallback paths.

Connected Agents

The pattern is converging across platforms

Modern platforms now support agent-to-agent communication natively:

OpenAI Agents SDK handoffs

Anthropic tool loops with child agents

Google ADK connected agent patterns

Microsoft Foundry multi-agent routing

The Reference Architecture

End-to-end agentic system design

Input: User or business trigger

Orchestrator: Planner / router layer

Execution: Tool layer + Knowledge layer + Specialist agents

Governance: Human approval gate

Observability: Tracing & evaluation

Compliance: Logs & audit trail

Lab: Content Operations Pipeline

Multi-agent system in practice

Build a multi-agent content pipeline:

Research agent → Outline agent → Drafter

QA / critic → Final reviewer

Define which steps are deterministic vs agentic

Map where human sign-off occurs

Identify handoff data and failure points

Activity: Design Your Multi-Agent System

20 minutes — hands-on

Create an architecture diagram

Define agent roles and bounded tool sets

Map handoff protocols and context transfer

Identify failure modes and mitigation strategies

Build a prototype if time allows

Module 6 Recap

Multi-Agent Orchestration — Key Takeaways

Start Simple

Planner/executor is the most reliable pattern. Add specialists only when single-agent complexity becomes unmanageable.

Handoffs Matter

Context transfer is where multi-agent systems break. Define protocols, sanitise inputs, test paths independently.

Plan for Failure

Circuit breakers, timeouts, budget caps. Every agent needs a fallback. Never let the system silently fail.

Module 07

Evaluation & Observability

Test sets, traces, tool-call accuracy, dashboards, and regression checks.

07

What You'll Learn

Module 7 Objectives

Create evaluation datasets

Interpret traces and tool accuracy

Design trajectory vs final-response evaluations

Build dashboards and regression gates

Why Evaluation Matters

The stakes are real

40%+ of agentic AI projects will be cancelled by end of 2027 — Gartner

Without evaluation, you can't prove value, catch regressions, or justify costs. Evaluation is what separates pilots from production.

Test Sets & Evaluation Datasets

Your agent's "unit tests"

Build 20–50 test cases per agent

Include: input, expected output, acceptable variations

Cover: happy path, edge cases, adversarial inputs

Version your test sets — they evolve with the agent

Test sets are your agent's "unit tests"

Traces & Tool-Call Accuracy

See exactly what your agent did and why

A trace = complete record of an agent's reasoning

Tool-call accuracy = did it call the right tool with right args?

Trace inspection reveals: wrong tool selection, missing context, unnecessary loops

OpenAI tracing, Anthropic managed agent logs, Google agent evaluation all provide trace data

Every production agent needs tracing enabled from day one

Trajectory vs Final-Response Evaluation

Two lenses on agent quality

Trajectory Eval

Did the agent take the right steps?

Were tool calls appropriate?

Was reasoning sound?

Final-Response Eval

Is the final answer correct?

Does it meet quality criteria?

Would a human approve this output?

Dashboards & Regression Checks

Your agent's health monitor

Track: success rate, latency, cost per run, error rate

Set up regression alerts: "success rate dropped below 90%"

Compare performance across agent versions

Dashboard is your agent's "health monitor"

Review weekly, not just at launch

Debugging Agent Failures

A systematic approach

Read the trace end-to-end

Identify where the agent went wrong (planning? tool call? output?)

Check: was the test case fair? Was the instruction clear?

Common fix: better system instructions, not more tools

Document every failure pattern for future test cases

Evaluation Pack Design

What to ship alongside every agent

Your evaluation pack should include:

Test dataset (20–50 cases)

Pass/fail thresholds per metric

Regression baseline from last version

Trace samples (good and bad)

One-page summary of what changed and why

Activity: Build Your Evaluation Pack

20 minutes — hands-on

Create a test set for your agent (20–50 cases)

Run evaluations against your test set

Set pass/fail thresholds for each metric

Document results and identify improvement areas

Module 7 Recap

Evaluation & Observability — Key Takeaways

Test Everything

20–50 test cases per agent. Cover happy paths, edge cases, and adversarial inputs. Version your test sets.

Trace Everything

Enable tracing from day one. Inspect tool-call accuracy. Better instructions beat more tools.

Monitor Always

Dashboards for success rate, latency, cost. Regression alerts. Review weekly, not just at launch.

Module 08

Security, Governance & Australian Compliance

Privacy, permissions, cross-border, auditability, prompt injection, and change control.

08

What You'll Learn

Apply governance controls for agent deployment
Navigate Australian privacy obligations
Design permission and audit frameworks
Defend against prompt injection attacks

22%

of Australian organisations report advanced agent governance

Deloitte 2026

Despite 69% using agentic AI. The gap between adoption and governance is the #1 enterprise risk.

Privacy & Permissions

Privacy Act applies to ALL uses of AI involving personal information (OAIC)
Both inputs AND outputs can trigger privacy obligations
Least-privilege principle: agents should only access what they need
Document every data source and its sensitivity level
Review permissions quarterly

OAIC Guidance & Privacy Act

OAIC is explicit: Privacy Act covers AI systems processing personal information
APP 8 (cross-border disclosure) — critical when using offshore model APIs
APP 11 (security) — relevant for SaaS automation platforms
Government teams: DTA responsible AI policy v2.0 now active
Mandatory capability-building in the APS

Cross-Border Considerations

Using OpenAI, Anthropic, or Google APIs = data crosses borders
APP 8 requires you to ensure overseas recipients handle data per APPs
SaaS platforms (Zapier, Make) may process data in multiple jurisdictions
Document your data flows
Consider: regional processing options (OpenAI), data residency (Google)

Prompt Injection & Safety

Prompt injection = malicious input that hijacks agent behaviour
Defence: input sanitisation, output validation, guardrails
Never let user input become system instructions
Use content guardrails for sensitive outputs
Define escalation paths for suspicious behaviour
Test with adversarial inputs

Change Control & Auditability

Version every agent configuration change
Maintain audit trail of all agent decisions
Log: who deployed, what changed, when, why
Define rollback procedures before deployment
Auditability is not optional — it's a governance requirement

The 10-Point Release Gate Checklist

1. Value definition — workflow, success metric, cost target, owner documented
2. Human boundaries — which decisions agent can make alone vs require approval
3. Data map — sensitive data, PII, retention rules, source systems documented
4. Permission scope — tool access is least-privilege, action scopes allowlisted
5. Knowledge integrity — grounding sources current, versioned, permission-aware
6. Safety & misuse — prompt injection handling, content guardrails, escalation defined
7. Evaluation — test set exists, pass threshold set, regression check in place
8. Observability — traces, logs, analytics, error alerts available
9. Lifecycle control — versioning, rollback, retirement, ownership defined
10. Australian compliance — privacy review done, cross-border & security checked

Activity

Complete Your Deployment Readiness Checklist

Work through all 10 release gates for your agent
Document what passes and what needs work
Identify your biggest governance gap

15 minutes

Module 08 Recap

Security, Governance & Australian Compliance

Privacy Act applies to all AI processing personal information
Cross-border data flows require APP 8 compliance
Prompt injection is a real threat — defend in layers
Change control and auditability are governance requirements
The 10-point release gate checklist is your deployment standard

Module 09

Deployment & Lifecycle Management

Versioning, traffic splitting, staged rollout, rollback, incident response, and cost routing.

09

What You'll Learn

Plan staged agent deployments
Implement versioning and rollback
Design incident response procedures
Optimise cost through smart routing

Versioning & Revisions

Every agent deployment is a versioned release
Track: system instructions, tool config, knowledge sources, model version
Never edit production agents directly
Use revision history for audit and rollback
Google ADK and Microsoft Foundry both support revisioning

Traffic Splitting & Staged Rollout

Don't go from 0% to 100% in one step
Start with 5–10% of traffic to the new version
Monitor: error rate, latency, user satisfaction
If metrics hold, gradually increase to 25% → 50% → 100%
Google's agent platform supports traffic splitting natively

Rollback Planning

Define rollback criteria BEFORE deployment
"If error rate exceeds X%, roll back immediately"
Keep previous version ready to re-activate
Test rollback procedure in staging first
Document who has authority to trigger rollback
Rollback should take minutes, not hours

Incident Response

When an agent fails in production:

1. Detect — monitoring alerts
2. Contain — pause or throttle agent
3. Diagnose — read traces, identify root cause
4. Fix — update instructions/tools/knowledge
5. Verify — run evaluation pack
6. Deploy fix — staged rollout
7. Post-mortem — document and improve

Cost Routing & Optimisation

Use premium models for planning and review
Use cheaper models for routing and extraction
Batch non-urgent tasks for lower pricing tiers
Cache repeated queries when appropriate
Set cost alerts and per-run budgets
Track cost-per-successful-outcome not just cost-per-call

Production Deployment Plan

Your deployment plan should include:

Version identifier and change summary
Rollout schedule (% traffic by day)
Success metrics and rollback triggers
Monitoring dashboard setup
On-call responsibility during rollout
Post-deployment review date

Activity

Write Your Production Deployment Plan

Define versioning approach for your agent
Map rollout stages with traffic percentages
Set rollback criteria and monitoring plan
Assign on-call schedule

15 minutes

Module 09 Recap

Deployment & Lifecycle Management

Version every deployment — never edit production directly
Stage rollouts with traffic splitting
Define rollback criteria before you deploy
Follow the 7-step incident response process
Route costs smartly — premium for planning, cheap for routing

Module 10

Capstone Studio

Build, test, present, and certify your production-ready agent.

10

What You'll Learn

Build end-to-end production agent
Create complete evidence and governance pack
Present and defend your design decisions
Meet certification criteria

Three Capstone Pathways

Internal Knowledge Agent

Knowledge-backed Q&A for your organisation

Operations Triage Agent

Request routing with approval gates

Content/Research Agent

Multi-step research and synthesis pipeline

Capstone Requirements

Business case document
Architecture diagram
Data map and permission model
Evaluation set with 20+ test cases
Run results with pass/fail evidence
Governance checklist (10-point)
Rollback plan
5-minute demo

Build Sprint

Build Time

You have the rest of this session to build. Use everything you've learned. Ask for help. Start with the simplest version that works, then iterate.

Peer Review & Demo Prep

Pair up for peer review. Check:

Does the agent actually work?
Is the evaluation evidence convincing?
Would you trust this in production?
Are the governance controls genuine?

Prepare a 5-minute demo: problem → solution → evidence → limits.

Assessment Rubric

Problem framing — 15%
Architecture — 15%
Tooling & action design — 15%
Grounding & data handling — 10%
Evaluation evidence — 15%
Governance & security — 15%
Reliability & operational readiness — 10%
Communication — 5%

Pass: 75%+ overall, capstone 75%+, no red-flag safety failure.

Certification Criteria

Overall score 75%+
Capstone score 75%+
No critical failure in privacy/permissions/approval design
All labs submitted with no critical omissions
Module quizzes average 80%+

Certification rewards useful work, not flashy demos.

Future-Proofing

Schedule quarterly agent reviews
Update platform matrix as vendors change
Re-run evaluation packs after model updates
Keep governance checklist current
Build a team habit of agent maintenance

Module 10 Recap

Capstone Studio

Three capstone pathways — knowledge, triage, or research
Full evidence pack required: business case through rollback plan
Peer review strengthens your design
Assessment rewards rigour, not flash
Future-proof by building maintenance habits

Your Immediate Next Steps

Pick one workflow to automate this week
Complete your platform selection memo
Run through the 10-point release gate checklist
Schedule your first quarterly agent review
Share this framework with your team

Resources

Course materials: www.rupertchesman.com
AI Prompt Builder: www.rupertchesman.com/tools/prompt-builder
Cheat sheets: www.rupertchesman.com/cheatsheets
All resources: www.rupertchesman.com/resources

Recommended Next Courses

Mastering AI Tools — deep dive into prompting and tool workflows
AI for Corporate Teams — AI adoption strategy and governance
AI Productivity Systems — personal AI workflows
Vibe Coding — creating apps by describing what you want
Visit www.rupertchesman.com for all courses

Certificate