Course 04 Course Content Spec: AI Agent Build Lab
1. Title and Source Files Used
Course Title: AI Agent Build Lab
Owned Output File: 04-ai-agent-build-lab/04_ai_agent_build_lab_course_content.md
Primary source files used:
04-ai-agent-build-lab/curriculum.md04-ai-agent-build-lab/website-prompt.md
Purpose of this document: This is the implementation-ready course-content specification for Course 04. It converts the source curriculum into a facilitation, delivery, assignment, and assessment plan that can be taught consistently by an instructor and evaluated consistently by humans and LLMs.
2. Design Decisions at the Top
- Curriculum is the base truth; this document expands rather than reinterprets it.
The source curriculum defines the core philosophy, modules, example projects, and assessment priorities. This document preserves those and adds operational detail.
- The course is taught as a build lab, not a lecture series.
The dominant learning mode is making, testing, debugging, and shipping. Explanations are short and instrumental. Students should leave each session with visible progress.
- The central mental model is assistant vs. agent.
The course repeatedly reinforces the difference between asking for answers and delegating outcomes. This framing appears in instruction, exercises, demos, and assessment.
- No-code and low-code are the default, not a fallback.
The curriculum explicitly says no coding experience is required. Tool choices, assignments, and evaluation therefore reward clear workflow design and reliable execution more than technical sophistication.
- Working boring agents beat ambitious broken agents.
This is both a teaching principle and a grading principle. Scope control is treated as a skill, not a compromise.
- Students build on their own real problem space.
The curriculum emphasizes agent opportunities in students' actual lives. All major artifacts should stay anchored to a real workflow the student cares about.
- Evaluation must distinguish plausible output from successful execution.
Because agents can look impressive while failing silently, the assessment framework requires evidence of runs, test cases, failure analysis, and iteration.
- The website prompt informs tone and presentation language.
The website source adds a high-energy, futuristic framing: "A workforce, not a tool," "You delegate. It executes," and "architect vs. operator." Those phrases should shape facilitator language, slide copy, and demo framing, but not override the curriculum's practical focus.
- The source duration conflict is resolved explicitly.
The curriculum metadata says "5 sessions x 2 hours," while Modules 3-5 imply a longer build sprint with Days 3-4 at 4 hours each and Day 5 split into morning and afternoon blocks. This spec assumes a 5-day intensive with 14 contact hours total:
- Day 1: 2 hours
- Day 2: 2 hours
- Day 3: 4 hours
- Day 4: 4 hours
- Day 5: 2 hours
- Assessment uses human judgment supported by LLM evaluation, not replaced by it.
LLMs are used for first-pass scoring, rubric alignment, and formative feedback. Final high-stakes decisions should remain reviewable by a facilitator.
3. Delivery Model Assumptions
Target learner profile
- Ages 15-25
- Mixed academic backgrounds
- No coding experience assumed
- Comfortable with web apps and basic digital workflows
Cohort size
- Ideal: 15-24 students
- Minimum viable: 8 students
- Maximum without instructional support: 28 students
Facilitation staffing
- 1 lead facilitator
- 1 teaching assistant or floating technical coach for every 12-15 students during build days
Instructional format
- In-person preferred
- Can run hybrid if all tools are browser-based and students can share screens
- Projected live demos are required
Technology assumptions
- Every student has a laptop
- Stable internet access
- Access to at least one LLM interface such as ChatGPT or Claude
- Access to one workflow tool such as n8n, Zapier, Make, or equivalent
- Students can create free accounts if institutional accounts are not provided
Tool policy
- Default stack: ChatGPT or Claude plus n8n
- Simpler fallback: ChatGPT or Claude only, using structured multi-step prompting
- More technical extension path: LangFlow or similar visual builder
Data and privacy assumptions
- Students should avoid connecting sensitive personal data unless explicit safeguards are taught
- Demo tasks should use low-risk workflows, sample data, or sanitized personal data
- Email, calendar, and messaging automations should default to draft mode rather than send mode
Instructional pacing assumptions
- Day 1 and Day 2 are concept plus design heavy
- Day 3 and Day 4 are build heavy
- Day 5 is test, demo, and reflection heavy
Definition of success
- Every student finishes with a functioning agent or tightly scoped agent workflow prototype that completes a real task with evidence
- Every student can explain the goal, workflow, limitations, and next iteration
4. Detailed Course Content
Course Arc
The week progresses through five stages:
- See the paradigm shift
- Find a real problem worth delegating
- Design an agent workflow with prompts and guardrails
- Build and debug until the workflow works
- Evaluate, present, and plan the next version
Module 1 / Session 1
Session title: What Is an AI Agent?
Day and duration: Day 1, 2 hours
Session outcomes
- Students can explain the difference between an AI assistant and an AI agent
- Students can identify at least three possible agent opportunities from their own lives
- Students understand the five core components of an agent: goal, tools, memory, feedback loop, termination condition
Session agenda
0:00-0:10 | Opening framing
- Activity: Welcome, course challenge, and week outcome preview
- Facilitator moves:
- State the promise: by the end of the week each student ships one working agent
- Frame the course as delegation, not prompting trivia
- Use language from the source materials: "A workforce, not a tool" and "You are the architect, not the operator"
- Student outputs:
- Verbal check-in: one repetitive task they wish they could hand off
0:10-0:30 | Live contrast demo: smartest assistant vs. dumbest agent
- Activity: Demonstrate the same problem in two modes
- Mode A: single-turn assistant response
- Mode B: multi-step agent workflow with a defined goal and tool use
- Recommended demo task:
- "Prepare a one-page brief on a topic, with sources, summary, and next actions saved in a structured format"
- Facilitator moves:
- Narrate the difference between answering and executing
- Make the invisible visible: point out planning, tool use, verification, and completion criteria
- Ask students which version they would trust with repeated work
- Student outputs:
- Quick written comparison: three differences between assistant mode and agent mode
0:30-0:50 | Mini-lesson: anatomy of an agent
- Activity: Direct instruction with examples
- Content:
- Goal definition
- Tool access
- Memory/context
- Feedback loop
- Termination condition
- Facilitator moves:
- Show one bad example with a vague goal and one improved version
- Emphasize that most failures start with unclear success criteria
- Student outputs:
- Fill in a simple agent anatomy worksheet for a sample use case
0:50-1:15 | Exercise: Find the agent opportunities
- Activity: Students audit the last 24 hours of their life or school/workflow
- Prompt:
- What did you do repeatedly?
- What did you do by pattern?
- What consumed time but not deep judgment?
- What could be delegated if the delegate understood your instructions?
- Facilitator moves:
- Push students away from generic ideas toward concrete workflows
- Help students distinguish "hard because unclear" from "hard because deeply human"
- Encourage boring candidates such as organizing, summarizing, triaging, tracking, formatting, and routine research
- Student outputs:
- A labeled list of tasks under three columns:
- Manual only
- Delegatable to an AI agent
- Human judgment required
1:15-1:35 | Pair share and shortlist
- Activity: Students explain one top opportunity to a partner
- Facilitator moves:
- Instruct peers to challenge vagueness
- Require each student to turn a broad idea into a concrete workflow
- Ask, "What would count as done?"
- Student outputs:
- Top 3 candidate agent ideas ranked by feasibility and usefulness
1:35-1:55 | Tool overview and path selection
- Activity: Brief introduction to available build paths
- Tracks:
- Track A: ChatGPT or Claude only
- Track B: n8n plus LLM
- Track C: LangFlow or more technical visual builder
- Facilitator moves:
- Recommend the simplest tool path that can solve the student's problem
- Normalize choosing a low-complexity stack
- Student outputs:
- Selected build path
- One-sentence problem statement for tomorrow
1:55-2:00 | Exit ticket
- Student outputs:
- "My agent will help me ___ by ___"
- "The biggest risk is ___"
Artifacts produced
- Agent Opportunity Audit
- Initial problem statement
- Build-path choice
Facilitator prep
- Preload assistant-vs-agent demo
- Prepare worksheet for agent anatomy
- Prepare example workflows at three difficulty levels
Module 2 / Session 2
Session title: Designing the Agent
Day and duration: Day 2, 2 hours
Session outcomes
- Students can decompose a goal into agent-executable steps
- Students can draft a system prompt, task prompt, output format, and guardrails
- Students can define where human review belongs in their workflow
Session agenda
0:00-0:15 | Warm start and idea refinement
- Activity: Review yesterday's shortlisted agent ideas
- Facilitator moves:
- Have each student state their chosen workflow in one sentence
- Force specificity: input, transformation, output, and user value
- Student outputs:
- Final selected project idea
0:15-0:40 | Lesson: goal decomposition
- Activity: Direct instruction plus worked example
- Content:
- Task vs. goal
- Sequential steps
- Decision points
- What the human does vs. what the agent does
- Where failure is most likely
- Facilitator moves:
- Demonstrate a bad decomposition that skips inputs, checks, or outputs
- Model a good decomposition with explicit checkpoints
- Student outputs:
- Notes on decomposition pattern
0:40-1:05 | Exercise: Goal Autopsy
- Activity: Students map their selected problem into steps
- Required outputs in the map:
- Trigger/input
- Processing steps
- Tools required
- Human review points
- Final output
- Stop condition
- Facilitator moves:
- Ask "Can an AI actually do this step with the tools you chose?"
- Mark any steps that are too ambiguous or too broad
- Encourage scope cuts if a student has more than 5-7 major steps
- Student outputs:
- Agent workflow map
1:05-1:25 | Lesson: prompt design for agents
- Activity: Direct instruction and prompt teardown
- Content:
- Role definition
- Goal statement
- Constraints
- Output format
- Tool use instructions
- Error handling
- Escalation rules
- Facilitator moves:
- Show one weak prompt and one strong prompt
- Emphasize structure over clever wording
- Teach students to specify what the agent should do when uncertain
- Student outputs:
- Prompt template draft
1:25-1:50 | Hands-on workshop: Agent Blueprint
- Activity: Students complete a build-ready blueprint
- Blueprint sections:
- Project name
- User problem
- Success definition
- Input format
- Step-by-step workflow
- Tools
- Prompt set
- Guardrails
- Failure modes
- Test cases
- Facilitator moves:
- Review for feasibility, not elegance
- Require at least two test cases and one likely failure case
- Student outputs:
- Version 1 Agent Blueprint
1:50-2:00 | Build readiness check
- Activity: Rapid desk checks or pair reviews
- Facilitator moves:
- Verify each student can answer:
- What starts the workflow?
- What tools are used?
- What does "done" look like?
- What could fail first?
- Student outputs:
- Build-ready approval or scoped-down revision
Artifacts produced
- Goal Autopsy map
- Agent Blueprint v1
- Prompt set v1
- Test case set
Facilitator prep
- Blueprint template
- Prompt template
- Decomposition examples for simple and complex projects
Module 3 / Session 3
Session title: Build Sprint Part 1
Day and duration: Day 3, 4 hours
Session outcomes
- Students configure their chosen tools
- Students build a first runnable version of their workflow
- Students complete at least one end-to-end test, even if partial or fragile
Session agenda
0:00-0:20 | Sprint kickoff
- Activity: Re-state build rules and ship criteria
- Facilitator moves:
- Set the rule: "By end of Day 4, your agent must work"
- State that broken complexity is not rewarded
- Ask students to define today's concrete milestone
- Student outputs:
- Sprint goal for the day
0:20-1:00 | Tool setup and environment readiness
- Activity: Account access, workflow tool setup, folder structure, test input preparation
- Facilitator moves:
- Use a setup checklist projected on screen
- Encourage pair troubleshooting before facilitator escalation
- Offer fallback tools if account or integration issues stall progress
- Student outputs:
- Working tool access
- Ready-to-use inputs and sample data
1:00-1:45 | Build block 1: trigger plus first action
- Activity: Students create the first executable step
- Facilitator moves:
- Require early testing rather than planning forever
- Tell students to start with the smallest useful slice
- Help students remove unnecessary automation branches
- Student outputs:
- First functioning trigger and action
1:45-2:00 | Stand-up checkpoint
- Activity: Fast progress round
- Prompt:
- What works?
- What breaks?
- What is your next smallest step?
- Facilitator moves:
- Identify common blockers for a mini-clinic
- Student outputs:
- Updated sprint plan
2:00-2:45 | Build block 2: core workflow path
- Activity: Students connect the main sequence of steps
- Facilitator moves:
- Push students to keep one path working before adding branches
- Ask for explicit output formatting
- Ensure the workflow produces an observable result
- Student outputs:
- Core workflow v1
2:45-3:15 | Mini-lesson: prompt iteration under failure
- Activity: Short intervention based on real blockers in the room
- Common topics:
- Vague prompts causing messy outputs
- Missing required fields
- Tool mismatch
- Overly large scope
- Facilitator moves:
- Use student examples with permission
- Model one prompt change and one workflow change
- Student outputs:
- Revised prompt set
3:15-3:50 | Build block 3: first real run
- Activity: Students execute one real workflow run against a real or sanitized task
- Facilitator moves:
- Require capture of evidence: screenshots, logs, outputs
- Ask students to annotate what failed and why
- Student outputs:
- First real run evidence
- Failure notes
3:50-4:00 | Close and next-step commitment
- Student outputs:
- Written plan for Day 4:
- one bug to fix
- one feature to cut or simplify
- one success criterion for tomorrow
Artifacts produced
- Workflow v1
- Run evidence set
- Failure log v1
- Revised prompt set
Facilitator prep
- Setup checklist
- Troubleshooting board for common errors
- Fast fallback exercise for students blocked by tool access
Module 3 / Session 4
Session title: Build Sprint Part 2
Day and duration: Day 4, 4 hours
Session outcomes
- Students improve reliability and handle at least one edge case
- Students test the workflow with multiple inputs
- Students reach a demonstrable working version
Session agenda
0:00-0:15 | Sprint reset
- Activity: Review Day 3 evidence and set shipping targets
- Facilitator moves:
- Have students define the minimum viable working agent
- Make students cut optional features before resuming build
- Student outputs:
- Final ship target
0:15-1:15 | Build block 1: stabilize the happy path
- Activity: Students fix the most important broken step
- Facilitator moves:
- Keep them on the main path until it succeeds consistently
- Ban feature creep during this block
- Student outputs:
- Stable happy-path run
1:15-1:35 | Lesson: common agent failure modes
- Activity: Targeted instruction
- Failure modes:
- Goal drift
- Infinite loops
- Hallucinated tools
- Premature termination
- Output not matching user need
- Facilitator moves:
- Tie each failure mode to what students are seeing in the room
- Give one concrete fix pattern for each
- Student outputs:
- Self-diagnosis of likely failure mode
1:35-2:20 | Build block 2: add one control mechanism
- Activity: Students improve reliability by adding one of:
- Validation step
- Human review checkpoint
- Explicit stop condition
- Required output schema
- Retry or fallback instruction
- Facilitator moves:
- Require students to articulate why the control matters
- Favor simple validation over complex autonomy
- Student outputs:
- Workflow v2 with one reliability feature
2:20-2:35 | Break and peer debug
- Activity: Students pair up and test each other's agent
- Facilitator moves:
- Tell peers to break the workflow, not praise it
- Require specific bug reports
- Student outputs:
- Peer bug report
2:35-3:20 | Build block 3: edge case plus second test
- Activity: Students run at least two test cases and one edge case
- Facilitator moves:
- Require evidence for each run
- Push students to test the exact scenario they fear most
- Student outputs:
- Test log with results for:
- Test Case 1
- Test Case 2
- Edge Case
3:20-3:45 | Documentation block
- Activity: Students prepare a compact project record
- Required contents:
- What the agent does
- Input
- Output
- Tools used
- Prompt summary
- Known limitations
- What still fails
- Facilitator moves:
- Stress honesty over polish
- Explain that documentation is part of real agent design
- Student outputs:
- Agent project sheet
3:45-4:00 | Ship review
- Activity: Instructor checks if the agent is demo-ready
- Facilitator moves:
- Sort students into:
- working
- almost working
- needs rescue scope cut
- Student outputs:
- Demo readiness status
Artifacts produced
- Workflow v2 or working agent
- Reliability enhancement
- Test log with multiple runs
- Peer bug report
- Agent project sheet
Facilitator prep
- Failure mode examples
- Peer debug template
- Demo readiness checklist
Module 4 and Module 5 / Session 5
Session title: Evaluate, Demo, and Roadmap
Day and duration: Day 5, 2 hours
Session outcomes
- Students can evaluate whether their agent actually solves the intended problem
- Students can explain failure modes and future improvements
- Students present a working or near-working agent with evidence
Session agenda
0:00-0:20 | Lesson: the evaluation problem
- Activity: Short instruction
- Content:
- Why plausible outputs are not enough
- What counts as evidence of success
- When to use human review
- How to detect goal drift and premature completion
- Facilitator moves:
- Contrast "looks smart" with "completed the job"
- Show one example of output that is polished but wrong
- Student outputs:
- Personal checklist: how I know my agent works
0:20-0:40 | Exercise: Agent Surgery
- Activity: Diagnose a broken workflow
- Facilitator moves:
- Give students a pre-made failing agent example
- Ask them to name the failure mode, likely cause, and fix
- Student outputs:
- Short diagnostic response
0:40-1:30 | Demo day
- Activity: 5-minute presentations with evidence
- Student demo structure:
- Problem
- Workflow
- Live run or recorded run evidence
- What failed and what changed
- What comes next
- Facilitator moves:
- Keep time aggressively
- Require honesty about limitations
- Ask one question about evaluation or reliability for each presenter
- Student outputs:
- Demo presentation
1:30-1:45 | Peer feedback
- Activity: One specific strength and one specific improvement per presenter
- Facilitator moves:
- Ban empty praise
- Encourage comments on scope, clarity, and reliability
- Student outputs:
- Written peer feedback
1:45-2:00 | Closing: the agent operating system
- Activity: Reflection and next-step roadmap
- Facilitator moves:
- Frame each shipped agent as a reusable capability
- Encourage weekly agent sprints and monthly tool reviews
- Name the next frontier: multi-agent systems, agent-to-agent calls, feedback-driven improvement
- Student outputs:
- Next-version roadmap
- Reflection on architect vs. operator mindset
Artifacts produced
- Evaluation checklist
- Agent Surgery response
- Demo recording or live presentation
- Peer feedback set
- Next-version roadmap
Facilitator prep
- Broken-agent exercise
- Demo timer and order
- Closing reflection prompt
5. Assignments and Artifacts
Assignment 1: Agent Opportunity Audit
When: Day 1
Purpose: Identify viable agent opportunities from the student's real life
Submission requirements
- Minimum 10 tasks from the student's recent routine
- Each labeled as manual, delegatable, or human-judgment-heavy
- Top 3 agent opportunities ranked with a brief reason
Artifact produced
- Opportunity audit sheet
Assignment 2: Goal Autopsy and Agent Blueprint
When: Day 2
Purpose: Turn one selected opportunity into a buildable workflow
Submission requirements
- One clear problem statement
- Success definition
- Stepwise workflow map
- Tool list
- Prompt set
- Guardrails
- At least 2 normal test cases and 1 failure or edge case
Artifact produced
- Agent Blueprint v1
Assignment 3: Build Sprint Log
When: Days 3-4
Purpose: Capture implementation progress, test evidence, and iteration decisions
Submission requirements
- Date and version markers
- Screenshots or logs from at least 3 runs
- Notes on prompt or workflow changes
- At least 2 failures documented with cause and attempted fix
- Clear statement of what was cut or simplified
Artifact produced
- Build Sprint Log
Assignment 4: Working Agent and Project Sheet
When: Day 4 end
Purpose: Produce a usable agent with enough documentation for evaluation and demo
Submission requirements
- The actual workflow or reproducible setup
- Input example
- Output example
- Tool stack used
- Known limitations
- Instructions for how to run it
Artifact produced
- Working agent or runnable workflow
- Agent project sheet
Assignment 5: Demo and Reflection
When: Day 5
Purpose: Explain what was built, demonstrate reliability, and articulate next steps
Submission requirements
- 5-minute demo
- Evidence of at least one successful run
- One major failure mode encountered
- One improvement planned
- Reflection on what changed in the student's understanding of AI agents
Artifact produced
- Demo deck or live walkthrough
- Reflection note
Required end-of-course artifact bundle
Each student should leave with:
- Agent Opportunity Audit
- Agent Blueprint
- Prompt set
- Build Sprint Log
- Working agent or runnable workflow
- Test log with multiple runs
- Agent project sheet
- Demo artifact
- Reflection and next-version roadmap
6. AI/LLM Grading and Assessment Framework
Assessment philosophy
The source curriculum is explicit: execution matters more than polish. Therefore the grading system must reward:
- Real utility
- Clear design thinking
- Evidence of testing
- Honest debugging
- Practical scope choices
It must not over-reward:
- Fancy language
- Complex tooling without reliable outcomes
- Ambition without execution
Recommended grading weights
- Working Agent: 40%
- Agent Blueprint and Goal Decomposition: 20%
- Testing and Evaluation Evidence: 20%
- Demo and Explanation: 10%
- Peer Feedback Participation: 10%
This slightly expands the source assessment framework by splitting "Working Agent" from "Testing and Evaluation Evidence." That change is intentional because a functioning-looking workflow without credible test evidence should not receive top marks.
What LLMs should grade directly
LLMs are well suited for:
- Checking completeness of written artifacts
- Evaluating clarity and specificity of goals and prompts
- Assessing whether workflow descriptions are coherent
- Comparing student work against rubric descriptors
- Producing formative feedback aligned to evidence
What LLMs should not decide alone
Human review is required or strongly recommended for:
- Whether the agent actually ran successfully when evidence is ambiguous
- Whether screenshots or logs are authentic
- Safety concerns or inappropriate automation choices
- Final grade overrides in borderline cases
Submission package for LLM evaluation
To evaluate consistently, the evaluator should receive:
- Student identifier
- Agent Opportunity Audit
- Agent Blueprint
- Prompt set
- Build Sprint Log
- Test log
- Agent project sheet
- Demo summary or transcript
- Run evidence excerpts
Core LLM evaluation heuristics
The evaluator should inspect the following:
1. Problem clarity
- Is the problem concrete, real, and narrow enough to build in one week?
- Does the submission name a user, input, process, and output?
2. Agent suitability
- Is the workflow actually a candidate for delegation?
- Does the student distinguish agent work from human judgment?
3. Workflow coherence
- Do the steps logically connect?
- Are tools matched to tasks?
- Is there a clear start and stop condition?
4. Prompt quality
- Does the prompt specify role, goal, constraints, and output format?
- Does it say what to do when uncertain or when a tool fails?
5. Evidence of execution
- Is there proof of actual runs?
- Are outputs tied to inputs?
- Is at least one result demonstrably successful?
6. Reliability and testing
- Did the student test more than once?
- Did they include an edge case?
- Did they use validation, guardrails, or human review appropriately?
7. Debugging quality
- Did the student identify real failure modes?
- Did they respond with specific changes rather than vague complaints?
8. Reflection and transfer
- Can the student explain what they learned about agent design?
- Can they name a credible next iteration?
Concrete assessment heuristics for LLM scoring
The LLM should apply these rules:
- Score down if the "agent" is just a one-off chat answer with no multi-step workflow, no defined outcome, and no repeatable process.
- Score down if the project goal remains broad, such as "help me with school," without a bounded workflow.
- Score down if the artifact bundle lacks evidence of more than one test.
- Score down if the student cannot state what success looks like.
- Score down if tools are named but not actually used in the described workflow.
- Score down if the output is polished but the student provides no failure analysis.
- Score up if the student made smart scope cuts that increased reliability.
- Score up if guardrails, validation, or review checkpoints are intentionally placed.
- Score up if the student demonstrates awareness of where the agent should stop and hand back to a human.
- Score up if the workflow is simple, repeatable, and clearly useful.
Pass threshold guidance
Pass / proficient baseline
- The student built a repeatable workflow that solves one real task at least once with evidence
- The student can explain its logic and limitations
- The student has documented at least one iteration based on failure
Strong pass / distinction
- The workflow succeeds across multiple runs
- The project is well scoped and clearly useful
- The student shows thoughtful evaluation and reliability improvements
Needs revision
- The workflow is mostly conceptual or incomplete
- Evidence is missing or weak
- The student confuses an assistant output with an agent workflow
7. Rubrics, Scoring Criteria, and Evaluator Prompt Guidance
Rubric overview
Use a 4-point scale for each criterion:
4 = Exceeds3 = Meets2 = Approaching1 = Not yet
Criterion A: Problem Selection and Scope
4
- Problem is real, specific, valuable, and appropriately scoped for one week
- Student clearly identified what the agent should and should not do
3
- Problem is relevant and mostly well scoped
- Minor ambiguity remains, but build target is clear
2
- Problem is somewhat vague or slightly too broad
- Scope required facilitator intervention to become buildable
1
- Problem remains abstract, unrealistic, or not suited to agent delegation
Criterion B: Workflow and Goal Decomposition
4
- Workflow is explicit, stepwise, and feasible
- Human vs. agent responsibilities are clearly separated
- Stop condition is defined
3
- Workflow is coherent with only minor gaps
- Most steps are feasible and connected
2
- Workflow has important missing transitions, unclear steps, or tool mismatches
1
- Workflow is fragmented, implausible, or cannot be followed
Criterion C: Prompt and Guardrail Design
4
- Prompts are structured, precise, and include role, goal, constraints, outputs, and error handling
- Guardrails meaningfully reduce failure risk
3
- Prompts are generally strong but miss one important element
- Guardrails exist but may be light
2
- Prompts are partially useful but vague, underspecified, or inconsistent
1
- Prompts are generic, incomplete, or unusable for reliable execution
Criterion D: Working Agent Execution
4
- Agent completes the intended task reliably across multiple runs
- Evidence clearly links inputs, process, and outputs
3
- Agent completes the intended task at least once and mostly works
- Some fragility remains
2
- Agent partially works or only works with heavy intervention
1
- Agent does not demonstrate successful task completion
Criterion E: Testing and Evaluation
4
- Student ran multiple tests, included an edge case, documented results, and added reliability controls
3
- Student completed at least two tests and documented basic outcomes
2
- Student tested minimally or incompletely
1
- Testing evidence is missing or superficial
Criterion F: Debugging and Iteration
4
- Student identified specific failure modes and made targeted, justified fixes
3
- Student documented at least one real issue and one sensible adjustment
2
- Student noticed problems but responses were vague or ineffective
1
- Student provides little evidence of iteration or learning from failure
Criterion G: Demo and Explanation
4
- Presentation is clear, concrete, honest about limits, and grounded in evidence
3
- Presentation explains the project competently with minor gaps
2
- Presentation is understandable but vague, overly polished, or missing key evidence
1
- Presentation does not clearly explain the project or its outcome
Criterion H: Peer Feedback
4
- Feedback is specific, actionable, and grounded in the peer's demonstrated workflow
3
- Feedback is constructive and relevant
2
- Feedback is generic or only partially useful
1
- Feedback is missing, superficial, or non-constructive
Suggested scoring model
Recommended weights by criterion:
- A: 10%
- B: 15%
- C: 10%
- D: 25%
- E: 15%
- F: 10%
- G: 10%
- H: 5%
Evaluator prompt guidance for LLM use
Use the following operating rules when prompting an LLM evaluator:
- Require the evaluator to cite evidence from the submission bundle for every score
- Instruct it not to assume unprovided success evidence
- Tell it to reward appropriate scoping and reliability over sophistication
- Tell it to separate "good idea" from "working implementation"
- Require it to flag uncertainty when evidence is incomplete
Recommended system prompt for an LLM evaluator
You are evaluating a student project for the course "AI Agent Build Lab."
Your job is to score the work against the provided rubric using only evidence present in the submission. Do not assume facts not in evidence. Do not reward polish, ambition, or advanced tooling unless the workflow actually works and the student shows proof.
Prioritize:
1. Whether the student defined a concrete problem.
2. Whether the workflow is a real agent-like delegation workflow rather than a one-off assistant answer.
3. Whether there is evidence of successful execution.
4. Whether the student tested, debugged, and improved the workflow.
5. Whether the student can explain limitations honestly.
For each criterion:
- Assign a score from 1 to 4.
- Quote or paraphrase the exact evidence that supports the score.
- State one reason the score is not higher.
Then provide:
- A weighted total score.
- A 3-5 sentence summary.
- 3 prioritized improvement actions.
If evidence is missing, say so explicitly and lower the score accordingly.
```
### Recommended user prompt template for an LLM evaluator
Evaluate the following student submission for the course "AI Agent Build Lab."
Course expectations:
- Students should build a working AI agent or agent workflow that solves a real problem.
- Execution matters more than polish.
- Boring and working beats ambitious and broken.
- Evidence of multiple test runs and at least one iteration is important.
Rubric: [paste rubric criteria and weights]
Student submission: [paste or attach artifact bundle]
Output format:
- Criterion-by-criterion scores with evidence
- Weighted total
- Strengths
- Risks or gaps
- Improvement actions
- Confidence level: High / Medium / Low
### Calibration guidance for evaluators
Before grading a cohort, evaluators should review three anchor examples:
- A clearly excellent simple project
- A competent but fragile project
- A polished but mostly nonfunctional project
This reduces the common LLM error of over-scoring polished language and under-valuing simple reliable systems.
## 8. Feedback Strategy: What Strong, Average, and Weak Responses Look Like and How an LLM Should Respond
### Feedback principles
Feedback should be:
- Evidence-based
- Specific
- Actionable
- Honest about what works and what does not
- Focused on the next best improvement, not generic encouragement
LLM feedback should avoid:
- Overpraising vague work
- Inventing success that is not shown
- Giving ten suggestions at once
- Criticizing ambition without offering scope-control guidance
### Strong response profile
**What strong work looks like**
- The student chose a concrete problem with real utility
- The workflow is clear and repeatable
- The prompt design is structured and constrained
- There is evidence from multiple runs
- The student can explain one or more failure modes and what changed
**How an LLM should respond**
- Acknowledge the specific strengths with evidence
- Preserve what is already working
- Suggest one or two leverage improvements, such as stronger validation or broader test coverage
**Example feedback pattern**
- "Your project is strong because the workflow is narrow, repeatable, and backed by multiple test runs. The clearest evidence is your documented input-output sequence and the edge-case test. The next improvement is to add a validation check before final output so the agent can catch incomplete results."
### Average response profile
**What average work looks like**
- The idea is good and mostly scoped
- The workflow makes sense, but evidence is limited or reliability is shaky
- The prompt or guardrails are only partially specified
- The student shows some iteration but not enough testing
**How an LLM should respond**
- Confirm what is promising
- Identify the main missing element
- Recommend a concrete next step that is feasible within a short iteration
**Example feedback pattern**
- "The project is promising because the problem is real and the workflow is understandable. The main limitation is that the evidence shows only one successful run, so reliability is still unclear. Your next step should be to run two additional tests, including one edge case, and document what fails or changes."
### Weak response profile
**What weak work looks like**
- The problem is vague or too large
- The project is mostly conceptual
- The student presents an assistant-style answer as if it were an agent
- There is little or no test evidence
- Reflection is generic and not tied to the actual build
**How an LLM should respond**
- State plainly what is missing
- Avoid demoralizing language
- Recommend a scope cut and a minimum viable version
- Give a short path back to passing work
**Example feedback pattern**
- "This submission does not yet demonstrate a working agent workflow. The biggest gap is that the artifacts describe what the system should do, but do not show a repeatable multi-step run with evidence. To reach a passing standard, narrow the project to one task, define the exact input and output, run it twice, and document one change you made after a failure."
### LLM response structure for formative feedback
For each student, the LLM should respond in this order:
1. **Current status**
- One sentence: strong, developing, or not yet meeting expectations
2. **What is working**
- Two or three evidence-based observations
3. **What is limiting performance**
- One to three specific gaps
4. **Best next move**
- The single highest-leverage improvement
5. **If there is time for one more improvement**
- One optional secondary action
### Tone guidance for LLM feedback
The tone should be:
- Direct
- Specific
- Non-patronizing
- Grounded in evidence
The tone should not be:
- Hype-heavy
- Vague
- Overly harsh
- Generic praise followed by generic critique
### Instructor use of LLM feedback
Facilitators should use LLM feedback as:
- A first-pass evaluation aid
- A consistency tool across multiple student projects
- A way to generate draft written comments quickly
Facilitators should still review:
- Borderline grades
- Cases with unclear evidence
- Cases involving safety or privacy concerns
## Recommended Implementation Notes
- Build templates should be prepared before the course starts: opportunity audit, blueprint, build log, test log, peer bug report, project sheet, and demo prompt.
- Facilitators should maintain a visible "common failure modes" board during Days 3-4.
- Every student should be pushed to save evidence as they go; otherwise demo day becomes storytelling rather than proof.
- If a student is far behind by Day 4, the required intervention is scope reduction, not motivational coaching.
- The instructor demo should use a real problem and include at least one visible failure plus iteration, so students see debugging as normal.