Lesson 1011 lessons

Evaluating Agent Performance

Metrics that matter for agents

Task success rate (did it achieve the goal?), average steps/iterations to completion, cost per task, and failure mode frequency (which specific steps tend to fail) — track all four, not just whether it "worked once".

Building a small test suite

Create 10-20 representative test cases covering typical tasks and known edge cases. Run your agent against this suite whenever you change the prompt or tools, to catch regressions before they reach production.

Iterating based on failures

When a test case fails, look at the actual trace — was it a bad tool description, an ambiguous system prompt, or a genuinely hard edge case? Fix the specific root cause rather than broadly rewriting the whole prompt.

Key Takeaways

  • Track success rate, steps to completion, cost per task, and failure modes.
  • Build a small, representative test suite of 10-20 cases.
  • Re-run the test suite whenever you change prompts or tools.
  • Diagnose the specific root cause of each failure rather than broad rewrites.

Build a mini test suite

Write 5 representative test cases for an agent you've built, run them, and log the success rate and any failure patterns you notice.