Evaluating Agent Performance
Metrics that matter for agents
Task success rate (did it achieve the goal?), average steps/iterations to completion, cost per task, and failure mode frequency (which specific steps tend to fail) — track all four, not just whether it "worked once".
Building a small test suite
Create 10-20 representative test cases covering typical tasks and known edge cases. Run your agent against this suite whenever you change the prompt or tools, to catch regressions before they reach production.
Iterating based on failures
When a test case fails, look at the actual trace — was it a bad tool description, an ambiguous system prompt, or a genuinely hard edge case? Fix the specific root cause rather than broadly rewriting the whole prompt.
Key Takeaways
- Track success rate, steps to completion, cost per task, and failure modes.
- Build a small, representative test suite of 10-20 cases.
- Re-run the test suite whenever you change prompts or tools.
- Diagnose the specific root cause of each failure rather than broad rewrites.
Build a mini test suite
Write 5 representative test cases for an agent you've built, run them, and log the success rate and any failure patterns you notice.