Pattern
Failure Buckets
An eval rubric needs categorical failure modes, not just a single score.
When to use
When you need to diagnose what's failing, not just whether it's failing. Bucket failures into specific categories (e.g. "wrong tool selected", "tool output ignored", "fabricated citation"). A single score tells you the system is broken; buckets tell you what to fix.
When not to use
When the eval is purely a pass/fail gate with no diagnostic intent. (But that's almost never the right eval.)