Pattern

Failure Buckets

An eval rubric needs categorical failure modes, not just a single score.

When to use

When you need to diagnose what's failing, not just whether it's failing. Bucket failures into specific categories (e.g. "wrong tool selected", "tool output ignored", "fabricated citation"). A single score tells you the system is broken; buckets tell you what to fix.

When not to use

When the eval is purely a pass/fail gate with no diagnostic intent. (But that's almost never the right eval.)