Field Notes
Twice-weekly observations on building agents that survive contact with reality.
Thursdays · Sundays
FN-002 · 2026-05-17
Your eval rubric needs failure buckets, not just scores
Two agents can score identically on a benchmark and fail differently in production. The aggregate is not a rubric. It is what falls out of one.
7 min read →
FN-001 · 2026-05-14
The multi-agent papers nobody is citing
Three studies published since March 2025 put multi-agent failure rates between 41 and 86.7 percent and error amplification up to 17.2x over single-agent baselines. The vendor decks have not caught up.
8 min read →