Field Notes — Agentic AI for Serious Engineers

FN-003 · 2026-05-26

Anthropic just told you the harness is the product

Three Anthropic moves in April and May 2026, an independent essay from January, and a late-2025 preprint converge on the same point. The unit of evaluation is harness plus model plus task, not model.

8 min read →

FN-002 · 2026-05-17

Your eval rubric needs failure buckets, not just scores

Two agents can score identically on a benchmark and fail differently in production. The aggregate is not a rubric. It is what falls out of one.

7 min read →

FN-001 · 2026-05-14

The multi-agent papers nobody is citing

Three studies published since March 2025 put multi-agent failure rates between 41 and 86.7 percent and error amplification up to 17.2x over single-agent baselines. The vendor decks have not caught up.

8 min read →