← Writing

The Harness Is the Platform

The model did not change. The product did.

For several weeks earlier this year, Sonnet 4.5 users on every Claude-adjacent channel were reporting the same set of symptoms. Responses wrapping up earlier than they should have. Tool calls dropping mid-sequence. Long contexts going stale ten thousand tokens before the window was full. The community pattern-matched the obvious explanation: a silent model swap, a quantization, a quiet quality cut.

Anthropic’s post-mortem pointed somewhere else. The weights had not moved. What had moved was the harness. The runtime wrapper around the model. The context management policy, the system instructions, the tool wiring, the operating defaults. Same model, different harness, different product.

That is unusually clean evidence for an industry that has spent two years arguing about model regressions without a clear causal lever. The model is a component. The thing the user experiences is the harness with the model loaded into it. When the harness shifts, the product shifts, and no benchmark on any leaderboard captures the move because the leaderboard does not know the harness exists.

The same week Anthropic published the post-mortem, it shipped Managed Agents with a phrase that did not exist in the discourse three months earlier: harnesses encode assumptions that go stale as models improve. The vendor that trains the model is saying the wrapper around the model is where the assumptions live. Where the staleness lives. Where the product lives.

This is not a small framing shift. It rearranges what an agent vendor is selling, what an enterprise is buying, what an evaluator is measuring, and what a regulator should be auditing. The harness is the platform, and the model is the part you stopped paying for.

Why “harness” became a load-bearing word in 2026

Managed Agents shipped in April with the framing language baked into the launch material. The platform docs repeat the point. Anthropic is no longer describing itself as a model provider with some agent scaffolding on top. It is describing itself as a harness vendor whose harness happens to ship with a first-party model. The product surface is the harness. The model is interchangeable inside it, and Anthropic says so out loud.

The degradation story is the second beat. The same vendor that named the harness then traced its own quality regression to harness changes rather than weights. The vendor closest to the model is telling buyers that the model is not where the variance lives. VentureBeat picked it up and framed the next enterprise contest as the agent control plane. The same week, Phil Schmid published The importance of Agent Harness in 2026 from outside the Anthropic orbit. Independent voice, same term, same argument. The word has escaped vendor vocabulary.

The point of the section is not that the harness is new. Anyone who has shipped an agent system has been building one, usually without calling it that. The point is the naming. Until a layer is named, it is not budgeted, not benchmarked, not bought, not regulated. The harness has been the platform for at least two years. It is only now the word for it.

It is worth being precise about the term, because adjacent terms collide. An agent framework is a developer-facing library. An orchestrator is a workflow runner above the agent loop. A control plane is the management surface across many agents. The harness is the runtime around a single agent: the layer that decides what the model sees, what tools it can call, what hooks fire, what sub-agents it can spawn, what it remembers, and what it is allowed to touch. Frameworks, orchestrators, and control planes all sit outside the harness. The harness is where the model meets the world.

What naming changes is procurement. A buyer cannot put a line item against an unnamed surface. A regulator cannot audit one. A research team cannot compare two of them. The same effect played out a decade ago when the industry stopped calling cloud orchestration “the stuff between the VMs” and started calling it a control plane. Budgets followed the term. Benchmarks followed the budgets. The agent harness is sitting at the same point in the cycle now, with the same vendors and the same analysts on the way to noticing they have been buying it without a label for two years.

The anatomy of a harness

MODEL Context budget Tool dispatch Hooks Sub-agents Memory Permissions
The harness anatomy. The model sits at the center; the harness owns the perimeter.

A harness is a runtime that wraps a model with six components. Each component is a policy surface. Each component is a variance source. Each component is something a vendor can change without touching weights and shift the product underneath you.

Context budget. The harness decides what enters the window, what gets compacted, what gets retrieved on demand, and what gets dropped. Anthropic’s recent additions, including what its team has called context anxiety mitigation, sit here. Two harnesses with the same model and different compaction strategies produce two different products. Most public benchmarks do not disclose the context policy under which the score was produced, which means most public benchmarks are not reproducible at the harness level.

Tool dispatch. Which tools the model can see, in which order, with which schemas, with which parallelism, and with which error semantics on failure. The harness decides. Promptfoo’s evaluation guidance for coding agents names the problem directly: when you measure an agent on a task, you are measuring the system, not the model. Tool dispatch is the most visible part of that system, and the part that shifts most often between releases.

Hooks. Pre-tool and post-tool callbacks for policy, observability, redaction, and audit. Claude Code’s hook system is the reference shape, and it is the part of the harness that turns it from a developer convenience into something a regulated environment can sign off on. Hooks are where logging, approval gates, and runtime guardrails attach. A harness without hooks is not a harness an enterprise can deploy. It is a demo.

Sub-agents. Scoped delegation. A planner spawns a researcher, a researcher spawns a verifier, a verifier returns a finding. Each sub-agent runs inside its own harness mini-instance with its own tool surface and its own context budget. The sub-agent layer is where most production agent systems silently rebuild a workflow engine, badly, because they did not realize they were doing so. The harness that names sub-agents as a first-class primitive removes that hidden reimplementation.

Memory. Persistent state across sessions. Anthropic’s dreaming announcement in May is the latest move here, but the field has been converging on persistent memory for some time. MemGPT, Letta, Mem0 are all prior art the harness layer is now absorbing. Memory is the component that turns an agent from a stateless function into a long-lived collaborator, and the component most likely to produce surprising behavior six months after deployment when the harness has been quietly accumulating opinions the user never inspected.

Permissions. What the agent is allowed to touch. Which files, which network destinations, which credentials, which downstream services, with what scope, under what conditions, and with what audit. This is the most under-specified component in most public harnesses. Tool dispatch decides what tools exist. Permissions decide which calls are allowed to leave the runtime. The two are often conflated. They should not be. Tool dispatch is a developer experience question. Permissions are a risk question, and the harness is where the risk decision is enforced.

Six components. Six policy surfaces. Six places the product can shift without the model shifting. This is the layer the industry just named, and it is the layer the buyer is actually buying. A vendor that cannot map a release note to one of those six is a vendor whose release notes are marketing copy.

The harness is also the policy boundary

Naming the harness exposes a gap. The six components are policy surfaces, but most public harnesses ship without a policy primitive worthy of the surface. Hooks fire arbitrary code. Sub-agents inherit ambient credentials. Permissions are configured by file, not signed. Memory has no provenance. The harness is the layer where policy lives, and most harnesses have not yet noticed.

This is the bridge to AIP, the Agent Identity Protocol. AIP names what the harness needs and most harnesses do not yet have.

Hooks map to policy enforcement. A hook can call out to a policy decision point on every tool invocation. AIP Gateway is one shape of that decision point, sitting between the harness and the downstream service, inspecting the token and the requested operation and returning a verdict. Without that decision point, the hook is logging-only theater.

Sub-agents map to scoped delegation. When a planner spawns a researcher, the researcher should not inherit the planner’s full authority. It should receive a derived, signed identity that names which subset of capabilities it carries and which it does not. AIP’s signed identity per sub-agent is what the harness needs once it admits that sub-agents are first-class.

Permissions map to attenuation. Tokens that can be cryptographically constrained at the point of delegation, so a sub-agent cannot widen its own authority and a downstream service can verify the constraint without trusting the caller. Every harness eventually grows this primitive, usually after a sub-agent does something it should not have been able to do and the incident review finds that the harness was trusting a caller’s word on its own scope.

Memory maps to identity-bound provenance. Who said this, when, under what authority, with what scope. A memory record without identity is a rumor the agent will eventually believe. A memory record with signed identity is evidence. The harness that treats memory as evidence rather than as accumulated opinion is the harness an audit will survive.

The point is not that AIP is the answer. The point is that the harness is now the layer where the identity and policy question lives. Whether AIP becomes the way that question is answered, or whether something else does, the question itself is no longer optional for any harness aiming at enterprise deployment.

The buyer’s question for 2026 and 2027

If the harness is the platform, the buyer’s diligence changes. Three questions belong in every vendor conversation where the word agentic appears in the deck.

First. Which harness owns the operating surface? Your harness, Anthropic’s Managed Agents, a third party’s, or a custom build? This is the single question that determines who the buyer’s real platform vendor is. A startup whose product is a thin layer over Managed Agents has a different risk profile than one with a harness of its own. Neither is wrong. The buyer should know which is which, and the vendor should be able to answer without flinching.

Second. What does the harness do that the model alone does not? Be specific about hooks, permissions, memory, and sub-agents. If the answer is a prompt template and some retry logic, the vendor is reselling a model with markup. If the answer names the six components with concrete behavior in each, the vendor has built a platform. Most decks live somewhere between the two, and the question forces the position to be made explicit.

Third. Where does identity and policy live? If the answer is the model decides, that is a tell. The model is a probabilistic component. Identity and policy are not probabilistic concerns. They belong in the harness layer, at hook boundaries, with audit. A vendor that cannot point to where authority is verified and where actions are constrained has not built a system a regulated buyer can deploy. The vendor may not yet know this. The buyer can tell them.

The stochasticity result from late last year, Stochasticity in Agentic Evaluations, put numbers on what every practitioner already suspected. ICC values between 0.30 and 0.77 on agentic tasks mean that most published deltas between models are inside the noise floor of harness, seed, and configuration. The published research is a preprint, not a settled finding, but the implication for buyers is direct. A benchmark number without a harness disclosure is a sales asset. A buyer who treats it as evidence is treating sales material as evidence.

None of this points to a vendor pick. The harness war has no winner yet, and anyone predicting one is selling something. The buyer’s job for the next two budget cycles is not to back the right horse. It is to ask better questions until the horses are forced to describe themselves accurately. The vendor that answers crisply on harness ownership, on what the harness does beyond the model, and on where identity and policy live, is the vendor whose claims can be checked. The vendors who cannot answer are still selling models with markup, and the invoice line item still says model, and the capability they are charging for lives somewhere else.


AIP: sunilprakash.com/aip/. AIP Gateway: aip-gateway. Managed Agents: Anthropic engineering blog.