The Agent Demo Gap

May 2026

Multi-agent demos have a strange property. The Twitter clip of three AI agents debating each other gets fifty thousand impressions. The same architecture in a Fortune 500 production environment gets a six-month security review and a budget freeze.

Both reactions are correct. The demo is real engineering, sometimes excellent engineering. But the gap between "this works on my machine for sixty seconds" and "this runs as a regulated workflow with auditable delegation" is the entire field. Closing that gap, in my opinion, is what 2026 is about.

What demos optimize for

A multi-agent demo is good when it produces a screenshot. The path from "blank terminal" to "thing my friends will retweet" needs to fit inside a coffee break. The agents need to do something legible. The output needs to be visible without me explaining it. There should be no account creation, no infrastructure setup, no API key forms beyond pasting a single token. If a viewer cannot run the same thing on their laptop within five minutes of seeing the clip, the clip does not convert.

Every design choice in a good demo follows from that constraint. Templates, not blank canvases. Local dashboards, not hosted ones. Plain Python or TypeScript, not custom DSLs. Pre-wired identities, not enrollment flows. The demo's job is to bypass everything the user does not want to think about so they can think about the one thing that will surprise them. If the surprise is good, they share. If they share, the cycle repeats.

The mistake is treating this list as the full specification of what an agent system should do. It isn't. It's the specification for what a demo of an agent system should do. The two are not the same.

What production actually needs

A production multi-agent workflow has different ground truth. The agent's identity needs to be verifiable, not declared. When agent A delegates to agent B, the delegation needs to narrow scope, not pass a blanket bearer token. Every tool call needs to land in an audit trail that survives a legal review. Failure modes need to be safe by default, not interesting. The system needs to be explainable to someone whose first question is "what could go wrong" and whose second question is "show me where you stopped it from going wrong."

None of those properties are visible in a thirty-second clip. They are not what the demo is selling. They are what the security review is asking for, and the gap between the two has been the practical reason most enterprise multi-agent rollouts have stalled for the last eighteen months. The pattern goes: a developer ships a working demo, an executive funds a pilot, the pilot reaches a security review, and the review surfaces that none of the production primitives are present. The team then either bolts on identity, audit, and scope as a six-month retrofit, or quietly retires the project.

The retrofits do not work well. Identity bolted on after the fact looks like a service mesh sidecar that all the agents have to be rewritten to call. Audit trails added later are missing the events from before the audit existed. Scope narrowing added once the agents are already integrated is a permission rebuild that the agents resist because their working code expected blanket access. Most teams that try this end up with a demo-shaped artifact and a pile of compliance attachments stapled to it.

The pattern most stacks get wrong

The diagnosis I keep landing on is that the multi-agent ecosystem has tried to serve two audiences with the same code. The frameworks (CrewAI, LangChain, AutoGen, Google ADK) optimize for the developer experience that powers the demo loop. They do this well. The protocols and identity layers (MCP, A2A, OAuth, SPIFFE-style mesh identity) optimize for the production properties that the security review is asking about. They do this in their own way. What is missing is the connector that makes those two surfaces visible in the same place at the same time.

Every developer building a CrewAI demo right now has cryptographic identity available to them as an import. Almost none of them use it. Not because they object to the idea. Because the developer experience of adding identity is worse than the developer experience of not adding it. Adding identity means reading a spec, generating keys, configuring a verifier, structuring tokens, and explaining to the team's reviewer why this is necessary for a demo. Skipping identity means the demo runs sixty seconds faster.

This is the trap. The protocols and primitives that win do not win because they have the strongest security argument. They win because the developer who picks up the SDK on a Friday afternoon ends up with a production-shaped pattern by Monday morning, without ever feeling the friction of compliance. SSL won because curl made it easy. JWTs won because they are short and base64. Production-grade primitives are adopted on the same gradient as anything else: the path with the lowest friction wins, and the path with the friction is the path nobody takes.

The trick worth running

The interesting design move is to ship the demo loop with the production primitives baked in, not bolted on. Make the boring primitive part of the fun loop. Identity as the visible thing in the dashboard, not the invisible thing in the audit log. Scoped delegation as the cool visual that goes in the screenshot, not the compliance step that nobody screenshots. The demo's job stays the same: get the surprise into the viewer's head. The change is which surprise is on offer.

If the surprise is "watch an agent delegate to an agent that delegates to a tool, with each hop's scope narrowing visible on screen," the screenshot now contains a production primitive. The viewer who shares it is not sharing a demo. They are sharing a working trust topology dressed up as the cool thing. The next person who picks up the same scaffolder gets the same trust topology by default, in the same way they get an HTTP client by default. By the time they wonder how to trust the agent in production, the answer is already in their dependencies.

This is the bet I am making with clawjam, a small open-source scaffolder I shipped this week. Four templates that produce working multi-agent CrewAI projects in sixty seconds, with cryptographic identity per agent and scope-narrowing delegation visible on a local dashboard. One template specifically exists to show off a three-hop delegation chain with each hop's scope narrowing rendered as a connected node graph. That template is the screenshot. The other three are the retention.

None of the users running it have to learn the underlying protocol. None of them have to call themselves "into" identity or "out of" anything. Every project they scaffold ends up with the protocol's reference implementation in their dependencies file. If a small fraction of them, when they ask "wait, how do we trust this in production," find the answer is already wired up, the protocol gets adoption as a side effect of the scaffolder being fun. If that fraction is too small, the project joins the long list of cryptographically correct primitives that nobody uses. I will know in ninety days.

Three agents in a delegation chain rendered in clawjam's local dashboard. The Timeline panel shows identity_created, token_issued, and scope_narrowed events. The AIP Token Chain panel shows three connected nodes representing manager, researcher, and junior_researcher with each hop's scope narrowing visible. The Tools panel lists tool calls with timing. The Output panel is empty for this run. — A scoped delegation chain rendered as the visible thing in the dashboard. Three agents, each hop narrowing scope: `*` → `tools:*` → `tools:web_search`. The screenshot a viewer shares is the production primitive.

What this means for engineering leaders

The vendor question that matters in 2026 is not "can it scale" or "is it secure." Both answers are always yes. The vendor question that matters is whether the developer who picks up the SDK on a Friday afternoon ends up with a production-shaped pattern or a demo-shaped pattern by Monday morning. The second framing is uncomfortable because it forces vendors to admit that their production claims and their demo experience are running on different code paths. The first framing lets everyone agree to disagree.

Press the second framing. Ask the vendor for the scaffold-to-running-thing time, then ask which of the production primitives are present in that scaffold. If identity, scope, and audit are not in the dashboard the developer sees on the first run, they will not be in the dashboard the auditor sees on the hundredth run. The retrofit window has been roughly six months for the teams I have watched this happen to, and most of them do not reach the other side of it.

The teams that will. They are the ones whose first-day developer experience already includes the things their hundredth-day compliance review is going to ask about. The friction of getting there is the entire competitive moat. It is also the entire opportunity for the protocol layer underneath: every developer experience improvement that lowers the friction of running on top of a production-grade primitive is, mechanically, a unit of adoption for that primitive. The fun loop carries the boring primitive. That is the design pattern. The next eighteen months will tell us how robust it is.

Sunil Prakash works on the Agent Identity Protocol, an attempt to put scoped agent identity in the architecture instead of the prompt. clawjam is the scaffolder this piece references; the dev-side write-up of the build, including the empty PyPI wheel that almost stopped it at week six, is on dev.to.