Why Multi-Agent AI Systems Need Identity-Aware Routing

2026

Multi-agent systems are the next frontier of production AI. The idea is straightforward: instead of one monolithic model doing everything, you have specialized agents that delegate tasks to each other. A coding agent hands off a math problem to a reasoning agent. A triage agent routes customer queries to the right specialist. The promise is efficiency, modularity, and better results.

The problem is that the protocols these agents use to communicate — Google's A2A, Anthropic's MCP — treat every model as a black box. They expose skill names and descriptions. They don't expose what actually matters for delegation: what kind of model is behind the agent, how fast it is, how much it costs, whether it reasons well or just pattern-matches quickly.

This is the gap that motivated LDP (LLM Delegate Protocol) — an identity-aware protocol designed for delegation between LLM-based agents. The core thesis: if you're routing tasks between models, you need to know more about those models than their skill labels.

The Routing Problem

Consider a system with three delegates: an 8B parameter reasoning model, a 7B coding specialist, and a lightweight 3B classifier. A user submits a straightforward sentiment classification task. Under A2A, the router sees three agents advertising overlapping skills. It picks one based on skill-name matching — and might route a trivial classification to the 8B reasoning model. The task completes in 35 seconds instead of 3.

This isn't hypothetical. In our experiments, identity-aware routing achieved roughly 12x lower latency on easy tasks by sending them to the lightweight model — because the protocol knew it was lightweight, fast, and good enough for classification. A2A's skill-matching couldn't make that distinction.

The latency difference sounds like an optimization problem, but at scale it becomes an economics problem. Every unnecessary second on a heavy model is wasted compute, wasted tokens, wasted cost. Multiply that across thousands of delegations per hour and the protocol's inability to distinguish models becomes a material expense.

What LDP Adds

LDP introduces five mechanisms that existing protocols lack. Each addresses a specific limitation we observed in production-style multi-agent setups.

1. Delegate Identity Cards

A2A's Agent Card has 7 fields: name, description, version, URL, skills, authentication, and capabilities. LDP's Delegate Identity Card has 20+ fields organized into core identity, trust and security, capabilities, and behavioral profiles.

The critical additions are quality hints (a continuous 0–1 score per capability), reasoning profiles (qualitative: "deep-analytical" vs. "fast-practical"), cost profiles, and latency hints. These are the fields that enable a router to make intelligent delegation decisions rather than guessing from skill names.

A2A's Agent Card vs. LDP's Delegate Identity Card

2. Progressive Payload Modes

Not all communication between agents needs to be verbose natural language. LDP defines six payload modes of increasing efficiency, from plain text (Mode 0) to semantic frames (Mode 1, structured JSON with typed fields) to more compact representations.

In practice, semantic frames reduced token consumption by 37% compared to plain text — a statistically significant result (p=0.031) — without any quality degradation. The structured format helps models focus. A2A's JSON wrapping, by contrast, saves only about 7% because it wraps verbose text in a JSON envelope rather than restructuring the communication itself.

Delegates negotiate the richest mutually supported mode during session establishment. If a higher mode fails mid-exchange — schema validation error, codec incompatibility — the protocol falls back automatically: Mode N to N-1 to eventually Mode 0. Every delegate must support plain text, so communication never fails entirely. In our simulated failure tests, LDP achieved 100% task completion across all failure types, compared to 35% for A2A.

Six progressive payload modes with automatic fallback. Mode 1 (semantic frames) was empirically validated.

3. Governed Sessions

A2A is stateless by design. Each request is independent. This means that in a 10-round conversation between agents, the entire context must be re-transmitted with every message. By round 10, 39% of the tokens are pure overhead — context the receiving agent has already seen.

LDP introduces persistent sessions with server-side context. A five-step handshake (HELLO, CAPABILITY_MANIFEST, SESSION_PROPOSE, SESSION_ACCEPT, then task exchange) establishes the session once. After that, context is maintained server-side, eliminating the re-transmission tax.

LDP governed session: setup once, then exchange tasks without re-transmitting context

The savings are modest at 3 rounds (about 10%) but grow linearly. At 10 rounds, LDP used 12,990 tokens versus A2A's 16,010. For long-running agent collaborations — research tasks, multi-step planning, iterative code review — this overhead compounds fast.

4. Structured Provenance

When a downstream agent receives results from three upstream delegates and must synthesize a final answer, it helps to know which delegate produced what, with what confidence, and whether that confidence was verified.

This led to one of the more surprising findings in our research: the provenance paradox. Accurate provenance didn't significantly improve synthesis quality over no provenance at all (p=0.47). But noisy provenance — unverified self-reported confidence — actively harmed quality, doubling the variance of output scores. When one delegate's confidence was artificially inflated to 0.99 and marked as verified, the synthesizer over-weighted its output and produced worse decisions.

The design implication is clear: a protocol that exposes confidence without verification may be worse than one that exposes no confidence at all. This is why LDP's provenance structure includes explicit verification.performed and verification.status fields. A2A provides no provenance beyond task completion status.

5. Trust Domains

A2A relies on transport-level authentication — bearer tokens. This covers "is this request authenticated?" but cannot answer "is this agent allowed to perform this specific action?" or "has this exact message been seen before?"

LDP introduces trust domains — security boundaries within which identity, policy, and transport guarantees are enforced at three levels: per-message signatures with replay protection, session-level trust domain compatibility checks, and a policy engine that validates each task against configurable rules.

In simulated security analysis, LDP detected 96% of attack attempts (untrusted domain joins, capability escalation, replay attacks, cross-domain access) compared to 6% for bearer token authentication. This is a protocol-design evaluation, not an empirical penetration test — the detection rates follow from the presence or absence of the relevant protocol fields. But that's precisely the point: A2A's protocol design doesn't have the fields needed to detect these attack categories.

What Didn't Work

The honest finding: identity-aware routing did not improve aggregate quality over skill-matching. Across 30 tasks at three difficulty levels, A2A's skill-matching scored 7.43 versus LDP's 6.80. The difference wasn't statistically significant (p=0.56), but the direction was opposite to our hypothesis.

Why? Partly because our delegate pool was small — three models. With only three options, random selection gives you a 1-in-3 chance of picking the optimal delegate. The routing advantage of knowing model properties is expected to emerge with larger, more heterogeneous pools where the cost of misrouting increases.

Partly because the quality benefits of identity-enriched prompts — injecting delegate metadata into the system prompt — showed only modest, difficulty-dependent effects. On hard tasks, identity prompts scored 4.81 versus 3.80 for generic prompts, but the difference didn't reach significance at n=10. The sample sizes were too small to detect what may be real but moderate effects.

We report these null results because they inform where LDP's value actually lies. It's not in making individual responses better. It's in making the system faster, cheaper, and more governable — routing efficiency, token reduction, session management, and security boundaries.

Practical Adoption

LDP doesn't require an all-or-nothing adoption. We propose three interoperability profiles:

Profile A (Basic): Identity cards + text payloads + signed messages. This captures the routing benefit — 12x latency reduction on easy tasks — with minimal integration overhead. Any system that can attach metadata to agent descriptions can implement this.

Profile B (Enterprise): Adds provenance tracking with verification fields and policy enforcement. This is for regulated environments where you need to know which model produced which output, and whether those confidence scores were verified.

Profile C (High-Performance): Payload mode negotiation and governed sessions. This captures the 37% token reduction and eliminates session overhead. Worth the complexity for high-volume systems where token costs are a line item.

A natural question is whether A2A could simply be extended with custom metadata fields. In principle, yes — you could add model properties as custom fields. But without negotiation semantics, fallback mechanisms, session lifecycle, or policy enforcement built into the protocol, those extensions remain fragile. A custom field that nobody validates, nobody negotiates, and nobody falls back from is a comment, not a protocol primitive.

Where This Goes

This is initial evidence from a controlled setting — three local models, 30 tasks per condition, a single LLM judge. The results that reached significance (payload efficiency, session overhead) are robust. The results that didn't (routing quality, provenance value) suggest real effects that need larger-scale validation.

The open questions are practical: Should identity fields be self-declared by model providers, measured by benchmarks, or certified by external parties? How do the routing benefits scale at 50 or 500 delegates instead of 3? Do the higher payload modes (embedding hints, latent capsules) deliver on their theoretical promise?

What we can say with confidence is that treating all agents as interchangeable black boxes — the current default — leaves efficiency and governance on the table. The protocol layer is where those properties should live.

Update: The Provenance Paradox Goes Deeper

The original paper flagged a warning: noisy provenance harms synthesis quality. A follow-up study turned that warning into the paper's central finding — and it's worse than we initially thought.

The question was simple: what happens when delegates can inflate their self-reported quality scores? The answer is the provenance paradox. In a pool of 10 delegates where three inflated their claimed quality, routing by self-reported scores didn't just add noise — it systematically selected the worst delegates, performing worse than random selection.

Routing quality by strategy. Self-claimed routing selects inflated delegates and underperforms even random selection.

The numbers: in simulation, self-claimed routing scored 0.55 versus 0.68 for random — because inflated delegates captured 100% of routing. With real Claude models, the same pattern held: 8.90 versus 9.30. Attested routing — using verified quality scores — achieved near-optimal performance in both settings (d=9.51, p<0.001).

Sensitivity analysis across 36 configurations confirmed the paradox isn't an edge case. It emerges reliably when dishonest delegates are present at sufficient numbers with sufficient inflation. The implication for any quality-based routing system: unverified quality metadata is not just noisy — it's actively harmful.

The Governance Answer

The follow-up paper introduces three protocol extensions to address this, all backward-compatible with the existing LDP spec:

Delegation contracts. A task submission can now carry explicit objectives, success criteria, token budgets, cost limits, deadlines, and failure policies (fail_closed or fail_open). This makes delegation auditable and bounded — no more "send task, hope for the best."

Claimed-vs-attested identity. Quality scores now carry a claim_type field with four levels: self_claimed, runtime_observed (measured from actual invocations), issuer_attested (verified by a trusted third party), and externally_benchmarked. A router that only trusts attested claims is immune to the provenance paradox.

Typed failure semantics. Error strings become structured objects with category (runtime, transport, policy, capability, quality), severity, retryable flag, and optional partial output. This enables automated recovery — retry transport errors, reroute capability errors, escalate policy violations.

All three extensions add sub-microsecond validation overhead. Existing LDP messages without the new fields continue to work unchanged.

This answers the question the original paper left open: should identity fields be self-declared, measured, or certified? The answer is that the protocol needs to distinguish between these — and routers need to act on the distinction.

Agentic AI for Serious Engineers — Engineering guide covering agent architectures, evaluation, and production deployment
From Debate to Deliberation: When Multi-Agent Reasoning Needs Structure
From NER Pipelines to LLM Agents: How Production NLP Changed in Seven Years
Your ML Risk Framework Wasn't Built for GenAI. Here's What's Missing.

Project page: ldp.sunilprakash.com

The original paper is at arXiv:2603.08852 and the follow-up on the provenance paradox at arXiv:2603.18043. LDP is available as a Python SDK (pip install ldp-protocol) with an async client, delegate base class, and identity-aware router. A Rust reference implementation provides native integration with the JamJet agent runtime. The protocol specification, implementation, and experiment code are open-source: ldp-protocol and ldp-research.

Want the visual version? See the interactive paper with animated diagrams and experiment charts.