Interactive Paper · arXiv:2603.08852
LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems
Current agent protocols — Google's A2A, Anthropic's MCP — treat every model as a black box. They expose skill labels but nothing about the model behind them: how fast it is, how much it costs, whether it reasons well or just pattern-matches quickly. LDP extends these protocols with rich identity, structured communication, provenance tracking, and trust boundaries. This page walks through the four key findings from our experiments, following the paper's structure.
Experiment Setup (Section 5)
All experiments used three local Ollama models as delegates: qwen3:8b (high quality, q=0.85, reasoning specialist, 5s median latency), qwen2.5-coder:7b (medium quality, q=0.80, code specialist, 4s latency), and llama3.2:3b (lower quality, q=0.55, fast classification, 1s latency). All ran on a single Apple Silicon machine (36GB RAM) at zero API cost. Task outputs were evaluated by Gemini 2.5 Flash as an LLM-as-judge, scoring quality (30%), correctness (40%), and completeness (30%) on a 1–10 scale. Statistical significance assessed via Mann-Whitney U test (p<0.05).
Identity-Aware Routing Key Finding
RQ1: Does AI-native identity improve delegation routing quality compared to skill-matching and random selection?
Consider a system with three delegates: an 8B-parameter reasoning model, a 7B coding specialist, and a lightweight 3B classifier. A user submits a straightforward sentiment classification task. Under A2A, the router sees three agents advertising overlapping skills. It picks one based on skill-name matching — and might route a trivial classification to the 8B reasoning model. The task completes in 35 seconds instead of 3.
LDP's Delegate Identity Cards expose what A2A's Agent Cards don't: model family, parameter count, quality hints (continuous 0–1 per capability), reasoning profiles, cost profiles, and latency estimates. A2A's Agent Card has 7 fields. LDP's has 20+, organized into core identity, trust & security, capabilities, and behavioral profiles. A router with this information can match task difficulty to model capability.
The experiment: 30 tasks across three difficulty levels (easy, medium, hard; 10 each) spanning classification, reasoning, analysis, coding, and math. Three routing conditions: LDP (metadata-aware routing with identity-enriched prompts), A2A (skill-name matching with generic prompts), and Random (uniform selection with generic prompts). Same delegate pool, same judge for all conditions.
The surprise: Overall quality did not improve. A2A achieved the highest overall quality (7.43 ± 3.49), followed by Random (6.95 ± 3.22) and LDP (6.80 ± 3.60). No pairwise differences were statistically significant (p=0.56). This is an important honest result — identity-aware routing doesn't automatically improve output quality in a small-pool setting.
Where LDP adds clear value: latency. The primary benefit is specialization-based latency reduction. LDP's easy-task latency was 2.9s vs. 34.8s for A2A (~12× faster), because LDP routes easy tasks to the lightweight llama3.2:3b model while A2A's skill-matching selects heavier models. Quality was preserved: LDP scored 9.4 versus A2A's 9.6 on easy tasks.
Ablation study: To separate routing from prompting effects, we ran a 2×2 factorial design (120 runs total) crossing routing policy (A2A vs LDP) with prompt conditioning (generic vs identity-enriched). Result: routing drives the latency benefit (easy tasks: 1.7–2.9s with LDP routing vs 38.9–43.7s with A2A routing, regardless of prompt type). Prompting had a small, difficulty-dependent effect on quality that didn't reach significance.
Semantic Frames
RQ2: Do semantic frame payloads reduce communication cost while preserving quality?
Not all communication between agents needs to be verbose natural language. LDP defines six progressive payload modes of increasing efficiency: Mode 0 (plain text), Mode 1 (semantic frames), Mode 2 (embedding hints), Mode 3 (semantic graphs), Mode 4 (latent capsules), and Mode 5 (cache slices). Delegates negotiate the richest mutually supported mode during session establishment. If a higher mode fails mid-exchange (schema validation error, codec incompatibility), the protocol automatically falls back: Mode N → N−1 → … → Mode 0. Every delegate must support Mode 0, so communication never fails entirely.
The experiment: 20 tasks per condition spanning three categories: reasoning handoffs, context transfers, and verification tasks. Each task is encoded in three formats: plain text (Mode 0), semantic frames (Mode 1), and A2A's JSON envelope. The same delegate processes each version, and the same judge evaluates the output. The key question: can you reduce token count without losing information?
Mode 0: Plain Text
"Please analyze the following customer complaint. The customer John Smith purchased order #4521 on March 3rd. He reports that the package arrived damaged with a torn box and the item inside was broken..."
1,215 tokens
Mode 1: Semantic Frame
{"task": "complaint_analysis",
"customer": "John Smith",
"order": "#4521",
"issue": "damaged_package",
"details": ["torn_box",
"broken_item"]}
765 tokens −37%
task_type, instruction, expected_output_format) that eliminate verbose natural-language phrasing. A2A's JSON merely wraps the same verbose text in an envelope.Semantic frames reduced token count by 37% compared to raw text (765 vs 1,215 tokens, p=0.031, Cohen's d=−0.7, large effect). A2A's JSON wrapping saved only 7% (1,128 tokens) because it lacks the structural compactness of typed fields. Latency followed token count: semantic frames were 42% faster (14.0s vs 24.1s). Quality was comparable or slightly better (5.70 vs 5.54, p=0.96, n.s.), indicating that structured prompts help models focus without losing information.
Provenance Value Surprising Result
RQ3: Does structured provenance improve downstream decision quality in multi-source synthesis?
Every LDP task result carries structured provenance metadata: which delegate produced it, the model version, payload mode used, a confidence score with method (self-report vs calibrated), and verification status (verification.performed, verification.status). The hypothesis: a downstream synthesizer combining opinions from multiple delegates will make better decisions if it knows each source's reliability.
The experiment: 15 multi-source synthesis tasks where a synthesizer must combine opinions from three delegates. Three conditions: (1) no provenance — the synthesizer sees only delegate outputs, no metadata about who produced what; (2) accurate provenance — each output carries verified confidence scores matching the delegate's actual calibration; (3) noisy provenance — one delegate's confidence is artificially inflated to 0.99 and marked as verified, simulating an agent that lies about its own reliability.
Accurate provenance and no provenance produced similar quality (7.65 vs 7.85, p=0.47, n.s.) — the synthesizer performed comparably whether it knew delegate confidence levels or treated all sources equally. But noisy provenance degraded quality significantly: 6.85 ± 2.66, with nearly double the variance of accurate provenance (±1.21). When one delegate's confidence was inflated to 0.99, the synthesizer over-weighted its output and produced worse decisions than having no provenance at all.
This finding argues for LDP's structured verification fields: provenance is valuable only when trustworthy. Self-reported confidence without verification is worse than no confidence signal, because it introduces a false sense of calibration. Any protocol that exposes confidence or quality metadata should include verification mechanisms.
Session Efficiency
RQ4: Do governed sessions reduce token overhead compared to stateless re-invocation in multi-round delegation?
A2A is stateless by design: each task submission is independent. For multi-round delegation — iterative refinement, progressive analysis, verification chains — this means the entire conversation context must be re-transmitted with every request. By round 10, most of the tokens are pure overhead: context the receiving agent has already seen.
LDP introduces governed sessions — persistent contexts established through a five-step handshake: HELLO (caller announces identity), CAPABILITY_MANIFEST (callee responds with supported modes), SESSION_PROPOSE (caller proposes parameters: payload mode, latency target, cost budget), SESSION_ACCEPT (callee confirms), then TASK_SUBMIT/TASK_RESULT exchanges within the session. Context is maintained server-side, eliminating re-transmission.
The experiment: 10 multi-round delegation scenarios tested at 3, 5, and 10 rounds each, under both session-based (LDP) and stateless (A2A) conditions (n=60 total runs).
At 3 rounds, the protocols are comparable (~3,850 tokens each). At 5 rounds, A2A uses 7% more tokens (6,798 vs 6,379), with 20% being re-transmitted context overhead. At 10 rounds, the gap widens: A2A uses 23% more tokens (16,010 vs 12,990), with 39% being pure overhead. The overhead grows linearly because each round must re-send all prior context.
In production systems with thousands of multi-round delegations, this compounds directly into cost and latency savings. The savings are modest for short conversations but material for long-running agent collaborations — research tasks, multi-step planning, iterative code review.
Trust Domains
RQ5: Do trust domains detect unauthorized delegation attempts that bearer-token authentication misses?
A2A relies on transport-level authentication — bearer tokens. This answers "is this request authenticated?" but not "is this agent allowed to perform this specific action?" or "does this agent belong to a trusted organizational boundary?" These are different questions, and the gap between them is where attacks live.
LDP defines trust domains as security boundaries with enforcement at three levels: (1) message level — per-message signatures, nonces, and replay protection; (2) session level — trust domain compatibility checks during establishment; (3) policy level — a policy engine validates each task against configurable rules (capability scope, jurisdiction compliance, cost limits).
The experiment: 100 simulated security scenarios across four attack types: (1) untrusted domain join — an agent from an external domain tries to participate in a finance-domain conversation; (2) capability escalation — an agent requests capabilities beyond its identity card; (3) replay attack — a previously valid message is re-sent after the session has moved on; (4) cross-domain access — an agent in one trust domain tries to access resources in another.
LDP detected 96% of attack attempts compared to 6% for bearer token authentication — a 16× improvement. Both maintained 0% false positives. A2A's 6% comes from cases where bearer tokens happen to be revoked — it cannot detect capability escalation, replay attacks, or cross-domain access because these concepts are absent from its protocol model.
Fallback Reliability
RQ6: Does payload mode fallback improve task completion under communication failures?
In real systems, communication failures are inevitable: schema mismatches when delegates upgrade independently, codec incompatibilities across model families, version mismatches, and timeout degradation under load. The question is whether the protocol can recover gracefully.
LDP's progressive payload modes include an automatic fallback chain: if a higher mode fails mid-exchange, the protocol falls back Mode N → N−1 → … → Mode 0 (plain text). Since every delegate must support Mode 0, communication never fails entirely — at worst it degrades to verbose but functional text.
The experiment: 40 simulated communication failures across four failure types: (1) schema mismatch — the semantic frame schema is incompatible; (2) codec incompatibility — an embedding-based mode uses an unsupported codec; (3) version mismatch — delegates run different protocol versions; (4) timeout degradation — network latency causes higher modes to time out.
LDP's fallback chain achieved 100% task completion across all four failure types with minor quality degradation (0.16 on a 0–1 scale) and average recovery latency of 112ms. When a schema mismatch occurs, LDP falls back from semantic frames to text (50ms). When a codec fails, it falls back from embedding hints to semantic frames (80ms).
Without fallback, A2A completed only 35% of tasks under the same failure conditions. Schema mismatches succeed accidentally 33% of the time; version mismatches succeed 50% if versions happen to be compatible. Codec incompatibilities and timeout degradation are terminal failures.
Citation & Links
This explainer covers all six research questions from the paper. The full paper includes additional detail on methodology, the 2×2 ablation study, implementation architecture, and the multi-party room model (specified but not evaluated).
BibTeX
@article{prakash2026ldp,
title={LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems},
author={Prakash, Sunil},
journal={arXiv preprint arXiv:2603.08852},
year={2026}
}