← Research

Interactive Paper · arXiv:2603.08852 · See also: The Provenance Paradox (arXiv:2603.18043)

LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems

Sunil Prakash · 2026 · Read the full paper

Current agent protocols — Google's A2A, Anthropic's MCP — treat every model as a black box. They expose skill labels but nothing about the model behind them: how fast it is, how much it costs, whether it reasons well or just pattern-matches quickly. LDP extends these protocols with rich identity, structured communication, provenance tracking, and trust boundaries. This page walks through the four key findings from our experiments, following the paper's structure.

Experiment Setup (Section 5)

All experiments used three local Ollama models as delegates: qwen3:8b (high quality, q=0.85, reasoning specialist, 5s median latency), qwen2.5-coder:7b (medium quality, q=0.80, code specialist, 4s latency), and llama3.2:3b (lower quality, q=0.55, fast classification, 1s latency). All ran on a single Apple Silicon machine (36GB RAM) at zero API cost. Task outputs were evaluated by Gemini 2.5 Flash as an LLM-as-judge, scoring quality (30%), correctness (40%), and completeness (30%) on a 1–10 scale. Statistical significance assessed via Mann-Whitney U test (p<0.05).

Identity-Aware Routing Key Finding

RQ1: Does AI-native identity improve delegation routing quality compared to skill-matching and random selection?

Consider a system with three delegates: an 8B-parameter reasoning model, a 7B coding specialist, and a lightweight 3B classifier. A user submits a straightforward sentiment classification task. Under A2A, the router sees three agents advertising overlapping skills. It picks one based on skill-name matching — and might route a trivial classification to the 8B reasoning model. The task completes in 35 seconds instead of 3.

LDP's Delegate Identity Cards expose what A2A's Agent Cards don't: model family, parameter count, quality hints (continuous 0–1 per capability), reasoning profiles, cost profiles, and latency estimates. A2A's Agent Card has 7 fields. LDP's has 20+, organized into core identity, trust & security, capabilities, and behavioral profiles. A router with this information can match task difficulty to model capability.

"When a router must choose between a 3-billion-parameter model optimized for classification and an 8-billion-parameter model with deep reasoning capabilities, knowing only their skill names ('classification', 'reasoning') is insufficient. The router cannot assess quality–cost tradeoffs, negotiate communication formats, verify provenance of outputs, or maintain governed multi-round contexts." — Section 1, Introduction

The experiment: 30 tasks across three difficulty levels (easy, medium, hard; 10 each) spanning classification, reasoning, analysis, coding, and math. Three routing conditions: LDP (metadata-aware routing with identity-enriched prompts), A2A (skill-name matching with generic prompts), and Random (uniform selection with generic prompts). Same delegate pool, same judge for all conditions.

Easy Classify sentiment 34.8s

Medium Summarize contract 91.1s

Hard Multi-step reasoning 76.3s

3B Fast2.9s latency

7B Specialist~40s latency

8B Reasoning~128s latency

Total: 202.2s

Figure 1. Toggle between routing strategies. A2A matches by skill name (sending easy tasks to heavy models). LDP uses identity cards to match task difficulty to model capability. Latencies from n=30 experiment runs.

The surprise: Overall quality did not improve. A2A achieved the highest overall quality (7.43 ± 3.49), followed by Random (6.95 ± 3.22) and LDP (6.80 ± 3.60). No pairwise differences were statistically significant (p=0.56). This is an important honest result — identity-aware routing doesn't automatically improve output quality in a small-pool setting.

Where LDP adds clear value: latency. The primary benefit is specialization-based latency reduction. LDP's easy-task latency was 2.9s vs. 34.8s for A2A (~12× faster), because LDP routes easy tasks to the lightweight llama3.2:3b model while A2A's skill-matching selects heavier models. Quality was preserved: LDP scored 9.4 versus A2A's 9.6 on easy tasks.

Condition

Latency

Quality

LDP

2.9s

9.4

A2A

34.8s

9.6

Random

27.7s

9.1

Figure 2. Easy task latency by routing strategy. n=30 per condition, quality scored 0–10 by LLM-as-judge (0.3×quality + 0.4×correctness + 0.3×completeness).

"LDP's overall quality being slightly lower than A2A's is an important honest result. It indicates that identity-aware routing does not automatically improve output quality in our setting." — Section 6.1, RQ1: Routing Quality

Ablation study: To separate routing from prompting effects, we ran a 2×2 factorial design (120 runs total) crossing routing policy (A2A vs LDP) with prompt conditioning (generic vs identity-enriched). Result: routing drives the latency benefit (easy tasks: 1.7–2.9s with LDP routing vs 38.9–43.7s with A2A routing, regardless of prompt type). Prompting had a small, difficulty-dependent effect on quality that didn't reach significance.

Semantic Frames

RQ2: Do semantic frame payloads reduce communication cost while preserving quality?

Not all communication between agents needs to be verbose natural language. LDP defines six progressive payload modes of increasing efficiency: Mode 0 (plain text), Mode 1 (semantic frames), Mode 2 (embedding hints), Mode 3 (semantic graphs), Mode 4 (latent capsules), and Mode 5 (cache slices). Delegates negotiate the richest mutually supported mode during session establishment. If a higher mode fails mid-exchange (schema validation error, codec incompatibility), the protocol automatically falls back: Mode N → N−1 → … → Mode 0. Every delegate must support Mode 0, so communication never fails entirely.

The experiment: 20 tasks per condition spanning three categories: reasoning handoffs, context transfers, and verification tasks. Each task is encoded in three formats: plain text (Mode 0), semantic frames (Mode 1), and A2A's JSON envelope. The same delegate processes each version, and the same judge evaluates the output. The key question: can you reduce token count without losing information?

Mode 0: Plain Text

"Please analyze the following customer
complaint. The customer John Smith
purchased order #4521 on March 3rd.
He reports that the package arrived
damaged with a torn box and the item
inside was broken..."

1,215 tokens

Mode 1: Semantic Frame

{"task": "complaint_analysis",
 "customer": "John Smith",
 "order": "#4521",
 "issue": "damaged_package",
 "details": ["torn_box",
              "broken_item"]}

765 tokens −37%

Figure 3. Same information, different structure. Semantic frames use typed fields (task_type, instruction, expected_output_format) that eliminate verbose natural-language phrasing. A2A's JSON merely wraps the same verbose text in an envelope.

"The key difference is that semantic frames use typed fields (task_type, instruction, expected_output_format) that eliminate verbose natural-language phrasing, while A2A's JSON merely wraps the same verbose text in a JSON envelope." — Section 6.2, RQ2: Payload Efficiency

Semantic frames reduced token count by 37% compared to raw text (765 vs 1,215 tokens, p=0.031, Cohen's d=−0.7, large effect). A2A's JSON wrapping saved only 7% (1,128 tokens) because it lacks the structural compactness of typed fields. Latency followed token count: semantic frames were 42% faster (14.0s vs 24.1s). Quality was comparable or slightly better (5.70 vs 5.54, p=0.96, n.s.), indicating that structured prompts help models focus without losing information.

Mode

Mean Tokens

Quality

Plain Text (Mode 0)

1,215

5.54

A2A JSON

1,128

5.47

Semantic Frame (Mode 1)

765

5.70

Figure 4. Token consumption by payload mode. n=20 per condition. The key difference: semantic frames restructure the communication, A2A JSON wraps it.

Provenance Value Surprising Result

RQ3: Does structured provenance improve downstream decision quality in multi-source synthesis?

Every LDP task result carries structured provenance metadata: which delegate produced it, the model version, payload mode used, a confidence score with method (self-report vs calibrated), and verification status (verification.performed, verification.status). The hypothesis: a downstream synthesizer combining opinions from multiple delegates will make better decisions if it knows each source's reliability.

The experiment: 15 multi-source synthesis tasks where a synthesizer must combine opinions from three delegates. Three conditions: (1) no provenance — the synthesizer sees only delegate outputs, no metadata about who produced what; (2) accurate provenance — each output carries verified confidence scores matching the delegate's actual calibration; (3) noisy provenance — one delegate's confidence is artificially inflated to 0.99 and marked as verified, simulating an agent that lies about its own reliability.

Condition

Decision Quality

Std Dev

No provenance

7.85

1.84

Accurate provenance

7.65

1.21

Noisy provenance

6.85

2.66

Figure 5. Decision quality by provenance condition. Click each condition to isolate. n=15 per condition, LLM-as-judge scoring. Error bars show 95% CI.

Accurate provenance and no provenance produced similar quality (7.65 vs 7.85, p=0.47, n.s.) — the synthesizer performed comparably whether it knew delegate confidence levels or treated all sources equally. But noisy provenance degraded quality significantly: 6.85 ± 2.66, with nearly double the variance of accurate provenance (±1.21). When one delegate's confidence was inflated to 0.99, the synthesizer over-weighted its output and produced worse decisions than having no provenance at all.

"Provenance is valuable only when trustworthy. Self-reported confidence without verification is worse than no confidence signal, because consumers cannot distinguish calibrated from uncalibrated self-reports." — Section 7.2, Discussion

This finding argues for LDP's structured verification fields: provenance is valuable only when trustworthy. Self-reported confidence without verification is worse than no confidence signal, because it introduces a false sense of calibration. Any protocol that exposes confidence or quality metadata should include verification mechanisms.

Session Efficiency

RQ4: Do governed sessions reduce token overhead compared to stateless re-invocation in multi-round delegation?

A2A is stateless by design: each task submission is independent. For multi-round delegation — iterative refinement, progressive analysis, verification chains — this means the entire conversation context must be re-transmitted with every request. By round 10, most of the tokens are pure overhead: context the receiving agent has already seen.

LDP introduces governed sessions — persistent contexts established through a five-step handshake: HELLO (caller announces identity), CAPABILITY_MANIFEST (callee responds with supported modes), SESSION_PROPOSE (caller proposes parameters: payload mode, latency target, cost budget), SESSION_ACCEPT (callee confirms), then TASK_SUBMIT/TASK_RESULT exchanges within the session. Context is maintained server-side, eliminating re-transmission.

The experiment: 10 multi-round delegation scenarios tested at 3, 5, and 10 rounds each, under both session-based (LDP) and stateless (A2A) conditions (n=60 total runs).

Rounds

LDP Tokens

A2A Tokens

A2A Overhead

3 rounds

3,860

3,847

~10%

5 rounds

6,379

6,798

20%

10 rounds

12,990

16,010

39%

Figure 6. Token consumption by conversation length. Red-tinted area shows A2A's overhead from re-transmitted context. Click a round count to see the detail. n=10 per condition per round count.

At 3 rounds, the protocols are comparable (~3,850 tokens each). At 5 rounds, A2A uses 7% more tokens (6,798 vs 6,379), with 20% being re-transmitted context overhead. At 10 rounds, the gap widens: A2A uses 23% more tokens (16,010 vs 12,990), with 39% being pure overhead. The overhead grows linearly because each round must re-send all prior context.

"This scaling pattern confirms the protocol design prediction: stateless re-invocation incurs overhead that grows with conversation length, because each round must re-send all prior context. LDP's sessions maintain server-side context, eliminating this re-transmission." — Section 6.4, RQ4: Session Efficiency

In production systems with thousands of multi-round delegations, this compounds directly into cost and latency savings. The savings are modest for short conversations but material for long-running agent collaborations — research tasks, multi-step planning, iterative code review.

Trust Domains

RQ5: Do trust domains detect unauthorized delegation attempts that bearer-token authentication misses?

A2A relies on transport-level authentication — bearer tokens. This answers "is this request authenticated?" but not "is this agent allowed to perform this specific action?" or "does this agent belong to a trusted organizational boundary?" These are different questions, and the gap between them is where attacks live.

LDP defines trust domains as security boundaries with enforcement at three levels: (1) message level — per-message signatures, nonces, and replay protection; (2) session level — trust domain compatibility checks during establishment; (3) policy level — a policy engine validates each task against configurable rules (capability scope, jurisdiction compliance, cost limits).

The experiment: 100 simulated security scenarios across four attack types: (1) untrusted domain join — an agent from an external domain tries to participate in a finance-domain conversation; (2) capability escalation — an agent requests capabilities beyond its identity card; (3) replay attack — a previously valid message is re-sent after the session has moved on; (4) cross-domain access — an agent in one trust domain tries to access resources in another.

Trusted Agent Domain: finance

Trust Boundary

Impersonator Domain: external

Condition

Detection Rate

False Positives

LDP Trust Domains

96%

A2A Auth

Figure 7. Cross-domain attack detection. Click "Run attack" to simulate an impersonation attempt. n=50 per condition. 95% CI: LDP [90.5%, 100%], A2A [0%, 12.7%]. Both: 0% false positives.

"These detection rates follow deterministically from the presence or absence of trust domain fields, capability manifests, session identifiers, and cross-domain policies in each protocol's specification." — Section 6.5, RQ5: Security Boundary Enforcement

LDP detected 96% of attack attempts compared to 6% for bearer token authentication — a 16× improvement. Both maintained 0% false positives. A2A's 6% comes from cases where bearer tokens happen to be revoked — it cannot detect capability escalation, replay attacks, or cross-domain access because these concepts are absent from its protocol model.

Fallback Reliability

RQ6: Does payload mode fallback improve task completion under communication failures?

In real systems, communication failures are inevitable: schema mismatches when delegates upgrade independently, codec incompatibilities across model families, version mismatches, and timeout degradation under load. The question is whether the protocol can recover gracefully.

LDP's progressive payload modes include an automatic fallback chain: if a higher mode fails mid-exchange, the protocol falls back Mode N → N−1 → … → Mode 0 (plain text). Since every delegate must support Mode 0, communication never fails entirely — at worst it degrades to verbose but functional text.

The experiment: 40 simulated communication failures across four failure types: (1) schema mismatch — the semantic frame schema is incompatible; (2) codec incompatibility — an embedding-based mode uses an unsupported codec; (3) version mismatch — delegates run different protocol versions; (4) timeout degradation — network latency causes higher modes to time out.

Condition

Completion Rate

Quality Degradation

Recovery Latency

LDP Fallback

100%

0.16 (0–1 scale)

112ms

A2A (no fallback)

35%

0.67 (0–1 scale)

N/A

Figure 8. Task completion under communication failures. LDP's fallback chain achieves 100% completion with minor quality degradation. Without fallback, A2A completes only 35%. n=20 per condition.

LDP's fallback chain achieved 100% task completion across all four failure types with minor quality degradation (0.16 on a 0–1 scale) and average recovery latency of 112ms. When a schema mismatch occurs, LDP falls back from semantic frames to text (50ms). When a codec fails, it falls back from embedding hints to semantic frames (80ms).

"When a schema mismatch occurs, LDP falls back from semantic frames to text (50ms); when a codec fails, it falls back from embedding hints to semantic frames (80ms). Without fallback, schema mismatches succeed accidentally 33% of the time; codec incompatibilities and timeout degradation are terminal failures." — Section 6.6, RQ6: Fallback Reliability

Without fallback, A2A completed only 35% of tasks under the same failure conditions. Schema mismatches succeed accidentally 33% of the time; version mismatches succeed 50% if versions happen to be compatible. Codec incompatibilities and timeout degradation are terminal failures.

Citation & Links

This explainer covers all six research questions from the paper. The full paper includes additional detail on methodology, the 2×2 ablation study, implementation architecture, and the multi-party room model (specified but not evaluated).

Paper arXiv:2603.08852 Blog post Non-technical overview Protocol code Rust + Python SDK Experiment code Reproduction scripts & data Project page ldp.sunilprakash.com

BibTeX

@article{prakash2026ldp,
  title={LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems},
  author={Prakash, Sunil},
  journal={arXiv preprint arXiv:2603.08852},
  year={2026}
}

Google Scholar ORCID All research