PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of sensitive information leakage through shared context in multi-agent large language models, where existing defenses struggle to provide real-time interception. The authors model credential leakage as a sequential risk accumulation process during generation and, for the first time, identify measurable dynamic signals—such as entropy collapse and logit concentration—that precede actual leakage. By integrating 16-dimensional features spanning lexical, structural, information-theoretic, behavioral, and contextual aspects, the method computes a calibrated risk score at each decoding step and enforces token-level real-time intervention via a green-yellow-red triage mechanism. Evaluated on a comprehensive adversarial benchmark comprising 2,000 tasks across 13 attack types and three pressure levels, the approach achieves an F1 score of 0.832 (precision: 1.000, recall: 0.712), zero task-level leakage, and preserves output utility at 0.893, significantly outperforming the strongest baseline, Span Tagger.

📝 Abstract

Multi-agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt-based safeguards, static pattern matching, and LLM-as-judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface-form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real-time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information-theoretic, behavioural, and contextual features into a calibrated risk score, enabling per-token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text-structural cues such as identifier-pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000-task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task-level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task-level leak rate.

Problem

Research questions and friction points this paper is trying to address.

secret leakage

multi-agent LLM

propagation amplification

generation-time security

credential exposure

Innovation

Methods, ideas, or system contributions that make the work stand out.

propagation amplification

real-time leakage detection

risk-aware decoding