🤖 AI Summary
This study addresses a critical “monitoring-control gap” in current retrieval-augmented generation (RAG) systems during multi-turn interactions: while such systems can detect contradictory evidence, they fail to safely adjust their outputs accordingly. The work presents the first empirical characterization of this phenomenon, demonstrating that single-turn evaluations substantially overestimate RAG safety. By introducing a multi-turn document accumulation protocol and combining hidden state probing, attention analysis, and response strategy classification—validated through over 50,000 human-verified dialogue turns—the authors reveal that hazardous information is prominently attended to in internal representations yet inadequately constrains generation behavior. These findings indicate that existing RAG systems remain unreliable in high-stakes scenarios, and no universal prompt engineering approach can currently mitigate this fundamental limitation.
📝 Abstract
Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.