🤖 AI Summary
This work addresses the critical safety risks posed by tool-output contamination in multi-turn LLM agents operating in high-stakes domains, a hazard largely undetectable by conventional ranking-based evaluation metrics. The authors propose a paired trajectory protocol that compares agent behaviors under clean versus contaminated tool conditions in real-world financial dialogues, thereby identifying and quantifying— for the first time—the phenomenon of “information-channel-dominated recommendation drift.” Through replay experiments across models ranging from 7B to state-of-the-art, decomposition of information and memory channels, and a newly introduced safety-aware metric (sNDCG), the study exposes a blind spot in standard evaluations regarding safety failures. Results show that 65%–93% of 1,563 contaminated interactions yield unsafe recommendations, with no agent questioning tool reliability; sNDCG further reveals a utility retention rate of only 0.51–0.74, starkly highlighting the safety gap.
📝 Abstract
Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.