AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the critical safety risks posed by tool-output contamination in multi-turn LLM agents operating in high-stakes domains, a hazard largely undetectable by conventional ranking-based evaluation metrics. The authors propose a paired trajectory protocol that compares agent behaviors under clean versus contaminated tool conditions in real-world financial dialogues, thereby identifying and quantifying— for the first time—the phenomenon of “information-channel-dominated recommendation drift.” Through replay experiments across models ranging from 7B to state-of-the-art, decomposition of information and memory channels, and a newly introduced safety-aware metric (sNDCG), the study exposes a blind spot in standard evaluations regarding safety failures. Results show that 65%–93% of 1,563 contaminated interactions yield unsafe recommendations, with no agent questioning tool reliability; sNDCG further reveals a utility retention rate of only 0.51–0.74, starkly highlighting the safety gap.

Technology Category

Application Category

📝 Abstract

Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

Problem

Research questions and friction points this paper is trying to address.

recommendation safety

tool corruption

evaluation blindness

LLM agents

ranking metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

AgentDrift

tool corruption

safety evaluation