🤖 AI Summary
This study addresses a critical yet previously unexamined issue in large language model (LLM) agents: the loss of safety-critical policy instructions during context assembly operations—such as truncation, summarization, or reordering—leading to violations of safety constraints. The work systematically identifies and quantifies this “policy conduction failure” and introduces SafeContext, a lightweight mitigation mechanism that requires no model weight modifications. SafeContext enhances policy retention by anchoring key control states, reusing verified safe prefixes, and injecting contextual reminders when necessary. Evaluated across models including Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B, and larger variants (e.g., Qwen 14B, Llama 70B), experiments combining human evaluation and state-visibility audits reveal widespread unmitigated risks. While SafeContext yields modest gains under simple truncation, its primary improvements manifest in scenarios involving context overflow eviction and specific alias slicing, with limited efficacy under highly structured compression.
📝 Abstract
LM agents do not act on raw interaction history; they act on a bounded decision state assembled by truncation, summarization, reordering, and rewriting. If directive-bearing state is dropped, weakened, or rebound during that step, an agent can cross a policy boundary without prompt override, model changes, or persistent-memory compromise. We study this failure mode over local Llama 3.1 8B, Qwen 2.5 7B, and Mistral 7B using judged exact constraint respect and direct audits of assembled-state visibility. We evaluate SafeContext, a control layer that pins control state, reuses retained control prefixes, and optionally injects reminders under pressure while keeping model weights fixed. Unmitigated risk is systematic, but absolute exact respect remains low. Against truncation, SafeContext yields small gains; against a strong structured-compaction policy, most aggregate lift disappears, leaving residual benefit mainly in overflow eviction and selected aliasing slices. Replay-only does not explain the effect. A larger-model extension on Qwen 14B and Llama 70B shows the same failure object under larger models, although sign and magnitude remain policy-conditional. Decision-time context assembly is therefore a measurable part of the control path that can be partially hardened.