🤖 AI Summary
Large language models (LLMs) risk sensitive information leakage due to context mismatch in continual memory mechanisms. Method: We propose CIMemories—the first benchmark for evaluating memory controllability through context integrity assessment—built upon high-dimensional synthetic user profiles (>100 attributes per user) and multi-task scenarios, enabling attribute-level violation detection to quantify the trade-off between information leakage and task utility. Contribution/Results: Experiments reveal fundamental context integrity deficiencies in state-of-the-art LLMs, with attribute-level leakage rates up to 69%. GPT-5 exhibits a 9.6% violation rate across 40 tasks, escalating to 25.1% after five repeated executions. This work provides the first systematic evidence that memory leakage accumulates significantly with task scale and execution frequency, exposing critical shortcomings in fine-grained, context-aware privacy control in current LLMs.
📝 Abstract
Large Language Models (LLMs) increasingly use persistent memory from past interactions to enhance personalization and task performance. However, this memory introduces critical risks when sensitive information is revealed in inappropriate contexts. We present CIMemories, a benchmark for evaluating whether LLMs appropriately control information flow from memory based on task context. CIMemories uses synthetic user profiles with over 100 attributes per user, paired with diverse task contexts in which each attribute may be essential for some tasks but inappropriate for others. Our evaluation reveals that frontier models exhibit up to 69% attribute-level violations (leaking information inappropriately), with lower violation rates often coming at the cost of task utility. Violations accumulate across both tasks and runs: as usage increases from 1 to 40 tasks, GPT-5's violations rise from 0.1% to 9.6%, reaching 25.1% when the same prompt is executed 5 times, revealing arbitrary and unstable behavior in which models leak different attributes for identical prompts. Privacy-conscious prompting does not solve this - models overgeneralize, sharing everything or nothing rather than making nuanced, context-dependent decisions. These findings reveal fundamental limitations that require contextually aware reasoning capabilities, not just better prompting or scaling.