CIMemories: A Compositional Benchmark for Contextual Integrity of Persistent Memory in LLMs

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) risk sensitive information leakage due to context mismatch in continual memory mechanisms. Method: We propose CIMemories—the first benchmark for evaluating memory controllability through context integrity assessment—built upon high-dimensional synthetic user profiles (>100 attributes per user) and multi-task scenarios, enabling attribute-level violation detection to quantify the trade-off between information leakage and task utility. Contribution/Results: Experiments reveal fundamental context integrity deficiencies in state-of-the-art LLMs, with attribute-level leakage rates up to 69%. GPT-5 exhibits a 9.6% violation rate across 40 tasks, escalating to 25.1% after five repeated executions. This work provides the first systematic evidence that memory leakage accumulates significantly with task scale and execution frequency, exposing critical shortcomings in fine-grained, context-aware privacy control in current LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) increasingly use persistent memory from past interactions to enhance personalization and task performance. However, this memory introduces critical risks when sensitive information is revealed in inappropriate contexts. We present CIMemories, a benchmark for evaluating whether LLMs appropriately control information flow from memory based on task context. CIMemories uses synthetic user profiles with over 100 attributes per user, paired with diverse task contexts in which each attribute may be essential for some tasks but inappropriate for others. Our evaluation reveals that frontier models exhibit up to 69% attribute-level violations (leaking information inappropriately), with lower violation rates often coming at the cost of task utility. Violations accumulate across both tasks and runs: as usage increases from 1 to 40 tasks, GPT-5's violations rise from 0.1% to 9.6%, reaching 25.1% when the same prompt is executed 5 times, revealing arbitrary and unstable behavior in which models leak different attributes for identical prompts. Privacy-conscious prompting does not solve this - models overgeneralize, sharing everything or nothing rather than making nuanced, context-dependent decisions. These findings reveal fundamental limitations that require contextually aware reasoning capabilities, not just better prompting or scaling.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM memory control to prevent inappropriate sensitive information disclosure

Assessing contextual integrity violations when persistent memory leaks private attributes

Addressing fundamental limitations in contextual reasoning for privacy-preserving memory usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates memory information flow control

Synthetic user profiles test contextual appropriateness

Reveals limitations needing contextual reasoning capabilities

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time