๐ค AI Summary
This work addresses the degradation in response quality and increased latency experienced by large language models (LLMs) in long conversations due to context inflation. Existing context management approaches often suffer from inefficiency or compromise dialogue coherence. To overcome these limitations, we propose DYCP, a lightweight dynamic context pruning method that, for the first time, dynamically segments and retrieves relevant memory at query time based on the current user inputโwithout requiring predefined topic boundaries or additional LLM invocations. DYCP effectively preserves temporal structure and conversational coherence while significantly improving response quality and reducing latency across multiple LLMs, as demonstrated on three long-context dialogue benchmarks: LoCoMo, MT-Bench+, and SCM4LLMs.
๐ Abstract
Large Language Models (LLMs) increasingly operate over long-form dialogues with frequent topic shifts. While recent LLMs support extended context windows, efficient management of dialogue history in practice is needed due to inference cost and latency constraints. We present DyCP, a lightweight context management method implemented outside the LLM that dynamically identifies and retrieves relevant dialogue segments conditioned on the current turn, without offline memory construction. DyCP manages dialogue context while preserving the sequential nature of dialogue without predefined topic boundaries, enabling adaptive and efficient context selection. Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.