🤖 AI Summary
In multi-turn long dialogues, historical context growth and accumulated noise impede existing long-context models’ ability to accurately identify temporal information, severely degrading reasoning performance. To address this, we propose the Temporal-Aware Memory Selection (TAMS) framework, which jointly employs coarse-grained filtering and temporal-aware Proximal Policy Optimization (PPO) to extract critical temporal evidence at multiple granularities. We introduce a novel multi-level dense reward mechanism that jointly optimizes answer accuracy, evidence traceability, and temporal consistency—encompassing both conversation-level and utterance-level temporal coherence—to effectively resolve temporal ambiguity. On the Time-Dialog benchmark, our 7B model achieves 67.0% accuracy—setting a new state-of-the-art and outperforming the 14B baseline by 10.2%. TAMS supports dialogue histories up to 128K tokens and demonstrates significantly enhanced noise robustness and overall robustness compared to prior approaches.
📝 Abstract
Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/