Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

In multi-turn long dialogues, historical context growth and accumulated noise impede existing long-context models’ ability to accurately identify temporal information, severely degrading reasoning performance. To address this, we propose the Temporal-Aware Memory Selection (TAMS) framework, which jointly employs coarse-grained filtering and temporal-aware Proximal Policy Optimization (PPO) to extract critical temporal evidence at multiple granularities. We introduce a novel multi-level dense reward mechanism that jointly optimizes answer accuracy, evidence traceability, and temporal consistency—encompassing both conversation-level and utterance-level temporal coherence—to effectively resolve temporal ambiguity. On the Time-Dialog benchmark, our 7B model achieves 67.0% accuracy—setting a new state-of-the-art and outperforming the 14B baseline by 10.2%. TAMS supports dialogue histories up to 128K tokens and demonstrates significantly enhanced noise robustness and overall robustness compared to prior approaches.

Technology Category

Application Category

📝 Abstract

Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/

Problem

Research questions and friction points this paper is trying to address.

Enhances temporal reasoning in multi-session conversational agents.

Addresses noise and length challenges in long dialogue histories.

Improves accuracy, evidence grounding, and temporal consistency via reinforcement learning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for time-aware memory selection policy

Coarse-to-fine strategy with temporal and relevance filters

Multi-level reward optimizing accuracy, grounding, and temporal consistency

🔎 Similar Papers

Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation