Reinforcement Learning Enhanced Multi-hop Reasoning for Temporal Knowledge Question Answering

📅 2026-01-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the challenges faced by large language models in temporal knowledge graph question answering, where semantic complexity and temporally similar relations often lead to suboptimal decisions and error propagation during multi-hop reasoning. To mitigate these issues, the authors propose a Multi-hop Reasoning Enhancement (MRE) framework that integrates prompt engineering to generate diverse reasoning paths, initiates learning via supervised fine-tuning, and introduces a novel Tree-based Grouped Relative Policy Optimization (T-GRPO) mechanism. T-GRPO enables joint exploration of forward reasoning and backward feedback through a recursive tree structure that enforces strong causal dependencies, significantly improving the identification of globally optimal reasoning trajectories. Experimental results demonstrate that MRE outperforms state-of-the-art methods on two TKGQA benchmarks, achieving higher accuracy on complex multi-hop queries while exhibiting enhanced robustness to noisy temporal annotations and improved interpretability.

Technology Category

Application Category

📝 Abstract

Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in the knowledge graph to answer a given question. However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories. Specifically, MRE begins with prompt engineering to guide the LLM in generating diverse reasoning trajectories for a given question. Valid reasoning trajectories are then selected for supervised fine-tuning, serving as a cold-start strategy. Finally, we introduce Tree-Group Relative Policy Optimization (T-GRPO), a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on the previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experimental results on two TKGQA benchmarks indicate that the proposed MRE-based model consistently surpasses state-of-the-art (SOTA) approaches in handling complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.

Problem

Research questions and friction points this paper is trying to address.

Temporal Knowledge Graph Question Answering

Multi-hop Reasoning

Error Propagation

Suboptimal Decisions

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-hop Reasoning

Temporal Knowledge Graph QA

Reinforcement Learning