Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the instability and error accumulation in Q-learning when applied to long-horizon tasks, which stem from the backpropagation of temporal difference errors. To mitigate these issues, the paper proposes Long-horizon Q-Learning (LQL), a novel approach that leverages the optimality lower-bound inequality implied by n-step action sequences. LQL operationalizes this inequality as a stability mechanism by penalizing Q-value updates that violate it via a hinge loss, all within the standard Q-learning framework—without requiring additional neural networks or forward rollouts. Empirical results demonstrate that LQL consistently outperforms both 1-step and n-step TD methods across a range of online and offline-to-online benchmark tasks, achieving significant performance gains with comparable computational overhead.

📝 Abstract

Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.

Problem

Research questions and friction points this paper is trying to address.

off-policy reinforcement learning

long-horizon learning

temporal-difference error

error compounding

Q-learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-Horizon Q-Learning

n-Step Inequalities

Optimality Tightening