Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing return-conditioned sequential models treat return-to-go (RTG) as a generic numerical input, lacking guarantees of consistency between the specified RTG and the actual performance of the resulting policy. This work proposes Q-ALIGN DT, a novel framework that introduces Q-value alignment into return-conditioned supervised learning for the first time. By leveraging a Q-function to provide dense guidance and incorporating RTG perturbation-based fine-tuning, the method enforces alignment between the Q-values of the policy’s actions and the input RTG, ensuring that higher RTG values correspond to higher expected returns. This approach yields a structured family of policies that significantly improves the consistency, controllability, and generalization of RTG-conditioned behavior on the D4RL benchmark, and successfully extends to challenging tasks such as velocity tracking, where prior methods fail.
πŸ“ Abstract
Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.
Problem

Research questions and friction points this paper is trying to address.

Return-to-Go
Conditioned Sequence Models
Policy Alignment
Q-value Consistency
Supervised Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-guided alignment
Return-to-Go
Conditioned Sequence Models
Policy controllability
RTG-perturbation
πŸ”Ž Similar Papers