Test-time Offline Reinforcement Learning on Goal-related Experience

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the weak policy adaptation capability and high computational overhead during inference in goal-directed offline reinforcement learning, this paper proposes Goal-Conditioned Test-Time Training (GC-TTT). GC-TTT dynamically selects high-quality, goal-semantically relevant historical trajectories at inference time and performs lightweight self-supervised data selection followed by sliding-window fine-tuning to enable real-time policy adaptation. Its core innovation lies in tightly coupling goal-conditioned value estimation with test-time self-supervised trajectory filtering—eliminating the need for online interaction or large-scale model expansion. Evaluated on high-dimensional locomotion and manipulation tasks, GC-TTT achieves significant performance gains with only 1–3 gradient steps per test goal. Under identical computational budgets, it consistently outperforms baseline methods that rely solely on scaling model capacity.

Technology Category

Application Category

📝 Abstract

Foundation models compress a large amount of information in a single, large neural network, which can then be queried for individual tasks. There are strong parallels between this widespread framework and offline goal-conditioned reinforcement learning algorithms: a universal value function is trained on a large number of goals, and the policy is evaluated on a single goal in each test episode. Extensive research in foundation models has shown that performance can be substantially improved through test-time training, specializing the model to the current goal. We find similarly that test-time offline reinforcement learning on experience related to the test goal can lead to substantially better policies at minimal compute costs. We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state and quality with respect to the evaluation goal. We demonstrate across a wide range of high-dimensional loco-navigation and manipulation tasks that fine-tuning a policy on the selected data for a few gradient steps leads to significant performance gains over standard offline pre-training. Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out. Finally, we study compute allocation at inference, demonstrating that, at comparable costs, GC-TTT induces performance gains that are not achievable by scaling model size.

Problem

Research questions and friction points this paper is trying to address.

Improving offline reinforcement learning via test-time training

Enhancing policy performance with goal-related experience selection

Optimizing compute costs for adaptive policy fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised data selection for goal relevance

Test-time offline reinforcement learning fine-tuning

Receding-horizon policy adaptation during evaluation

🔎 Similar Papers

No similar papers found.