Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work proposes Implicit Contextual Preference-based Reinforcement Learning (ICPRL), a novel paradigm that enables contextual reinforcement learning without any explicit reward supervision—a longstanding challenge in settings where rewards are ambiguous or costly to obtain. ICPRL introduces two preference granularity frameworks, Instance-level PRL (I-PRL) and Trajectory-level PRL (T-PRL), integrated with a Transformer architecture and a preference-native training strategy. By leveraging only stepwise or trajectory-level preference feedback, the method supports both pretraining and deployment without access to numerical rewards. Empirical evaluations across dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL achieves contextual generalization performance on par with fully reward-supervised approaches, despite operating entirely without reward signals.

Technology Category

Application Category

📝 Abstract

In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods rely on explicit reward signals during pretraining, which limits their applicability when rewards are ambiguous, hard to specify, or costly to obtain. To overcome this limitation, we propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback, eliminating the need for reward supervision. We study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a standard approach in ICRL, remains effective under preference-only context datasets, demonstrating the feasibility of in-context reinforcement learning using only preference signals. To further improve data efficiency, we introduce alternative preference-native frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without requiring reward signals nor optimal action labels.Experiments on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.

Problem

Research questions and friction points this paper is trying to address.

in-context reinforcement learning

reward-free learning

preference-based feedback

transformer models

sequential decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Reinforcement Learning

Preference-Based Learning

Reward-Free RL