STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

In multi-stage tasks, preference-based reinforcement learning (PbRL) suffers from sparse preference signals and inefficient policy learning due to temporal misalignment—e.g., comparing trajectory segments across distinct stages. This work is the first to formally identify, characterize, and empirically validate this stage misalignment problem, and proposes a time-aligned preference learning framework. Methodologically, we design a prior-free contrastive learning mechanism that dynamically partitions tasks into stages based on temporal distance and enforces intra-stage preference comparisons only; this is seamlessly integrated into standard PbRL pipelines. Experiments demonstrate substantial improvements over state-of-the-art methods on multi-stage benchmarks, while maintaining competitive performance on single-stage tasks. Human evaluation further confirms that the automatically discovered stages align with cognitive stage boundaries. Our core contributions are: (i) exposing stage misalignment as a fundamental bottleneck in PbRL; (ii) introducing the first end-to-end temporally aligned preference learning paradigm; and (iii) enabling fully unsupervised, adaptive stage discovery without manual annotation.

Technology Category

Application Category

📝 Abstract

Preference-based reinforcement learning (PbRL) bypasses complex reward engineering by learning rewards directly from human preferences, enabling better alignment with human intentions. However, its effectiveness in multi-stage tasks, where agents sequentially perform sub-tasks (e.g., navigation, grasping), is limited by stage misalignment: Comparing segments from mismatched stages, such as movement versus manipulation, results in uninformative feedback, thus hindering policy learning. In this paper, we validate the stage misalignment issue through theoretical analysis and empirical experiments. To address this issue, we propose STage-AlIgned Reward learning (STAIR), which first learns a stage approximation based on temporal distance, then prioritizes comparisons within the same stage. Temporal distance is learned via contrastive learning, which groups temporally close states into coherent stages, without predefined task knowledge, and adapts dynamically to policy changes. Extensive experiments demonstrate STAIR's superiority in multi-stage tasks and competitive performance in single-stage tasks. Furthermore, human studies show that stages approximated by STAIR are consistent with human cognition, confirming its effectiveness in mitigating stage misalignment.

Problem

Research questions and friction points this paper is trying to address.

Addresses stage misalignment in multi-stage PbRL tasks

Learns temporal-aligned stages via contrastive learning

Prioritizes preference comparisons within coherent stages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns stage approximation via temporal distance

Prioritizes comparisons within same stage

Uses contrastive learning without predefined knowledge

🔎 Similar Papers

PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers