🤖 AI Summary
In multi-stage tasks, preference-based reinforcement learning (PbRL) suffers from sparse preference signals and inefficient policy learning due to temporal misalignment—e.g., comparing trajectory segments across distinct stages. This work is the first to formally identify, characterize, and empirically validate this stage misalignment problem, and proposes a time-aligned preference learning framework. Methodologically, we design a prior-free contrastive learning mechanism that dynamically partitions tasks into stages based on temporal distance and enforces intra-stage preference comparisons only; this is seamlessly integrated into standard PbRL pipelines. Experiments demonstrate substantial improvements over state-of-the-art methods on multi-stage benchmarks, while maintaining competitive performance on single-stage tasks. Human evaluation further confirms that the automatically discovered stages align with cognitive stage boundaries. Our core contributions are: (i) exposing stage misalignment as a fundamental bottleneck in PbRL; (ii) introducing the first end-to-end temporally aligned preference learning paradigm; and (iii) enabling fully unsupervised, adaptive stage discovery without manual annotation.
📝 Abstract
Preference-based reinforcement learning (PbRL) bypasses complex reward engineering by learning rewards directly from human preferences, enabling better alignment with human intentions. However, its effectiveness in multi-stage tasks, where agents sequentially perform sub-tasks (e.g., navigation, grasping), is limited by stage misalignment: Comparing segments from mismatched stages, such as movement versus manipulation, results in uninformative feedback, thus hindering policy learning. In this paper, we validate the stage misalignment issue through theoretical analysis and empirical experiments. To address this issue, we propose STage-AlIgned Reward learning (STAIR), which first learns a stage approximation based on temporal distance, then prioritizes comparisons within the same stage. Temporal distance is learned via contrastive learning, which groups temporally close states into coherent stages, without predefined task knowledge, and adapts dynamically to policy changes. Extensive experiments demonstrate STAIR's superiority in multi-stage tasks and competitive performance in single-stage tasks. Furthermore, human studies show that stages approximated by STAIR are consistent with human cognition, confirming its effectiveness in mitigating stage misalignment.