Human-in-the-loop Online Rejection Sampling for Robotic Manipulation

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the instability of reinforcement learning (RL) in robotic manipulation—caused by sparse rewards and inaccurate value estimation—and the poor generalization and limited robustness of imitation learning (IL) under offline paradigms, this paper proposes Hi-ORS: a novel fine-tuning framework for Vision-Language-Action (VLA) models that integrates online rejection sampling, reward-weighted supervision, and human-robot asynchronous co-training. Its core innovation lies in dynamically filtering low-reward trajectories during online execution, explicitly modeling error recovery behaviors, and enabling real-time human intervention for corrective feedback. Evaluated on three contact-intensive real-world robotic tasks, Hi-ORS achieves superior performance over pure RL and IL baselines after only 1.5 hours of physical training. It demonstrates exceptional training stability, robustness to distributional shifts, and, notably, autonomous recovery from complex execution failures—without requiring task-specific reward engineering or large-scale expert demonstrations.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.
Problem

Research questions and friction points this paper is trying to address.

Stabilizes reinforcement learning for robotic manipulation policies
Enhances training with online human-in-the-loop corrections
Improves error-recovery behaviors in vision-language-action models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online rejection sampling filters negatively rewarded samples
Reward-weighted supervised training provides dense supervision
Asynchronous framework enables human-in-the-loop error correction
🔎 Similar Papers
No similar papers found.
Guanxing Lu
Guanxing Lu
Tsinghua University
VLARLRobotics3D Vision
R
Rui Zhao
Tencent Robotics X
H
Haitao Lin
Tencent Robotics X
H
He Zhang
Tencent Robotics X
Y
Yansong Tang
Tsinghua Shenzhen International Graduate School