🤖 AI Summary
This work addresses the challenges of deploying vision-language-action (VLA) models in high-dimensional dexterous manipulation, where error accumulation, inefficient exploration, and hardware risks hinder real-world applicability. To this end, the authors propose BORA, a novel framework that uniquely integrates offline value guidance with online human-in-the-loop residual adaptation. In the offline phase, an action-conditional critic—combining cognitive tokens and action chunks—provides value-based supervision. During online execution, the VLA backbone is frozen, and a lightweight chunked residual mechanism corrects execution deviations. This approach preserves pre-trained policy priors while effectively handling real-world execution inconsistencies and object generalization. Experiments across five real-world dexterous tasks demonstrate that BORA improves average success rates by 33% under standard settings and achieves up to a 43% gain in unseen-object generalization, significantly outperforming pure imitation learning and conventional decoupled reinforcement learning baselines.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.