🤖 AI Summary
This work addresses the modality imbalance in existing vision-language-action (VLA) models, which often over-rely on proprioceptive signals and consequently suffer from the "false completion" problem—executing failed actions while incorrectly judging them as successful. To mitigate this, we propose ReViP, a novel framework that introduces the first vision-proprioception rebalancing mechanism: an external vision-language model acts as a task-phase observer to extract real-time visual semantic cues, and feature-level linear modulation dynamically adjusts the coupling strength between visual and proprioceptive inputs. We further establish the first benchmark specifically designed to evaluate false completion, incorporating controllable perturbations such as object dropping. Experiments demonstrate that ReViP significantly reduces false completion rates and improves task success across our benchmark, LIBERO, RoboTwin 2.0, and real-world robotic platforms, outperforming strong baselines.
📝 Abstract
Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with VLM-encoded vision-language features, resulting in state-dominant bias and false completions despite visible execution failures. We attribute this to modality imbalance, where policies over-rely on internal state while underusing visual evidence. To address this, we present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary task-aware environment priors to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations, which drive a Vision-Proprioception Feature-wise Linear Modulation to enhance environmental awareness and reduce state-driven errors. Moreover, to evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop. Extensive experiments show that ReViP effectively reduces false-completion rates and improves success rates over strong VLA baselines on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.