ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the modality imbalance in existing vision-language-action (VLA) models, which often over-rely on proprioceptive signals and consequently suffer from the "false completion" problem—executing failed actions while incorrectly judging them as successful. To mitigate this, we propose ReViP, a novel framework that introduces the first vision-proprioception rebalancing mechanism: an external vision-language model acts as a task-phase observer to extract real-time visual semantic cues, and feature-level linear modulation dynamically adjusts the coupling strength between visual and proprioceptive inputs. We further establish the first benchmark specifically designed to evaluate false completion, incorporating controllable perturbations such as object dropping. Experiments demonstrate that ReViP significantly reduces false completion rates and improves task success across our benchmark, LIBERO, RoboTwin 2.0, and real-world robotic platforms, outperforming strong baselines.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with VLM-encoded vision-language features, resulting in state-dominant bias and false completions despite visible execution failures. We attribute this to modality imbalance, where policies over-rely on internal state while underusing visual evidence. To address this, we present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary task-aware environment priors to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations, which drive a Vision-Proprioception Feature-wise Linear Modulation to enhance environmental awareness and reduce state-driven errors. Moreover, to evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop. Extensive experiments show that ReViP effectively reduces false-completion rates and improves success rates over strong VLA baselines on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
false completion
modality imbalance
proprioception
visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
modality imbalance
false completion
feature-wise linear modulation
visual grounding
🔎 Similar Papers
No similar papers found.
Z
Zhuohao Li
1Sun Yat-sen University 2Shenzhen Loop Area Institute
Yinghao Li
Yinghao Li
Applied Scientist, AWS
NLP
Jian-Jian Jiang
Jian-Jian Jiang
Sun Yat-sen University
Robotics
L
Lang Zhou
1Sun Yat-sen University 2Shenzhen Loop Area Institute
T
Tianyu Zhang
3Beijing Institute of Technology 2Shenzhen Loop Area Institute
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning