SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying semantic-driven active perception with robust, viewpoint-invariant manipulation. To this end, we propose SaPaVe, an end-to-end framework that decouples camera control from manipulation actions and employs a bottom-up training strategy to jointly optimize perception and action. Our key contributions include the creation of ActiveManip-Bench—the first benchmark for active manipulation—along with the release of ActiveViewPose-200K, a large-scale dataset of semantic camera motions. The framework integrates semantic camera pretraining, 3D geometric awareness, and joint vision-language-action modeling. Experiments demonstrate that SaPaVe significantly outperforms baselines such as GR00T N1 in both simulation and real-world settings, achieving up to a 31.25% improvement in task success rate on physical robotic tasks.

Technology Category

Application Category

📝 Abstract
Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(π_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe
Problem

Research questions and friction points this paper is trying to address.

active perception
vision-language-action models
robotic manipulation
viewpoint invariance
semantic-driven perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

active perception
vision-language-action models
decoupled action learning
3D geometry-aware module
active manipulation benchmark