π€ AI Summary
To address the challenge of semantic understanding for robots in partially observable complex environments, this paper proposes a closed-loop decision-making framework integrating vision-language models (VLMs) with active perception. Methodologically, we introduce Qwen-VL into an end-to-end active perception loop for the first time, enabling direct mapping from high-level semantic queries (e.g., βWhere is the red screw?β) to low-level camera pose control; we further propose an observation strategy search algorithm based on 3D scene voxel grids and differentiable orientation optimization. Our key contribution is the first VLM-driven differentiable active perception paradigm, supporting real-time closed-loop interaction. Evaluated on Franka Panda and UR5 robotic platforms, our approach achieves a 37.2% accuracy improvement over passive baselines and outperforms TGCSR by 21.5%; under severe occlusion, object localization accuracy reaches 92.4%.
π Abstract
Active perception enables robots to dynamically gather information by adjusting their viewpoints, a crucial capability for interacting with complex, partially observable environments. In this paper, we present AP-VLM, a novel framework that combines active perception with a Vision-Language Model (VLM) to guide robotic exploration and answer semantic queries. Using a 3D virtual grid overlaid on the scene and orientation adjustments, AP-VLM allows a robotic manipulator to intelligently select optimal viewpoints and orientations to resolve challenging tasks, such as identifying objects in occluded or inclined positions. We evaluate our system on two robotic platforms: a 7-DOF Franka Panda and a 6-DOF UR5, across various scenes with differing object configurations. Our results demonstrate that AP-VLM significantly outperforms passive perception methods and baseline models, including Toward Grounded Common Sense Reasoning (TGCSR), particularly in scenarios where fixed camera views are inadequate. The adaptability of AP-VLM in real-world settings shows promise for enhancing robotic systems' understanding of complex environments, bridging the gap between high-level semantic reasoning and low-level control.