π€ AI Summary
This work addresses the limited human-like spatial visualization and mental rotation capabilities in robotic systems. We propose a method for real-time optimal viewpoint prediction from a single 2D image, enabling active perception-driven scene understanding. Our core contribution is a differentiable, lightweight 3D Viewpoint Quality Field (VQF) that jointly models three geometric-semantic metrics: self-occlusion ratio, occupancy-aware normal entropy, and visual entropy. Leveraging a pre-trained image encoder for feature extraction and an end-to-end differentiable decoder network, the approach achieves cross-category zero-shot generalization. Evaluated on a single GPU, our method operates at 72 FPS, significantly improving object recognition accuracy. Extensive experiments on real robotic platforms demonstrate both the effectiveness and robustness of the viewpoint optimization strategy. This work establishes a new paradigm for autonomous perception and motion planning by tightly integrating active sensing with geometric-semantic reasoning.
π Abstract
When observing objects, humans benefit from their spatial visualization and mental rotation ability to envision potential optimal viewpoints based on the current observation. This capability is crucial for enabling robots to achieve efficient and robust scene perception during operation, as optimal viewpoints provide essential and informative features for accurately representing scenes in 2D images, thereby enhancing downstream tasks. To endow robots with this human-like active viewpoint optimization capability, we propose ViewActive, a modernized machine learning approach drawing inspiration from aspect graph, which provides viewpoint optimization guidance based solely on the current 2D image input. Specifically, we introduce the 3D Viewpoint Quality Field (VQF), a compact and consistent representation of viewpoint quality distribution similar to an aspect graph, composed of three general-purpose viewpoint quality metrics: self-occlusion ratio, occupancy-aware surface normal entropy, and visual entropy. We utilize pre-trained image encoders to extract robust visual and semantic features, which are then decoded into the 3D VQF, allowing our model to generalize effectively across diverse objects, including unseen categories. The lightweight ViewActive network (72 FPS on a single GPU) significantly enhances the performance of state-of-the-art object recognition pipelines and can be integrated into real-time motion planning for robotic applications. Our code and dataset are available here: https://github.com/jiayi-wu-umd/ViewActive.