🤖 AI Summary
Point cloud completion suffers from severe geometric incompleteness due to self-occlusion and sensor limitations, degrading downstream robotic tasks such as grasping and obstacle avoidance. To address this, we propose a single-view RGB-guided cross-modal point cloud completion framework. Our method introduces a hierarchical graph attention encoder to jointly model local and global structural continuity in point clouds. An attention-driven multi-scale cross-modal fusion module enables fine-grained alignment between RGB image priors and geometric features. Additionally, a contrastive loss is employed to enhance semantic consistency across modalities. Evaluated on ShapeNet-ViPC and YCB-Complete benchmarks, our approach achieves state-of-the-art performance in both quantitative metrics and qualitative reconstruction fidelity. Crucially, it demonstrates superior generalization and reconstruction accuracy in real-world robotic manipulation scenarios, validating its practical applicability for embodied perception systems.
📝 Abstract
Point cloud completion is essential for robotic perception, object reconstruction and supporting downstream tasks like grasp planning, obstacle avoidance, and manipulation. However, incomplete geometry caused by self-occlusion and sensor limitations can significantly degrade downstream reasoning and interaction. To address these challenges, we propose HGACNet, a novel framework that reconstructs complete point clouds of individual objects by hierarchically encoding 3D geometric features and fusing them with image-guided priors from a single-view RGB image. At the core of our approach, the Hierarchical Graph Attention (HGA) encoder adaptively selects critical local points through graph attention-based downsampling and progressively refines hierarchical geometric features to better capture structural continuity and spatial relationships. To strengthen cross-modal interaction, we further design a Multi-Scale Cross-Modal Fusion (MSCF) module that performs attention-based feature alignment between hierarchical geometric features and structured visual representations, enabling fine-grained semantic guidance for completion. In addition, we proposed the contrastive loss (C-Loss) to explicitly align the feature distributions across modalities, improving completion fidelity under modality discrepancy. Finally, extensive experiments conducted on both the ShapeNet-ViPC benchmark and the YCB-Complete dataset confirm the effectiveness of HGACNet, demonstrating state-of-the-art performance as well as strong applicability in real-world robotic manipulation tasks.