🤖 AI Summary
This work addresses the limitations in sketch-based 3D shape retrieval, where existing multi-view feature aggregation methods often neglect geometric relationships and multi-level details, and exhibit insufficient zero-shot generalization. To overcome these challenges, the authors propose a hierarchical multi-view graph neural network that constructs a view-level graph structure to model inter-view geometric dependencies through local graph convolutions and global attention mechanisms. A novel view selector is introduced to enable hierarchical graph coarsening, progressively expanding the receptive field while suppressing redundant information. Furthermore, for the first time, CLIP text embeddings are leveraged as semantic prototypes to align sketch and 3D features into a shared semantic space, facilitating category-agnostic matching and zero-shot generalization. Experiments demonstrate that the proposed method significantly outperforms state-of-the-art approaches on two public benchmarks under both category-level and zero-shot evaluation settings.
📝 Abstract
Sketch-based 3D shape retrieval (SBSR) aims to retrieve 3D shapes that are consistent with the category of the input hand-drawn sketch. The core challenge of this task lies in two aspects: existing methods typically employ simplified aggregation strategies for independently encoded 3D multi-view features, which ignore the geometric relationships between views and multi-level details, resulting in weak 3D representation. Simultaneously, traditional SBSR methods are constrained by visible category limitations, leading to poor performance in zero-shot scenarios. To address these challenges, we propose Multi-View Hierarchical Graph Neural Network (MV-HGNN), a novel framework for SBSR. Specifically, we construct a view-level graph and capture adjacent geometric dependencies and cross-view message passing via local graph convolution and global attention. A view selector is further introduced to perform hierarchical graph coarsening, enabling a progressively larger receptive field for graph convolution and mitigating the interference of redundant views, which leads to more discriminate discriminative hierarchical 3D representation. To enable category agnostic alignment and mitigate overfitting to seen classes, we leverage CLIP text embeddings as semantic prototypes and project both sketch and 3D features into a shared semantic space. We use a two-stage training strategy for category-level retrieval and a one-stage strategy for zero-shot retrieval under the same model architecture. Under both category-level and zero-shot settings, extensive experiments on two public benchmarks demonstrate that MV-HGNN outperforms state-of-the-art methods.