🤖 AI Summary
This work addresses the limited cross-unseen-category generalization in zero-shot 3D anomaly detection by proposing the first CLIP-based unified framework. Methodologically, it introduces point-pixel joint modeling to fuse rendering and geometric semantics; designs explicit and implicit dual-path anomaly representations; employs hierarchical text prompts—including both rendering- and geometry-aware prompts—and a cross-level contrastive alignment mechanism; and incorporates G-aggregation to enhance geometric awareness. The framework supports plug-and-play RGB modality integration. Evaluated on highly diverse unseen objects, it significantly improves both anomaly detection and segmentation performance, achieving state-of-the-art results across multiple benchmarks. Notably, it is the first method to enable simultaneous fine-grained spatial localization and holistic anomaly understanding in a generalizable zero-shot 3D anomaly modeling paradigm.
📝 Abstract
In this paper, we aim to transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.