🤖 AI Summary
Existing 3D anomaly detection methods struggle to effectively model the relationships between multi-view and multi-modal features, limiting their anomaly localization accuracy. To address this challenge, this work proposes ModMap, a novel framework that jointly models multi-view and multi-modal information by integrating cross-modal feature mapping with a view-aware feature modulation mechanism for the first time. The approach further introduces an omnidirectional ensemble training strategy to generate comprehensive multi-view anomaly scores. Additionally, the project releases a high-resolution depth encoder tailored for industrial applications. Evaluated on the SiM3D benchmark, ModMap significantly outperforms current state-of-the-art methods, achieving the best reported performance in both 3D anomaly detection and segmentation.
📝 Abstract
We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.