🤖 AI Summary
This work addresses odd-one-out anomaly detection in multi-object scenes—identifying visually salient outlier objects relative to their contextual surroundings. The task demands cross-view spatial reasoning, context-aware relational modeling, and strong generalization across object categories and spatial layouts. To this end, we propose a lightweight, efficient architecture grounded in DINO features, incorporating multi-view feature fusion and structured relational modeling; it reduces parameter count and training time by approximately two-thirds while matching state-of-the-art detection accuracy and significantly improving inference efficiency. Furthermore, we establish the first systematic multimodal large language model (MLLM) baseline for this task, empirically revealing its fundamental limitations in structured visual reasoning. Our contributions include a novel, efficient paradigm for vision-based anomaly detection and a rigorous empirical benchmark that advances both methodological design and evaluation standards for generalizable, real-time visual anomaly identification.
📝 Abstract
The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/