🤖 AI Summary
Multimodal out-of-distribution (OOD) detection faces challenges including diverse distribution shifts, difficulty in predicting model performance under unsupervised conditions, and the absence of a universally optimal detector.
Method: This paper proposes the first meta-learning-based framework for automatic selection of multimodal OOD detectors. It integrates multimodal embeddings with hand-crafted distributional statistics and cross-modal meta-features to construct a lightweight historical performance predictor, enabling rapid adaptation to unseen distribution shifts and recommending the optimal detector.
Contribution/Results: Evaluated across 12 heterogeneous test scenarios, our method significantly outperforms ten baseline approaches, achieving substantial gains in recommendation accuracy while maintaining minimal computational overhead. It establishes a novel, generalizable, and efficient paradigm for detector selection in multimodal OOD detection.
📝 Abstract
Out-of-distribution (OOD) robustness is a critical challenge for modern machine learning systems, particularly as they increasingly operate in multimodal settings involving inputs like video, audio, and sensor data. Currently, many OOD detection methods have been proposed, each with different designs targeting various distribution shifts. A single OOD detector may not prevail across all the scenarios; therefore, how can we automatically select an ideal OOD detection model for different distribution shifts? Due to the inherent unsupervised nature of the OOD detection task, it is difficult to predict model performance and find a universally Best model. Also, systematically comparing models on the new unseen data is costly or even impractical. To address this challenge, we introduce M3OOD, a meta-learning-based framework for OOD detector selection in multimodal settings. Meta learning offers a solution by learning from historical model behaviors, enabling rapid adaptation to new data distribution shifts with minimal supervision. Our approach combines multimodal embeddings with handcrafted meta-features that capture distributional and cross-modal characteristics to represent datasets. By leveraging historical performance across diverse multimodal benchmarks, M3OOD can recommend suitable detectors for a new data distribution shift. Experimental evaluation demonstrates that M3OOD consistently outperforms 10 competitive baselines across 12 test scenarios with minimal computational overhead.