Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

📅 2026-01-20
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing out-of-distribution (OOD) detection methods, which overly rely on textual features and struggle with distributional shifts in the visual domain—particularly under both near- and far-OOD scenarios. To overcome this, the authors propose MM-OOD, a novel framework that systematically leverages the multimodal reasoning and generative capabilities of large multimodal language models (MLLMs). For near-OOD samples, MM-OOD performs zero-shot inference by jointly utilizing image and text prompts. For far-OOD cases, it introduces a three-stage “sketch–generate–refine” pipeline that enhances multimodal prompting through generated visual exemplars. By moving beyond conventional unimodal or text-only paradigms, MM-OOD achieves state-of-the-art performance on multimodal benchmarks such as Food-101 and demonstrates strong scalability on ImageNet-1K.

Technology Category

Application Category

📝 Abstract
Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.
Problem

Research questions and friction points this paper is trying to address.

Out-of-Distribution detection
multimodal learning
image space
zero-shot learning
distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model
Out-of-Distribution Detection
Sketch-Generate-Elaborate Framework
Zero-shot OOD
Multimodal Reasoning
🔎 Similar Papers
No similar papers found.
Haoran Xu
Haoran Xu
Zhejiang University
Embodied AIRoboticsComputer Vision3D Vision
Y
Yanlin Liu
Tsinghua University
Z
Zizhao Tong
University of Chinese Academy of Sciences
Jiaze Li
Jiaze Li
zhejiang university
MLLMFederated Learning
Kexue Fu
Kexue Fu
City University of Hong Kong
HCIStorytellingCreativityCognitionHuman-AI collaboration
Yuyang Zhang
Yuyang Zhang
Graduate Student, Harvard University
Reinforcement LearningControl Theory
Longxiang Gao
Longxiang Gao
Professor, Qilu University of Technology; Adjunct Professor, University of Southern Queensland
Edge AIFederated LearningMachine LearningQuantum Computing
S
Shuaiguang Li
Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250013, China; Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan, 250014, China
X
Xingyu Li
Mohamed bin Zayed University of Artificial Intelligence
Y
Yanran Xu
RWTH Aachen University
Changwei Wang
Changwei Wang
Shandong Computer Science Center
Multimodal LearningEmbodied AIEdge Intelligent ComputingAI for HealthcareSafety Alignment