🤖 AI Summary
Large multimodal models (LMMs) have long underperformed specialized detectors in object detection, primarily due to their lack of explicit localization capability and unsuitable training paradigms. This paper introduces LMM-Det, the first systematic demonstration that a pure LMM—without auxiliary detection heads—can perform end-to-end object detection. Our approach comprises three key innovations: (1) instruction dialogue reformulation, unifying detection as structured vision-language question answering; (2) data distribution reorganization, enhancing diversity and spatial consistency in bounding box descriptions; and (3) context-based inference optimization to improve localization recall. On standard benchmarks including COCO and LVIS, LMM-Det achieves substantial gains over baseline LMMs (+12.3 AP@0.5) and approaches the performance of lightweight task-specific detectors. To foster reproducibility and further research, we release our code, models, and curated datasets.
📝 Abstract
Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others. While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors. To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a Large Multimodal Model for vanilla object Detection without relying on specialized detection modules. Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models. We claim that a large multimodal model possesses detection capability without any extra detection modules. Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. The datasets, models, and codes are available at https://github.com/360CVGroup/LMM-Det.