🤖 AI Summary
Current omnimodal language models (OLMs) suffer from insufficient fine-grained audio-visual perception and severe hallucination, compounded by the absence of systematic evaluation frameworks. To address these issues, we propose Omni-Detective—a tool-augmented, agent-based data generation pipeline integrating hallucination-suppressing tool invocation, multi-stage data cleaning, and fine-grained prompt engineering—to construct a high-quality audio-visual captioning dataset. Leveraging this dataset, we train Audio-Captioner and Omni-Captioner, enabling cross-modal parallel understanding and generation. We introduce two novel contributions: (1) the “co-growth” cognitive paradigm for joint audio-visual representation learning, and (2) Omni-Cloze, the first fill-in-the-blank benchmark for stable, quantitative evaluation of fine-grained audio, visual, and audio-visual description. Experiments show Audio-Captioner outperforms Gemini 2.5 Flash on MMAU/MMAR and matches Gemini 2.5 Pro; Omni-Captioner achieves state-of-the-art hallucination-detail trade-off on VDC and Video-SALMONN 2, significantly advancing open-source model capabilities.
📝 Abstract
Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent"co-growth"between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.