Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current omnimodal language models (OLMs) suffer from insufficient fine-grained audio-visual perception and severe hallucination, compounded by the absence of systematic evaluation frameworks. To address these issues, we propose Omni-Detective—a tool-augmented, agent-based data generation pipeline integrating hallucination-suppressing tool invocation, multi-stage data cleaning, and fine-grained prompt engineering—to construct a high-quality audio-visual captioning dataset. Leveraging this dataset, we train Audio-Captioner and Omni-Captioner, enabling cross-modal parallel understanding and generation. We introduce two novel contributions: (1) the “co-growth” cognitive paradigm for joint audio-visual representation learning, and (2) Omni-Cloze, the first fill-in-the-blank benchmark for stable, quantitative evaluation of fine-grained audio, visual, and audio-visual description. Experiments show Audio-Captioner outperforms Gemini 2.5 Flash on MMAU/MMAR and matches Gemini 2.5 Pro; Omni-Captioner achieves state-of-the-art hallucination-detail trade-off on VDC and Video-SALMONN 2, significantly advancing open-source model capabilities.

Technology Category

Application Category

📝 Abstract

Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent"co-growth"between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.

Problem

Research questions and friction points this paper is trying to address.

Limited capacity of Omni Language Models to capture fine-grained multimodal details

Inherent co-growth between detail and hallucination in current audio-visual models

Absence of dedicated benchmark for evaluating omni detailed perception capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic data pipeline generates detailed multimodal content

Two captioning models for audio and audio-visual perception

Novel cloze-style benchmark evaluates detailed captioning performance

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis