Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current omnimodal language models (OLMs) suffer from insufficient fine-grained audio-visual perception and severe hallucination, compounded by the absence of systematic evaluation frameworks. To address these issues, we propose Omni-Detective—a tool-augmented, agent-based data generation pipeline integrating hallucination-suppressing tool invocation, multi-stage data cleaning, and fine-grained prompt engineering—to construct a high-quality audio-visual captioning dataset. Leveraging this dataset, we train Audio-Captioner and Omni-Captioner, enabling cross-modal parallel understanding and generation. We introduce two novel contributions: (1) the “co-growth” cognitive paradigm for joint audio-visual representation learning, and (2) Omni-Cloze, the first fill-in-the-blank benchmark for stable, quantitative evaluation of fine-grained audio, visual, and audio-visual description. Experiments show Audio-Captioner outperforms Gemini 2.5 Flash on MMAU/MMAR and matches Gemini 2.5 Pro; Omni-Captioner achieves state-of-the-art hallucination-detail trade-off on VDC and Video-SALMONN 2, significantly advancing open-source model capabilities.

Technology Category

Application Category

📝 Abstract
Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent"co-growth"between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.
Problem

Research questions and friction points this paper is trying to address.

Limited capacity of Omni Language Models to capture fine-grained multimodal details
Inherent co-growth between detail and hallucination in current audio-visual models
Absence of dedicated benchmark for evaluating omni detailed perception capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic data pipeline generates detailed multimodal content
Two captioning models for audio and audio-visual perception
Novel cloze-style benchmark evaluates detailed captioning performance
🔎 Similar Papers
No similar papers found.
Z
Ziyang Ma
Shanghai Jiao Tong University
R
Ruiyang Xu
Shanghai Jiao Tong University
Zhenghao Xing
Zhenghao Xing
The Chinese University of Hong Kong
Multimodal LearningComputer Vision
Yunfei Chu
Yunfei Chu
Alibaba Group
machine learning
Y
Yuxuan Wang
Alibaba Group
Jinzheng He
Jinzheng He
Alibaba Qwen Team, Zhejiang University
Omni LLMPost-TrainingRL
J
Jin Xu
Alibaba Group
P
Pheng-Ann Heng
The Chinese University of Hong Kong
K
Kai Yu
Shanghai Jiao Tong University
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining
E
Eng Siong Chng
Nanyang Technological University
X
Xie Chen
Shanghai Jiao Tong University, Shanghai Innovation Institution