Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

πŸ“… 2025-05-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current multimodal large language models (MLLMs) perform well on unimodal benchmarks but exhibit limited capability in audio-visual joint reasoning and fine-grained temporal alignment, with no systematic evaluation available. Method: We introduce Daily-Omni, the first audio-visual synchronized reasoning benchmark for everyday scenarios, comprising 684 multi-source videos and 1,197 multiple-choice QA items covering six cross-modal understanding tasks. We propose a novel frame-level timestamp-based evaluation paradigm for audio-visual temporal alignment and design an automated QA generation pipeline. Additionally, we build Daily-Omni-Agentβ€”a training-free agent integrating open-source vision-language models (VLMs), audio-language models (ALMs), and automatic speech recognition (ASR) systems. Contribution/Results: Experiments reveal significant deficiencies in existing MLLMs for audio-visual collaborative reasoning. We demonstrate that lightweight temporal alignment enables substantial performance gains from VLM+ALM fusion, establishing a new reproducible baseline and evaluation framework for multimodal temporal understanding.

Technology Category

Application Category

πŸ“ Abstract
Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. In this paper, we introduce: 1) Daily-Omni, an Audio-Visual Questioning and Answering benchmark comprising 684 videos of daily life scenarios from diverse sources, rich in both audio and visual information, and featuring 1197 multiple-choice QA pairs across 6 major tasks; 2) Daily-Omni QA Generation Pipeline, which includes automatic annotation, QA generation and QA optimization, significantly improves efficiency for human evaluation and scalability of the benchmark; 3) Daily-Omni-Agent, a training-free agent utilizing open-source Visual Language Model (VLM), Audio Language Model (ALM) and Automatic Speech Recognition (ASR) model to establish a baseline for this benchmark. The results show that current MLLMs still struggle significantly with tasks requiring audio-visual integration, but combining VLMs and ALMs with simple temporal alignment techniques can achieve substantially better performance. Codes and benchmark are available at href{https://github.com/Lliar-liar/Daily-Omni}{https://github.com/Lliar-liar/Daily-Omni}.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' ability to process synchronized audio-visual data
Creating a scalable benchmark for audio-visual QA tasks
Improving audio-visual integration using temporal alignment techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual QA benchmark with diverse daily scenarios
Automatic annotation pipeline for efficient QA generation
Training-free agent combining VLM, ALM, and ASR
πŸ”Ž Similar Papers
No similar papers found.
Ziwei Zhou
Ziwei Zhou
Undergraduate Student, Fudan University
Artificial Intelligence
R
Rui Wang
Computation and Artificial Intelligence Innovative Collage, Fudan University
Zuxuan Wu
Zuxuan Wu
Fudan University