Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Current video understanding models struggle to effectively fuse and interpret temporally aligned visual, audio, and speech modalities, limiting precise localization and description of fine-grained audio-visual synchronization events (e.g., speaker articulation + background music + audience reactions). To address this, we propose a query-driven modality-adaptive weighting connector enabling robust inference under missing-modal conditions. We introduce TriSense-2M—the first large-scale, 2-million-instance tri-modal alignment dataset—covering long-form videos and sparse modality combinations. Further, we design a tri-modal large language model architecture integrating LLM-guided synthetic data generation and cross-modal alignment representation learning. Our method achieves significant improvements over state-of-the-art methods across multiple benchmarks; notably, it retains over 92% of full-modality performance under audio absence. Code and the TriSense-2M dataset will be publicly released.

Technology Category

Application Category

📝 Abstract

Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like"A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding"requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

Integrating visual, audio, and speech signals for video understanding

Overcoming modality fusion challenges in existing models

Enhancing temporal comprehension of multimodal video moments

Innovation

Methods, ideas, or system contributions that make the work stand out.

TriSense integrates visual, audio, speech modalities.

Query-Based Connector adaptively reweights modality contributions.

TriSense-2M dataset enables broad generalization capabilities.

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs