WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited capability of multimodal large language models (MLLMs) in representing dynamic modalities such as audio and video, this paper introduces UniAV—the first MLLM-based universal audio-visual embedding framework—establishing a unified text-audio-video tri-modal embedding space. Methodologically, UniAV employs hierarchical cross-modal feature fusion, prompt-aware embedding generation, and multi-task joint training to achieve fine-grained semantic alignment and bidirectional retrieval and generation across arbitrary modalities. Its core contributions are: (1) the first LLM-driven, prompt-controllable, modality-agnostic unified embedding representation; and (2) state-of-the-art performance on the MMEB-v2 video understanding benchmark, along with significant improvements over prior methods in audio-video cross-modal retrieval and multimodal question answering.

Technology Category

Application Category

📝 Abstract
While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE ( extbf{u}nified & extbf{v}ersatile extbf{a}udio- extbf{v}isual extbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.
Problem

Research questions and friction points this paper is trying to address.

Creating unified embeddings for text, audio, and video
Enabling any-to-any cross-modal retrieval between modalities
Generating prompt-aware embeddings for user instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified text-audio-video embedding via multimodal LLM
Hierarchical feature fusion for cross-modal retrieval
Joint multi-modal training for prompt-aware embeddings
🔎 Similar Papers
No similar papers found.
Changli Tang
Changli Tang
Tsinghua University
Automatic Speech RecognitionVideo Understanding
Q
Qinfan Xiao
Tsinghua University
Ke Mei
Ke Mei
Tencent Wechat
deep learningcomputer vision
T
Tianyi Wang
WeChat Vision, Tencent Inc.
F
Fengyun Rao
WeChat Vision, Tencent Inc.
C
Chao Zhang
Tsinghua University