WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the limited capability of multimodal large language models (MLLMs) in representing dynamic modalities such as audio and video, this paper introduces UniAV—the first MLLM-based universal audio-visual embedding framework—establishing a unified text-audio-video tri-modal embedding space. Methodologically, UniAV employs hierarchical cross-modal feature fusion, prompt-aware embedding generation, and multi-task joint training to achieve fine-grained semantic alignment and bidirectional retrieval and generation across arbitrary modalities. Its core contributions are: (1) the first LLM-driven, prompt-controllable, modality-agnostic unified embedding representation; and (2) state-of-the-art performance on the MMEB-v2 video understanding benchmark, along with significant improvements over prior methods in audio-video cross-modal retrieval and multimodal question answering.

Technology Category

Application Category

📝 Abstract

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE ( extbf{u}nified & extbf{v}ersatile extbf{a}udio- extbf{v}isual extbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.

Problem

Research questions and friction points this paper is trying to address.

Creating unified embeddings for text, audio, and video

Enabling any-to-any cross-modal retrieval between modalities

Generating prompt-aware embeddings for user instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified text-audio-video embedding via multimodal LLM

Hierarchical feature fusion for cross-modal retrieval

Joint multi-modal training for prompt-aware embeddings

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs