🤖 AI Summary
To address the limited capability of multimodal large language models (MLLMs) in representing dynamic modalities such as audio and video, this paper introduces UniAV—the first MLLM-based universal audio-visual embedding framework—establishing a unified text-audio-video tri-modal embedding space. Methodologically, UniAV employs hierarchical cross-modal feature fusion, prompt-aware embedding generation, and multi-task joint training to achieve fine-grained semantic alignment and bidirectional retrieval and generation across arbitrary modalities. Its core contributions are: (1) the first LLM-driven, prompt-controllable, modality-agnostic unified embedding representation; and (2) state-of-the-art performance on the MMEB-v2 video understanding benchmark, along with significant improvements over prior methods in audio-video cross-modal retrieval and multimodal question answering.
📝 Abstract
While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE ( extbf{u}nified & extbf{v}ersatile extbf{a}udio- extbf{v}isual extbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.