Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in acoustic representation learning and cross-modal alignment, hindering text-to-audio retrieval performance. To address this, we propose Vela: the first framework that integrates a voice large language model (Voice LLM) with customized prompt engineering, in-context learning, and text-pair distillation—enabling general-purpose multimodal embedding learning using *text-only supervision*, without any audio annotations. This unimodal training paradigm effectively bridges the modality gap. Additionally, we introduce the first benchmark specifically designed for audio–text retrieval under long-text inputs, fine-grained semantics, and compositional queries—exposing critical limitations of existing models like CLAP. Extensive experiments demonstrate that Vela significantly outperforms CLAP on both standard and newly proposed benchmarks, with substantial gains in robustness and zero-shot generalization.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have seen substantial progress in recent years. However, their ability to represent multimodal information in the acoustic domain remains underexplored. In this work, we introduce Vela, a novel framework designed to adapt MLLMs for the generation of universal multimodal embeddings. By leveraging MLLMs with specially crafted prompts and selected in-context learning examples, Vela effectively bridges the modality gap across various modalities. We then propose a single-modality training approach, where the model is trained exclusively on text pairs. Our experiments show that Vela outperforms traditional CLAP models in standard text-audio retrieval tasks. Furthermore, we introduce new benchmarks that expose CLAP models' limitations in handling long texts and complex retrieval tasks. In contrast, Vela, by harnessing the capabilities of MLLMs, demonstrates robust performance in these scenarios. Our code will soon be available.
Problem

Research questions and friction points this paper is trying to address.

Adapting MLLMs for universal multimodal embeddings generation
Bridging modality gap across diverse modalities effectively
Outperforming CLAP in text-audio retrieval tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts MLLMs for universal multimodal embeddings
Uses prompts and in-context learning examples
Trains exclusively on text pairs
🔎 Similar Papers
No similar papers found.
Ruofan Hu
Ruofan Hu
Zhe Jiang University
Y
Yan Xia
Zhejiang University, China
Minjie Hong
Minjie Hong
Zhejiang University
Multi-modal LearningLLMReinforcement learningGenerative RetrievalRecommendation
J
Jieming Zhu
Huawei Noah’s Ark Lab, China
B
Bo Chen
Huawei Noah’s Ark Lab, China
X
Xiaoda Yang
Zhejiang University, China
Minghui Fang
Minghui Fang
Zhejiang University
SpeechMulti-Modal LearningInformation Retrieval
T
Tao Jin
Zhejiang University, China