SpeechVerse: A Large-scale Generalizable Audio Language Model

📅 2024-05-14

🏛️ arXiv.org

📈 Citations: 29

✨ Influential: 3

career value

166K/year

🤖 AI Summary

Existing audio-language models are typically task-specific, exhibiting limited generalization capability and poor zero-shot instruction-following performance. To address this, we propose the first frozen multimodal alignment framework that couples frozen speech and text foundation models via learnable lightweight adapters, augmented by Continuous Latent-space Instruction-tuning (CLIMT) for cross-task and cross-domain generalization. Our method integrates multi-task learning, curriculum learning, and latent-space alignment—without fine-tuning the backbone models. Evaluated on 11 speech understanding tasks, our approach outperforms specialized models on 9 tasks. It significantly enhances zero-shot performance, enabling instruction-driven inference on unseen tasks, novel prompts, and out-of-domain data. This work establishes a scalable, efficient, and highly generalizable unified paradigm for general-purpose audio-language understanding.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.

Problem

Research questions and friction points this paper is trying to address.

Develops a generalizable audio-language model for diverse speech tasks.

Enables zero-shot performance using natural language instructions.

Outperforms task-specific baselines on most evaluated tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines speech and text models via learnable parameters

Uses curriculum learning for multi-task training

Achieves zero-shot performance with latent representations

🔎 Similar Papers

No similar papers found.