VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

📅 2024-10-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses catastrophic forgetting—significant degradation in text capabilities—when integrating speech functionality into large language models (LLMs). We propose the first single-stage, joint speech-text supervised fine-tuning (SFT) paradigm. Methodologically, we employ LoRA for parameter-efficient adaptation, design a unified speech-text tokenization scheme, and jointly train on heterogeneous multimodal supervision—including ASR, speech-to-text translation, spoken question answering, and textual SFT—under a unified objective. Evaluated on a 3B-parameter model, our approach surpasses 7B- and 13B-scale baselines on speech tasks while fully preserving original text-generation performance. It notably enhances multi-turn and mixed-modality (speech + text) interaction capabilities, demonstrating strong cross-modal generalization and zero-shot task adaptability. Our method achieves state-of-the-art results across multiple speech benchmarks.

Technology Category

Application Category

📝 Abstract

Recent studies have augmented large language models (LLMs) with speech capabilities, leading to the development of speech language models (SpeechLMs). Earlier SpeechLMs focused on single-turn speech-based question answering (QA), where user input comprised a speech context and a text question. More recent studies have extended this to multi-turn conversations, though they often require complex, multi-stage supervised fine-tuning (SFT) with diverse data. Another critical challenge with SpeechLMs is catastrophic forgetting, where models optimized for speech tasks suffer significant degradation in text-only performance. To mitigate these issues, we propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the LLM backbone. Our joint SFT combines text-only SFT data with three types of speech-related data: speech recognition and translation, speech-based QA, and mixed-modal SFT. Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks while preserving the original capabilities on text-only tasks. Furthermore, our model shows emergent abilities of effectively handling previously unseen prompts and tasks, including multi-turn, mixed-modal inputs.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLMs with speech capabilities efficiently.

Address catastrophic forgetting in SpeechLMs.

Improve multi-turn, mixed-modal conversation handling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage joint speech-text fine-tuning

Low-rank adaptation (LoRA) for LLMs

Combines diverse speech and text data

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models

2024-09-30arXiv.orgCitations: 3

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

2024-09-30arXiv.orgCitations: 1