SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

📅 2026-01-14
🏛️ IEEE Journal on Selected Topics in Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited support for audio modalities—such as speech, general audio, and music—in existing open-source multimodal large language model (MLLM) frameworks, which has hindered the advancement of audio-language models. To overcome this limitation, we propose a modular and customizable open-source MLLM framework that systematically integrates diverse encoders, projection layers, large language models, and parameter-efficient fine-tuning techniques tailored for audio modalities. The framework provides end-to-end training and inference pipelines for mainstream tasks, including automatic speech recognition and audio/music captioning. We release high-performance quantized models that achieve state-of-the-art or competitive results across multiple benchmarks, significantly lowering the barrier to entry for researchers and fostering community collaboration. The proposed techniques have already been adopted in peer-reviewed academic publications.

Technology Category

Application Category

📝 Abstract
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Model
Speech Processing
Audio Processing
Music Processing
Open-Source Framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model
Audio-Language Processing
Modular Framework
Parameter-Efficient Fine-Tuning
Open-Source
🔎 Similar Papers
No similar papers found.
Ziyang Ma
Ziyang Ma
Shanghai Jiao Tong University
Speech and Language ProcessingTextless NLPSelf-supervised LearningMultimedia
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
Wenxi Chen
Wenxi Chen
Shanghai Jiao Tong University
Self-Supervised LearningDeep learningAudioSpeech
Z
Zhifu Gao
Tongyi Lab, Alibaba Group, Hangzhou, China
Yexing Du
Yexing Du
Harbin Institute of Technology
Xiquan Li
Xiquan Li
Shanghai Jiao Tong University
Audio UnderstandingAudio GenerationLarge Language Models
Zhisheng Zheng
Zhisheng Zheng
The University of Texas at Austin
Speech and Language ProcessingNatural Language ProcessingMultimodal Learning
Haina Zhu
Haina Zhu
Shanghai Jiao Tong University
Music GenerationSelf-Supervised LearningDeep Reinforcement Learning
J
Jianheng Zhuo
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University, Shanghai, China
Z
Zheshu Song
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University, Shanghai, China
R
Ruiyang Xu
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University, Shanghai, China
T
Tiranrui Wang
Tianjin University, Tianjin, China
Yifan Yang
Yifan Yang
Shanghai Jiao Tong University, Tencent, Microsoft, Xiaomi
Spoken Language Processing
Y
Yanqiao Zhu
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University, Shanghai, China
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Liumeng Xue
Liumeng Xue
Hong Kong University of Science and Technology
Audio Speech and Language ProcessingSpeech Generation
Yinghao Ma
Yinghao Ma
PhD candidate, Centre for Digital Music (C4DM), Queen Mary University of London
Music Information RetrievalLarge Language ModelsMultimodal LearningAudio Signal Processing
Rui Yuan
Rui Yuan
Unknown affiliation
Machine learningDeep learningReinforcement learningOptimization
Shiliang Zhang
Shiliang Zhang
Department of Computer Science, School of EECS, Peking University
Multimedia Information RetrievalMultimedia SystemsVisual Search
K
Kai Yu
X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University, Shanghai, China
E
E. Chng
Nanyang Technological University, Singapore
Xie Chen
Xie Chen
Shanghai Jiao Tong University <- Microsoft <- Cambridge University
Machine LearningSpeech RecognitionSpeech SynthesisSpeech&Audio Processing