SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

📅 2026-01-14

🏛️ IEEE Journal on Selected Topics in Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the limited support for audio modalities—such as speech, general audio, and music—in existing open-source multimodal large language model (MLLM) frameworks, which has hindered the advancement of audio-language models. To overcome this limitation, we propose a modular and customizable open-source MLLM framework that systematically integrates diverse encoders, projection layers, large language models, and parameter-efficient fine-tuning techniques tailored for audio modalities. The framework provides end-to-end training and inference pipelines for mainstream tasks, including automatic speech recognition and audio/music captioning. We release high-performance quantized models that achieve state-of-the-art or competitive results across multiple benchmarks, significantly lowering the barrier to entry for researchers and fostering community collaboration. The proposed techniques have already been adopted in peer-reviewed academic publications.

Technology Category

Application Category

📝 Abstract

The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Model

Speech Processing

Audio Processing

Music Processing

Open-Source Framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model

Audio-Language Processing

Modular Framework