DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing music-oriented large language models (LLMs) are predominantly limited to text-audio bimodality, overlooking the potential of visual modalities—such as images and videos—to enhance music understanding. Method: We propose the first music-centric multimodal LLM integrating audio, text, image, and video signals. Our approach comprises: (1) constructing Music4way, the first quadruple-aligned music dataset; (2) designing a pre-aligned Transformer architecture; (3) incorporating multi-sampling ImageBind embeddings for robust cross-modal representation; and (4) introducing a novel bidirectional instruction-tuning paradigm enabling joint fine-tuning of visual and musical features. Contribution/Results: Our model achieves state-of-the-art performance across six diverse music understanding benchmarks. Empirical results demonstrate that structured multimodal alignment—particularly the integration of visual modalities—yields significant gains in music semantic modeling, validating both architectural innovations and data curation strategies.

Technology Category

Application Category

📝 Abstract
Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-alignment Transformer to enhance modality fusion prior to input into text LLMs, tailoring DeepResonance for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We plan to open-source the models and the newly constructed datasets.
Problem

Research questions and friction points this paper is trying to address.

Enhance multimodal music understanding
Incorporate images, videos, and textual music
Improve music LLMs with multi-way instruction tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal music understanding enhancement
Multi-way instruction tuning technique
ImageBind embeddings for modality fusion
🔎 Similar Papers
No similar papers found.