MokA: Multimodal Low-Rank Adaptation for MLLMs

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing efficient multimodal fine-tuning methods largely adapt techniques designed for large language models (LLMs), neglecting modality heterogeneity and cross-modal interaction requirements—leading to suboptimal modality utilization. To address this, we propose the first low-rank adaptation framework that jointly optimizes unimodal compression and cross-modal alignment. Specifically: (1) we systematically decouple and co-train modality-specific low-rank parameters alongside dedicated cross-modal interaction modules; (2) we introduce an explicit alignment enhancement mechanism to improve cross-modal representation consistency. Our method is fully compatible with LoRA and supports mainstream multimodal LLM backbones—including LLaMA-2/3 and Qwen2. Extensive experiments across audio-video–text, image–text, and speech–text tasks demonstrate consistent and significant improvements over conventional LLM transfer fine-tuning, achieving state-of-the-art performance on multiple benchmarks.

Technology Category

Application Category

📝 Abstract
In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.
Problem

Research questions and friction points this paper is trying to address.

Current multimodal fine-tuning methods neglect intrinsic differences in multimodal scenarios
Unimodal and cross-modal adaptation are essential for effective MLLM fine-tuning
Propose MokA to enhance multimodal adaptation with modality-specific parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-specific parameters compress unimodal information
Explicitly enhances cross-modal interaction
Multimodal-aware efficient fine-tuning strategy
🔎 Similar Papers
No similar papers found.
Yake Wei
Yake Wei
Renmin University of China
multimodal learning
Y
Yu Miao
Gaoling School of Artificial Intelligence Renmin University of China, Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE
Dongzhan Zhou
Dongzhan Zhou
Researcher at Shanghai AI Lab
AI4Sciencecomputer visiondeep learning
D
Di Hu
Gaoling School of Artificial Intelligence Renmin University of China, Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE