MokA: Multimodal Low-Rank Adaptation for MLLMs

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing efficient multimodal fine-tuning methods largely adapt techniques designed for large language models (LLMs), neglecting modality heterogeneity and cross-modal interaction requirements—leading to suboptimal modality utilization. To address this, we propose the first low-rank adaptation framework that jointly optimizes unimodal compression and cross-modal alignment. Specifically: (1) we systematically decouple and co-train modality-specific low-rank parameters alongside dedicated cross-modal interaction modules; (2) we introduce an explicit alignment enhancement mechanism to improve cross-modal representation consistency. Our method is fully compatible with LoRA and supports mainstream multimodal LLM backbones—including LLaMA-2/3 and Qwen2. Extensive experiments across audio-video–text, image–text, and speech–text tasks demonstrate consistent and significant improvements over conventional LLM transfer fine-tuning, achieving state-of-the-art performance on multiple benchmarks.

Technology Category

Application Category

📝 Abstract

In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at https://gewu-lab.github.io/MokA.

Problem

Research questions and friction points this paper is trying to address.

Current multimodal fine-tuning methods neglect intrinsic differences in multimodal scenarios

Unimodal and cross-modal adaptation are essential for effective MLLM fine-tuning

Propose MokA to enhance multimodal adaptation with modality-specific parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-specific parameters compress unimodal information

Explicitly enhances cross-modal interaction

Multimodal-aware efficient fine-tuning strategy

🔎 Similar Papers

No similar papers found.