🤖 AI Summary
Existing multimodal large language model (MLLM) fine-tuning methods neglect modality heterogeneity, hindering effective cross-modal representation alignment. To address this, we propose MuNG—a variational inference-based noise-injection fine-tuning framework. MuNG introduces a lightweight multimodal noise generator that dynamically models image-text relationships and injects task-adaptive noise while keeping the backbone model frozen. It adds only 1–2% extra parameters yet consistently outperforms full fine-tuning across diverse benchmarks. By reformulating MLLM inference through probabilistic modeling, MuNG explicitly mitigates cross-modal heterogeneity. Extensive experiments on Qwen-VL and LLaVA demonstrate that MuNG achieves state-of-the-art performance on multiple downstream tasks—including visual question answering, captioning, and reasoning—while maintaining minimal parameter overhead. Our approach establishes a new paradigm for efficient, robust, and parameter-efficient multimodal fine-tuning.
📝 Abstract
Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1sim2%$ additional parameters. The relevant code is uploaded in the supplementary.