Explore How to Inject Beneficial Noise in MLLMs

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) fine-tuning methods neglect modality heterogeneity, hindering effective cross-modal representation alignment. To address this, we propose MuNG—a variational inference-based noise-injection fine-tuning framework. MuNG introduces a lightweight multimodal noise generator that dynamically models image-text relationships and injects task-adaptive noise while keeping the backbone model frozen. It adds only 1–2% extra parameters yet consistently outperforms full fine-tuning across diverse benchmarks. By reformulating MLLM inference through probabilistic modeling, MuNG explicitly mitigates cross-modal heterogeneity. Extensive experiments on Qwen-VL and LLaVA demonstrate that MuNG achieves state-of-the-art performance on multiple downstream tasks—including visual question answering, captioning, and reasoning—while maintaining minimal parameter overhead. Our approach establishes a new paradigm for efficient, robust, and parameter-efficient multimodal fine-tuning.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1sim2%$ additional parameters. The relevant code is uploaded in the supplementary.

Problem

Research questions and friction points this paper is trying to address.

Injecting beneficial noise improves multimodal fine-tuning efficiency

Addressing cross-modal heterogeneity in MLLMs through noise injection

Enhancing representation alignment with minimal additional parameter adjustments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting beneficial random noise for fine-tuning

Using Multimodal Noise Generator for cross-modal alignment

Dynamically analyzing image-text pairs for task-adaptive noise

🔎 Similar Papers

Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models