Explore How to Inject Beneficial Noise in MLLMs

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
Existing multimodal large language model (MLLM) fine-tuning methods neglect modality heterogeneity, hindering effective cross-modal representation alignment. To address this, we propose MuNG—a variational inference-based noise-injection fine-tuning framework. MuNG introduces a lightweight multimodal noise generator that dynamically models image-text relationships and injects task-adaptive noise while keeping the backbone model frozen. It adds only 1–2% extra parameters yet consistently outperforms full fine-tuning across diverse benchmarks. By reformulating MLLM inference through probabilistic modeling, MuNG explicitly mitigates cross-modal heterogeneity. Extensive experiments on Qwen-VL and LLaVA demonstrate that MuNG achieves state-of-the-art performance on multiple downstream tasks—including visual question answering, captioning, and reasoning—while maintaining minimal parameter overhead. Our approach establishes a new paradigm for efficient, robust, and parameter-efficient multimodal fine-tuning.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1sim2%$ additional parameters. The relevant code is uploaded in the supplementary.
Problem

Research questions and friction points this paper is trying to address.

Injecting beneficial noise improves multimodal fine-tuning efficiency
Addressing cross-modal heterogeneity in MLLMs through noise injection
Enhancing representation alignment with minimal additional parameter adjustments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting beneficial random noise for fine-tuning
Using Multimodal Noise Generator for cross-modal alignment
Dynamically analyzing image-text pairs for task-adaptive noise
🔎 Similar Papers
2024-07-02Conference on Empirical Methods in Natural Language ProcessingCitations: 1