Directional Reasoning Injection for Fine-Tuning MLLMs

πŸ“… 2025-10-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multimodal large language models (MLLMs) exhibit notably weaker reasoning capabilities compared to their unimodal text-only counterparts, and existing enhancement methods rely on large-scale multimodal reasoning datasets or computationally expensive reinforcement learning. Method: We propose DRIFTβ€”a low-resource, data-free reasoning enhancement framework that transfers reasoning knowledge from pretrained text-only LMs to MLLMs. DRIFT models such knowledge as a directional prior in parameter space and guides gradient updates during supervised fine-tuning of MLLMs, enabling stable reasoning capability transfer while preserving modality alignment. It integrates model merging principles with lightweight fine-tuning, requiring only precomputed reasoning priors. Contribution/Results: On benchmarks including MathVista and MathVerse, DRIFT significantly outperforms naive model merging and standard supervised fine-tuning, matching the performance of high-cost training methods while incurring minimal memory and computational overhead.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.
Problem

Research questions and friction points this paper is trying to address.

MLLMs lag behind text-only models in reasoning capability
Existing reasoning enhancement methods are resource-intensive and unstable
Naive model merging causes performance degradation in certain architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transfers reasoning knowledge in gradient space
Uses parameter difference as reasoning prior
Biases gradients during multimodal fine-tuning