π€ AI Summary
Multimodal large language models (MLLMs) exhibit notably weaker reasoning capabilities compared to their unimodal text-only counterparts, and existing enhancement methods rely on large-scale multimodal reasoning datasets or computationally expensive reinforcement learning. Method: We propose DRIFTβa low-resource, data-free reasoning enhancement framework that transfers reasoning knowledge from pretrained text-only LMs to MLLMs. DRIFT models such knowledge as a directional prior in parameter space and guides gradient updates during supervised fine-tuning of MLLMs, enabling stable reasoning capability transfer while preserving modality alignment. It integrates model merging principles with lightweight fine-tuning, requiring only precomputed reasoning priors. Contribution/Results: On benchmarks including MathVista and MathVerse, DRIFT significantly outperforms naive model merging and standard supervised fine-tuning, matching the performance of high-cost training methods while incurring minimal memory and computational overhead.
π Abstract
Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.