FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work proposes a subspace-level fine-grained fusion mechanism to better integrate large reasoning models with vision-language models, addressing the limitations of coarse-grained approaches that struggle to simultaneously inject strong reasoning capabilities while preserving original visual perception abilities. By leveraging singular value decomposition to extract task-specific vectors from the reasoning model, the method learns adaptive scaling coefficients to dynamically modulate the contribution of each subspace. Additionally, an unlabeled self-distillation strategy is introduced to enable dual-objective optimization. This approach overcomes the performance trade-offs inherent in conventional layer-wise fusion schemes, achieving state-of-the-art results across multiple visual reasoning benchmarks while significantly enhancing reasoning capacity without compromising foundational visual understanding.

Technology Category

Application Category

📝 Abstract

Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Reasoning Injection

Model Merging

Visual Reasoning

Fine-Grained Fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

subspace-level merging

fine-grained reasoning injection

Singular Value Decomposition