🤖 AI Summary
This work proposes a subspace-level fine-grained fusion mechanism to better integrate large reasoning models with vision-language models, addressing the limitations of coarse-grained approaches that struggle to simultaneously inject strong reasoning capabilities while preserving original visual perception abilities. By leveraging singular value decomposition to extract task-specific vectors from the reasoning model, the method learns adaptive scaling coefficients to dynamically modulate the contribution of each subspace. Additionally, an unlabeled self-distillation strategy is introduced to enable dual-objective optimization. This approach overcomes the performance trade-offs inherent in conventional layer-wise fusion schemes, achieving state-of-the-art results across multiple visual reasoning benchmarks while significantly enhancing reasoning capacity without compromising foundational visual understanding.
📝 Abstract
Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.