FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a subspace-level fine-grained fusion mechanism to better integrate large reasoning models with vision-language models, addressing the limitations of coarse-grained approaches that struggle to simultaneously inject strong reasoning capabilities while preserving original visual perception abilities. By leveraging singular value decomposition to extract task-specific vectors from the reasoning model, the method learns adaptive scaling coefficients to dynamically modulate the contribution of each subspace. Additionally, an unlabeled self-distillation strategy is introduced to enable dual-objective optimization. This approach overcomes the performance trade-offs inherent in conventional layer-wise fusion schemes, achieving state-of-the-art results across multiple visual reasoning benchmarks while significantly enhancing reasoning capacity without compromising foundational visual understanding.

Technology Category

Application Category

📝 Abstract
Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Reasoning Injection
Model Merging
Visual Reasoning
Fine-Grained Fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

subspace-level merging
fine-grained reasoning injection
Singular Value Decomposition
self-distillation
vision-language models
🔎 Similar Papers
No similar papers found.
Chenyu Huang
Chenyu Huang
Fudan University
Deep LearningComputer VisionModel MergingModel Compression
P
Peng Ye
Shanghai Artificial Intelligence Laboratory, China; The Chinese University of Hong Kong, China
X
Xudong Tan
College of Future Information Technology, Fudan University, Shanghai, China
J
Jinhan Mu
College of Future Information Technology, Fudan University, Shanghai, China
Shenghe Zheng
Shenghe Zheng
Harbin Institute of Technology
Large Language ModelEfficient AINeural Architecture Search
Li Shen
Li Shen
Associate Professor, Sun Yat-sen University
Machine LearningOptimization
Tao Chen
Tao Chen
Fudan University
Deep LearningMedical Image Segmentation