🤖 AI Summary
This work addresses the degradation of temporal reasoning capabilities—originally inherited from text pretraining—in current video-language models due to visual alignment processes. To mitigate this, the authors propose MERIT, a training-free, layer-selective model fusion framework that restores temporal reasoning by selectively merging self-attention mechanisms between the video-language model and its text backbone on a per-layer basis, while preserving temporal awareness. The study makes the novel observation that temporal reasoning ability is concentrated in specific network layers, which informs a perception-aware layer selection strategy. Experimental results demonstrate that MERIT significantly enhances temporal reasoning performance across multiple video benchmarks, exhibits strong generalization, and outperforms baseline approaches such as uniform fusion and random layer selection.
📝 Abstract
Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.