Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the degradation of temporal reasoning capabilities—originally inherited from text pretraining—in current video-language models due to visual alignment processes. To mitigate this, the authors propose MERIT, a training-free, layer-selective model fusion framework that restores temporal reasoning by selectively merging self-attention mechanisms between the video-language model and its text backbone on a per-layer basis, while preserving temporal awareness. The study makes the novel observation that temporal reasoning ability is concentrated in specific network layers, which informs a perception-aware layer selection strategy. Experimental results demonstrate that MERIT significantly enhances temporal reasoning performance across multiple video benchmarks, exhibits strong generalization, and outperforms baseline approaches such as uniform fusion and random layer selection.

Technology Category

Application Category

📝 Abstract

Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.

Problem

Research questions and friction points this paper is trying to address.

temporal reasoning

video-language models

multimodal adaptation

reasoning degradation

visual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

layer-selective merging

temporal reasoning

video-language models