Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from “visual forgetting” in long-chain visual reasoning—e.g., geometric math problems—where attention to image content progressively decays with inference steps, causing overreliance on textual cues. To address this, we propose Take-along Visual Conditioning (TVC), the first mechanism enabling dynamic preservation and continuous guidance of critical visual cues throughout reasoning. TVC comprises three core components: (1) conditional image re-injection, adaptively reloading original visual representations at bottleneck reasoning stages; (2) dynamic visual token pruning, compressing redundant tokens while retaining discriminative image regions; and (3) multi-stage vision–language alignment modeling, enforcing cross-modal semantic consistency. Evaluated on five mathematical reasoning benchmarks, TVC yields an average 3.4% improvement in accuracy, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista's test-hard subset, revealing the model's textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.

Problem

Research questions and friction points this paper is trying to address.

Addresses visual forgetting in multimodal reasoning tasks.

Proposes Take-along Visual Conditioning to maintain visual focus.

Improves accuracy in mathematical reasoning benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Take-along Visual Conditioning (TVC)

Shifts image input to critical reasoning stages

Compresses redundant visual tokens via dynamic pruning

🔎 Similar Papers

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models