🤖 AI Summary
This study investigates how vision-language models integrate visual and textual information during chain-of-thought (CoT) reasoning and examines their susceptibility to misleading textual cues. Through dynamic tracking of CoT confidence trajectories, controlled interventions introducing deceptive text, and comparative analysis across 18 models, the work reveals a previously undocumented “answer inertia” phenomenon: models exhibit strong reliance on textual cues even when capable of correction. The findings indicate that while reasoning-focused training enhances corrective capacity, it fails to eliminate textual bias. Although instruction-tuned models less frequently cite misleading information explicitly, their reasoning traces more readily expose inconsistencies between visual and textual inputs. These results suggest that although CoT partially reflects multimodal integration, its apparent fluency may mask an implicit overreliance on textual signals.
📝 Abstract
Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.