Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work investigates the root cause of performance disparities in vision-language models (VLMs) on isomorphic vision–language tasks—such as counting—where visual and linguistic inputs demand functionally equivalent reasoning. We find that although the underlying functional circuits for vision and language are highly similar, their modality-specific computational subgraphs exhibit significant spatial and temporal separation: textual pathways respond earlier, while visual representations suffer cross-layer latency. To address this, we propose a training-free, cross-layer representational backpropagation mechanism that dynamically calibrates visual feedforward delay. Through circuit attribution, cross-modal comparative analysis, and validation across multiple VLMs (e.g., LLaVA, Qwen-VL) and tasks (e.g., visual question answering, fine-grained counting), our method reduces the vision–language performance gap by 33% on average. Our core contribution is uncovering the “functional isomorphism with path asynchrony” principle governing multimodal computation and delivering a lightweight, generalizable solution for visual–linguistic representation synchronization.

Technology Category

Application Category

📝 Abstract

Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs (e.g., counting objects in an image), yet demonstrate higher accuracies when performing an analogous task on text (e.g., counting words in a text). We investigate this accuracy gap by identifying and comparing the extit{circuits} - the task-specific computational sub-graphs - in different modalities. We show that while circuits are largely disjoint between modalities, they implement relatively similar functionalities: the differences lie primarily in processing modality-specific data positions (an image or a text sequence). Zooming in on the image data representations, we observe they become aligned with the higher-performing analogous textual representations only towards later layers, too late in processing to effectively influence subsequent positions. To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers. In experiments with multiple tasks and models, this simple intervention closes a third of the performance gap between the modalities, on average. Our analysis sheds light on the multi-modal performance gap in VLMs and suggests a training-free approach for reducing it.

Problem

Research questions and friction points this paper is trying to address.

Investigates accuracy gap between visual and text tasks in VLMs

Compares modality-specific circuits and their functional similarities

Proposes method to align visual and text representations for performance improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identify and compare modality-specific computational circuits

Patch visual data tokens from later to earlier layers

Training-free approach reduces performance gap

🔎 Similar Papers

Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models