Mechanisms of Object Localization in Vision-Language Models

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses the unclear internal mechanisms underlying object localization in vision-language models (VLMs), which hinders their interpretability and performance improvement. It reveals, for the first time at the layer and attention head granularity, that object localization in LLaVA-1.5 and InternVL-3.5 relies on narrow computational pathways formed by a small subset of specialized attention heads—rather than internal semantic rearrangements—exhibiting a “containerized” mechanism. Through token ablation, attention knockout, and causal mediation analysis, the work demonstrates that localization and classification tasks share early visual processing but are driven by distinct sets of attention heads: in LLaVA, critical heads concentrate in early-to-mid layers, whereas in InternVL, they are distributed across mid-to-late layers.

📝 Abstract

Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

Problem

Research questions and friction points this paper is trying to address.

object localization

vision-language models

mechanistic interpretability

attention mechanisms

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

object localization

vision-language models

mechanistic interpretability