VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current high-performance quantized vision-language models still exhibit poorly understood failure modes. This work proposes the first interactive framework enabling interpretable visual concept analysis during inference: it employs sparse autoencoders to extract human-understandable concepts from the visual encoder, links these concepts to textual tokens via cross-modal attention mechanisms, and performs causal analysis through token-latent attribution heatmaps and concept ablation. The approach systematically uncovers three novel classes of failure modes—insufficient cross-modal alignment, misleading visual concepts, and underutilized latent cues—thereby offering a practical tool for model debugging and targeted improvement.

Technology Category

Application Category

📝 Abstract
High-performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging, we introduce VisualScratchpad, an interactive interface for visual concept analysis during inference. We apply sparse autoencoders to the vision encoder and link the resulting visual concepts to text tokens via text-to-image attention, allowing us to examine which visual concepts are both captured by the vision encoder and utilized by the language model. VisualScratchpad also provides a token-latent heatmap view that suggests a sufficient set of latents for effective concept ablation in causal analysis. Through case studies, we reveal three underexplored failure modes: limited cross-modal alignment, misleading visual concepts, and unused hidden cues. Project page: https://hyesulim.github.io/visual_scratchpad_projectpage/
Problem

Research questions and friction points this paper is trying to address.

vision language models
failure modes
model interpretability
visual concepts
cross-modal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

VisualScratchpad
sparse autoencoders
visual concept analysis
cross-modal alignment
causal ablation
🔎 Similar Papers
No similar papers found.