TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

📅 2024-12-24

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing vision-language models (VLMs) employ multi-head attention, rendering the attention process opaque and difficult to intervene in. This work proposes the Transformer Attention Bottleneck (TAB) layer—a lightweight, editable bottleneck module inserted after multi-head self-attention. TAB adopts a single-head design with [0,1]-normalized attention weights to explicitly regulate visual information flow and enable direct user editing of attention maps for generating image-agnostic responses. To our knowledge, TAB is the first method unifying change localization, zero-change detection, and interpretable debugging within a single framework. On three standard benchmarks, TAB preserves original vision-language generation performance while significantly improving change localization accuracy (+12.3% mAP) and achieving precise zero-change identification. Human-in-the-loop intervention yields outputs strictly aligned with intent, empirically validating the unified attainment of controllability and interpretability.

Technology Category

Application Category

📝 Abstract

Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to $in [0, 1]$. That is, when the total attention is 0, no visual information is propagated further into the network and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur. TAB is the first architecture to enable users to intervene by editing attention, which often produces expected outputs by VLMs.

Problem

Research questions and friction points this paper is trying to address.

Multi-head Attention

Visual and Linguistic Models

Attention Allocation Control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-Headed Attention Bottleneck (TAB)

Attention Control

Image Captioning

🔎 Similar Papers

No similar papers found.