🤖 AI Summary
This study investigates how Vision Transformers (ViTs) achieve object binding, specifically examining whether they rely on the Gestalt principle of continuity rather than solely on similarity or proximity. By constructing synthetic image datasets and employing attention head probing, cross-dataset generalization tests, and ablation studies, the work provides the first systematic evidence that certain attention heads in ViTs explicitly encode continuity and causally contribute to object binding. The findings reveal that most pretrained ViTs are highly sensitive to continuity, with these critical attention heads demonstrating strong generalization capabilities and significantly enhancing the quality of binding representations.
📝 Abstract
Object binding is a foundational process in visual cognition, during which low-level perceptual features are joined into object representations. Binding has been considered a fundamental challenge for neural networks, and a major milestone on the way to artificial models with flexible visual intelligence. Recently, several investigations have demonstrated evidence that binding mechanisms emerge in pretrained vision models, enabling them to associate portions of an image that contain an object. The question remains: how are these models binding objects together? In this work, we investigate whether vision models rely on the principle of Gestalt continuity to perform object binding, over and above other principles like similarity and proximity. Using synthetic datasets, we demonstrate that binding probes are sensitive to continuity across a wide range of pretrained vision transformers. Next, we uncover particular attention heads that track continuity, and show that these heads generalize across datasets. Finally, we ablate these attention heads, and show that they often contribute to producing representations that encode object binding.