From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models

📅 2025-05-31
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether self-supervised vision models spontaneously develop human-like Gestalt perception—specifically illusory contour completion, convexity preference, and dynamic figure-ground segregation—and examines the necessity of global spatial structure modeling. We introduce DiSRT (Distorted Spatial Relationship Testbench), the first diagnostic benchmark to systematically evaluate model sensitivity to core Gestalt principles—including closure, proximity, and figure-ground assignment. Our experiments reveal that self-supervised pretraining (e.g., MAE) induces robust Gestalt perception, whereas subsequent supervised fine-tuning degrades it; reintroducing Top-K sparse activation effectively restores global spatial sensitivity. Notably, ViT and ConvNeXt models evaluated on DiSRT outperform supervised baselines, with some metrics exceeding human performance. These results demonstrate that Gestalt organization does not require attention mechanisms per se and is instead modulated by training paradigms—highlighting the critical role of objective design in shaping perceptual priors.

Technology Category

Application Category

📝 Abstract
Human vision organizes local cues into coherent global forms using Gestalt principles like closure, proximity, and figure-ground assignment -- functions reliant on global spatial structure. We investigate whether modern vision models show similar behaviors, and under what training conditions these emerge. We find that Vision Transformers (ViTs) trained with Masked Autoencoding (MAE) exhibit activation patterns consistent with Gestalt laws, including illusory contour completion, convexity preference, and dynamic figure-ground segregation. To probe the computational basis, we hypothesize that modeling global dependencies is necessary for Gestalt-like organization. We introduce the Distorted Spatial Relationship Testbench (DiSRT), which evaluates sensitivity to global spatial perturbations while preserving local textures. Using DiSRT, we show that self-supervised models (e.g., MAE, CLIP) outperform supervised baselines and sometimes even exceed human performance. ConvNeXt models trained with MAE also exhibit Gestalt-compatible representations, suggesting such sensitivity can arise without attention architectures. However, classification finetuning degrades this ability. Inspired by biological vision, we show that a Top-K activation sparsity mechanism can restore global sensitivity. Our findings identify training conditions that promote or suppress Gestalt-like perception and establish DiSRT as a diagnostic for global structure sensitivity across models.
Problem

Research questions and friction points this paper is trying to address.

Investigates Gestalt principles in vision models
Tests global spatial sensitivity via DiSRT
Explores training impact on Gestalt perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

ViTs with MAE show Gestalt-like activation patterns
DiSRT tests global spatial sensitivity in models
Top-K sparsity restores global sensitivity post-finetuning
🔎 Similar Papers
No similar papers found.