Perceptual Inductive Bias Is What You Need Before Contrastive Learning

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing contrastive representation learning methods neglect the multi-stage nature of human visual perception, leading to insufficient inductive bias, slow convergence, and texture bias. Method: This paper introduces the first systematic integration of Marr’s vision theory into self-supervised pretraining, proposing a “boundary → surface → semantics” three-stage perceptual prior. We design a two-stage training paradigm: an initial perception-driven pretraining phase—where early ResNet-18 layers are jointly optimized for edge and surface feature encoding—followed by standard contrastive learning. Contribution/Results: Our approach doubles convergence speed, consistently improves performance on semantic segmentation, depth estimation, and object recognition, and significantly enhances out-of-distribution robustness. Crucially, it explicitly incorporates low-level visual priors to mitigate learning shortcuts and texture bias, establishing a cognitively inspired paradigm for visual representation learning.

Technology Category

Application Category

📝 Abstract
David Marr's seminal theory of human perception stipulates that visual processing is a multi-stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi-stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and learning shortcut resulting in texture bias. In this work, we demonstrate that leveraging Marr's multi-stage theory-by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics-leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.
Problem

Research questions and friction points this paper is trying to address.

Enhancing contrastive learning with perceptual inductive bias
Improving convergence speed and representation quality
Reducing texture bias via multi-stage visual processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage Marr's multi-stage visual processing theory
Construct boundary and surface-level representations first
Enhance contrastive learning with perceptual inductive bias
🔎 Similar Papers
No similar papers found.
Tianqin Li
Tianqin Li
Computer Science Ph.D. student, Carnegie Mellon University
Deep LearningComputational NeuroscienceBrain-inspired AI
J
Junru Zhao
Carnegie Mellon University
D
Dunhan Jiang
Carnegie Mellon University
S
Shenghao Wu
Carnegie Mellon University
A
Alan Ramirez
Carnegie Mellon University
Tai Sing Lee
Tai Sing Lee
Professor of Computer Science, Carnegie Mellon University
Computational NeuroscienceComputer Vision