Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This study investigates whether human visual representations align more closely with discriminative or generative learning objectives and examines how these objectives influence alignment with human perception. By leveraging Joint Energy Models (JEMs) to continuously interpolate between the two paradigms—adjusting only a single mixing coefficient—while holding architecture, model scale, and training data constant, the authors systematically evaluate performance across six human-alignment benchmarks. The results reveal that optimal alignment with human vision is achieved not at either pure discriminative or pure generative extremes, but at an intermediate training objective, challenging the conventional dichotomy. This finding is robustly corroborated across multiple dimensions, including perceptual similarity, gloss modeling, response uncertainty, robustness evaluations, shape–texture conflict resolution, and feature attribution, collectively demonstrating the superior capacity of hybrid JEMs to emulate the multifaceted nature of human visual behavior.
📝 Abstract
A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.
Problem

Research questions and friction points this paper is trying to address.

human alignment
discriminative learning
generative learning
visual representation
learning objective
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Energy-Based Models
human alignment
generative-discriminative continuum
visual representation learning
learning objective interpolation