Towards aligned body representations in vision models

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This study investigates whether vision models spontaneously develop human-like coarse-grained body representations during semantic segmentation training to support physical reasoning. Using psychophysical experimental paradigms, we conduct cross-scale representational analyses of multi-scale semantic segmentation networks. Results reveal that smaller models generate abstract volumetric approximations aligned with human intuitive physical predictions, whereas larger models—due to overly fine-grained encoding—exhibit diminished structural coherence in such representations. This work provides the first empirical evidence of a negative correlation between model capacity and the granularity of physically grounded representations. It establishes machine vision models as scalable, experimentally tractable computational platforms for reverse-engineering the neural representational mechanisms underlying human physical reasoning. These findings open new avenues for embodied cognition modeling and neurosymbolic AI. (136 words)

Technology Category

Application Category

📝 Abstract

Human physical reasoning relies on internal "body" representations - coarse, volumetric approximations that capture an object's extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.

Problem

Research questions and friction points this paper is trying to address.

Tests if vision models develop human-like body representations

Compares coarse versus detailed encodings in segmentation networks

Explores computational resource impact on physical reasoning structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Small models form coarse body representations

Larger models develop detailed fine-grain encodings

Coarse representations emerge under limited computational resources

🔎 Similar Papers

Law of Vision Representation in MLLMs