Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study investigates the fine-grained correspondence between deep neural network (DNN) representations and human object perception. Addressing the limitation of prior work—restricted to coarse-grained category-level similarity analyses—we propose, for the first time, an unsupervised object-level representation alignment method based on Gromov–Wasserstein optimal transport. Our approach directly matches internal representations of individual objects in human cognition (behavioral similarity judgments from the THINGS dataset) with those in vision models (CLIP and self-supervised models). Results demonstrate that CLIP achieves significant alignment with human representations at both fine-grained (object-level) and coarse-grained (category-level) scales, whereas self-supervised models exhibit only coarse-grained categorical clustering and fail to align at the fine-grained level. These findings establish language supervision as a critical factor in acquiring high-fidelity, fine-grained object representations. The work introduces a novel paradigm for probing human visual representation mechanisms and enhancing DNN interpretability through principled, instance-level representational alignment.

Technology Category

Application Category

📝 Abstract

The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional representational similarity analysis is that it estimates optimal fine-grained mappings between the representation of each object in human and model representations. We used this unsupervised alignment method to assess the extent to which the representation of each object in humans is correctly mapped to the corresponding representation of the same object in models. Using human similarity judgments of 1,854 objects from the THINGS dataset, we find that models trained with CLIP consistently achieve strong fine- and coarse-grained matching with human object representations. In contrast, self-supervised models showed limited matching at both fine- and coarse-grained levels, but still formed object clusters that reflected human coarse category structure. Our results offer new insights into the role of linguistic information in acquiring precise object representations and the potential of self-supervised learning to capture coarse categorical structures.

Problem

Research questions and friction points this paper is trying to address.

Compare human and DNN object representations at fine and coarse levels

Assess unsupervised alignment for mapping human-model object representations

Evaluate CLIP and self-supervised models against human similarity judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised alignment using Gromov-Wasserstein Optimal Transport

Comparing human and model representations at multiple levels

CLIP models achieve strong fine- and coarse-grained matching

🔎 Similar Papers

Dimensions underlying the representational alignment of deep neural networks with humans