🤖 AI Summary
This study investigates whether embodied vision-language model (VLM) agents possess animal-like mirror self-recognition—the capacity to accurately infer their own physical attributes from mirror reflections and distinguish self from others. To this end, we introduce a controlled 3D simulation benchmark in which first-person VLMs must reason about hidden bodily features via mirrors to complete matching tasks, while facing perturbations such as mirror removal, misleading cues, and occlusions designed to eliminate shortcut strategies. Through multidimensional behavioral analysis—including mirror-seeking behavior, temporal judgment, self-attribution, and consistency between reasoning and action—we pioneer the use of the mirror test as a diagnostic tool. Our findings reveal that high-performing VLMs effectively leverage mirrors to guide actions, whereas weaker models, despite fixating on mirrors, fail to extract self-relevant information or conflate self with others, suggesting that mere self-referential language is insufficient for genuine embodied self-awareness.
📝 Abstract
In the animal kingdom, mirror self-recognition is a canonical probe of higher-order cognition, emerging only in some species. We ask whether an analogous functional capability emerges in embodied vision-language model (VLM) agents: can they recognize themselves in a mirror? We introduce a controlled 3D benchmark where a first-person VLM agent must infer a hidden body attribute from its reflection and select the matching target, while avoiding self-other misattribution. To separate mirror-grounded self-identification from shortcuts, we test mirror removal, misleading cues, and occluded reflections. We also evaluate the decision process through mirror seeking, temporal ordering, self-attribution, and reasoning-action consistency. Our experiments show that mirror-based self-identification emerges mainly in stronger VLMs. These models can use reflected evidence for action, whereas weaker models often inspect the mirror but fail to extract self-relevant information or misattribute their reflection. Language-vision conflict further shows that self-referential language alone is not evidence of grounded self-identification. Overall, mirror-based evaluation provides a diagnostic for whether embodied self-grounding is causally rooted in perception and action rather than priors, prompt compliance, or confabulation.