🤖 AI Summary
This study investigates whether multimodal large language models (MLLMs) can spontaneously develop bodily self-awareness solely through embodied sensorimotor interaction. We embed an MLLM in an autonomous mobile robot that explores its environment and learns closed-loop behaviors exclusively from real-time multimodal sensory inputs—vision, touch, proprioception, and vestibular signals—without any explicit supervision or pre-defined self-models. We systematically evaluate the model’s capabilities in environmental recognition, self-discrimination, and motor prediction. Our key contributions are threefold: (1) First empirical evidence that MLLMs hierarchically emergent bodily self-awareness in a fully unsupervised, embodied setting; (2) Causal insights—derived via structural equation modeling and sensory ablation experiments—into how multisensory integration, temporal memory, and hierarchical internal representations jointly enable self-awareness; and (3) Demonstration that structured and episodic memory are essential for coherent self-referential reasoning, along with identification of critical sensory modalities and their functional redundancy relationships.
📝 Abstract
Self-awareness - the ability to distinguish oneself from the surrounding environment - underpins intelligent, autonomous behavior. Recent advances in AI achieve human-like performance in tasks integrating multimodal information, particularly in large language models, raising interest in the embodiment capabilities of AI agents on nonhuman platforms such as robots. Here, we explore whether multimodal LLMs can develop self-awareness solely through sensorimotor experiences. By integrating a multimodal LLM into an autonomous mobile robot, we test its ability to achieve this capacity. We find that the system exhibits robust environmental awareness, self-recognition and predictive awareness, allowing it to infer its robotic nature and motion characteristics. Structural equation modeling reveals how sensory integration influences distinct dimensions of self-awareness and its coordination with past-present memory, as well as the hierarchical internal associations that drive self-identification. Ablation tests of sensory inputs identify critical modalities for each dimension, demonstrate compensatory interactions among sensors and confirm the essential role of structured and episodic memory in coherent reasoning. These findings demonstrate that, given appropriate sensory information about the world and itself, multimodal LLMs exhibit emergent self-awareness, opening the door to artificial embodied cognitive systems.