MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Current benchmarks lack a systematic evaluation of multimodal large language models’ (MLLMs) self-awareness capabilities in embodied settings. Inspired by the psychological mirror self-recognition (MSR) test, this work proposes MirrorBench—a hierarchical simulation framework that introduces the MSR paradigm into MLLM assessment for the first time. MirrorBench constructs progressively challenging self-referential understanding tasks, spanning from basic visual perception to higher-order self-representation. By integrating multimodal perception, linguistic reasoning, and embodied interaction, the framework enables a systematic evaluation of egocentric intelligence in MLLMs. Experimental results reveal that even state-of-the-art MLLMs perform substantially worse than humans at the most fundamental level, exposing critical limitations in their capacity for self-referential understanding and thereby addressing a key gap in evaluating self-awareness within embodied intelligence.

Technology Category

Application Category

📝 Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated remarkable advances in perception and reasoning, suggesting their potential for embodied intelligence. While recent studies have evaluated embodied MLLMs in interactive settings, current benchmarks mainly target capabilities to perceive, understand, and interact with external objects, lacking a systematic evaluation of self-centric intelligence. To address this, we introduce MirrorBench, a simulation-based benchmark inspired by the classical Mirror Self-Recognition (MSR) test in psychology. MirrorBench extends this paradigm to embodied MLLMs through a tiered framework of progressively challenging tasks, assessing agents from basic visual perception to high-level self-representation. Experiments on leading MLLMs show that even at the lowest level, their performance remains substantially inferior to human performance, revealing fundamental limitations in self-referential understanding. Our study bridges psychological paradigms and embodied intelligence, offering a principled framework for evaluating the emergence of general intelligence in large models. Project page: https://fflahm.github.io/mirror-bench-page/.

Problem

Research questions and friction points this paper is trying to address.

self-centric intelligence

Multimodal Large Language Models

Mirror Self-Recognition

embodied intelligence

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

MirrorBench

self-centric intelligence

Mirror Self-Recognition

embodied MLLMs

self-representation

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

2024-06-12Citations: 0