Reflection-Bench: probing AI intelligence with reflection

📅 2024-10-21
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This study addresses the absence of cognitive agency—specifically, the capacity for reflective belief construction, dynamic updating, and self-monitoring—in large language models (LLMs). To this end, we introduce Reflection-Bench, the first comprehensive benchmark explicitly designed to evaluate reflection across seven cognitive dimensions: perception, memory, belief updating, decision-making, prediction, counterfactual reasoning, and meta-reflection. We systematically assess 13 state-of-the-art LLMs. Grounded in cognitive science, our work provides the first formal definition and quantification of AI reflection capability, featuring multi-level interactive tasks that uniquely address critical gaps in belief updating and metacognitive evaluation. Experimental results reveal that even top-tier models—including GPT-4, Claude 3.5, and o1—exhibit error rates exceeding 60% on belief updating and counterfactual reasoning tasks, underscoring their fundamental lack of closed-loop cognitive regulation mechanisms.

Technology Category

Application Category

📝 Abstract
The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating epistemic agency in large language models
Assessing belief construction and adaptation in dynamic environments
Benchmarking cognitive functions like meta-reflection and belief updating
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive-psychology-inspired benchmark for LLMs
Seven-task evaluation of epistemic agency
Three-tier performance hierarchy analysis
🔎 Similar Papers
No similar papers found.
Lingyu Li
Lingyu Li
Shanghai Jiao Tong University
Active inferenceArtificial Intelligencephilosophy
Y
Yixu Wang
Shanghai Artificial Intelligence Laboratory
Haiquan Zhao
Haiquan Zhao
Alibaba Group
LLM Safety
S
Shuqi Kong
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University, Shanghai Mental Health Center
Y
Yan Teng
Shanghai Artificial Intelligence Laboratory
C
Chunbo Li
Shanghai Mental Health Center
Y
Yingchun Wang
Shanghai Artificial Intelligence Laboratory