Reflection-Bench: probing AI intelligence with reflection

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 1

career value

191K/year

🤖 AI Summary

This study addresses the absence of cognitive agency—specifically, the capacity for reflective belief construction, dynamic updating, and self-monitoring—in large language models (LLMs). To this end, we introduce Reflection-Bench, the first comprehensive benchmark explicitly designed to evaluate reflection across seven cognitive dimensions: perception, memory, belief updating, decision-making, prediction, counterfactual reasoning, and meta-reflection. We systematically assess 13 state-of-the-art LLMs. Grounded in cognitive science, our work provides the first formal definition and quantification of AI reflection capability, featuring multi-level interactive tasks that uniquely address critical gaps in belief updating and metacognitive evaluation. Experimental results reveal that even top-tier models—including GPT-4, Claude 3.5, and o1—exhibit error rates exceeding 60% on belief updating and counterfactual reasoning tasks, underscoring their fundamental lack of closed-loop cognitive regulation mechanisms.

Technology Category

Application Category

📝 Abstract

The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating epistemic agency in large language models

Assessing belief construction and adaptation in dynamic environments

Benchmarking cognitive functions like meta-reflection and belief updating

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive-psychology-inspired benchmark for LLMs

Seven-task evaluation of epistemic agency

Three-tier performance hierarchy analysis

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?