The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) can meta-cognitively predict intrinsic properties of their own responses—such as question difficulty, refusal propensity, and semantic association patterns—despite lacking true self-execution capability. To this end, we introduce the first Self-Execution Benchmark (SEB), a standardized, reproducible evaluation suite comprising three tasks: difficulty prediction, refusal detection, and associative inference. SEB integrates human annotation with automated metrics for robust assessment. Empirical results across diverse LLMs reveal consistently weak performance, with no significant positive correlation between parameter count and self-prediction accuracy—indicating fundamental limitations in self-behavior modeling and metacognitive reasoning. SEB thus provides a multidimensional, rigorously validated framework for quantifying and advancing LLMs’ self-awareness capabilities.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.
Problem

Research questions and friction points this paper is trying to address.

Measure LLMs' ability to predict their own responses
Assess models' anticipation of output properties
Evaluate limitations in LLMs' self-behavior reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark measures LLM self-prediction ability
Evaluates output properties anticipation
Reveals LLM self-behavior reasoning limits
🔎 Similar Papers
No similar papers found.
E
Elon Ezra
School of Computer Science, Ariel University, Israel
A
Ariel Weizman
School of Computer Science, Ariel University, Israel
Amos Azaria
Amos Azaria
Computer Science Dpt. Ariel University, Israel
Human-Agent InteractionMachine LearningReinforcement LearningInstructable AgentsNatural Language Processing