🤖 AI Summary
This study addresses the lack of effective evaluation of metacognitive capabilities—such as self-monitoring and belief revision—in current AI systems, particularly when models exhibit genuine disagreement. The authors introduce MEDLEY-BENCH, a benchmark comprising 130 ambiguous instances across five domains, which uniquely disentangles assessment (judgment accuracy) from control (behavioral regulation) dimensions. They evaluate 35 models on independent reasoning, self-correction, and socially influenced belief updating, employing complementary metrics: MMS (hierarchically aggregated scoring) and MAS (metacognitive sub-competency decomposition), alongside relative ability profiling. Findings reveal that model scale enhances assessment but not control capabilities, manifesting a pervasive “knowing–doing gap.” Notably, smaller models outperform larger ones on certain metacognitive tasks, and two distinct belief-revision patterns are identified, demonstrating that metacognitive competence does not solely depend on model size.
📝 Abstract
Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.