🤖 AI Summary
When AI capabilities surpass human cognitive thresholds, alignment methods relying on direct human evaluation—such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)—become infeasible.
Method: We propose a recursive self-critique framework that, for the first time, adapts the “verification-easier-than-generation” principle to critique tasks. It enables supervision dimensionality reduction without direct human judgments via multi-layer decomposable critique (e.g., “critique of critique”).
Contribution: We formalize the core hypothesis that critique is recursively simplifiable. We design controlled comparative experiments across human-human, human-AI, and AI-AI settings, integrating multi-step automated critique chains with task-adaptive modeling. Empirical evaluation across diverse complex tasks demonstrates significant improvements in supervision feasibility and consistency. This work establishes the first empirically validated, scalable supervision paradigm for aligning superhuman AI systems.
📝 Abstract
As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques including SFT and RLHF face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become untenable when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) critique of critique can be easier than critique itself, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) this difficulty relationship is recursively held, suggesting that when direct evaluation is infeasible, performing high-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. To examine these hypotheses, we perform Human-Human, Human-AI, and AI-AI experiments across multiple tasks. Our results demonstrate encouraging evidence supporting these hypotheses and suggest that recursive self-critiquing is a promising direction for scalable oversight.