🤖 AI Summary
This work addresses the limitations of current large language models in complex reasoning, which often rely on expert annotations and external validation, while self-evolution approaches are prone to collective hallucinations and biased priors, hindering precise identification of effective learning zones. To overcome these challenges, we propose AERO, a framework grounded in the Zone of Proximal Development theory that enables autonomous reasoning evolution through an endogenous dual-loop mechanism integrating self-questioning, self-answering, and self-critique. AERO innovatively employs entropy-based localization to identify the “solvability gap,” couples it with independent counterfactual correction for robust verification, and introduces an interleaved training strategy to co-evolve the capabilities of each reasoning role while preventing curriculum collapse. Evaluated across nine benchmarks in three domains, AERO achieves significant improvements, boosting Qwen3-4B and Qwen3-8B base models by 4.57% and 5.10% on average, respectively.
📝 Abstract
Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert-annotated data and external verifiers. While existing self-evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underline{A}utonomous \underline{E}volutionary \underline{R}easoning \underline{O}ptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self-questioning, answering, and criticism within a synergistic dual-loop system. Inspired by the \textit{Zone of Proximal Development (ZPD)} theory, AERO utilizes entropy-based positioning to target the ``solvability gap''and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57\% on Qwen3-4B-Base and 5.10\% on Qwen3-8B-Base, outperforming competitive baselines. Code is available at https://github.com/mira-ai-lab/AERO.