AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current large language models in complex reasoning, which often rely on expert annotations and external validation, while self-evolution approaches are prone to collective hallucinations and biased priors, hindering precise identification of effective learning zones. To overcome these challenges, we propose AERO, a framework grounded in the Zone of Proximal Development theory that enables autonomous reasoning evolution through an endogenous dual-loop mechanism integrating self-questioning, self-answering, and self-critique. AERO innovatively employs entropy-based localization to identify the “solvability gap,” couples it with independent counterfactual correction for robust verification, and introduces an interleaved training strategy to co-evolve the capabilities of each reasoning role while preventing curriculum collapse. Evaluated across nine benchmarks in three domains, AERO achieves significant improvements, boosting Qwen3-4B and Qwen3-8B base models by 4.57% and 5.10% on average, respectively.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert-annotated data and external verifiers. While existing self-evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underline{A}utonomous \underline{E}volutionary \underline{R}easoning \underline{O}ptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self-questioning, answering, and criticism within a synergistic dual-loop system. Inspired by the \textit{Zone of Proximal Development (ZPD)} theory, AERO utilizes entropy-based positioning to target the ``solvability gap''and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57\% on Qwen3-4B-Base and 5.10\% on Qwen3-8B-Base, outperforming competitive baselines. Code is available at https://github.com/mira-ai-lab/AERO.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
self-evolution
reasoning optimization
hallucination
learning zone
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous Evolutionary Reasoning
Dual-Loop Feedback
Zone of Proximal Development
Counterfactual Correction
Staggered Training
🔎 Similar Papers
No similar papers found.
Z
Zhitao Gao
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China
J
Jie Ma
School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China
Xuhong Li
Xuhong Li
Baidu Inc
Explainable AITransfer Learning
Pengyu Li
Pengyu Li
Duke University
Machine LearningMatrix AnalysisRepresentation Learning
Ning Qu
Ning Qu
NIO USA, Waymo, Baidu USA, Google, Carnegie Mellon University, Peking University
Operating SystemSecurity
Yaqiang Wu
Yaqiang Wu
Lenovo
Hui Liu
Hui Liu
Amazon
Natural Language ProcessingLarge Language ModelsArtificial Intelligence
J
Jun Liu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China; MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, 710049, China; Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, Xi’an Jiaotong University, Xi’an, 710049, China