🤖 AI Summary
This work investigates whether large language models (LLMs) can spontaneously acquire human-like System 2 reasoning—characterized by deliberate, sequential, and reflective inference—through evolutionary mechanisms, rather than merely improving task-specific performance.
Method: We propose Evolutionary Reasoning Optimization (ERO), a framework that maintains an LLM population, employs a quantitative reasoning score as the fitness metric, and iteratively optimizes model parameters via evolutionary operators (selection, mutation, crossover).
Contribution/Results: Experiments reveal that mainstream LLMs exhibit weak System 2 reasoning capabilities; remarkably, weaker models (e.g., Qwen-7B) rapidly develop strong reasoning ability after only a few ERO cycles. On multiple standard reasoning benchmarks (e.g., GSM8K, MMLU, ARC), ERO enables low-parameter models to achieve high reasoning performance—demonstrating a leapfrog improvement. Crucially, this is the first empirical validation of an unsupervised, task-agnostic pathway for reasoning capability evolution, establishing a novel paradigm for studying general intelligence emergence.
📝 Abstract
Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.