Huxley-G""odel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies the “metaproductivity–performance mismatch” in programming agents’ self-improvement: high scores on software engineering benchmarks do not reliably predict an agent’s capacity for effective self-modification. To address this, we propose CMP (Cumulative Offspring Performance), a novel evaluation metric that approximates Gödel Machine–style self-referential optimization for the first time. Our method constructs self-modifying code trees, integrates SWE-bench and Polyglot benchmarks, and employs tree search guided by large language models (GPT-5-mini) in a self-supervised learning framework, with CMP directing efficient evolutionary path exploration. Experiments demonstrate that our approach surpasses prior methods on SWE-bench Verified and Polyglot while reducing runtime, and achieves human-level performance on SWE-bench Lite—validating the efficacy of a general-purpose self-improvement mechanism.

Technology Category

Application Category

📝 Abstract
Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ($mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $mathrm{CMP}$ is sufficient to simulate how the G""odel Machine would behave under certain assumptions. We introduce the Huxley-G""odel Machine (HGM), which, by estimating $mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using less wall-clock time. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is available at https://github.com/metauto-ai/HGM.
Problem

Research questions and friction points this paper is trying to address.

Addresses mismatch between coding performance and self-improvement potential
Proposes metric to estimate agent's potential for self-modifications
Develops self-improving coding agent achieving human-level performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates agent potential using descendant performance metric
Guides self-modification search with CMP approximation
Achieves human-level coding with efficient tree exploration