🤖 AI Summary
This work investigates the capability of large language models (LLMs) to autonomously design neural architectures that are simultaneously syntactically valid, high-performing, and structurally novel. By integrating a code-oriented LLM into a closed-loop synthesis framework—initialized with the LEMUR dataset and enhanced through 22 rounds of supervised fine-tuning, low-fidelity evaluation (single-epoch training accuracy), MinHash-Jaccard-based structural deduplication, and efficient LoRA adaptation—the study demonstrates for the first time that LLMs can internalize execution feedback as non-textual experiential rewards, transcending mere memorization of training data. Experimental results show a stable effective generation rate of 50.6% (peaking at 74.5%), with average first-epoch accuracy improving from 28.06% to 50.99%. Notably, the proportion of candidate architectures exceeding 40% accuracy surged from 2.04% to 96.81%, yielding 455 high-performance, structurally novel architectures absent from the original corpus.
📝 Abstract
Large language models (LLMs) excel in program synthesis, yet their ability to autonomously navigate neural architecture design--balancing syntactic reliability, performance, and structural novelty--remains underexplored. We address this by placing a code-oriented LLM within a closed-loop synthesis framework, analyzing its evolution over 22 supervised fine-tuning cycles. The model synthesizes PyTorch convolutional networks which are validated, evaluated via low-fidelity performance signals (single-epoch accuracy), and filtered using a MinHash-Jaccard criterion to prevent structural redundancy. High-performing, novel architectures are converted into prompt-code pairs for iterative fine-tuning via parameter-efficient LoRA adaptation, initialized from the LEMUR dataset. Across cycles, the LLM internalizes empirical architectural priors, becoming a robust generator. The valid generation rate stabilizes at 50.6 percent (peaking at 74.5 percent), while mean first-epoch accuracy rises from 28.06 percent to 50.99 percent, and the fraction of candidates exceeding 40 percent accuracy grows from 2.04 percent to 96.81 percent. Analyses confirm the model moves beyond replicating existing motifs, synthesizing 455 high-performing architectures absent from the original corpus. By grounding code synthesis in execution feedback, this work provides a scalable blueprint for transforming stochastic generators into autonomous, performance-driven neural designers, establishing that LLMs can internalize empirical, non-textual rewards to transcend their training data.