🤖 AI Summary
This work addresses the inefficiency of traditional neural architecture search (NAS), which relies heavily on manual design or brute-force trial-and-error. We propose, for the first time, leveraging large language models (LLMs) as end-to-end neural architecture designers that generate executable image captioning models under strict Net API contractual constraints. Methodologically, we build a prompt-driven NAS pipeline based on Qwen3-8B to jointly synthesize CNN-based encoders and LSTM/GRU/Transformer decoders—including their hyperparameters and training strategies—while integrating automated BLEU-4 evaluation and code correction. Our contributions include: (1) establishing the first LLM-driven, API-compliant NAS paradigm; (2) open-sourcing the extended LEMUR dataset; and (3) experimentally generating dozens of models, over 50% of which train successfully, achieving a peak BLEU-4 score of 32.7—demonstrating the critical role of API constraints in ensuring generation quality.
📝 Abstract
Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR's classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.