🤖 AI Summary
This study investigates whether large language models (LLMs) can infer the relative performance of neural networks across diverse image classification datasets solely from their source code, without relying on post-training evaluation. Building upon the NNGPT framework and leveraging standardized PyTorch implementations and performance metrics from the LEMUR dataset, the authors fine-tune the DeepSeek-Coder-7B-Instruct model using efficient LoRA adaptation and evaluate three prompting strategies: code-only, code with metadata, and metadata-only. Experimental results demonstrate that the code-only approach achieves 80% prediction accuracy—significantly outperforming the metadata-only variant at 70%—thereby providing the first empirical evidence that neural network source code inherently contains sufficient discriminative signals for cross-dataset performance inference. The findings further highlight the critical role of model capacity in this reasoning capability, underscoring the method’s effectiveness and generalization potential.
📝 Abstract
Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched prompt replacing accuracies with dataset properties, and a code-only prompt presenting only architecture source code and dataset names. Using DeepSeek-Coder-7B-Instruct fine-tuned with LoRA, the code-only prompt reaches 80% peak accuracy over 15 epochs, while the metadata prompt peaks at 70%. Perdataset analysis reveals complementary strengths: metadata excels for datasets with distinctive properties (CelebAGender at 90.9%) but degrades for overlapping characteristics, whereas the code-only prompt shows more balanced performance. A comparison with DeepSeek-Coder1.3B confirms that model capacity affects this form of architectural reasoning. The results establish that LLMs can be fine-tuned to predict cross-dataset suitability from neural network code, suggesting that architecture source code contains richer discriminative signal than dataset metadata alone.