🤖 AI Summary
This work addresses the lack of systematic evaluation methodologies for small language models (SLMs) on code-to-natural-language description tasks. Methodologically, we propose a standardized, reproducible evaluation framework that integrates multiple mainstream Transformer architectures (e.g., Qwen, Phi, LLaMA), employs structured prompt templates, and incorporates an iterative refinement mechanism. We further introduce dual-dimensional automated metrics—semantic fidelity and conciseness—to quantitatively assess description quality. Our key finding is that advanced prompt engineering significantly narrows the performance gap between SLMs and large language models (LLMs): several optimized SLMs achieve description accuracy and conciseness comparable to, or even exceeding, those of substantially larger models. This study advances an efficient, cost-effective, and deployment-oriented evaluation paradigm for code language modeling.
📝 Abstract
Natural Language Description (NLD) is a Natural Language Processing (NLP) task that requires models to generate structured and meaningful outputs from natural language inputs. In this work, we propose NLD-LLM, a systematic NLP framework to evaluate the performance of language models to generate accurate and concise source code descriptions. This framework incorporates a diverse set of transformer models, including Qwen, DeepSeek, Phi, LLaMA, and Mistral, spanning various sizes, architectures, and training approaches. Central to NLD-LLM is a comprehensive prompt design strategy that includes standardized formatting, clear task guidance, and NLD prompting, ensuring fair and consistent evaluation. Additionally, we apply an iterative refinement process to improve output's quality and assess the model's adaptability. Using semantic and structural metrics, our analysis demonstrates that prompt engineering significantly impacts the effectiveness of the model such that smaller models often performing competitively when supported by well-crafted prompts.