NLD-LLM: A systematic framework for evaluating small language transformer models on natural language description

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of systematic evaluation methodologies for small language models (SLMs) on code-to-natural-language description tasks. Methodologically, we propose a standardized, reproducible evaluation framework that integrates multiple mainstream Transformer architectures (e.g., Qwen, Phi, LLaMA), employs structured prompt templates, and incorporates an iterative refinement mechanism. We further introduce dual-dimensional automated metrics—semantic fidelity and conciseness—to quantitatively assess description quality. Our key finding is that advanced prompt engineering significantly narrows the performance gap between SLMs and large language models (LLMs): several optimized SLMs achieve description accuracy and conciseness comparable to, or even exceeding, those of substantially larger models. This study advances an efficient, cost-effective, and deployment-oriented evaluation paradigm for code language modeling.

Technology Category

Application Category

📝 Abstract

Natural Language Description (NLD) is a Natural Language Processing (NLP) task that requires models to generate structured and meaningful outputs from natural language inputs. In this work, we propose NLD-LLM, a systematic NLP framework to evaluate the performance of language models to generate accurate and concise source code descriptions. This framework incorporates a diverse set of transformer models, including Qwen, DeepSeek, Phi, LLaMA, and Mistral, spanning various sizes, architectures, and training approaches. Central to NLD-LLM is a comprehensive prompt design strategy that includes standardized formatting, clear task guidance, and NLD prompting, ensuring fair and consistent evaluation. Additionally, we apply an iterative refinement process to improve output's quality and assess the model's adaptability. Using semantic and structural metrics, our analysis demonstrates that prompt engineering significantly impacts the effectiveness of the model such that smaller models often performing competitively when supported by well-crafted prompts.

Problem

Research questions and friction points this paper is trying to address.

Evaluating small language models on code description generation

Systematic framework for assessing transformer model performance

Analyzing prompt engineering impact on model effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework evaluates small transformer models systematically

Comprehensive prompt design strategy ensures consistent evaluation

Iterative refinement process improves output quality adaptably

🔎 Similar Papers

No similar papers found.

Authors to Follow