Evaluating Large Language Models on Non-Code Software Engineering Tasks

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Prior research lacks systematic evaluation of large language models (LLMs) on non-code software engineering (SE) tasks. Method: We introduce SELU, the first comprehensive benchmark for SE text understanding, covering 17 real-world SE tasks—including requirements classification, effort estimation, named entity recognition (NER), and masked language modeling (MLM)—and evaluating 22 open-source and commercial LLMs. SELU features a novel multi-task, multi-source, multi-objective design and employs rigorous evaluation metrics (e.g., macro-F1, sMAPE) alongside Bayesian signed-rank tests. Contribution/Results: Experiments reveal that code-specific pretraining yields limited gains for non-code SE tasks; decoder-only models of moderate scale (e.g., Phi-3, Qwen2) achieve top overall performance and strongest cross-task robustness; and the efficacy of fine-tuning and prompt engineering varies significantly across task types. This work provides the first empirical foundation for LLM selection and adaptation in SE contexts.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation; however, their effectiveness on non-code Software Engineering (SE) tasks remains underexplored. We present the first comprehensive benchmark, which we name `Software Engineering Language Understanding' (SELU), for evaluating LLMs on 17 non-code tasks, spanning from identifying whether a requirement is functional or non-functional to estimating the effort and complexity of backlog items. SELU covers classification, regression, Named Entity Recognition (NER), and Masked Language Modeling (MLM) targets, with data drawn from diverse sources such as code repositories, issue tracking systems, and developer forums. We fine-tune 22 open-source LLMs, prompt two proprietary alternatives, and train two baselines. Performance is measured using metrics such as F1-macro, SMAPE, F1-micro, and accuracy, and compared via the Bayesian signed-rank test. Our results show that moderate-scale decoder-only models consistently form a top-tier, exhibiting high mean performance and low across-task variance, while domain adaptation via code-focused pre-training might yield only modest improvements. These insights guide model selection for non-code SE workflows and highlight directions for expanding SELU to generative and design-oriented scenarios.

Problem

Research questions and friction points this paper is trying to address.

Effectiveness of LLMs on non-code SE tasks is underexplored.

First comprehensive benchmark SELU evaluates 17 non-code tasks.

Moderate-scale models outperform with high mean and low variance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for non-code SE tasks.

Fine-tuning and domain adaptation.

Performance metrics and model insights.

🔎 Similar Papers

No similar papers found.