Enhancement Report Approval Prediction: A Comparative Study of Large Language Models

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the inefficiency and oversight of high-value enhancement requests (ERs) in manual software maintenance approval, this paper proposes ER Approval Prediction (ERAP), an automated approach. We first systematically evaluate 18 large language models (LLMs) on ERAP, revealing that integrating developer profiling significantly improves zero-shot decoder accuracy. Building on this insight, we introduce a LoRA-finetuned Llama 3.1 8B Instruct model, combined with multi-encoder architectures (BERT, RoBERTa, DeBERTa-v3, ELECTRA, XLNet) and decoder backbones (GPT-series, DeepSeek-V3), validated under strict time-series cross-validation to prevent data leakage. Our best model achieves 79.0% overall accuracy and 76.1% recall for the “approve” class—outperforming LSTM-GloVe by 12.0 percentage points in static evaluation and maintaining a 5.0-point advantage under temporal validation. These results demonstrate strong practical viability for deployment in industrial ER triaging systems.

Technology Category

Application Category

📝 Abstract

Enhancement reports (ERs) serve as a critical communication channel between users and developers, capturing valuable suggestions for software improvement. However, manually processing these reports is resource-intensive, leading to delays and potential loss of valuable insights. To address this challenge, enhancement report approval prediction (ERAP) has emerged as a research focus, leveraging machine learning techniques to automate decision-making. While traditional approaches have employed feature-based classifiers and deep learning models, recent advancements in large language models (LLM) present new opportunities for enhancing prediction accuracy. This study systematically evaluates 18 LLM variants (including BERT, RoBERTa, DeBERTa-v3, ELECTRA, and XLNet for encoder models; GPT-3.5-turbo, GPT-4o-mini, Llama 3.1 8B, Llama 3.1 8B Instruct and DeepSeek-V3 for decoder models) against traditional methods (CNN/LSTM-BERT/GloVe). Our experiments reveal two key insights: (1) Incorporating creator profiles increases unfine-tuned decoder-only models' overall accuracy by 10.8 percent though it may introduce bias; (2) LoRA fine-tuned Llama 3.1 8B Instruct further improve performance, reaching 79 percent accuracy and significantly enhancing recall for approved reports (76.1 percent vs. LSTM-GLOVE's 64.1 percent), outperforming traditional methods by 5 percent under strict chronological evaluation and effectively addressing class imbalance issues. These findings establish LLM as a superior solution for ERAP, demonstrating their potential to streamline software maintenance workflows and improve decision-making in real-world development environments. We also investigated and summarized the ER cases where the large models underperformed, providing valuable directions for future research.

Problem

Research questions and friction points this paper is trying to address.

Predicting enhancement report approval using large language models

Automating resource-intensive manual processing of software improvement suggestions

Addressing class imbalance and bias in enhancement report evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes 18 LLM variants for prediction

Incorporates creator profiles for accuracy

Employs LoRA fine-tuned Llama 3.1 8B

🔎 Similar Papers

No similar papers found.