Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience

📅 2024-08-16

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

For non-generative software engineering tasks—such as bug localization and code clone detection—pretrained language models (PLMs) suffer from limited performance due to the scarcity of high-quality labeled data and the high computational cost of fine-tuning. To address this, we propose a generative experience distillation framework leveraging large language models (LLMs) like CodeLlama and GPT as “experience distillers” to automatically synthesize high-quality, task-aligned domain-specific labeled data. This synthetic data enables supervised learning and domain adaptation of lightweight PLMs—including BERT and CodeBERT—without costly end-to-end fine-tuning. Our approach is the first to systematically replace conventional fine-tuning with generative data augmentation, thereby alleviating both data scarcity and computational bottlenecks. Experiments demonstrate substantial improvements: up to +58.36% in F1-score for bug localization and +6.09% for clone detection, significantly outperforming state-of-the-art fine-tuning and data augmentation baselines.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub), these models can understand the patterns in source code and use these patterns to predict code properties. However, LLMs under few-shot learning perform poorly on non-generative tasks (e.g., fault localization and vulnerability localization), and fine-tuning LLMs is time-consuming and costly for end users and small organizations. Furthermore, the performance of fine-tuning LMs for non-generative tasks is impressive, yet it heavily depends on the amount and quality of data. As a result, the current lack of data and the high cost of collecting it in real-world scenarios further limit the applicability of LMs. In this paper, we leverage the powerful generation capabilities of LLMs to enhance pre-trained LMs. Specifically, we use LLMs to generate domain-specific data, thereby improving the performance of pre-trained LMs on the target tasks. We conduct experiments by combining different LLMs in our generation phase and introducing various LMs to learn from the LLM-generated data. Then, we compare the performance of these LMs before and after learning the data. We find that LLM-generated data significantly enhances the performance of LMs. The improvement can reach up to 58.36% for fault localization and up to 6.09% for clone detection.

Problem

Research questions and friction points this paper is trying to address.

Pre-trained Language Models

Code Error Detection

Security Vulnerability Identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Domain-specific Data Generation

Performance Enhancement

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models