Large Language Models as Robust Data Generators in Software Analytics: Are We There Yet?

📅 2024-11-15

📈 Citations: 1

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work investigates the adversarial robustness of LLM-generated data as a substitute for human-annotated data in software analysis tasks. We systematically evaluate six state-of-the-art pre-trained models across three core tasks—code clone detection, code summarization, and code review sentiment analysis—under nine SOTA text-based adversarial attacks (e.g., TextFooler, BERT-Attack). To enable fine-grained analysis, we introduce an 11-dimensional semantic and syntactic similarity metric suite for multi-faceted comparison. Results show that models fine-tuned on LLM-synthesized data achieve performance comparable to those trained on human-annotated data under standard evaluation, yet suffer significant robustness degradation under adversarial perturbations—exhibiting an average performance drop of 23.7%. This study reveals an intrinsic vulnerability of current LLM-generated data in security-critical software analysis, providing empirical evidence and a standardized evaluation framework essential for building trustworthy AI-powered software analytics systems.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM)-generated data is increasingly used in software analytics, but it is unclear how this data compares to human-written data, particularly when models are exposed to adversarial scenarios. Adversarial attacks can compromise the reliability and security of software systems, so understanding how LLM-generated data performs under these conditions, compared to human-written data, which serves as the benchmark for model performance, can provide valuable insights into whether LLM-generated data offers similar robustness and effectiveness. To address this gap, we systematically evaluate and compare the quality of human-written and LLM-generated data for fine-tuning robust pre-trained models (PTMs) in the context of adversarial attacks. We evaluate the robustness of six widely used PTMs, fine-tuned on human-written and LLM-generated data, before and after adversarial attacks. This evaluation employs nine state-of-the-art (SOTA) adversarial attack techniques across three popular software analytics tasks: clone detection, code summarization, and sentiment analysis in code review discussions. Additionally, we analyze the quality of the generated adversarial examples using eleven similarity metrics. Our findings reveal that while PTMs fine-tuned on LLM-generated data perform competitively with those fine-tuned on human-written data, they exhibit less robustness against adversarial attacks in software analytics tasks. Our study underscores the need for further exploration into enhancing the quality of LLM-generated training data to develop models that are both high-performing and capable of withstanding adversarial attacks in software analytics.

Problem

Research questions and friction points this paper is trying to address.

Comparing LLM-generated vs human-written data robustness in adversarial scenarios

Evaluating PTM performance on software analytics tasks under adversarial attacks

Assessing quality of adversarial examples using multiple similarity metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically compare human-written and LLM-generated data

Evaluate robustness using nine adversarial attack techniques

Analyze adversarial examples with eleven similarity metrics

🔎 Similar Papers

No similar papers found.