Retrieval-Augmented Test Generation: How Far Are We?

📅 2024-09-19

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 0

career value

132K/year

🤖 AI Summary

This work investigates the effectiveness of Retrieval-Augmented Generation (RAG) for automated unit test generation targeting machine learning (ML) APIs. To address the challenge of generating semantically relevant and executable tests, we propose an API-level RAG strategy—refining retrieval granularity to individual API interfaces—and compare it against zero-shot prompting and baseline RAG. We systematically evaluate three knowledge sources—API documentation, GitHub Issues, and Stack Overflow Q&A—across 188 high-frequency APIs from five major ML libraries (e.g., TensorFlow), using GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B. Results show that API-level RAG improves average test pass rate by 27%; Stack Overflow yields superior coverage of edge cases and best cost-effectiveness. This is the first study to conduct a systematic, multi-source, domain-specific comparison of knowledge sources in RAG-based test generation, significantly enhancing test relevance and executability.

Technology Category

Application Category

📝 Abstract

Retrieval Augmented Generation (RAG) has shown notable advancements in software engineering tasks. Despite its potential, RAG's application in unit test generation remains under-explored. To bridge this gap, we take the initiative to investigate the efficacy of RAG-based LLMs in test generation. As RAGs can leverage various knowledge sources to enhance their performance, we also explore the impact of different sources of RAGs' knowledge bases on unit test generation to provide insights into their practical benefits and limitations. Specifically, we examine RAG built upon three types of domain knowledge: 1) API documentation, 2) GitHub issues, and 3) StackOverflow Q&As. Each source offers essential knowledge for creating tests from different perspectives, i.e., API documentations provide official API usage guidelines, GitHub issues offer resolutions of issues related to the APIs from the library developers, and StackOverflow Q&As present community-driven solutions and best practices. For our experiment, we focus on five widely used and typical Python-based machine learning (ML) projects, i.e., TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost to build, train, and deploy complex neural networks efficiently. We conducted experiments using the top 10% most widely used APIs across these projects, involving a total of 188 APIs. We investigate the effectiveness of four state-of-the-art LLMs (open and closed-sourced), i.e., GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llamma 3.1 405B. Additionally, we compare three prompting strategies in generating unit test cases for the experimental APIs, i.e., zero-shot, a Basic RAG, and an API-level RAG on the three external sources. Finally, we compare the cost of different sources of knowledge used for the RAG.

Problem

Research questions and friction points this paper is trying to address.

Investigating RAG effectiveness for unit test generation of ML/DL APIs

Analyzing impact of different knowledge sources on test coverage improvement

Evaluating RAG's potential in detecting bugs through generated unit tests

Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG combines API documentation, GitHub issues, StackOverflow

Tests generated for ML libraries using multiple LLMs

GitHub issues improve line coverage by providing edge cases

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark