exLong: Generating Exceptional Behavior Tests with Large Language Models

๐Ÿ“… 2024-05-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the low coverage and poor quality of exception-behavior tests (EBTs) generated by existing tools, this paper proposes the first LLM-based framework for EBT generation. We perform fine-grained instruction tuning on CodeLlama, integrating program execution trace analysis, guard condition identification, and joint reasoning with non-exceptional test cases. Our novel contribution is a co-modeling mechanism that jointly captures exception-triggering paths and associated guard conditions, enabling end-to-end, interpretable EBT generation. Evaluated on multiple benchmarks, our approach significantly outperforms CAT-LM, GPT-4o, Randoop, and EvoSuite. Moreover, 23 of our generated EBTs have been accepted and merged into mainstream open-source projects, with corresponding pull requests publicly available. This work bridges a critical research gap in LLM-augmented exception testing and establishes a new paradigm for robustness verification.

Technology Category

Application Category

๐Ÿ“ Abstract
Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on"happy paths", e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed exLong, that automatically generates EBTs. exLong is a large language model instruction fine-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT-4o), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that exLong outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by exLong were already accepted.
Problem

Research questions and friction points this paper is trying to address.

Automatic Generation
Exception Behavior Testing
Programming Languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

exLong
exceptional behavior tests
language model fine-tuning
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jiyang Zhang
The University of Texas at Austin, USA
Y
Yu Liu
The University of Texas at Austin, USA
Pengyu Nie
Pengyu Nie
University of Waterloo
Software EngineeringNatural Language ProcessingProgramming Languages
Junyi Jessy Li
Junyi Jessy Li
Associate Professor, The University of Texas at Austin
Computational LinguisticsNatural Language Processing
M
Miloลก Gligoriฤ‡
The University of Texas at Austin, USA