Understanding the Characteristics of LLM-Generated Property-Based Tests in Exploring Edge Cases

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Large language models (LLMs) often fail to detect edge-case defects in generated code due to inadequate boundary-condition coverage in automated testing. Method: This study comparatively analyzes property-based testing (PBT) and example-based testing (EBT), both automatically generated by Claude-4-Sonnet, for their efficacy in uncovering boundary defects across 16 HumanEval tasks. Contribution/Results: We empirically demonstrate that PBT and EBT exhibit strong complementarity: used individually, each detects defects in 68.75% of tasks, whereas their combination achieves 81.25% detection rate—marking the first evidence that LLM-generated PBT uniquely excels at covering special input patterns and boundary conditions. Based on this finding, we propose a hybrid test-generation paradigm that integrates PBT and EBT to enhance the reliability of LLM-generated code. This work provides a novel, empirically grounded framework for designing automated testing strategies in LLM-assisted software development.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) increasingly generate code in software development, ensuring the quality of LLM-generated code has become important. Traditional testing approaches using Example-based Testing (EBT) often miss edge cases -- defects that occur at boundary values, special input patterns, or extreme conditions. This research investigates the characteristics of LLM-generated Property-based Testing (PBT) compared to EBT for exploring edge cases. We analyze 16 HumanEval problems where standard solutions failed on extended test cases, generating both PBT and EBT test codes using Claude-4-sonnet. Our experimental results reveal that while each method individually achieved a 68.75% bug detection rate, combining both approaches improved detection to 81.25%. The analysis demonstrates complementary characteristics: PBT effectively detects performance issues and edge cases through extensive input space exploration, while EBT effectively detects specific boundary conditions and special patterns. These findings suggest that a hybrid approach leveraging both testing methods can improve the reliability of LLM-generated code, providing guidance for test generation strategies in LLM-based code generation.

Problem

Research questions and friction points this paper is trying to address.

Investigating LLM-generated Property-based Testing effectiveness for edge case detection

Comparing PBT and Example-based Testing in identifying boundary value defects

Proposing hybrid testing approach to improve LLM-generated code reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines property-based and example-based testing methods

Explores edge cases through extensive input space exploration

Leverages hybrid testing to improve bug detection rates

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation