Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based test generation approaches are largely confined to single-language settings and single-sample decoding, failing to simultaneously ensure test robustness and self-consistency. This paper introduces PolyTest, the first framework unifying multi-language (Java, C, Python, JavaScript, CSV) test generation with temperature-controlled sampling: it employs zero-temperature cross-language generation to produce high-precision base tests, augments diversity via high-temperature intra-language multiple sampling, and replaces execution feedback with ensemble-level conflict resolution and self-consistency verification. Evaluated on Llama3-70B, GPT-4o, and GPT-3.5 using EvalPlus, PolyTest significantly improves test quality across all five languages—achieving up to 9.01% higher statement/branch coverage and 11.23% higher mutation score, consistently outperforming Pynguin.

Technology Category

Application Category

📝 Abstract
Large language model (LLM)-based test generation has gained attention in software engineering, yet most studies evaluate LLMs' ability to generate unit tests in a single attempt for a given language, missing the opportunity to leverage LLM diversity for more robust testing. This paper introduces PolyTest, a novel approach that enhances test generation by exploiting polyglot and temperature-controlled diversity. PolyTest systematically leverages these properties in two complementary ways: (1) Cross-lingual test generation, where tests are generated in multiple languages at zero temperature and then unified; (2) Diverse test sampling, where multiple test sets are generated within the same language at a higher temperature before unification. A key insight is that LLMs can generate diverse yet contradicting tests -- same input, different expected outputs -- across languages and generations. PolyTest mitigates inconsistencies by unifying test sets, fostering self-consistency and improving overall test quality. Unlike single-language or single-attempt approaches, PolyTest enhances testing without requiring on-the-fly execution, making it particularly beneficial for weaker-performing languages. We evaluate PolyTest on Llama3-70B, GPT-4o, and GPT-3.5 using EvalPlus, generating tests in five languages (Java, C, Python, JavaScript, and a CSV-based format) at temperature 0 and sampling multiple sets at temperature 1. We observe that LLMs frequently generate contradicting tests across settings, and that PolyTest significantly improves test quality across all considered metrics -- number of tests, passing rate, statement/branch coverage (up to +9.01%), and mutation score (up to +11.23%). Finally, PolyTest outperforms Pynguin in test generation, passing rate, and mutation score.
Problem

Research questions and friction points this paper is trying to address.

Enhances unit test generation using polyglot and temperature-controlled diversity.
Mitigates inconsistencies by unifying diverse and contradicting test sets.
Improves test quality across multiple languages and metrics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

PolyTest leverages polyglot test generation.
Temperature-controlled diversity enhances test sampling.
Unified test sets improve self-consistency and quality.