Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This paper addresses the testing challenges of large language models (LLMs) and multi-agent LLM-based software, stemming from their inherent non-determinism—particularly in scenarios with ambiguous inputs/outputs and irreproducible outputs. To systematically model test variability, we propose the first four-dimensional facet taxonomy—comprising *target*, *system under test*, *input*, and *output*. We innovatively define two types of test oracles: *atomic* (fine-grained, single-step) and *aggregated* (coarse-grained, multi-step). Through systematic literature analysis, industrial surveys, and empirical evaluation of open-source tools, we identify fundamental gaps in existing tools’ coverage of variability sources. Our work establishes a structured, extensible taxonomy for LLM testing and explicitly articulates six open research challenges. This framework advances both the theoretical foundations and practical methodologies for improving the reliability and reproducibility of LLM testing.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) and Multi-Agent LLMs (MALLMs) introduce non-determinism unlike traditional or machine learning software, requiring new approaches to verifying correctness beyond simple output comparisons or statistical accuracy over test datasets. This paper presents a taxonomy for LLM test case design, informed by both the research literature, our experience, and open-source tools that represent the state of practice. We identify key variation points that impact test correctness and highlight open challenges that the research, industry, and open-source communities must address as LLMs become integral to software systems. Our taxonomy defines four facets of LLM test case design, addressing ambiguity in both inputs and outputs while establishing best practices. It distinguishes variability in goals, the system under test, and inputs, and introduces two key oracle types: atomic and aggregated. Our mapping indicates that current tools insufficiently account for these variability points, highlighting the need for closer collaboration between academia and practitioners to improve the reliability and reproducibility of LLM testing.

Problem

Research questions and friction points this paper is trying to address.

Addressing non-determinism in LLM and MALLM testing

Developing a taxonomy for LLM test case design

Identifying key variation points impacting test correctness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy for LLM test case design

Four facets addressing input-output ambiguity

Two oracle types: atomic and aggregated

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation