π€ AI Summary
This work addresses the critical issue of large language model (LLM)-driven automotive manual retrieval systems frequently omitting essential safety warnings in response to user queries. To systematically evaluate this risk, the authors organize the first benchmarking competition specifically designed for LLM-based automotive assistants. The competition leverages a suite of competing tools that generate test cases through automated test generation, LLM interaction, failure detection, and diversity analysis, establishing an evaluation framework centered on failure-exposure capability and test diversity. The study assesses the effectiveness of four distinct testing tools in uncovering safety-critical flaws, offering a reproducible experimental setup, baseline results, and actionable insights into the safety assurance of LLM applications in high-stakes domains such as automotive assistance.
π Abstract
This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.