DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

πŸ“… 2026-04-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

223K/year
πŸ€– AI Summary
This work addresses the critical issue of large language model (LLM)-driven automotive manual retrieval systems frequently omitting essential safety warnings in response to user queries. To systematically evaluate this risk, the authors organize the first benchmarking competition specifically designed for LLM-based automotive assistants. The competition leverages a suite of competing tools that generate test cases through automated test generation, LLM interaction, failure detection, and diversity analysis, establishing an evaluation framework centered on failure-exposure capability and test diversity. The study assesses the effectiveness of four distinct testing tools in uncovering safety-critical flaws, offering a reproducible experimental setup, baseline results, and actionable insights into the safety assurance of LLM applications in high-stakes domains such as automotive assistance.

Technology Category

Application Category

πŸ“ Abstract
This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.
Problem

Research questions and friction points this paper is trying to address.

LLM-based automotive assistant
failure detection
warning omission
test diversity
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM testing
automotive assistant
failure-revealing tests
benchmarking
safety warnings
πŸ”Ž Similar Papers
No similar papers found.