DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the critical issue of large language model (LLM)-driven automotive manual retrieval systems frequently omitting essential safety warnings in response to user queries. To systematically evaluate this risk, the authors organize the first benchmarking competition specifically designed for LLM-based automotive assistants. The competition leverages a suite of competing tools that generate test cases through automated test generation, LLM interaction, failure detection, and diversity analysis, establishing an evaluation framework centered on failure-exposure capability and test diversity. The study assesses the effectiveness of four distinct testing tools in uncovering safety-critical flaws, offering a reproducible experimental setup, baseline results, and actionable insights into the safety assurance of LLM applications in high-stakes domains such as automotive assistance.

Technology Category

Application Category

📝 Abstract

This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.

Problem

Research questions and friction points this paper is trying to address.

LLM-based automotive assistant

failure detection

warning omission

test diversity

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM testing

automotive assistant

failure-revealing tests