🤖 AI Summary
To address the fragmentation, low reusability, and development inefficiency of test assets in automotive Hardware-in-the-Loop (HIL) testing, this paper proposes HIL-GPT: a domain-adapted lightweight large language model (LLM) integrating Retrieval-Augmented Generation (RAG) and a fine-tuned semantic embedding model to enable traceable, bidirectional retrieval between requirements and test cases. Methodologically, we introduce a novel data curation pipeline combining heuristic mining and LLM-based synthesis to construct a high-quality, domain-specific dataset. We empirically validate that compact models achieve an optimal trade-off among accuracy, inference latency, and deployment cost—challenging the prevailing “bigger is better” assumption. A/B experiments demonstrate that HIL-GPT significantly outperforms general-purpose LLMs in practical utility, result reliability, and user satisfaction.
📝 Abstract
Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, a retrieval-augmented generation (RAG) system integrating domain-adapted large language models (LLMs) with semantic retrieval. HIL-GPT leverages embedding fine-tuning using a domain-specific dataset constructed via heuristic mining and LLM-assisted synthesis, combined with vector indexing for scalable, traceable test case and requirement retrieval. Experiments show that fine-tuned compact models, such as exttt{bge-base-en-v1.5}, achieve a superior trade-off between accuracy, latency, and cost compared to larger models, challenging the notion that bigger is always better. An A/B user study further confirms that RAG-enhanced assistants improve perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs. These findings provide insights for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments.