🤖 AI Summary
LLM inference engines are highly resource-intensive and platform-dependent, making them prone to defects; however, real-world bugs in such engines remain understudied. Method: We conduct the first empirical study on five mainstream open-source LLM inference engines, mining and validating 929 real bugs from GitHub Issues and PRs to construct the first dedicated bug benchmark dataset. Using open coding and qualitative root-cause analysis, we propose a fine-grained taxonomy—six symptom categories (e.g., out-of-memory, CUDA context errors, KV cache inconsistency) and 28 root-cause classes—and identify cross-platform resource scheduling and memory management as the most vulnerable components. Contribution/Results: We derive actionable debugging strategies and concrete architectural improvement recommendations, providing both theoretical foundations and practical guidance for developers, vendors, and researchers in building robust LLM inference systems.
📝 Abstract
Large language model-specific inference engines (in short as emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, and commonality. Our findings reveal six major bug symptoms and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers.