🤖 AI Summary
Deep learning frameworks face severe scalability, stability, and efficiency challenges when supporting large language models (LLMs); existing frameworks exhibit critical shortcomings in usability, functional completeness, and error handling—impeding development velocity and causing resource waste. Method: We conduct the first systematic analysis of thousands of issue reports across three mainstream deep learning frameworks and eight LLM toolkits, augmented by in-depth interviews with over 20 practitioners. We propose the first LLM-centric tri-dimensional taxonomy—defects, requirements, and questions—comprising 4+4+10 high-level themes and 69 fine-grained categories. Using topic modeling and empirical induction, we quantify practitioners’ priorities. Contribution/Results: We derive five actionable framework improvement recommendations. Our work establishes the first empirically grounded engineering guide and theoretical foundation for designing LLM-specialized frameworks.
📝 Abstract
Large language models (LLMs) drive significant advancements in real industry applications. LLMs rely on DL frameworks for efficient model construction, distributed execution, and optimized deployment. Their large parameter scale and long execution cycles place extreme demands on DL frameworks in terms of scalability, stability, and efficiency. Therefore, poor usability, limited functionality, and subtle bugs in DL frameworks may hinder development efficiency and cause severe failures or resource waste. However, a fundamental question remains underinvestigated, i.e., What challenges do DL frameworks face in supporting LLMs? To seek an answer, we investigate these challenges through a large-scale analysis of issue reports from three major DL frameworks (MindSpore, PyTorch, TensorFlow) and eight associated LLM toolkits (e.g., Megatron). We construct a taxonomy of LLM-centric bugs, requirements, and user questions and enrich it through interviews with 11 LLM users and eight DL framework developers, uncovering key technical challenges and misalignments between user needs and developer priorities. Our contributions are threefold: (1) we develop a comprehensive taxonomy comprising four question themes (nine sub-themes), four requirement themes (15 sub-themes), and ten bug themes (45 sub-themes); (2) we assess the perceived importance and priority of these challenges based on practitioner insights; and (3) we identify five key findings across the LLM development and propose five actionable recommendations to improve the reliability, usability, and testability of DL frameworks. Our results highlight critical limitations in current DL frameworks and offer concrete guidance for advancing their support for the next generation of LLM construction and applications.