Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study addresses the growing challenge of runtime failures in locally deployed and fine-tuned open-source large language models (LLMs), which increasingly stem not from algorithmic flaws but from systemic fragility in the deployment stack. Through a large-scale empirical analysis of 705 real-world failure reports from the DeepSeek, Llama, and Qwen ecosystems, we identify and formally characterize three recurring phenomena: diagnostic divergence, system homogeneity, and lifecycle upgrade issues. By integrating fault log analysis, root cause categorization, and cross-ecosystem comparison, we construct the first reliability analysis framework for open-source LLM deployment, establishing the deployment stack as the primary source of failures. We further release the first public dataset of such failures, offering actionable insights to enhance deployment reliability.

Technology Category

Application Category

📝 Abstract

The democratization of open-source Large Language Models (LLMs) allows users to fine-tune and deploy models on local infrastructure but exposes them to a First Mile deployment landscape. Unlike black-box API consumption, the reliability of user-managed orchestration remains a critical blind spot. To bridge this gap, we conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems. Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack. We identify three key phenomena: (1) Diagnostic Divergence: runtime crashes distinctively signal infrastructure friction, whereas incorrect functionality serves as a signature for internal tokenizer defects. (2) Systemic Homogeneity: Root causes converge across divergent series, confirming reliability barriers are inherent to the shared ecosystem rather than specific architectures. (3) Lifecycle Escalation: Barriers escalate from intrinsic configuration struggles during fine-tuning to compounded environmental incompatibilities during inference. Supported by our publicly available dataset, these insights provide actionable guidance for enhancing the reliability of the LLM landscape.

Problem

Research questions and friction points this paper is trying to address.

LLM reliability

deployment failures

open-source LLMs

systemic fragility

user-reported failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

empirical study

LLM deployment failures

systemic fragility