An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Public large language model (LLM) services suffer from uncharacterized failures and inconsistent recovery behaviors, undermining their reliability as critical infrastructure. Method: We conduct a longitudinal, empirical study of eight major LLM platforms via continuous multi-source health probing, temporal anomaly detection, statistical modeling, and co-occurrence analysis—building the first multi-vendor, long-duration dataset on LLM service reliability. Results: We uncover previously undocumented weekly and monthly failure periodicities; quantify significant cross-vendor disparities in recovery latency and fault isolation efficacy (e.g., ChatGPT exhibits low failure frequency but slow recovery, whereas Claude suffers frequent failures with poor isolation); and derive over ten reproducible findings. We release a FAIR-compliant dataset and open-source analytical tools, establishing the first empirical foundation for resilience-aware LLM system design and operational optimization.

Technology Category

Application Category

📝 Abstract
People and businesses increasingly rely on public LLM services, such as ChatGPT, DALLE, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Public Service
Fault Recovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Model Resilience
Chatbot Fault Analysis
Service Stability Insights
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Chu
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
S
Sacheendra Talluri
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Q
Qingxian Lu
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Alexandru Iosup
Alexandru Iosup
Professor of Comp.Sci., VU University Amsterdam
Distributed SystemsPerformance EngineeringCloud ComputingBig DataComputer Ecosystems