🤖 AI Summary
Public large language model (LLM) services suffer from uncharacterized failures and inconsistent recovery behaviors, undermining their reliability as critical infrastructure.
Method: We conduct a longitudinal, empirical study of eight major LLM platforms via continuous multi-source health probing, temporal anomaly detection, statistical modeling, and co-occurrence analysis—building the first multi-vendor, long-duration dataset on LLM service reliability.
Results: We uncover previously undocumented weekly and monthly failure periodicities; quantify significant cross-vendor disparities in recovery latency and fault isolation efficacy (e.g., ChatGPT exhibits low failure frequency but slow recovery, whereas Claude suffers frequent failures with poor isolation); and derive over ten reproducible findings. We release a FAIR-compliant dataset and open-source analytical tools, establishing the first empirical foundation for resilience-aware LLM system design and operational optimization.
📝 Abstract
People and businesses increasingly rely on public LLM services, such as ChatGPT, DALLE, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.