Explaining GitHub Actions Failures with Large Language Models: Challenges, Insights, and Limitations

πŸ“… 2025-01-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
GitHub Actions failure logs are verbose and obscure, significantly impeding fault localization and remediation. Method: This paper presents the first systematic evaluation of large language models (LLMs) for interpreting real-world CI/CD logs, introducing an experience-aware explanation generation paradigm tailored to developers’ expertise levels. We optimize mainstream LLMs (e.g., GPT, Claude) via prompt engineering and conduct empirical user studies, expert evaluations, and qualitative analysis. Contribution/Results: Over 80% of developers deem LLM-generated explanations accurate and actionable for common errors. Novices prefer step-by-step remediation guidance, whereas experienced developers prioritize concise root-cause identification. LLMs demonstrate robust performance on single-step failures but struggle with multi-stage, dependency-heavy pipelines due to limited causal reasoning. Our findings establish a methodological foundation and empirical evidence for integrating LLMs into intelligent DevOps diagnostics.

Technology Category

Application Category

πŸ“ Abstract
GitHub Actions (GA) has become the de facto tool that developers use to automate software workflows, seamlessly building, testing, and deploying code. Yet when GA fails, it disrupts development, causing delays and driving up costs. Diagnosing failures becomes especially challenging because error logs are often long, complex and unstructured. Given these difficulties, this study explores the potential of large language models (LLMs) to generate correct, clear, concise, and actionable contextual descriptions (or summaries) for GA failures, focusing on developers' perceptions of their feasibility and usefulness. Our results show that over 80% of developers rated LLM explanations positively in terms of correctness for simpler/small logs. Overall, our findings suggest that LLMs can feasibly assist developers in understanding common GA errors, thus, potentially reducing manual analysis. However, we also found that improved reasoning abilities are needed to support more complex CI/CD scenarios. For instance, less experienced developers tend to be more positive on the described context, while seasoned developers prefer concise summaries. Overall, our work offers key insights for researchers enhancing LLM reasoning, particularly in adapting explanations to user expertise.
Problem

Research questions and friction points this paper is trying to address.

GitHub Actions
Error Interpretation
Development Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
GitHub Actions Error Interpretation
Developer Efficiency Improvement
πŸ”Ž Similar Papers
No similar papers found.
P
Pablo Valenzuela-Toledo
Software Engineering Group, University of Bern, Bern, Switzerland; Universidad de La Frontera, Temuco, Chile
C
Chuyue Wu
Software Engineering Group, University of Bern, Bern, Switzerland
S
Sandro Hernandez
Software Engineering Group, University of Bern, Bern, Switzerland
Alexander Boll
Alexander Boll
Ph. D. student, Software Engineering Group, University of Bern
Automatic ProgrammingOpen Science
R
Roman Machacek
Software Engineering Group, University of Bern, Bern, Switzerland
Sebastiano Panichella
Sebastiano Panichella
Senior Computer Science Researcher at the University of Bern
Software Engineering (SE)cloud computing (CC)and Data Science (DS)
Timo Kehrer
Timo Kehrer
University of Bern
Software Engineering