🤖 AI Summary
This study addresses the challenge of program termination analysis—a classically undecidable problem that remains difficult in practical software verification—by presenting the first systematic evaluation of large language models (LLMs) on predicting termination of C programs. Leveraging the SV-Comp 2025 Termination Competition dataset and employing test-time scaling techniques, we assess state-of-the-art models including GPT-5, Claude Sonnet-4.5, and Code World Model (CWM). Our results show that GPT-5 and Claude Sonnet-4.5 achieve performance comparable to the competition’s top-performing tool, with CWM closely following the runner-up. However, LLMs struggle to generate valid termination witnesses, and their prediction accuracy degrades significantly as program length increases. This work thus reveals both the promising potential and current limitations of LLMs in formal verification tasks.
📝 Abstract
Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.