🤖 AI Summary
This work addresses the challenge of detecting subtle semantic violations—often manifesting as silent failures—in complex software systems, where manually crafting runtime checkers is prohibitively expensive and lacks scalability. The authors propose a novel approach that uniquely integrates large language models, static program analysis, and dynamic cross-validation to automatically derive stateful runtime checkers from existing test cases. These checkers generalize across arbitrary execution paths to monitor method invocations and validate semantic correctness. Evaluated on four widely used complex systems, the method generated 334 checkers from 400 test cases, with 300 verified as correct, and uncovered 5.2 times more bugs than state-of-the-art techniques.
📝 Abstract
Complex software systems often suffer from silent failures, i.e., violations of the intended semantics that do not cause explicit errors. A promising approach to detect such errors is to use system-specific runtime checkers that monitor the execution of a system and check for violations of the intended semantics. However, writing such checkers for a given software system is challenging and time-consuming, and hence, rarely done in practice. This work presents FlyCatcher, an automated approach to derive runtime checkers from existing tests, i.e., from a resource available for most software systems. The critical challenge of such an approach is to generalize the behavioral properties encoded in a test case to arbitrary executions of a system. FlyCatcher addresses this challenge through a combination of LLM-based synthesis, static analysis, and dynamic validation, which infers a checker that monitors specific method calls and asserts properties that should hold when they are called. The inferred checkers are stateful, i.e., they reason about the system's behavior by maintaining a shadow state that abstracts the actual system state as needed by the checker. Our evaluation applies FlyCatcher to 400 tests from four widely used, complex software systems. The approach infers 334 checkers, out of which 300 are found to be correct via cross-validation. Compared with a state-of-the-art approach, our approach infers 2.6x more correct checkers, which enables it to detect 5.2x more errors. By contributing to the automated inference of runtime checkers from tests, this work enables the broader adoption of runtime checking as a practical approach to detect silent failures in complex software systems.