🤖 AI Summary
Verifying the satisfiability of string constraints specified in natural language (NL) remains challenging: traditional SMT solvers face theoretical limitations and require labor-intensive formalization, while the efficacy of large language models (LLMs) for this task has not been systematically investigated.
Method: We propose the first LLM-driven bidirectional verification paradigm. It leverages state-of-the-art LLMs (e.g., GPT-4, Claude) to jointly generate candidate strings satisfying NL requirements and synthesize dual-modal validators—SMT-Lib formulas and Python checkers—in parallel. A feedback loop iteratively refines generation using validator outputs, establishing an end-to-end, falsifiable pipeline: NL requirement → concrete string instance → machine-checkable validation.
Results: Our Python validators achieve 100% accuracy; integrating validation boosts consistent string generation success rate and F1-score by over 2× compared to unverified baselines, demonstrating substantial gains in correctness and reliability.
📝 Abstract
Requirements over strings, commonly represented using natural language (NL), are particularly relevant for software systems due to their heavy reliance on string data manipulation. While individual requirements can usually be analyzed manually, verifying properties (e.g., satisfiability) over sets of NL requirements is particularly challenging. Formal approaches (e.g., SMT solvers) may efficiently verify such properties, but are known to have theoretical limitations. Additionally, the translation of NL requirements into formal constraints typically requires significant manual effort. Recently, large language models (LLMs) have emerged as an alternative approach for formal reasoning tasks, but their effectiveness in verifying requirements over strings is less studied. In this paper, we introduce a hybrid approach that verifies the satisfiability of NL requirements over strings by using LLMs (1) to derive a satisfiability outcome (and a consistent string, if possible), and (2) to generate declarative (i.e., SMT) and imperative (i.e., Python) checkers, used to validate the correctness of (1). In our experiments, we assess the performance of four LLMs. Results show that LLMs effectively translate natural language into checkers, even achieving perfect testing accuracy for Python-based checkers. These checkers substantially help LLMs in generating a consistent string and accurately identifying unsatisfiable requirements, leading to more than doubled generation success rate and F1-score in certain cases compared to baselines without generated checkers.