🤖 AI Summary
Formal modeling of cyber-physical systems (e.g., robotics, autonomous vehicles) relies heavily on human experts, impeding automated, reliable reasoning. Method: This paper investigates the capability of large language models (LLMs) to automatically translate natural-language kinematics problems into formal differential game logic (dGL) models. We introduce a novel natural-language-to-formal benchmark—comprising 20 problems—targeting hybrid game logic with continuous dynamics, and propose a two-stage evaluation framework integrating syntactic validation and symbolic-execution-based semantic verification. Our approach employs LLMs to generate dGL models, iteratively refines them using parser feedback, and verifies semantic consistency via symbolic execution. Contribution/Results: On undergraduate-level kinematics problems, our method achieves up to 70% success rate under five-shot sampling. This work establishes the first quantifiable, reproducible benchmark and verification paradigm for LLM-driven automatic formalization of physical systems.
📝 Abstract
Autonomous cyber-physical systems like robots and self-driving cars could greatly benefit from using formal methods to reason reliably about their control decisions. However, before a problem can be solved it needs to be stated. This requires writing a formal physics model of the cyber-physical system, which is a complex task that traditionally requires human expertise and becomes a bottleneck.
This paper experimentally studies whether Large Language Models (LLMs) can automate the formalization process. A 20 problem benchmark suite is designed drawing from undergraduate level physics kinematics problems. In each problem, the LLM is provided with a natural language description of the objects' motion and must produce a model in differential game logic (dGL). The model is (1) syntax checked and iteratively refined based on parser feedback, and (2) semantically evaluated by checking whether symbolically executing the dGL formula recovers the solution to the original physics problem. A success rate of 70% (best over 5 samples) is achieved. We analyze failing cases, identifying directions for future improvement. This provides a first quantitative baseline for LLM-based autoformalization from natural language to a hybrid games logic with continuous dynamics.