๐ค AI Summary
This work addresses the challenge of verifying consistency between natural language plans generated by large language models (LLMs) and their intended behaviors. Methodologically, it introduces the first automated framework that deeply integrates LLM-based semantic parsing with formal verification: advanced LLMs (e.g., GPT-5) parse natural language plans into Kripke structures and linear temporal logic (LTL) formulas, which are then rigorously validated via model checking. The key contribution is the first end-to-end, fully automated translation from natural language to formally verifiable modelsโensuring both syntactic correctness and strong formal guarantees. Evaluated on a simplified PlanBench benchmark, the approach achieves a 96.3% F1 score, substantially outperforming existing baselines while delivering high precision and reliability.
๐ Abstract
We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.