🤖 AI Summary
Formal verification languages (e.g., F*, Verus) suffer from low adoption rates and a lack of empirically grounded design principles for AI-assisted tools, largely due to steep learning curves and opaque expert practices. To address this, we conduct the first systematic study of expert proof development in F* and Verus, combining fine-grained code telemetry with qualitative user studies. Our analysis uncovers three core proof strategies and several predictive engineering practices—including specification-first development and explicit subgoal decomposition. Leveraging these human-centered insights, we formulate design principles for AI-powered proof assistants and build an evidence-driven F* proof agent prototype. Empirical evaluation demonstrates that our agent significantly outperforms baseline large language models in both task completion rate and efficiency, validating the critical role of cognitive and behavioral insights in advancing AI-augmented formal verification.
📝 Abstract
Proof-oriented programming languages (POPLs) empower developers to write code alongside formal correctness proofs, providing formal guarantees that the code adheres to specified requirements. Despite their powerful capabilities, POPLs present a steep learning curve and have not yet been adopted by the broader software community. The lack of understanding about the proof-development process and how expert proof developers interact with POPLs has hindered the advancement of effective proof engineering and the development of proof-synthesis models/tools.
In this work, we conduct a user study, involving the collection and analysis of fine-grained source code telemetry from eight experts working with two languages, F* and Verus. Results reveal interesting trends and patterns about how experts reason about proofs and key challenges encountered during the proof development process. We identify three distinct strategies and multiple informal practices that are not captured final code snapshots, yet are predictive of task outcomes. We translate these findings into concrete design guidance for AI proof assistants: bias toward early specification drafting, explicit sub-goal decomposition, bounded active errors, and disciplined verifier interaction. We also present a case study of an F* proof agent grounded in these recommendations, and demonstrate improved performance over baseline LLMs