Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work investigates the robustness of large language models (LLMs) in automated formalization of mathematical statements, specifically their stability under semantically equivalent yet syntactically distinct natural language inputs. Method: We introduce, for the first time in this domain, systematic semantic-preserving rewrites of problem statements and evaluate LLMs on the MiniF2F and Lean 4–based ProofNet benchmarks. Two state-of-the-art LLMs generate formal proofs, with outputs cross-validated for semantic consistency and compilation validity. Contribution/Results: Despite high semantic equivalence among inputs, minor paraphrasing induces substantial variation in formalized outputs—revealing acute sensitivity to surface-level linguistic variations. Our findings expose a critical robustness gap in current automated formalization systems and establish a reproducible evaluation framework grounded in empirical evidence, providing concrete foundations for improving model stability in formal reasoning tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM robustness in autoformalization using paraphrased inputs

Testing semantic and compilation validity of generated formal proofs

Measuring performance variability across semantically similar statements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLM robustness via semantic paraphrasing

Cross-validating paraphrased statements across models

Measuring semantic and compilation validity metrics

🔎 Similar Papers

No similar papers found.