TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Current large language models exhibit limited performance in automated formalization—the translation of informal mathematical statements into formal proof languages—primarily due to the scarcity of high-quality informal–formal paired data and the substantial structural and syntactic gap between natural-language code and formal mathematics. This work introduces TopoAlign, the first framework to construct structurally analogous formal-mathematical training data from general-purpose code repositories without human annotation. TopoAlign employs topological decomposition to disentangle code into docstrings, main functions, and dependent functions, then reconstructs them into representations aligned with formal statements. Evaluated on minif2f, Putnam, and ProofNet benchmarks, TopoAlign significantly enhances DeepSeek-Math’s performance: BEq@10 improves by 17.77% and typecheck@10 by 68.82%. The approach establishes a scalable, structure-aware training paradigm for mathematical foundation models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

Problem

Research questions and friction points this paper is trying to address.

Autoformalisation struggles with informal to formal math transformation

Structural differences limit code-to-formal-math transfer learning

TopoAlign repurposes code repositories for math LLM training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes code into docstrings and functions

Reassembles components to mirror formal statements

Uses aligned code data for training without annotation

🔎 Similar Papers

GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding