🤖 AI Summary
This work demonstrates that intensive domain-specific supervised fine-tuning severely degrades the general-purpose tool-calling capabilities of foundation models, rendering them nearly ineffective on formal mathematical reasoning tasks. To address this, the authors propose a lightweight fine-tuning strategy applied to the specialized Goedel-Prover-V2 model using only 100 agent trajectories collected in the Lean environment. This approach effectively restores cross-domain tool-calling proficiency, indicating that such capabilities are suppressed rather than irreversibly lost during heavy domain adaptation. The method integrates domain-specific trajectory fine-tuning with natural language–driven Mathlib querying and function-call transfer learning. Empirical results show substantial improvements: accuracy on the Berkeley Function Calling Leaderboard rises from near 0% to 83.8%, and ProofNet’s pass@32 increases from 21.51% to 25.81%.
📝 Abstract
Heavy supervised fine-tuning on a target domain can strongly suppress capabilities that were present in the base model. We study this phenomenon in formal mathematics using Goedel-Prover-V2, an open-source model heavily trained on 1.8 million formal-math examples. After domain specialization, the model almost completely loses its ability to produce valid tool calls, even when explicitly instructed to use tools, dropping from 89.4% function-calling accuracy in the base model to nearly 0%. We ask whether this agentic collapse is permanent or instead reversible. To answer this question, we fine-tune the specialized model on a small amount of Lean-specific tool-use data. Remarkably, as few as 100 agentic traces are sufficient to restore strong tool-calling behavior. Importantly, this recovery is not the result of reward hacking or benchmark-specific optimization: the recovery data is entirely drawn from the Lean setting, where the model uses natural-language queries to search the Mathlib library for relevant theorems and lemmas, yet the regained capability transfers well beyond that domain. In particular, these same 100 Lean-specific traces improve performance on the Berkeley Function Calling Leaderboard from near zero to 83.8%, approaching the base model's 89.4% despite the mismatch in task distribution and protocol. The recovered capability is also practically useful in-domain. On ProofNet, pass@32 improves from 21.51% to 25.81%. Together, these results show that heavy domain supervised fine-tuning can suppress general tool-use ability without permanently erasing it, and that a small amount of domain-specific agentic data can awaken dormant tool-use capabilities.