ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak generalization in mathematical reasoning, and conventional token-level autoregressive modeling fails to capture the structured, stepwise nature of human mathematical problem-solving. Method: We propose ClozeMath—a novel equation-level cloze fine-tuning paradigm that directly predicts masked key equations within solution derivations, aligning more closely with human-like structured reasoning. Our approach integrates LLM-based text infilling fine-tuning, chain-of-thought (CoT) prompting, and beam search decoding, complemented by a systematic ablation framework. Contribution/Results: Evaluated on GSM8K, MATH, and GSM-Symbolic benchmarks, ClozeMath consistently outperforms the Masked Thought baseline across all metrics. Results demonstrate that equation-level modeling significantly improves both accuracy and robustness in mathematical reasoning. This work establishes a new training paradigm for enhancing LLMs’ mathematical capabilities through structural, derivation-aware supervision.

Technology Category

Application Category

📝 Abstract

The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' mathematical reasoning via equation infilling

Improving model robustness and performance on math tasks

Analyzing architectural choices for better mathematical generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-infilling task for equation prediction

Fine-tuning LLMs with cloze exercises

Beam Search and Chain-of-Thought decoding

🔎 Similar Papers

No similar papers found.