ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations

๐Ÿ“… 2025-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) exhibit weak generalization in mathematical reasoning, and conventional token-level autoregressive modeling fails to capture the structured, stepwise nature of human mathematical problem-solving. Method: We propose ClozeMathโ€”a novel equation-level cloze fine-tuning paradigm that directly predicts masked key equations within solution derivations, aligning more closely with human-like structured reasoning. Our approach integrates LLM-based text infilling fine-tuning, chain-of-thought (CoT) prompting, and beam search decoding, complemented by a systematic ablation framework. Contribution/Results: Evaluated on GSM8K, MATH, and GSM-Symbolic benchmarks, ClozeMath consistently outperforms the Masked Thought baseline across all metrics. Results demonstrate that equation-level modeling significantly improves both accuracy and robustness in mathematical reasoning. This work establishes a new training paradigm for enhancing LLMsโ€™ mathematical capabilities through structural, derivation-aware supervision.

Technology Category

Application Category

๐Ÿ“ Abstract
The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' mathematical reasoning via equation infilling
Improving model robustness and performance on math tasks
Analyzing architectural choices for better mathematical generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-infilling task for equation prediction
Fine-tuning LLMs with cloze exercises
Beam Search and Chain-of-Thought decoding
๐Ÿ”Ž Similar Papers
No similar papers found.