🤖 AI Summary
This study investigates whether mathematical problem-solving capabilities can generalize to general-purpose reasoning tasks. Method: We systematically compare continual pretraining on mathematical text, instruction fine-tuning, and rule-guided reinforcement learning (RL)—specifically contrasting short-chain versus long-chain chain-of-thought (CoT) reasoning samples, augmented with self-reflection mechanisms. Results: Continual pretraining on mathematics alone yields only limited cross-task transfer; in contrast, rule-regularized RL incorporating long-chain CoT reasoning and structured self-reflection substantially improves zero-shot generalization across five mathematical and eight general reasoning benchmarks, whereas short-chain approaches show negligible or even negative transfer. Contribution: We propose a novel paradigm—“long-chain reasoning + rule-regularized self-reflection”—and provide the first systematic empirical validation of its strong, cross-domain generalization to general reasoning. This framework offers a reproducible, interpretable, and scalable training pathway toward building transferable reasoning models.
📝 Abstract
There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model's reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.