Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work investigates whether improvements in large language models’ (LLMs) mathematical reasoning capabilities exhibit cross-task transferability. Method: We systematically evaluate the generalization effects of mathematical reinforcement training across scientific question answering, agent planning, programming, and instruction following—across 20+ open-source LLMs—using both supervised fine-tuning (SFT) and reinforcement learning (RL). We further employ latent space analysis and output distribution shift attribution to diagnose capability interactions. Contribution/Results: We find that enhanced mathematical reasoning does not automatically improve general-purpose capabilities; SFT often induces degradation in non-mathematical tasks, whereas RL-based optimization significantly promotes cross-domain generalization. This study provides the first empirical evidence that training methodology critically governs capability transfer—revealing a fundamental decoupling between mathematical and general competencies. Our findings offer both theoretical grounding and empirical validation for targeted capability enhancement and modular LLM development.

Technology Category

Application Category

📝 Abstract

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Problem

Research questions and friction points this paper is trying to address.

Does math reasoning improve general problem-solving in LLMs?

Do math gains transfer to other domains like science and coding?

How do tuning methods affect generalization across domains?

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating transferability of math reasoning across domains

Comparing RL-tuned vs SFT-tuned generalization capabilities

Analyzing latent-space shifts from different tuning methods

🔎 Similar Papers

No similar papers found.