Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) incur high computational overhead during mathematical reasoning due to long autoregressive generation, while existing efficient inference methods—though preserving general language capabilities—significantly degrade mathematical reasoning performance. Method: We propose Caprese, a low-overhead, low-rank distillation framework specifically designed for feed-forward network (FFN) blocks. Caprese introduces lightweight, modular low-rank adapters without modifying the original model weights, leveraging only 1% additional parameters and 20K synthetically generated samples. Contribution/Results: By integrating low-rank matrix decomposition, intra-layer modular integration, and inference-efficient design, Caprese fully restores mathematical reasoning capability while maintaining zero degradation on standard language tasks. On Gemma-2 9B, it reduces activation parameters by ~2 billion; on Qwen2.5-14B, it achieves >11% latency reduction for 2048-token generation and yields more concise responses.

Technology Category

Application Category

📝 Abstract
Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a low-cost distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the math capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>11% reduction to generate 2048 tokens with Qwen 2.5 14B) while encouraging response brevity.
Problem

Research questions and friction points this paper is trying to address.

Recover math capabilities lost from efficient inference methods
Reduce computational resources and time for LLM math reasoning
Maintain language task performance while improving math reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-rank distillation for math reasoning recovery
Minimal additional parameters and synthetic training
Reduces active parameters and generation latency
🔎 Similar Papers
No similar papers found.