When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing compression methods for Large Reasoning Models (LRMs) lack systematic evaluation of reasoning capabilities. Method: This work conducts the first unified assessment of quantization (down to 1.58 bits), pruning (SparseGPT), and knowledge distillation (based on LLaMA/Qwen) on DeepSeek-R1 across challenging reasoning benchmarks—AIME 2024 (mathematical reasoning), FOLIO (first-order logic), BIG-Bench Hard time-series, and MuSiQue (multi-hop reasoning). Contribution/Results: We find that parameter reduction primarily impairs knowledge retention rather than core reasoning ability; shorter inference chains correlate with higher accuracy—challenging the “longer thinking = stronger reasoning” assumption; and test-time token consumption exhibits a significant negative correlation with performance. We introduce the first reasoning-centric compression evaluation framework and empirically characterize the trade-off boundary between compression ratio and reasoning fidelity.

Technology Category

Application Category

📝 Abstract

Recent open-source large reasoning models (LRMs) exhibit strong performance on complex reasoning tasks, but their large parameter count makes them prohibitively expensive for individuals. The compression of large language models (LLMs) offers an effective solution to reduce cost of computational resources. However, systematic studies on the performance of compressed LLMs in complex reasoning tasks, especially for LRMs, are lacking. Most works on quantization and pruning focus on preserving language modeling performance, while existing distillation works do not comprehensively benchmark student models based on reasoning difficulty or compression impact on knowledge and reasoning. In this paper, we benchmark compressed DeepSeek-R1 models on four different reasoning datasets (AIME 2024, FOLIO, Temporal Sequences of BIG-Bench Hard, and MuSiQue), ranging from mathematical to multihop reasoning, using quantization, distillation, and pruning methods. We benchmark 2.51-, 1.73-, and 1.58-bit R1 models that adopt dynamic quantization. We also benchmark distilled R1 models that are based on LLaMA or Qwen and run SparseGPT on them to obtain various sparsity levels. Studying the performance and behavior of compressed LRMs, we report their performance scores and test-time compute (number of tokens spent on each question). Notably, using MuSiQue, we find that parameter count has a much greater impact on LRMs' knowledge memorization than on their reasoning capability, which can inform the choice of compression techniques. Through our empirical analysis of test-time compute, we find that shorter model outputs generally achieve better performance than longer ones across several benchmarks for both R1 and its compressed variants, highlighting the need for more concise reasoning chains.

Problem

Research questions and friction points this paper is trying to address.

Evaluating compressed large reasoning models on complex tasks

Assessing impact of compression on knowledge and reasoning

Comparing performance of different compression methods on reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic quantization reduces model size significantly

Distillation transfers knowledge to smaller models

Pruning increases sparsity while maintaining performance

🔎 Similar Papers

No similar papers found.

Authors to Follow