Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address the low inference efficiency and high computational overhead of large language models in complex tasks such as mathematical reasoning, this paper proposes a multi-branch parallel decoding method within a single output sequence. The core innovation lies in a custom-designed sparse attention mask that enables simultaneous token generation across multiple independent reasoning paths within one sequence, explicitly modeling parallel reasoning trajectories. Crucially, the method requires no architectural modifications or retraining—only decoding-time optimization. Experiments on mainstream mathematical reasoning benchmarks demonstrate over 100% decoding speedup with negligible accuracy degradation. To our knowledge, this is the first work to achieve multi-path parallel reasoning under a single-sequence decoding framework, establishing a new paradigm for long-chain symbolic reasoning that jointly optimizes efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence. Experimental results show that our method achieves over 100% speedup in decoding time while basically maintaining accuracy.

Problem

Research questions and friction points this paper is trying to address.

Accelerate parallelizable reasoning tasks efficiently

Reduce decoding time while maintaining accuracy

Process multiple tokens per step via specialized attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel decoding within one sequence

Specialized attention mask for tokens

Maintains accuracy with speedup

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models