Parallel Thinking, Sequential Answering: Bridging NAR and AR for Efficient Reasoning

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Autoregressive (AR) models suffer from high latency in long-chain reasoning tasks (e.g., mathematical reasoning and code generation), whereas non-autoregressive (NAR) models yield inferior output quality due to limited sequential modeling capacity. Method: This paper proposes an AR–NAR collaborative inference paradigm: a NAR model—specifically, a discrete diffusion model—parallelly generates high-quality, structured chain-of-thought (CoT) intermediate steps; an AR model then refines the final answer conditioned on this precomputed reasoning path. This decouples “reasoning process generation” from “answer generation,” synergizing parallel efficiency with sequential modeling fidelity. Contribution/Results: Evaluated on multiple challenging reasoning benchmarks, the method achieves a significant reduction in inference latency while improving overall performance by 26% over strong baselines. It demonstrates both computational efficiency and generalization capability without compromising output quality, validating the effectiveness of the collaborative architecture for complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

We study reasoning tasks through a framework that integrates auto-regressive (AR) and non-autoregressive (NAR) language models. AR models, which generate text sequentially, excel at producing coherent outputs but often suffer from slow inference, particularly in reasoning-intensive domains such as mathematics and code, where lengthy chains of thought are required. In contrast, NAR models, such as discrete diffusion models, allow parallel generation and offer substantial speedups, though typically at the cost of reduced output quality. To address these limitations, we introduce a new paradigm in which an NAR model efficiently produces intermediate reasoning traces, which subsequently guide an AR model to deliver precise final answers. Experiments demonstrate that our approach yields significant 26% improvements over strong baselines while substantially reducing inference cost.

Problem

Research questions and friction points this paper is trying to address.

Slow inference speed of autoregressive models in reasoning tasks

Reduced output quality of non-autoregressive parallel generation models

Bridging AR and NAR models for efficient high-quality reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

NAR models generate parallel reasoning traces

AR models produce sequential final answers

Hybrid approach combines efficiency with accuracy

🔎 Similar Papers

No similar papers found.