Scaling Reasoning without Attention

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from inefficiency in complex reasoning—due to the quadratic complexity of Transformer attention—and lack structured guidance for fine-tuning on high-difficulty domains. Method: We propose an attention-free, efficient reasoning model built on the Mamba-2 state-space architecture (SSD layers), enabling constant-time and fixed-memory inference; complemented by a two-stage PromptCoT curriculum fine-tuning paradigm that integrates abstract concept selection, rationale-guided generation, and structured data synthesis to enhance interpretability and generalization. Results: Our 7B-parameter model significantly outperforms comparably sized Transformers and even the 27B Gemma-3 on AIME 2024/2025 and LiveCodeBench, achieving up to a 3.0% absolute improvement—marking the first empirical validation of attention-free architectures’ superiority and scalability in high-difficulty symbolic reasoning tasks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the extsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6% on AIME 24, 0.6% on AIME 25, and 3.0% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.
Problem

Research questions and friction points this paper is trying to address.

Eliminate Transformer inefficiency in large language models
Address lack of structured fine-tuning for complex reasoning
Enable fixed-memory, constant-time inference without self-attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-free model using SSD layers
Two-phase curriculum fine-tuning strategy
Fixed-memory constant-time inference
🔎 Similar Papers
2024-02-26Annual Meeting of the Association for Computational LinguisticsCitations: 97