Scaling Reasoning without Attention

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Large language models (LLMs) suffer from inefficiency in complex reasoning—due to the quadratic complexity of Transformer attention—and lack structured guidance for fine-tuning on high-difficulty domains. Method: We propose an attention-free, efficient reasoning model built on the Mamba-2 state-space architecture (SSD layers), enabling constant-time and fixed-memory inference; complemented by a two-stage PromptCoT curriculum fine-tuning paradigm that integrates abstract concept selection, rationale-guided generation, and structured data synthesis to enhance interpretability and generalization. Results: Our 7B-parameter model significantly outperforms comparably sized Transformers and even the 27B Gemma-3 on AIME 2024/2025 and LiveCodeBench, achieving up to a 3.0% absolute improvement—marking the first empirical validation of attention-free architectures’ superiority and scalability in high-difficulty symbolic reasoning tasks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the extsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6% on AIME 24, 0.6% on AIME 25, and 3.0% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.

Problem

Research questions and friction points this paper is trying to address.

Eliminate Transformer inefficiency in large language models

Address lack of structured fine-tuning for complex reasoning

Enable fixed-memory, constant-time inference without self-attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-free model using SSD layers

Two-phase curriculum fine-tuning strategy

Fixed-memory constant-time inference

🔎 Similar Papers

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures