Some Attention is All You Need for Retrieval

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hybrid SSM-Transformer architectures are commonly assumed to exhibit functional redundancy or complementarity between self-attention and State Space Model (SSM) layers, particularly for information retrieval. Method: We conduct systematic ablation and sparsification experiments across model scales, evaluating retrieval performance under attention removal and progressive head pruning during both prefilling and decoding phases. Contribution/Results: We demonstrate strict functional segregation: self-attention layers exclusively enable retrieval—removing them collapses retrieval accuracy to zero, with no compensation from SSM layers. Critically, retaining only ~15% of attention heads preserves >99% retrieval accuracy while maintaining 84% MMLU generalization performance. This is the first empirical evidence that hybrid architectures operate via modular division of labor—self-attention specializes in long-range dependency modeling and retrieval, whereas SSMs handle other sequential modeling tasks. These findings challenge prevailing assumptions and establish a new paradigm for efficient architecture design and inference.

Technology Category

Application Category

📝 Abstract
We demonstrate complete functional segregation in hybrid SSM-Transformer architectures: retrieval depends exclusively on self-attention layers. Across RecurrentGemma-2B/9B and Jamba-Mini-1.6, attention ablation causes catastrophic retrieval failure (0% accuracy), while SSM layers show no compensatory mechanisms even with improved prompting. Conversely, sparsifying attention to just 15% of heads maintains near-perfect retrieval while preserving 84% MMLU performance, suggesting self-attention specializes primarily for retrieval tasks. We identify precise mechanistic requirements for retrieval: needle tokens must be exposed during generation and sufficient context must be available during prefill or generation. This strict functional specialization challenges assumptions about redundancy in hybrid architectures and suggests these models operate as specialized modules rather than integrated systems, with immediate implications for architecture optimization and interpretability.
Problem

Research questions and friction points this paper is trying to address.

Self-attention layers exclusively handle retrieval in hybrid architectures
Attention ablation causes catastrophic retrieval failure without compensation
Retrieval requires needle token exposure and sufficient context availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-attention layers exclusively handle retrieval tasks
Sparsifying attention heads maintains retrieval with minimal performance loss
Retrieval requires needle token exposure and sufficient context
🔎 Similar Papers
No similar papers found.