Understanding Differential Transformer Unchains Pretrained Self-Attentions

๐Ÿ“… 2025-05-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Differential Transformers exhibit strong performance but suffer from opaque mechanisms and reliance on costly from-scratch pretraining, hindering reuse of existing pretrained weights. This work presents the first systematic dissection of differential attentionโ€™s effectiveness, identifying three key mechanisms: (i) negative attention enhances representational capacity, (ii) it reduces multi-head redundancy, and (iii) it improves learning dynamics. Building on these insights, we propose DEXโ€”a plug-and-play, lightweight adaptation framework that integrates differential attention into arbitrary pretrained language models without full retraining. DEX achieves efficient synergy via softmax-score reuse and value-matrix differentiation. Theoretical analysis and extensive experiments demonstrate that DEX consistently boosts performance across diverse benchmarks, requiring less than 0.01% task-specific adaptation data and incurring negligible training and inference overhead.

Technology Category

Application Category

๐Ÿ“ Abstract
Differential Transformer has recently gained significant attention for its impressive empirical performance, often attributed to its ability to perform noise canceled attention. However, precisely how differential attention achieves its empirical benefits remains poorly understood. Moreover, Differential Transformer architecture demands large-scale training from scratch, hindering utilization of open pretrained weights. In this work, we conduct an in-depth investigation of Differential Transformer, uncovering three key factors behind its success: (1) enhanced expressivity via negative attention, (2) reduced redundancy among attention heads, and (3) improved learning dynamics. Based on these findings, we propose DEX, a novel method to efficiently integrate the advantages of differential attention into pretrained language models. By reusing the softmax attention scores and adding a lightweight differential operation on the output value matrix, DEX effectively incorporates the key advantages of differential attention while remaining lightweight in both training and inference. Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (<0.01%).
Problem

Research questions and friction points this paper is trying to address.

Understanding how Differential Transformer achieves noise canceled attention
Enabling use of pretrained weights in Differential Transformer architecture
Integrating differential attention benefits into pretrained models efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates differential attention into pretrained models
Uses lightweight operation on value matrix
Enhances performance with minimal adaptation data
๐Ÿ”Ž Similar Papers
No similar papers found.