Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

πŸ“… 2025-08-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates the causes of example ineffectiveness in in-context learning (ICL), revealing that examples contribute negligibly when their information is already internalized by the model or irrelevant to the query; moreover, multi-layer attention progressively amplifies the advantage of effective examples. Based on these findings, we propose a novel perspective: example effectiveness is modulated by the model’s prior knowledge. To formalize this insight, we develop a linear self-attention theoretical framework grounded in gradient flow analysis and design GradSβ€”a method that selects examples based on the gradient flow strength from each example to the query. Extensive experiments across four state-of-the-art large language models and five benchmark datasets demonstrate that GradS achieves an average relative improvement of 6.8% over the strongest baseline. Both theoretical analysis and empirical results exhibit strong consistency, validating our framework and method.

Technology Category

Application Category

πŸ“ Abstract
Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $6.8%$ on average over the strongest baselines, demonstrating its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Investigates reasons behind ineffective demonstrations in in-context learning
Analyzes gradient flow and self-attention to identify ineffective demonstrations
Proposes GradS method to select effective demonstrations using gradient flow
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer attention amplifies demonstration effectiveness disparity
GradS uses gradient flow for effective demonstration selection
Effectiveness disparity grows with increasing model layers
πŸ”Ž Similar Papers
No similar papers found.
Dingzirui Wang
Dingzirui Wang
Harbin Institute of Technology
Semantic Parsing
X
Xuangliang Zhang
Harbin Institute of Technology
K
Keyan Xu
Harbin Institute of Technology
Qingfu Zhu
Qingfu Zhu
Harbin Institute of Technology
NLPCode LLM
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing
Y
Yang Deng
Singapore Management University