Multi-Layer Attention is the Amplifier of Demonstration Effectiveness

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work investigates the causes of example ineffectiveness in in-context learning (ICL), revealing that examples contribute negligibly when their information is already internalized by the model or irrelevant to the query; moreover, multi-layer attention progressively amplifies the advantage of effective examples. Based on these findings, we propose a novel perspective: example effectiveness is modulated by the model’s prior knowledge. To formalize this insight, we develop a linear self-attention theoretical framework grounded in gradient flow analysis and design GradS—a method that selects examples based on the gradient flow strength from each example to the query. Extensive experiments across four state-of-the-art large language models and five benchmark datasets demonstrate that GradS achieves an average relative improvement of 6.8% over the strongest baseline. Both theoretical analysis and empirical results exhibit strong consistency, validating our framework and method.

Technology Category

Application Category

📝 Abstract

Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $6.8%$ on average over the strongest baselines, demonstrating its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Investigates reasons behind ineffective demonstrations in in-context learning

Analyzes gradient flow and self-attention to identify ineffective demonstrations

Proposes GradS method to select effective demonstrations using gradient flow

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer attention amplifies demonstration effectiveness disparity

GradS uses gradient flow for effective demonstration selection

Effectiveness disparity grows with increasing model layers

🔎 Similar Papers

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning