π€ AI Summary
This work investigates the causes of example ineffectiveness in in-context learning (ICL), revealing that examples contribute negligibly when their information is already internalized by the model or irrelevant to the query; moreover, multi-layer attention progressively amplifies the advantage of effective examples. Based on these findings, we propose a novel perspective: example effectiveness is modulated by the modelβs prior knowledge. To formalize this insight, we develop a linear self-attention theoretical framework grounded in gradient flow analysis and design GradSβa method that selects examples based on the gradient flow strength from each example to the query. Extensive experiments across four state-of-the-art large language models and five benchmark datasets demonstrate that GradS achieves an average relative improvement of 6.8% over the strongest baseline. Both theoretical analysis and empirical results exhibit strong consistency, validating our framework and method.
π Abstract
Numerous studies have investigated the underlying mechanisms of in-context learning (ICL) effectiveness to inspire the design of related methods. However, existing work predominantly assumes the effectiveness of the demonstrations provided within ICL, while many research indicates that not all demonstrations are effective, failing to yielding any performance improvement during ICL. Therefore, in this paper, we investigate the reasons behind demonstration ineffectiveness. Our analysis is based on gradient flow and linear self-attention models. By setting the gradient flow to zero, we deduce that a demonstration becomes ineffective if its information has either been learned by the model or is irrelevant to the user query. Furthermore, we demonstrate that in multi-layer models, the disparity in effectiveness among demonstrations is amplified with layer increasing, causing the model to focus more on effective ones. Considering that current demonstration selection methods primarily focus on the relevance to the user query while overlooking the information that the model has already assimilated, we propose a novel method called GradS, which leverages gradient flow for demonstration selection. We use the magnitude of the gradient flow of the demonstration with respect to a given user query as the criterion, thereby ensuring the effectiveness of the chosen ones. We validate our derivation and GradS on four prominent LLMs across five mainstream datasets. The experimental results confirm that the disparity in effectiveness among demonstrations is magnified as the model layer increases, substantiating our derivations. Moreover, GradS achieves a relative improvement of $6.8%$ on average over the strongest baselines, demonstrating its effectiveness.