🤖 AI Summary
This study challenges the universality of empirical heuristics in in-context learning (ICL), such as “more examples are always better” and “one-shot is inherently superior to zero-shot,” by systematically investigating the joint effects of example quantity, ordering, and selection. Method: We propose the first Monte Carlo sampling framework that explicitly models the interaction between example selection and permutation, thereby mitigating attribution bias arising from isolating single factors. Contribution/Results: Experiments reveal that conventional quantity guidelines exhibit poor generalization across example sets; optimal example configurations are highly task- and model-dependent—even one-shot performance can degrade below zero-shot baselines. Moreover, data-value-based “robust” selection strategies introduce implicit optimization pitfalls, yielding lower accuracy than random sampling. These findings establish a new paradigm for interpretable ICL design and empirically grounded benchmarking.
📝 Abstract
Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.