🤖 AI Summary
In in-context learning (ICL), standard perplexity fails to distinguish clean from noisy demonstrations under high-label-noise conditions, rendering it unreliable for demonstration selection.
Method: This paper proposes a dual-debiasing perplexity evaluation framework that first disentangles and corrects two inherent biases in perplexity—those arising from label errors and those induced by the LLM’s intrinsic domain knowledge. It further performs local perplexity recalibration via synthetically generated neighborhood samples, yielding an absolute cleanliness score independent of global noise level.
Contribution/Results: The framework overcomes the fundamental limitation of conventional ranking-based methods under high noise, enabling unsupervised, noise-aware demonstration selection. Experiments demonstrate robust clean-sample identification even at extremely high noise ratios; ICL performance matches that achieved with fully clean demonstrations, and noise detection AUC improves by 12.6% on average.
📝 Abstract
In context learning (ICL) relies heavily on high quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed. We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level. Extensive experiments demonstrate our method's superior noise detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.