The Few-shot Dilemma: Over-prompting Large Language Models

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

134K/year

🤖 AI Summary

This work identifies the “over-prompting” phenomenon in large language models (LLMs), wherein performance degrades beyond an optimal number of in-context examples—challenging the conventional assumption that “more examples yield better performance.” Through systematic evaluation on tasks such as software requirements classification, we observe an inverted-U relationship between example count and model accuracy. To address this, we propose an optimal example selection strategy leveraging TF-IDF weighting and hierarchical semantic filtering, which demonstrates superior robustness over baselines including random sampling, semantic embedding–based selection, and vanilla TF-IDF. Evaluated across multiple state-of-the-art LLMs—including Llama-3, Qwen, and GLM—our method achieves a 1.0% absolute accuracy gain over prior SOTA while reducing required examples by 35% on average. This approach effectively mitigates over-prompting and establishes a new paradigm for efficient, sample-aware few-shot prompting engineering.

Technology Category

Application Category

📝 Abstract

Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.

Problem

Research questions and friction points this paper is trying to address.

Investigating over-prompting's negative impact on LLM performance

Determining optimal few-shot example quantity for different LLMs

Improving software requirement classification accuracy while avoiding over-prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging three few-shot selection methods

Identifying optimal example quantity per LLM

Combining TF-IDF with stratified sampling approach

🔎 Similar Papers

No similar papers found.