🤖 AI Summary
This work investigates the intrinsic difficulty of prompt engineering for large language models (LLMs): why effective prompts remain hard to discover and interpret reliably—even when deployed on near-optimal pretrained sequence predictors. We formalize prompt difficulty from two complementary perspectives—statistical learning theory and empirical realizability—treating prompts as conditional modulators of the pretrained distribution. Using systematic exhaustive search over binary sequence prediction tasks, explicit modeling of the pretrained distribution, and analysis of neural predictor performance bounds, we demonstrate that optimal prompts exhibit counterintuitive structural properties and depend critically on unobservable characteristics of the pretrained data distribution. Intuitively designed or task-sample-based prompts are provably suboptimal. These findings challenge prevailing prompt design paradigms and establish a novel theoretical foundation for studying prompt interpretability, optimization, and principled evaluation benchmarks.
📝 Abstract
Large language models (LLMs) can be prompted to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We explore these issues by viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources. Through numerous prompt search experiments, we show that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice. Moreover, even using exhaustive search, reliably identifying optimal prompts from practical neural predictors can be difficult. Further, we demonstrate that common prompting methods, such as using intuitive prompts or samples from the targeted task, are in fact suboptimal. Thus, this work takes an initial step towards understanding the difficulties in finding and understanding optimal prompts from a statistical and empirical perspective.