🤖 AI Summary
Large language models (LLMs) exhibit high sensitivity to prompting, and manually designing effective prompts remains challenging. Existing automated prompting methods face three key bottlenecks: reliance on large task-specific datasets, computationally expensive optimization, and inability to adapt prompts at the sample level. To address these limitations, we propose GPS—a general, sample-wise prompt generation framework that requires no task-specific training or optimization. GPS employs a reinforcement learning–trained prompter augmented with a novel regularization constraint and minimum Bayes risk decoding to enable input-adaptive prompt generation. Evaluated in zero-shot settings, GPS significantly improves both generalization and robustness across diverse tasks—including text simplification, summarization, classification, and mathematical reasoning (GSM8K)—consistently outperforming strong baselines. Notably, on GSM8K, GPS achieves state-of-the-art performance.
📝 Abstract
LLMs are sensitive to prompting, with task performance often hinging on subtle, sometimes imperceptible variations in phrasing. As a result, crafting effective prompts manually remains challenging and time-consuming. Recent automatic prompting methods mitigate this difficulty but face three key limitations: (i) for each new task, they require large datasets to train good prompts;(ii) they rely on costly optimization loops that may take hours; (iii)they typically produce a single task-level prompt that does not adapt to the individual input problem to be solved. We propose GPS, the first general-purpose, per-sample prompting method. Without any task-specific tuning, GPS generates a tailored prompt for each unseen input, improving performance across diverse tasks. The prompter is trained with reinforcement learning on a suite of training tasks and includes a novel regularization for effectively adapting to per-sample prompting. Finally, we employ Minimum Bayes Risk decoding to stabilize inference. Empirically, GPS demonstrates competitive performance: we attain second best results among baselines on text simplification, third best results on summarization and on-par results on classification, while not training on any of these tasks, in contrast to the baselines. For in-domain prompting, we obtain sota on GSM8K. Our work shows the potential of a novel and effective paradigm for automatic prompting: generating adaptive, input-specific prompts without extensive optimization and without access to a task-specific training set. Our code is available at https://github.com/Batorskq/GPS.