🤖 AI Summary
Prompt optimization for large language models (LLMs) remains challenging in the absence of automatic evaluation metrics. Method: This paper proposes a few-shot prompt optimization framework requiring only a single round of human feedback. It employs a lightweight three-module architecture: (i) a learnable evaluator that replaces gold-standard reference comparisons, (ii) a feedback encoder that models human preferences, and (iii) a gradient-guided mechanism for parameterized prompt updates. Inspired by RLHF principles but eschewing reinforcement learning training, the framework directly incorporates human feedback into the prompt generation loop—eliminating both multi-round interaction and surrogate reward modeling. Contribution/Results: Experiments across multiple public and industrial benchmarks demonstrate that our method significantly outperforms existing output-scoring–based prompt optimization approaches, achieving superior task adaptability, higher feedback utilization efficiency, and faster convergence.
📝 Abstract
Automatic prompt optimization frameworks are developed to obtain suitable prompts for large language models (LLMs) with respect to desired output quality metrics. Although existing approaches can handle conventional tasks such as fixed-solution question answering, defining the metric becomes complicated when the output quality cannot be easily assessed by comparisons with standard golden samples. Consequently, optimizing the prompts effectively and efficiently without a clear metric becomes a critical challenge. To address the issue, we present PLHF (which stands for"P"rompt"L"earning with"H"uman"F"eedback), a few-shot prompt optimization framework inspired by the well-known RLHF technique. Different from naive strategies, PLHF employs a specific evaluator module acting as the metric to estimate the output quality. PLHF requires only a single round of human feedback to complete the entire prompt optimization process. Empirical results on both public and industrial datasets show that PLHF outperforms prior output grading strategies for LLM prompt optimizations.