QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

📅 2024-08-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt optimization methods neglect query dependency and rely heavily on frequent online LLM interactions, resulting in suboptimal performance and high computational overhead. To address this, we propose a multi-cycle offline reinforcement learning framework—the first to establish a query-dependent prompt optimization paradigm. Leveraging existing prompt demonstration data, our method iteratively enhances training data within a fully offline closed loop via three key components: prompt distillation, lightweight fine-tuning of small surrogate models, and dynamic data augmentation—eliminating the need for any online LLM queries. Evaluated across diverse NLP and mathematical reasoning tasks, our approach significantly improves zero-shot and few-shot performance of LLMs ranging from 7B to 70B parameters. Crucially, it reduces the computational cost of prompt optimization by up to an order of magnitude compared to online baselines, enabling scalable, efficient, and query-adaptive prompt engineering.

Technology Category

Application Category

📝 Abstract
Prompt engineering has demonstrated remarkable success in enhancing the performance of large language models (LLMs) across diverse tasks. However, most existing prompt optimization methods only focus on the task-level performance, overlooking the importance of query-preferred prompts, which leads to suboptimal performances. Additionally, these methods rely heavily on frequent interactions with LLMs to obtain feedback for guiding the optimization process, incurring substantial redundant interaction costs. In this paper, we introduce Query-dependent Prompt Optimization (QPO), which leverages multi-loop offline reinforcement learning to iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries, thus significantly improving the prompting effect on the large target LLM. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks, thereby circumventing the expenses of online interactions. Furthermore, we continuously augment the offline dataset with the generated prompts in each loop, as the prompts from the fine-tuned model are supposed to outperform the source prompts in the original dataset. These iterative loops bootstrap the model towards generating optimal prompts. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
Problem

Research questions and friction points this paper is trying to address.

Optimizing prompts for query-specific performance in LLMs
Reducing interaction costs with offline reinforcement learning
Enhancing prompt effectiveness without online LLM feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-loop offline reinforcement learning for prompts
Query-preferred prompts via fine-tuned small model
Offline dataset augmentation with generated prompts
🔎 Similar Papers
No similar papers found.
Yilun Kong
Yilun Kong
Tsinghua University
Reinforcement LearningLarge Language Models
H
Hangyu Mao
SenseTime Research
Q
Qi Zhao
Tsinghua University
B
Bin Zhang
Institute of automation,Chinese academy of science; School of Artificial Intelligence,University of Chinese Academy of Sciences
J
Jingqing Ruan
Institute of automation,Chinese academy of science; School of Artificial Intelligence,University of Chinese Academy of Sciences
L
Li Shen
Sun Yat-Sen University
Yongzhe Chang
Yongzhe Chang
UNSW/Data 61 PhD, Tsinghua postdoc.
machine learningreinforcement learning
Xueqian Wang
Xueqian Wang
Tsinghua University
Information FusionTarget DetectionRadar ImagingImage Processing
R
Rui Zhao
SenseTime Research
Dacheng Tao
Dacheng Tao
Nanyang Technological University
artificial intelligencemachine learningcomputer visionimage processingdata mining