Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of evaluation signals in automated prompt optimization, where existing approaches either rely on fixed subsets of samples—lacking prompt awareness—or employ heuristics without theoretical guarantees. The paper formulates the problem as online adaptive testing and introduces a novel prompt-aware evaluation scheduling framework that uniquely integrates Item Response Theory (IRT) with submodular optimization, achieving a (1−1/e)-approximation guarantee. The framework leverages IRT-based item discrimination, facility-location-inspired coverage, switch-cost-aware warm-starting, and an adaptive exploration–exploitation mechanism to enable efficient evaluation. Experiments across 36 tasks demonstrate an average accuracy improvement of 6.2% with only ~4% additional token overhead; notably, just 20 intelligently selected samples outperform 30–50 naively chosen ones, reducing token usage by 35–60%.

Technology Category

Application Category

📝 Abstract
Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.
Problem

Research questions and friction points this paper is trying to address.

automatic prompt optimization
evaluation scheduling
submodular optimization
adaptive testing
prompt evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-Aware Evaluation
Submodular Optimization
Online Adaptive Testing
Automatic Prompt Optimization
Evaluation Scheduling
🔎 Similar Papers
No similar papers found.
Xiaoyu Ma
Xiaoyu Ma
Carnegie Mellon University
Transportation network modelingmachine learningreinforcement learningsimulationoptimization
Y
Yiwen Li
The Chinese University of Hong Kong, Shenzhen
Haoyue Liu
Haoyue Liu
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Computer VisionEvent Camera
Z
Zhichao Wang
The Chinese University of Hong Kong, Shenzhen
Y
Ye Chen
Xi’an Jiaotong University
Yongxin Guo
Yongxin Guo
Alibaba Group
Video UnderstandingMLLM
X
Xiaoying Tang
The Chinese University of Hong Kong, Shenzhen