Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Random sampling of evaluation data in automated prompt optimization leads to unreliable evaluations and suboptimal prompts. Method: This paper proposes IPOMP, a two-stage iterative evaluation data selection framework. IPOMP innovatively incorporates real-time model performance feedback into evaluation set construction—integrating semantic clustering, decision boundary analysis, and iterative reweighted sampling to dynamically replace redundant samples—without requiring prior performance labels. Contribution/Results: IPOMP significantly enhances generalization on private or unseen datasets. On BIG-bench, it improves average accuracy by 1.6–5.3%, boosts evaluation stability by ≥57%, and incurs <1% additional computational overhead. Moreover, it is plug-and-play compatible, effectively augmenting existing coreset methods.

Technology Category

Application Category

📝 Abstract
Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the majority of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection for effective Prompt Optimization using real-time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on the BIG-bench dataset show that IPOMP improves effectiveness by 1.6% to 5.3% and stability by at least 57% compared with SOTA baselines, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.
Problem

Research questions and friction points this paper is trying to address.

Automated prompt optimization lacks reliable evaluation data selection
Existing coreset methods are unsuitable for prompt optimization challenges
IPOMP improves prompt optimization effectiveness and stability significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

IPOMP uses semantic clustering for sample selection
Real-time model performance guides iterative refinement
Boundary analysis replaces redundant samples effectively
🔎 Similar Papers
No similar papers found.