Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt generation methods primarily rely on task accuracy or similar outcome-based metrics, neglecting the intrinsic stability—i.e., response consistency across repeated executions—of prompts, leading to poor interpretability and robustness. Method: This paper introduces *prompt semantic stability* as a necessary condition for LLM prompt reliability and proposes the first stability-aware iterative prompt generation framework. It features (i) an automated stability evaluation mechanism grounded in multi-execution consistency and semantic similarity; (ii) a feedback-driven optimization pipeline; and (iii) a fine-tuned LLaMA-based dedicated stability evaluator. Contribution/Results: Evaluated on both general-purpose and domain-specific tasks, the framework significantly improves both task accuracy and output consistency. Empirical results demonstrate that incorporating stability feedback systematically enhances prompt quality and end-to-end execution reliability.

Technology Category

Application Category

📝 Abstract
Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating prompt stability in auto-generated prompts for reliability
Proposing semantic stability to measure response consistency across tasks
Developing stability-aware prompt generation to improve accuracy and consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes semantic stability for prompt evaluation
Fine-tunes LLaMA-based evaluator for automatic measurement
Develops stability-aware prompt generation system
🔎 Similar Papers
No similar papers found.
K
Ke Chen
School of Information Sciences, University of Illinois Urbana-Champaign, Urbana, USA
Y
Yufei Zhou
Department of Economics, Duke University, Durham, USA
Xitong Zhang
Xitong Zhang
Qualcomm
GeneralizationLarge Language ModelGraph Neural Network
Haohan Wang
Haohan Wang
School of Information Sciences, University of Illinois Urbana-Champaign
Computational BiologyAgentic AIAI4ScienceAI security