Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing prompt generation methods primarily rely on task accuracy or similar outcome-based metrics, neglecting the intrinsic stability—i.e., response consistency across repeated executions—of prompts, leading to poor interpretability and robustness. Method: This paper introduces *prompt semantic stability* as a necessary condition for LLM prompt reliability and proposes the first stability-aware iterative prompt generation framework. It features (i) an automated stability evaluation mechanism grounded in multi-execution consistency and semantic similarity; (ii) a feedback-driven optimization pipeline; and (iii) a fine-tuned LLaMA-based dedicated stability evaluator. Contribution/Results: Evaluated on both general-purpose and domain-specific tasks, the framework significantly improves both task accuracy and output consistency. Empirical results demonstrate that incorporating stability feedback systematically enhances prompt quality and end-to-end execution reliability.

Technology Category

Application Category

📝 Abstract

Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating prompt stability in auto-generated prompts for reliability

Proposing semantic stability to measure response consistency across tasks

Developing stability-aware prompt generation to improve accuracy and consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes semantic stability for prompt evaluation

Fine-tunes LLaMA-based evaluator for automatic measurement

Develops stability-aware prompt generation system

🔎 Similar Papers

No similar papers found.