🤖 AI Summary
Existing prompt generation methods primarily rely on task accuracy or similar outcome-based metrics, neglecting the intrinsic stability—i.e., response consistency across repeated executions—of prompts, leading to poor interpretability and robustness.
Method: This paper introduces *prompt semantic stability* as a necessary condition for LLM prompt reliability and proposes the first stability-aware iterative prompt generation framework. It features (i) an automated stability evaluation mechanism grounded in multi-execution consistency and semantic similarity; (ii) a feedback-driven optimization pipeline; and (iii) a fine-tuned LLaMA-based dedicated stability evaluator.
Contribution/Results: Evaluated on both general-purpose and domain-specific tasks, the framework significantly improves both task accuracy and output consistency. Empirical results demonstrate that incorporating stability feedback systematically enhances prompt quality and end-to-end execution reliability.
📝 Abstract
Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.