Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Code generation models exhibit high sensitivity to the affective and personality cues embedded in prompts, yet existing benchmarks focus exclusively on peak performance and neglect output stability. Method: We propose PromptSE, a framework that constructs semantically equivalent prompt variants using affective and personality templates, thereby establishing prompt stability as a novel, independent evaluation dimension for the first time. We introduce AUC-E—a metric that decouples performance from stability—enabling fair cross-model comparison, and integrate probability-aware continuous scoring with binary pass rates for fine-grained stability quantification. Contribution/Results: Extensive experiments across 14 state-of-the-art models reveal non-monotonic effects of architecture and scale on stability; notably, closed-source models demonstrate efficient screening capability. This work establishes a reproducible, stability-centric evaluation paradigm for trustworthy AI-powered programming assistants.

Technology Category

Application Category

📝 Abstract

Code generation models are widely used in software development, yet their sensitivity to prompt phrasing remains under-examined. Identical requirements expressed with different emotions or communication styles can yield divergent outputs, while most benchmarks emphasize only peak performance. We present PromptSE (Prompt Sensitivity Evaluation), a framework that creates semantically equivalent prompt variants with emotion and personality templates, and that evaluates stability using probability aware continuous scoring or using binary pass rates when logits are unavailable. The results are aggregated into a proposed area under curve metric (AUC-E) for cross model comparison. Across 14 models from three families (Llama, Qwen, and DeepSeek), our study shows that performance and stability behave as largely decoupled optimization objectives, and it reveals architectural and scale related patterns that challenge common assumptions about model robustness. The framework supports rapid screening for closed-source models as well as detailed stability analysis in research settings. PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection, positioning prompt stability as a complementary evaluation dimension alongside performance and fairness, and contributing to more trustworthy AI-assisted software development tools.

Problem

Research questions and friction points this paper is trying to address.

Measures prompt sensitivity in code LLMs

Evaluates output stability across emotional variations

Quantifies performance-stability tradeoffs for deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion-personality prompt variant framework

Probability-aware continuous scoring system

AUC-E cross-model comparison metric

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models