🤖 AI Summary
This work addresses the lack of quantifiable evaluation for iterative prompting efficacy in large language models (LLMs) across multi-turn interactions. We propose a cross-domain evaluation framework spanning creative generation, programming, and mathematical reasoning tasks. Through 12 controlled experimental rounds, we introduce three turn-wise metrics—semantic change rate, code size growth ratio, and reasoning refinement index—enabling the first cross-model, cross-task comparability of iterative prompting effects. Methodologically, we innovatively integrate dual protocols: fuzzy feedback and directed guidance, coupled with domain-adapted evaluation criteria—including unit testing pass rate, answer equivalence, reasoning validity, originality, and feasibility. Results reveal pronounced domain heterogeneity in iterative gains: improvements manifest early in creative and coding tasks but emerge only in later turns for mathematical reasoning. Directed prompting significantly enhances output quality while mitigating performance degradation, providing both theoretical foundations and practical paradigms for optimizing LLM-based multi-turn workflows.
📝 Abstract
Large language models (LLMs) are now used in multi-turn workflows, but we still lack a clear way to measure when iteration helps and when it hurts. We present an evaluation framework for iterative refinement that spans ideation, code, and math. Our protocol runs controlled 12-turn conversations per task, utilizing a variety of prompts ranging from vague ``improve it'' feedback to targeted steering, and logs per-turn outputs. We score outcomes with domain-appropriate checks (unit tests for code; answer-equivalence plus reasoning-soundness for math; originality and feasibility for ideation) and track turn-level behavior with three families of metrics: semantic movement across turns, turn-to-turn change, and output size growth. Across models and tasks, gains are domain-dependent: they arrive early in ideas and code, but in math late turns matter when guided by elaboration. After the first few turns, vague feedback often plateaus or reverses correctness, while targeted prompts reliably shift the intended quality axis (novelty vs. feasibility in ideation; speed vs. readability in code; in math, elaboration outperforms exploration and drives late-turn gains). We also observe consistent domain patterns: ideation moves more in meaning across turns, code tends to grow in size with little semantic change, and math starts fixed but can break that path with late, elaborative iteration.Together, the framework and metrics make iteration measurable and comparable across models, and signal when to steer, stop, or switch strategies.