The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit high sensitivity to input ordering—minor permutations induce output inconsistency and bias, severely undermining reliability. Method: This work systematically quantifies the “order effect” in closed-source LLMs under a unified multi-task framework, conducting controlled permutation experiments across paraphrasing, relevance judgment, and multiple-choice tasks; it employs cross-task benchmarking, statistical significance testing, and evaluates few-shot prompting as a potential mitigation strategy. Contribution/Results: Input order perturbations reduce task accuracy by 12.7%–34.1% on average; no mainstream closed-source LLM achieves order invariance; few-shot prompting yields only marginal robustness improvement, revealing a fundamental deficiency in inherent model robustness. This study establishes critical empirical evidence and a methodological foundation for assessing and enhancing LLM trustworthiness and input-order robustness.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in closed-source LLMs by conducting experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.
Problem

Research questions and friction points this paper is trying to address.

Investigates order sensitivity in closed-source LLMs.
Examines impact of input order on task performance.
Highlights need for robust LLMs in high-stakes apps.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates order sensitivity in LLMs
Tests multiple tasks for performance
Explores few-shot prompting mitigation