🤖 AI Summary
This study investigates how system prompts enhance the accuracy and behavioral robustness of large language models (LLMs) in multilingual settings. To this end, we propose the first four-dimensional system prompt evaluation and optimization framework tailored for multilingual scenarios, integrating key prompt components—including chain-of-thought reasoning, affective cues, and contextual grounding—and analyze over ten million inference units across five languages, three mainstream LLMs, and three cross-lingual benchmarks. Experimental results show that high-performing prompts induce more structured and consistent reasoning paths while significantly suppressing code-switching. Our optimization method yields average improvements of 5–10% across all metrics, substantially enhancing cross-lingual reasoning consistency and deployment reliability. The core contributions are: (1) uncovering interpretable associations between prompt components and multilingual performance, and (2) establishing a scalable, principled evaluation paradigm for multilingual system prompts.
📝 Abstract
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.