Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how system prompts enhance the accuracy and behavioral robustness of large language models (LLMs) in multilingual settings. To this end, we propose the first four-dimensional system prompt evaluation and optimization framework tailored for multilingual scenarios, integrating key prompt components—including chain-of-thought reasoning, affective cues, and contextual grounding—and analyze over ten million inference units across five languages, three mainstream LLMs, and three cross-lingual benchmarks. Experimental results show that high-performing prompts induce more structured and consistent reasoning paths while significantly suppressing code-switching. Our optimization method yields average improvements of 5–10% across all metrics, substantially enhancing cross-lingual reasoning consistency and deployment reliability. The core contributions are: (1) uncovering interpretable associations between prompt components and multilingual performance, and (2) establishing a scalable, principled evaluation paradigm for multilingual system prompts.

Technology Category

Application Category

📝 Abstract
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
Problem

Research questions and friction points this paper is trying to address.

Optimizing system prompts for multilingual LLM behavior
Evaluating prompt components for cross-lingual robustness
Reducing language-switching in multilingual reasoning patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation framework for multilingual prompt assessment
Automated prompt optimization improving metrics by 5-10%
System prompts induce structured reasoning and reduce language-switching
🔎 Similar Papers
No similar papers found.