Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates how system prompts enhance the accuracy and behavioral robustness of large language models (LLMs) in multilingual settings. To this end, we propose the first four-dimensional system prompt evaluation and optimization framework tailored for multilingual scenarios, integrating key prompt components—including chain-of-thought reasoning, affective cues, and contextual grounding—and analyze over ten million inference units across five languages, three mainstream LLMs, and three cross-lingual benchmarks. Experimental results show that high-performing prompts induce more structured and consistent reasoning paths while significantly suppressing code-switching. Our optimization method yields average improvements of 5–10% across all metrics, substantially enhancing cross-lingual reasoning consistency and deployment reliability. The core contributions are: (1) uncovering interpretable associations between prompt components and multilingual performance, and (2) establishing a scalable, principled evaluation paradigm for multilingual system prompts.

Technology Category

Application Category

📝 Abstract

System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Problem

Research questions and friction points this paper is trying to address.

Optimizing system prompts for multilingual LLM behavior

Evaluating prompt components for cross-lingual robustness

Reducing language-switching in multilingual reasoning patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified evaluation framework for multilingual prompt assessment

Automated prompt optimization improving metrics by 5-10%

System prompts induce structured reasoning and reduce language-switching

🔎 Similar Papers

Understanding and Mitigating Language Confusion in LLMs