🤖 AI Summary
This study investigates the logical consistency and cross-lingual alignment capabilities of large language models (LLMs) in multilingual and code-switched settings. Method: We propose the first logic-controllable, synthetic multilingual natural language inference (NLI) evaluation framework: it automatically generates semantically precise premise–hypothesis pairs, constructs diverse test sets via multilingual translation and controlled code-switching, and validates semantic fidelity through embedding similarity analysis and visualization. Contribution/Results: Contrary to expectations, code-switching does not degrade model performance; instead, it enhances cross-lingual reasoning stability—attributed to implicit regularization induced by translation-driven lexical variation. Our work establishes a reproducible benchmark for multilingual LLM evaluation, releases open-source tooling, and introduces the “multilingual augmentation for robustness” paradigm—a novel approach to improving LLM resilience through controlled linguistic mixing.
📝 Abstract
Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing