Optimizing Language Models for Crosslingual Knowledge Consistency

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
This work addresses the reliability issue in large language models (LLMs) that arises in multilingual settings, where inconsistent knowledge leads to contradictory responses to semantically equivalent queries in different languages. To mitigate this, the paper proposes Direct Consistency Optimization (DCO), a method inspired by Direct Preference Optimization (DPO) but eliminating the need for an explicit reward model. DCO constructs structured reward signals directly from the model’s own outputs and leverages reinforcement learning to align cross-lingual responses. The approach supports controllable alignment, cross-domain generalization, and efficient bilingual training. Experiments across multiple LLMs demonstrate that DCO significantly improves cross-lingual consistency, outperforming existing baselines. Moreover, when gold-standard data are available, DCO can be effectively combined with DPO, further enhancing performance and showcasing strong generalization and practical utility.
📝 Abstract
Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.
Problem

Research questions and friction points this paper is trying to address.

crosslingual knowledge consistency
large language models
multilingual scenarios
knowledge inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

crosslingual consistency
reinforcement learning
Direct Consistency Optimization
multilingual LLMs
structured reward