Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Existing methods for generating counterfactual explanations in non-English languages struggle to simultaneously achieve high validity—defined as successfully flipping the model’s prediction—and minimality, which requires minimal perturbation to the input, due to an inherent trade-off between these two objectives. This work proposes Macro, a novel framework that formulates multilingual counterfactual generation as a preference alignment problem. Macro constructs quantifiable preference pairs using a composite scoring function and fine-tunes multilingual large language models via Direct Preference Optimization (DPO), eliminating the need for translation or supervised fine-tuning. Experiments across four models and seven languages demonstrate that Macro improves average validity by 12.55% while preserving minimality, significantly outperforming chain-of-thought and translation-based baselines, reducing generation errors, and enhancing cross-lingual consistency.

📝 Abstract

Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.

Problem

Research questions and friction points this paper is trying to address.

counterfactual generation

multilingual

validity

minimality

preference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference optimization

counterfactual generation

multilingual alignment