🤖 AI Summary
This work addresses the implicit cultural bias in moral judgments exhibited by large language models and the limitations of existing alignment methods, which rely on fine-tuning or internal model access and thus cannot be applied to black-box API settings. The authors propose DISCA, a training-free, inference-time calibration method that leverages cross-national value divergences—rather than consensus—from the World Values Survey to construct a panel of persona-based agents. By incorporating a loss-aversion mechanism, DISCA translates these value disagreements into bounded logit adjustments, achieving cultural alignment without modifying model weights. Evaluated across 20 countries and seven open-source models (ranging from 2B to 70B parameters), DISCA reduces MultiTP cultural misalignment by 10–24% in models of at least 3.8B parameters and by 2–7% in open-ended scenarios, using only publicly available data and black-box model access.
📝 Abstract
Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.