🤖 AI Summary
This study investigates the *causal triggers* of gender bias in machine translation and large language models—not merely its detection. We propose the first method integrating *contrastive explanation* with saliency-based attribution to quantify how each source-token influences gender inflection choices in the target language, and employ linguistic analysis to identify critical bias-triggering contexts. Furthermore, we establish a mapping between model attribution outputs and human judgments of gendered language perception. Experimental results demonstrate significant spatial overlap between model-identified saliency hotspots and human-perceived bias cues, validating the efficacy of our interpretability framework for localizing bias origins. This yields a traceable, empirically verifiable explanatory pathway for bias intervention—bridging model internals, linguistic structure, and human cognition.
📝 Abstract
Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.