What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study investigates the *causal triggers* of gender bias in machine translation and large language models—not merely its detection. We propose the first method integrating *contrastive explanation* with saliency-based attribution to quantify how each source-token influences gender inflection choices in the target language, and employ linguistic analysis to identify critical bias-triggering contexts. Furthermore, we establish a mapping between model attribution outputs and human judgments of gendered language perception. Experimental results demonstrate significant spatial overlap between model-identified saliency hotspots and human-perceived bias cues, validating the efficacy of our interpretability framework for localizing bias origins. This yields a traceable, empirically verifiable explanatory pathway for bias intervention—bridging model internals, linguistic structure, and human cognition.

Technology Category

Application Category

📝 Abstract

Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.

Problem

Research questions and friction points this paper is trying to address.

Investigates gender bias triggers in translation models

Explores source context influence on gender inflection choices

Compares model gender decisions with human perceptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive explanations analyze gender bias triggers

Saliency attribution identifies influential source tokens

Linguistic analysis compares model and human perceptions

🔎 Similar Papers

No similar papers found.