Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how text classification models rely on spurious correlations (shortcuts), using actor names in movie reviews as a controllable shortcut. It reveals that certain attention heads can trigger premature classification decisions before full input processing. To address this, we propose Head-based Token Attribution (HTA), the first method to precisely identify and attribute shortcut-driven decisions to specific attention heads, enabling targeted suppression. HTA integrates mechanistic interpretability techniques—including attention head analysis, intermediate-layer attribution tracing, and selective head deactivation. Experiments demonstrate that HTA efficiently detects and mitigates shortcut learning in large language models, significantly improving robustness across multiple benchmarks without compromising original task performance.

Technology Category

Application Category

📝 Abstract
Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.
Problem

Research questions and friction points this paper is trying to address.

Investigates how shortcuts are processed in model decision-making
Identifies attention heads focusing on shortcuts in text classification
Introduces Head-based Token Attribution to detect and mitigate shortcuts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses actor names as controllable shortcuts
Identifies specific shortcut-focused attention heads
Introduces Head-based Token Attribution for mitigation
🔎 Similar Papers
No similar papers found.