Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

πŸ“… 2026-04-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

178K/year
πŸ€– AI Summary
Pretrained language models often rely on superficial shortcut features from training data, leading to degraded generalization under distribution shifts. This work proposes Shortcut Guardrail, a deployment-time debiasing framework that requires neither the original training data nor annotated shortcut labels. By leveraging the model’s own gradient-based attributions, the method identifies shortcut words and mitigates their influence through a lightweight LoRA module combined with a masked contrastive learning objective. Shortcut Guardrail achieves the first unsupervised, fine-grained mitigation of shortcut reliance at inference time, significantly improving both overall accuracy and worst-group performance under distribution shift across sentiment classification, toxicity detection, and natural language inference tasks, while preserving performance on the original distribution.

Technology Category

Application Category

πŸ“ Abstract
Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.
Problem

Research questions and friction points this paper is trying to address.

shortcut learning
pretrained language models
distribution shift
bias mitigation
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shortcut Learning
Deployment-Time Mitigation
Gradient-Based Attribution
LoRA
Masked Contrastive Learning
πŸ”Ž Similar Papers
2024-10-03arXiv.orgCitations: 0