Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Pretrained language models often rely on superficial shortcut features from training data, leading to degraded generalization under distribution shifts. This work proposes Shortcut Guardrail, a deployment-time debiasing framework that requires neither the original training data nor annotated shortcut labels. By leveraging the model’s own gradient-based attributions, the method identifies shortcut words and mitigates their influence through a lightweight LoRA module combined with a masked contrastive learning objective. Shortcut Guardrail achieves the first unsupervised, fine-grained mitigation of shortcut reliance at inference time, significantly improving both overall accuracy and worst-group performance under distribution shift across sentiment classification, toxicity detection, and natural language inference tasks, while preserving performance on the original distribution.

Technology Category

Application Category

📝 Abstract

Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

Problem

Research questions and friction points this paper is trying to address.

shortcut learning

pretrained language models

distribution shift

bias mitigation

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shortcut Learning

Deployment-Time Mitigation

Gradient-Based Attribution