π€ AI Summary
Pretrained language models often rely on superficial shortcut features from training data, leading to degraded generalization under distribution shifts. This work proposes Shortcut Guardrail, a deployment-time debiasing framework that requires neither the original training data nor annotated shortcut labels. By leveraging the modelβs own gradient-based attributions, the method identifies shortcut words and mitigates their influence through a lightweight LoRA module combined with a masked contrastive learning objective. Shortcut Guardrail achieves the first unsupervised, fine-grained mitigation of shortcut reliance at inference time, significantly improving both overall accuracy and worst-group performance under distribution shift across sentiment classification, toxicity detection, and natural language inference tasks, while preserving performance on the original distribution.
π Abstract
Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.