Don't Just Pay Attention, PLANT It: Transfer L2R Models to Fine-tune Attention in Extreme Multi-Label Text Classification

📅 2024-10-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the optimization challenges of attention mechanisms in eXtreme Multi-Label Text Classification (XMTC), this paper proposes PLANT, a novel transfer learning framework. PLANT introduces the “Attention Grafting” paradigm—pioneering the transplantation of attention layers from pre-trained Learning-to-Rank (LTR) models into XMTC architectures. It comprises three core innovations: (i) mutual information gain modeling to enhance label relevance estimation; (ii) an *inattention* mechanism that explicitly suppresses redundant attention weights; and (iii) a *stateful decoder* enabling state-aware, sequential label generation. Through multi-stage attention-layer transfer fine-tuning, PLANT achieves new state-of-the-art performance across five standard benchmarks—including MIMIC-Full—with up to 50+ percentage-point F1 gains on the low-resource MIMIC-Rare benchmark. Remarkably, it attains comparable accuracy to fully supervised baselines using only a minimal amount of labeled data.

Technology Category

Application Category

📝 Abstract
State-of-the-art Extreme Multi-Label Text Classification (XMTC) models rely heavily on multi-label attention layers to focus on key tokens in input text, but obtaining optimal attention weights is challenging and resource-intensive. To address this, we introduce PLANT -- Pretrained and Leveraged AtteNTion -- a novel transfer learning strategy for fine-tuning XMTC decoders. PLANT surpasses existing state-of-the-art methods across all metrics on mimicfull, mimicfifty, mimicfour, eurlex, and wikiten datasets. It particularly excels in few-shot scenarios, outperforming previous models specifically designed for few-shot scenarios by over 50 percentage points in F1 scores on mimicrare and by over 36 percentage points on mimicfew, demonstrating its superior capability in handling rare codes. PLANT also shows remarkable data efficiency in few-shot scenarios, achieving precision comparable to traditional models with significantly less data. These results are achieved through key technical innovations: leveraging a pretrained Learning-to-Rank model as the planted attention layer, integrating mutual-information gain to enhance attention, introducing an inattention mechanism, and implementing a stateful-decoder to maintain context. Comprehensive ablation studies validate the importance of these contributions in realizing the performance gains.
Problem

Research questions and friction points this paper is trying to address.

Improving attention weight learning in multi-label classification
Initializing attention via pretrained learning-to-rank models
Enhancing performance on rare labels in few-shot settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrains attention weights using Learning-to-Rank model
Initializes label-specific attention via mutual information gain
Provides plug-and-play integration with LLM backbones
🔎 Similar Papers
No similar papers found.
D
Debjyoti Saharoy
Khoury College of Computer Sciences, Northeastern University, Boston, Massachusetts
J
J. Aslam
Khoury College of Computer Sciences, Northeastern University, Boston, Massachusetts
Virgil Pavlu
Virgil Pavlu
Northeastern University