🤖 AI Summary
This work addresses the limitations of current language models in rhetorical role labeling, particularly their poor performance on low-confidence, difficult instances and their neglect of the semantic information embedded in label names. The authors propose RISE, a framework that enhances inference without retraining or modifying the original model. RISE leverages contrastive learning to construct semantic representations of labels and performs semantic reranking of predictions for challenging samples. Notably, this is the first approach to incorporate label semantics during inference to refine output ranking. Analysis using human-annotated difficulty labels reveals a moderate agreement between model- and human-assessed sample difficulty (Cohen’s κ = 0.40). Experiments across eight domain-specific datasets and seven language models demonstrate that RISE improves macro F1 on difficult samples by an average of 9.15 points.
📝 Abstract
Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen's kappa = 0.40.