Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

๐Ÿ“… 2025-10-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Personalized text summarization suffers from severe scarcity of high-quality training data: existing datasets (e.g., MS/CAS PENS) contain user click/skip trajectories but lack gold-standard reference summaries and exhibit limited topical diversity, hindering end-to-end supervised learning and generalization. To address this, we propose PerAugy, a data augmentation framework that generates highly diverse training samples via cross-trajectory reordering and summary content perturbation. We further introduce three novel diversity metricsโ€”Topic Purity (TP), Relevance-to-Context (RTC), and Degree of Diversity (DegreeD)โ€”to quantitatively characterize dataset diversity and empirically establish its strong correlation with model performance for the first time. Integrating a user encoder fine-tuned within a state-of-the-art end-to-end framework, our approach achieves significant gains: up to +0.132 in AUC and an average +61.2% improvement in user encoder performance, markedly enhancing modeling of subjective relevance.

Technology Category

Application Category

๐Ÿ“ Abstract
Document summarization enables efficient extraction of user-relevant content but is inherently shaped by individual subjectivity, making it challenging to identify subjective salient information in multifaceted documents. This complexity underscores the necessity for personalized summarization. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., click-skip trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a valuable resource but includes only preference history without target summaries, preventing end-to-end supervised learning, and its limited topic-transition diversity further restricts generalization. To address this, we propose $mathrm{PerAugy}$, a novel cross-trajectory shuffling and summary-content perturbation based data augmentation technique that significantly boosts the accuracy of four state-of-the-art baseline (SOTA) user-encoders commonly used in personalized summarization frameworks (best result: $ ext{0.132}$$uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization (avg. boost: $ ext{61.2%}uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by peraugy, we introduce three dataset diversity metrics -- $mathrm{TP}$, $mathrm{RTC}$, and degreed to quantify the induced diversity. We find that $mathrm{TP}$ and $mathrm{DegreeD}$ strongly correlate with user-encoder performance on the PerAugy-generated dataset across all accuracy metrics, indicating that increased dataset diversity is a key factor driving performance gains.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarce user preference data for personalized summarization training
Enhancing generalization through diversity augmentation of training datasets
Improving user-encoder accuracy and personalization in summarization frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-trajectory shuffling for data augmentation
Summary-content perturbation to boost diversity
Introducing metrics to quantify dataset diversity gains
๐Ÿ”Ž Similar Papers
No similar papers found.
P
Parthiv Chatterjee
KDM Lab, Dhirubhai Ambani University, India
S
Shivam Sonawane
KDM Lab, Dhirubhai Ambani University, India
Amey Hengle
Amey Hengle
University of Maryland, College Park
Natural Language ProcessingLLMsGenerative AI
A
Aditya Tanna
KDM Lab, Dhirubhai Ambani University, India
Sourish Dasgupta
Sourish Dasgupta
Dhirubhai Ambani University
Natural Language ProcessingKnowledge Graph LearningRecommendation Systems
T
Tanmoy Chakraborty
LCS2 Lab, Indian Institute of Technology Delhi, India