Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the challenge of improving topic-controllable summarization quality in small language models under limited real-world data. The authors propose a context-pairing data augmentation method that constructs contrastive training samples by mixing contexts from different documents, thereby strengthening the model’s understanding of semantic relationships between topics and summaries. Requiring no additional real data, the approach leverages Wikipedia-derived topic annotations and achieves substantial performance gains when applied to compact models such as T5-base. Experimental results demonstrate consistent improvements in human preference scores and semantic alignment metrics as the scale of augmentation increases, enabling smaller models trained on fewer samples to match the controllable summarization capabilities of significantly larger counterparts.

Technology Category

Application Category

📝 Abstract

Topic-controlled summarisation enables users to generate summaries focused on specific aspects of source documents. This paper investigates a data augmentation strategy for training small language models (sLMs) to perform topic-controlled summarisation. We propose a pairwise data augmentation method that combines contexts from different documents to create contrastive training examples, enabling models to learn the relationship between topics and summaries more effectively. Using the SciTLDR dataset enriched with Wikipedia-derived topics, we systematically evaluate how augmentation scale affects model performance. Results show consistent improvements in win rate and semantic alignment as the augmentation scale increases, while the amount of real training data remains fixed. Consequently, a T5-base model trained with our augmentation approach achieves competitive performance relative to larger models, despite using significantly fewer parameters and substantially fewer real training examples.

Problem

Research questions and friction points this paper is trying to address.

topic-controlled summarisation

data augmentation

small language models

educational summarisation

training data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

topic-controlled summarisation

pairwise data augmentation

small language models