Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of paired input-output data in low-resource natural language generation (NLG), this paper proposes PbT, a two-stage teacher-student framework. The teacher model compresses unpaired inputs and outputs separately into compact, shared intermediate representations; the student model then learns to reconstruct the original inputs from these representations, thereby synthesizing high-fidelity pseudo-paired data. Crucially, PbT bridges unpaired data via intermediate representations—eliminating reliance on costly human annotation or direct large-model generation, which suffers from high computational expense and poor generalization. Evaluated on five benchmarks, an 8B student model trained solely on PbT-synthesized data achieves a ROUGE-L score significantly surpassing that obtained using data generated by a 70B model, and approaches human-annotated performance—narrowing the gap to just 1.2 points and closing 82% of the oracle gap—while reducing annotation cost to one-third that of direct synthesis.

Technology Category

Application Category

📝 Abstract
We present Paired by the Teacher (PbT), a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks-document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)-as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.
Problem

Research questions and friction points this paper is trying to address.

Generating input-output pairs without parallel data
Addressing data scarcity in low-resource text generation
Creating high-fidelity synthetic training data efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher LLM compresses unpaired data into intermediate representations
Student model reconstructs inputs from these representations
Generates high-quality synthetic pairs without parallel data
🔎 Similar Papers
No similar papers found.
Y
Yen-Ju Lu
Center for Language and Speech Processing, Johns Hopkins University
Thomas Thebaud
Thomas Thebaud
Assistant Research Scientist, ECE Dept., Johns Hopkins University, Baltimore
Adversarial and Backdoor attacksSpeech Emotion RecognitionAudio LLMsSpeaker Characterisation
L
Laureano Moro-Velazquez
Center for Language and Speech Processing, Johns Hopkins University
Najim Dehak
Najim Dehak
Associate Professor at ECE department, Johns Hopkins University.
Machine learningspeech processingspeaker recognitionlanguage recognitionemotion recognition
J
Jesus Villalba
Center for Language and Speech Processing, Johns Hopkins University