Relational Representation Distillation

📅 2024-07-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing knowledge distillation methods struggle to model the structured relationships among internal representations of teacher models, while mainstream contrastive learning objectives (e.g., InfoNCE) impose overly stringent instance discrimination constraints, disrupting relative semantic similarities among semantically proximal samples. To address these limitations, we propose Relational Representation Distillation (RRD). Its core innovations are: (1) a dual-temperature Softmax mechanism—employing a high temperature to emphasize dominant relational patterns and a low temperature to preserve secondary semantic similarities; and (2) a theoretically unified loss that bridges InfoNCE and KL divergence, enabling relative distribution alignment. Evaluated on multi-task transfer learning benchmarks, RRD significantly improves teacher–student representation alignment. Notably, on several downstream tasks, student models trained with RRD even surpass their teachers in performance—demonstrating both the effectiveness of structured relational modeling and its strong generalization capability.

Technology Category

Application Category

📝 Abstract
Knowledge distillation involves transferring knowledge from large, cumbersome teacher models to more compact student models. The standard approach minimizes the Kullback-Leibler (KL) divergence between the probabilistic outputs of a teacher and student network. However, this approach fails to capture important structural relationships in the teacher's internal representations. Recent advances have turned to contrastive learning objectives, but these methods impose overly strict constraints through instance-discrimination, forcing apart semantically similar samples even when they should maintain similarity. This motivates an alternative objective by which we preserve relative relationships between instances. Our method employs separate temperature parameters for teacher and student distributions, with sharper student outputs, enabling precise learning of primary relationships while preserving secondary similarities. We show theoretical connections between our objective and both InfoNCE loss and KL divergence. Experiments demonstrate that our method significantly outperforms existing knowledge distillation methods across diverse knowledge transfer tasks, achieving better alignment with teacher models, and sometimes even outperforms the teacher network.
Problem

Research questions and friction points this paper is trying to address.

Capturing structural relationships in teacher models
Avoiding overly strict contrastive learning constraints
Preserving relative instance relationships effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preserves relative relationships between instances
Uses separate temperature parameters for distributions
Achieves better alignment with teacher models
🔎 Similar Papers
No similar papers found.