SRA: Span Representation Alignment for Large Language Model Distillation

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the poor transfer performance in cross-tokenizer knowledge distillation caused by fragile token-level alignment. To overcome this limitation, the authors propose a robust semantic span-based alignment method that models knowledge transfer through a multi-particle dynamical system. In this framework, each semantic span is represented by an attention-weighted centroid capturing its state, and knowledge is distilled via logit matching between aligned spans, augmented with geometric regularization. Notably, this approach is the first to leverage a physics-inspired multi-particle system for distillation, thereby eliminating reliance on brittle token-level correspondence. Extensive experiments across diverse cross-architecture settings demonstrate that the proposed method significantly outperforms the current state-of-the-art CTKD, confirming its effectiveness and strong generalization capability.

📝 Abstract

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

Problem

Research questions and friction points this paper is trying to address.

Cross-Tokenizer Knowledge Distillation

Span Representation

Large Language Model Distillation

Tokenizer Discrepancy

Knowledge Transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Span Representation Alignment

Cross-Tokenizer Knowledge Distillation

Center of Mass