CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

To address insufficient reasoning capability transfer in large-model knowledge distillation under heterogeneous tokenizers, this paper proposes a general distillation framework integrating Chain-of-Thought (CoT) enhancement and cross-chain alignment. The framework introduces a novel cross-chain CoT alignment mechanism, extending optimal transport to both sequence-level and layer-level matching while preserving variable-length input handling and contextual integrity. It jointly models CoT-enhanced reasoning, heterogeneous vocabulary mapping, and sequence-layer coupled distillation. Under diverse vocabulary configurations, our method consistently outperforms baselines—including ULD and DSKD—on reasoning tasks and domain robustness. Empirical results demonstrate substantial improvements across multiple benchmarks, establishing a new paradigm for efficient knowledge transfer in tokenizer-agnostic distillation scenarios.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models. However, existing KD methods often assume shared vocabularies and tokenizers, limiting their flexibility. While approaches like Universal Logit Distillation (ULD) and Dual-Space Knowledge Distillation (DSKD) address vocabulary mismatches, they overlook the critical extbf{reasoning-aware distillation} aspect. To bridge this gap, we propose CoT2Align a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer. Additionally, we extend Optimal Transport beyond token-wise alignment to a sequence-level and layer-wise alignment approach that adapts to varying sequence lengths while preserving contextual integrity. Comprehensive experiments demonstrate that CoT2Align outperforms existing KD methods across different vocabulary settings, improving reasoning capabilities and robustness in domain-specific tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning transfer in distillation

Address vocabulary and tokenizer mismatches

Improve computational and memory efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Chain of Thought Distillation

Optimal Transport Alignment

Sequence-level Layer-wise Alignment

🔎 Similar Papers

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs