CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Explicit chain-of-thought (CoT) reasoning suffers from verbosity and inefficiency, making it difficult to compress without sacrificing performance or interpretability. Method: We propose the first self-supervised framework that distills natural-language CoT into a continuous latent space, enabling joint modeling of explicit and implicit CoT with latent-state alignment. Our approach employs a shared-weight architecture incorporating three key components: (i) a latent-space alignment loss, (ii) continuous representation learning, and (iii) a differentiable thought decoding mechanism. Contribution/Results: On GSM8k, our implicit CoT achieves performance on par with explicit CoT—surpassing prior state-of-the-art by 28.2% in accuracy—while attaining a 3.1× inference compression ratio. Crucially, it supports continuous thought decoding, yielding strong interpretability, cross-dataset generalization, and robust transferability. This work establishes a novel paradigm for efficient and interpretable neural reasoning.

Technology Category

Application Category

📝 Abstract

Chain-of-Thought (CoT) enhances Large Language Models (LLMs) by enabling step-by-step reasoning in natural language. However, the language space may be suboptimal for reasoning. While implicit CoT methods attempt to enable reasoning without explicit CoT tokens, they have consistently lagged behind explicit CoT method in task performance. We propose CODI (Continuous Chain-of-Thought via Self-Distillation), a novel framework that distills CoT into a continuous space, where a shared model acts as both teacher and student, jointly learning explicit and implicit CoT while aligning their hidden activation on the token generating the final answer. CODI is the first implicit CoT method to match explicit CoT's performance on GSM8k while achieving 3.1x compression, surpassing the previous state-of-the-art by 28.2% in accuracy. Furthermore, CODI demonstrates scalability, robustness, and generalizability to more complex CoT datasets. Additionally, CODI retains interpretability by decoding its continuous thoughts, making its reasoning process transparent. Our findings establish implicit CoT as not only a more efficient but a powerful alternative to explicit CoT.

Problem

Research questions and friction points this paper is trying to address.

Compresses Chain-of-Thought reasoning into continuous space

Improves implicit CoT performance to match explicit CoT

Enhances scalability, robustness, and interpretability of CoT methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distillation compresses Chain-of-Thought into continuous space.

Shared model learns explicit and implicit CoT jointly.

CODI matches explicit CoT performance with 3.1x compression.

🔎 Similar Papers

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency