AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the challenge of token-level credit assignment in aligning large language models for complex reasoning, where existing self-distillation approaches often suffer from teacher overfitting, answer leakage, and training collapse in later stages. The authors propose a reflection bottleneck mechanism that compresses verifier diagnostic signals into self-generated Socratic prompts and critiques. By integrating causal information gain with an asymmetric ReLU gating function, this method enables sparse yet precise token-level advantage modulation. Crucially, it avoids over-conditioning on original reference solutions, thereby ensuring stable long-horizon training. Experimental results demonstrate that the proposed approach significantly outperforms current methods on scientific, mathematical, and tool-use benchmarks while effectively mitigating performance degradation.
📝 Abstract
The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
self-distillation
token-level reward
training collapse
LLM alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Meta-Reflective Self-Distillation
Token-Level Credit Assignment
Causal Information Gain
Self-Distillation
Reinforcement Learning with Verifiable Rewards
🔎 Similar Papers
No similar papers found.