AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of token-level credit assignment in aligning large language models for complex reasoning, where existing self-distillation approaches often suffer from teacher overfitting, answer leakage, and training collapse in later stages. The authors propose a reflection bottleneck mechanism that compresses verifier diagnostic signals into self-generated Socratic prompts and critiques. By integrating causal information gain with an asymmetric ReLU gating function, this method enables sparse yet precise token-level advantage modulation. Crucially, it avoids over-conditioning on original reference solutions, thereby ensuring stable long-horizon training. Experimental results demonstrate that the proposed approach significantly outperforms current methods on scientific, mathematical, and tool-use benchmarks while effectively mitigating performance degradation.

📝 Abstract

The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

self-distillation

token-level reward

training collapse

LLM alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Meta-Reflective Self-Distillation

Token-Level Credit Assignment

Causal Information Gain