Embarrassingly Simple Self-Distillation Improves Code Generation

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Self-Supervised Distillation (SSD), a method that enhances code generation capabilities of large language models without external supervision, leveraging only code samples generated by the model itself. To address the tension between decoding accuracy and exploration, SSD employs temperature-controlled and truncated sampling to produce diverse candidate outputs, while dynamically suppressing distracting tail tokens at critical positions to preserve beneficial diversity in a context-aware manner, thereby reshaping the output distribution. When combined with standard supervised fine-tuning, SSD significantly improves performance on LiveCodeBench v6, raising the pass@1 score from 42.4% to 55.3% on Qwen3-30B-Instruct. The approach demonstrates consistent effectiveness across multiple model scales in both the Qwen and Llama families.
📝 Abstract
Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.
Problem

Research questions and friction points this paper is trying to address.

code generation
large language models
self-distillation
LLM decoding
supervised fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-distillation
code generation
large language models
precision-exploration tradeoff
supervised fine-tuning
🔎 Similar Papers
No similar papers found.