One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Balancing model performance and inference latency remains challenging in code retrieval. Method: This paper proposes MODULARSTARENCODER—a modular, multi-exit encoder with 1B parameters—that introduces hierarchical self-distillation, where high-layer outputs directly supervise early-layer representations to enhance low-level embedding quality. It further designs a repository-level context-aware loss and constructs the first cross-lingual code-translation-driven hybrid benchmark encompassing both text-to-code and code-to-code retrieval. Leveraging modular architecture and multi-task joint training, the model enables flexible early-exit inference without additional overhead. Contribution/Results: Experiments demonstrate significant improvements in Recall@10 on both text-to-code and code-to-code retrieval tasks, achieving a superior trade-off between accuracy and latency.

Technology Category

Application Category

📝 Abstract
Deploying language models often requires handling model size vs. performance trade-offs to satisfy downstream latency constraints while preserving the model's usefulness. Model distillation is commonly employed to reduce model size while maintaining acceptable performance. However, distillation can be inefficient since it involves multiple training steps. In this work, we introduce MODULARSTARENCODER, a modular multi-exit encoder with 1B parameters, useful for multiple tasks within the scope of code retrieval. MODULARSTARENCODER is trained with a novel self-distillation mechanism that significantly improves lower-layer representations-allowing different portions of the model to be used while still maintaining a good trade-off in terms of performance. Our architecture focuses on enhancing text-to-code and code-to-code search by systematically capturing syntactic and semantic structures across multiple levels of representation. Specific encoder layers are targeted as exit heads, allowing higher layers to guide earlier layers during training. This self-distillation effect improves intermediate representations, increasing retrieval recall at no extra training cost. In addition to the multi-exit scheme, our approach integrates a repository-level contextual loss that maximally utilizes the training context window, further enhancing the learned representations. We also release a new dataset constructed via code translation, seamlessly expanding traditional text-to-code benchmarks with code-to-code pairs across diverse programming languages. Experimental results highlight the benefits of self-distillation through multi-exit supervision.
Problem

Research questions and friction points this paper is trying to address.

Enhances text-to-code and code-to-code search efficiency
Improves early layer embeddings via self-distillation
Reduces model size while maintaining performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

MODULARSTARENCODER: modular multi-exit encoder
Self-distillation enhances early layer embeddings
Repository-level contextual loss maximizes training context
🔎 Similar Papers
No similar papers found.