Large Language Models Explore by Latent Distilling

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This work addresses the challenge that large language models struggle to simultaneously achieve semantic diversity and coherence during inference, as conventional stochastic sampling yields limited gains. The authors propose Exploratory Sampling (ESamp), a novel decoding strategy that, for the first time, leverages prediction errors between shallow and deep layer representations as a signal of semantic novelty to dynamically reweight candidate tokens and promote diversity. A lightweight distillation module models inter-layer representation shifts, and an asynchronous training-inference pipeline enables low-overhead deployment (under 5% latency overhead in the worst case). ESamp substantially improves Pass@k efficiency, outperforming or matching strong baselines across diverse tasks—including mathematical reasoning, scientific question answering, code generation, and creative writing—effectively reconciling the diversity–coherence trade-off while demonstrating strong cross-task generalization.
📝 Abstract
Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well-known observation that neural networks tend to make lower-error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations to model the LLM's depth-wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ESamp is implemented with an asynchronous training--inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade-off between diversity and coherence in creative writing. Our code has released at: https://github.com/LinesHogan/tLLM.
Problem

Research questions and friction points this paper is trying to address.

semantic diversity
large language models
test-time scaling
response generation
exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploratory Sampling
semantic diversity
latent distillation
test-time adaptation
decoding strategy
🔎 Similar Papers
No similar papers found.
Y
Yuanhao Zeng
School of Information Science and Technology, ShanghaiTech University, Shanghai, China; State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
A
Ao Lu
School of Information Science and Technology, ShanghaiTech University, Shanghai, China
L
Lufei Li
School of Information Science and Technology, ShanghaiTech University, Shanghai, China
Z
Zheng Zhang
School of Information Science and Technology, ShanghaiTech University, Shanghai, China; State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Yexin Li
Yexin Li
State Key Laboratory of General Artificial Intelligence BIGAI
reinforcement learningmulti-agent systemmulti-armed banditsdata mining
Kan Ren
Kan Ren
Assistant Professor, ShanghaiTech University
Machine LearningData MiningLarge Language ModelFoundation Model