Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (including multimodal variants) suffer from inefficient exploration and redundant reasoning paths in reasoning and reinforcement learning (RL), with random sampling failing to induce high-level strategic diversity. To address this, we propose an “implicit reasoning palette” mechanism based on variational autoencoders (VAEs): it maps question-answer pairs into semantically grounded latent variables, which are then decoded into learnable prefixes enabling explicit, interpretable, and on-demand control over reasoning style and structure. Our method integrates latent-variable conditioning, learnable token injection, supervised fine-tuning (SFT) warm-starting, and subsequent RL optimization. Evaluated across multiple reasoning benchmarks, it significantly outperforms standard RL baselines—enhancing exploration efficiency, enabling continual learning, and supporting visualizable intervention in reasoning trajectories and controllable generation. To our knowledge, this is the first work achieving explicit, latent-space-level control over reasoning strategies.

Technology Category

Application Category

📝 Abstract
Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.
Problem

Research questions and friction points this paper is trying to address.

Enhances reasoning diversity in language models via latent variables.
Controls model exploration for improved reinforcement learning efficiency.
Modulates internal planning to shape response style and structure.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses VAE to infer latent context from question-answer embeddings
Prepends decoded latent as token prefixes to modulate reasoning
Enables structured exploration in RL via diverse reasoning modes
🔎 Similar Papers
No similar papers found.
Rujiao Long
Rujiao Long
Tsinghua University, Alibaba
OCRVLM
Y
Yang Li
Alibaba Group, Shanghai Jiao Tong University
Xingyao Zhang
Xingyao Zhang
Microsoft
W
Weixun Wang
Alibaba Group
T
Tianqianjin Lin
Alibaba Group, Zhejiang University
X
Xi Zhao
Alibaba Group
Y
Yuchi Xu
Alibaba Group
W
Wenbo Su
Alibaba Group
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving
B
Bo Zheng
Alibaba Group