HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the limitations of reinforcement learning with verifiable rewards (RLVR) in low-resource settings, where entropy collapse severely restricts exploration and reasoning capabilities. To overcome this challenge, the authors propose the HEAL framework, which introduces a novel trajectory-level Entropy Dynamic Alignment (EDA) mechanism that jointly matches both the magnitude and fine-grained temporal dynamics of entropy between the target and general domains. Combined with a hybrid-domain data selection strategy, HEAL effectively transfers diverse exploratory behaviors across domains. Remarkably, with only 32 samples from the target domain, HEAL matches or even surpasses the performance of full-data training methods that utilize 1,000 samples, demonstrating substantial improvements in few-shot RLVR across multiple domains.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective for training reasoning-oriented large language models, but existing methods largely assume high-resource settings with abundant training data. In low-resource scenarios, RLVR is prone to more severe entropy collapse, which substantially limits exploration and degrades reasoning performance. To address this issue, we propose Hybrid-domain Entropy dynamics ALignment (HEAL), a framework tailored for few-shot RLVR. HEAL first selectively incorporates high-value general-domain data to promote more diverse exploration. Then, we introduce Entropy Dynamics Alignment (EDA), a reward mechanism that aligns trajectory-level entropy dynamics between the target and general domains, capturing both entropy magnitude and fine-grained variation. Through this alignment, EDA not only further mitigates entropy collapse but also encourages the policy to acquire more diverse exploration behaviors from the general domain. Experiments across multiple domains show that HEAL consistently improves few-shot RLVR performance. Notably, using only 32 target-domain samples, HEAL matches or even surpasses full-shot RLVR trained with 1K target-domain samples.

Problem

Research questions and friction points this paper is trying to address.

entropy collapse

few-shot RLVR

exploration

low-resource reinforcement learning

reasoning-oriented LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-Shot RLVR

Entropy Collapse

Entropy Dynamics Alignment