š¤ AI Summary
This work addresses the challenge of enabling small language models to perform complex reasoning efficiently under limited computational resources, a task hindered by conventional exploration strategies that incur high computational costs and overlook semantic diversity. The authors propose SD-E², a novel framework that, for the first time, incorporates semantic diversity as an exploration reward signal in reinforcement learning. By leveraging a frozen sentence embedding model, SD-E² computes semantic coverage and average pairwise dissimilarity without per-token overhead, and integrates correctness and efficiency into a z-score normalized multi-objective reward. This approach dynamically adjusts reasoning structures to enable efficient exploration. Experiments demonstrate that SD-E² outperforms Qwen2.5-3B-Instruct by 27.4 percentage points on GSM8K, achieves 49.64% accuracy on MedMCQA, and obtains an AIME score of 13.28%, generating an average of 9.8 semantically distinct reasoning strategies per question.
š Abstract
Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E$^2$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E$^2$ offers a complementary path to efficiency gains in resource-constrained models.