🤖 AI Summary
Standard reinforcement learning often struggles in scientific discovery due to high-variance penalties that trap policies in local modes, hindering exploration of novel solutions. This work proposes a novel approach based on a small ensemble of low-rank adapters built upon a frozen foundation model. For the first time, it leverages prediction disagreement among adapters—quantified via mutual information as epistemic uncertainty—as an intrinsic exploration reward, directing the policy toward regions underrepresented in training data. Additionally, nuclear norm regularization is introduced to preserve adapter diversity, enabling effective discrimination between genuinely unexplored regions and those inherently difficult to solve. Evaluated on four scientific discovery benchmarks, the method significantly improves both maximum reward and solution diversity on three tasks, with ablation studies confirming the critical role of regularization in sustaining long-term exploration.
📝 Abstract
Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network's confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.