Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

In scientific discovery, efficient screening of large combinatorial objects (e.g., molecules, proteins) is hindered by uncertainty in surrogate reward functions, leading to low-quality and low-diversity candidates from existing reinforcement learning (RL) methods. To address this, we propose a robust RL framework centered on the first general-purpose soft operator that unifies strategy robustness and sampling sharpness—subsuming and extending mainstream soft RL operators. This operator jointly integrates uncertainty-aware optimization and surrogate reward modeling, enabling stable policy learning in discrete combinatorial spaces. Evaluated on both synthetic benchmarks and real-world scientific tasks, our method significantly improves candidate quality (e.g., drug-likeness, binding affinity) and structural diversity, consistently outperforming reward-proportional sampling baselines across all metrics.

Technology Category

Application Category

📝 Abstract

A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.

Problem

Research questions and friction points this paper is trying to address.

Narrowing large combinatorial sets to promising candidates

Addressing uncertainty in proxy reward functions

Improving quality and diversity in discrete compositional generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust RL with general soft operators

Peakier sampling for uncertain rewards

Novel algorithm for diverse candidates

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL