BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Generative models struggle to optimize under sparse rewards—where reward signals vanish for hard samples and reward function evaluations are costly. Method: We propose a failure-only, negative-driven learning paradigm that treats failure-mode modeling as an *in-loop* generation problem. Leveraging Bayesian inference, we construct a posterior distribution over failures to actively steer generation away from known failure regions—requiring neither successful samples nor explicit reward signals. Contribution/Results: This framework unifies negative-evidence learning with generative sparse-reward optimization. On multiple highly sparse-reward benchmarks, it improves success rates by 2–3 orders of magnitude while drastically reducing reward function calls. It establishes a novel pathway for aligning generative models under low-feedback regimes, offering both theoretical novelty and empirical efficacy in reward-scarce settings.

Technology Category

Application Category

📝 Abstract

Today's generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.

Problem

Research questions and friction points this paper is trying to address.

Improving generative models with only negative rewards

Reducing expensive reward evaluations during post-training

Steering generation away from failure patterns without successes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Negative Evidence Learning for generative modeling

Uses only failed attempts to post-train models

Minimizes reward evaluations by steering from failures

🔎 Similar Papers

No similar papers found.