PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the limitations of existing unsupervised reinforcement learning methods for large language models, which rely on heuristic intrinsic rewards, lack a well-defined optimization objective, and are susceptible to length bias in autoregressive generation. The authors reformulate unsupervised fine-tuning as a distribution matching problem and propose a variational sampling framework grounded in Generative Flow Networks (GFlowNets). They introduce a length-aware trajectory balance objective to mitigate structural biases inherent in sequential generation and design an α-power distribution modulation mechanism: setting α > 1 enhances logical reasoning, while α < 1 promotes creative expression, simultaneously alleviating over-sharpening during alignment. Experiments demonstrate that the proposed method outperforms current unsupervised RLIF approaches across multiple tasks, achieving performance comparable to or exceeding supervised GRPO, and attains Pareto improvements in both diversity and quality on creative generation benchmarks.

Technology Category

Application Category

📝 Abstract

Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised Reinforcement Learning

Intrinsic Rewards

Length Bias

Distribution Matching

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

PowerFlow

distribution matching

GFlowNet

α-power distributions

unsupervised reinforcement learning

🔎 Similar Papers

No similar papers found.