Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

169K/year
📝 Abstract
Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at https://github.com/dl-m9/SIOP.git.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
turn-level
verifier-free
long-horizon reasoning
self-supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

credit assignment
self-induced outcome potential
semantic clustering
verifier-free RL
turn-level reward
🔎 Similar Papers
S
Senkang Hu
Hong Kong JC STEM Lab of Smart City, City University of Hong Kong, University of Oxford
Y
Yong Dai
Fudan University
Xudong Han
Xudong Han
University of Sussex
Multiple object trackingLLMMLLM
Z
Zhengru Fang
Hong Kong JC STEM Lab of Smart City, City University of Hong Kong
Yuzhi Zhao
Yuzhi Zhao
Ph.D., City University of Hong Kong; B.Eng., Huazhong University of Science and Technology
Low-level VisionComputational PhotographyLLMMLLM
S
Sam Tak Wu Kwong
Lingnan University
Y
Yuguang Fang
Hong Kong JC STEM Lab of Smart City, City University of Hong Kong