Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the credit assignment problem in outcome-oriented online reinforcement learning under sparse and delayed rewards. Specifically, it considers the setting where rewards are observed only at trajectory termini. We establish the first theoretical framework for outcome-oriented RL with general function approximation, revealing its statistical identifiability limitation and an inherent exponential separation in sample complexity relative to dense-reward settings. Methodologically, we propose an online policy optimization algorithm grounded in coverage-coefficient analysis, integrating trajectory-level counterfactual estimation, preference modeling, and comparative learning—enabling convergence in deterministic MDPs without completeness assumptions and extending naturally to preference-based feedback. Theoretically, our algorithm achieves a sample complexity of $widetilde{O}(C_{ m cov} H^3/epsilon^2)$, yielding the first online RL solution for sparse feedback that simultaneously supports general function approximation and provides rigorous statistical guarantees.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with outcome-based feedback faces a fundamental challenge: when rewards are only observed at trajectory endpoints, how do we assign credit to the right actions? This paper provides the first comprehensive analysis of this problem in online RL with general function approximation. We develop a provably sample-efficient algorithm achieving $widetilde{O}({C_{ m cov} H^3}/{epsilon^2})$ sample complexity, where $C_{ m cov}$ is the coverability coefficient of the underlying MDP. By leveraging general function approximation, our approach works effectively in large or infinite state spaces where tabular methods fail, requiring only that value functions and reward functions can be represented by appropriate function classes. Our results also characterize when outcome-based feedback is statistically separated from per-step rewards, revealing an unavoidable exponential separation for certain MDPs. For deterministic MDPs, we show how to eliminate the completeness assumption, dramatically simplifying the algorithm. We further extend our approach to preference-based feedback settings, proving that equivalent statistical efficiency can be achieved even under more limited information. Together, these results constitute a theoretical foundation for understanding the statistical properties of outcome-based reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

How to assign credit for actions with endpoint-only rewards
Sample-efficient algorithm for outcome-based RL with function approximation
Statistical separation between outcome-based and per-step feedback in RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online RL algorithm with outcome-based feedback
Leverages general function approximation
Eliminates completeness assumption for deterministic MDPs
🔎 Similar Papers
No similar papers found.