An Invitation to Deep Reinforcement Learning

📅 2023-12-13
🏛️ Found. Trends Optim.
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Deep neural networks struggle to directly optimize non-differentiable objectives (e.g., IoU, BLEU, reward signals) due to the absence of well-defined gradients. Method: This paper reframes deep reinforcement learning (DRL) as a generalized extension of supervised learning, centering on gradient-based policy optimization frameworks (e.g., PPO) rather than tabular RL paradigms. It systematically integrates loss proxy modeling, policy gradient derivation, and human feedback alignment techniques to bridge the gap from single-step non-differentiable optimization to multi-step sequential decision-making. Contribution/Results: The work establishes, for the first time, a conceptual continuity between supervised learning and DRL—clarifying theoretical foundations, algorithmic boundaries, and practical implementation logic. This significantly lowers the entry barrier to DRL, enabling researchers with only supervised learning background to rigorously understand, adapt, and deploy state-of-the-art DRL methods across diverse application domains.
📝 Abstract
Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning large language models via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.
Problem

Research questions and friction points this paper is trying to address.

Optimizing non-differentiable objectives with reinforcement learning
Addressing suboptimal surrogate losses in supervised learning
Simplifying deep RL for non-tabular, temporal problems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses deep reinforcement learning for non-differentiable objectives
Introduces RL as supervised learning generalization
Teaches PPO via basic supervised learning knowledge