ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In quadrotor control, non-differentiable components in the reward function induce gradient bias in Backpropagation Through Time (BPTT), degrading policy optimization. Method: We propose Adjusted Backpropagation Through Time (ABPT), a novel gradient correction framework that jointly incorporates zero-step and N-step returns. ABPT leverages a learned Q-value function to provide value-based gradient corrections, mitigating bias from reward non-differentiability. Additionally, entropy regularization and deterministic state initialization are integrated to enhance exploration robustness. Contribution/Results: ABPT preserves the computational efficiency of standard BPTT while substantially alleviating gradient miscalibration caused by non-smooth rewards. Evaluated on four representative quadrotor control tasks, ABPT achieves faster convergence and superior asymptotic performance—particularly under partially differentiable reward settings. Our approach advances model-based reinforcement learning for real-world physical systems by enabling stable, efficient gradient estimation in the presence of non-differentiable reward structures.

Technology Category

Application Category

📝 Abstract

Using the exact gradients of the rewards to directly optimize policy parameters via backpropagation-through-time (BPTT) enables high training performance for quadrotor tasks. However, designing a fully differentiable reward architecture is often challenging. Partially differentiable rewards will result in biased gradient propagation that degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to encourage exploration during training. We evaluate ABPT on four representative quadrotor flight tasks. Experimental results demonstrate that ABPT converges significantly faster and achieves higher ultimate rewards than existing learning algorithms, particularly in tasks involving partially differentiable rewards.

Problem

Research questions and friction points this paper is trying to address.

Time Travel Learning

Quadcopter Training

Reward Calculation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal_Crossing_Learning

Entropy_Regularization

State_Initialization

🔎 Similar Papers

No similar papers found.