Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

In multi-vehicle cooperative driving, high-frequency continuous control causes vanishing state-reward differentiation in conventional reward design, leading to degraded signal-to-noise ratio (SNR) in policy gradients and poor convergence. To address this, we propose a hybrid differential reward mechanism that jointly incorporates temporal-difference rewards derived from a global value function—ensuring long-term objective consistency—and action-gradient rewards—providing high-SNR local guidance—thereby mitigating reward signal attenuation under partial observability in multi-agent settings. The mechanism is seamlessly integrated into mainstream multi-agent RL frameworks, including MCTS-QMIX, MAPPO, and MADDPG, enabling joint online planning and learning. Experiments demonstrate that our approach significantly accelerates convergence, enhances policy stability, and simultaneously improves traffic efficiency and safety under dynamic agent populations.

Technology Category

Application Category

📝 Abstract

In multi-vehicle cooperative driving tasks involving high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with long-term objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Partially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety.

Problem

Research questions and friction points this paper is trying to address.

Addresses vanishing reward differences in multi-vehicle cooperative driving tasks

Solves low signal-to-noise ratio in policy gradients hindering algorithm convergence

Improves learning efficiency for high-frequency continuous control in multi-agent systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Differential Reward combines temporal difference and action gradients

Temporal Difference Reward uses global potential function for long-term objectives

Action Gradient Reward measures marginal utility for high SNR guidance

🔎 Similar Papers

A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving