Algorithm-Relative Trajectory Valuation in Policy Gradient Control

๐Ÿ“… 2025-11-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study reveals that trajectory value in policy gradient control exhibits a strong dependence on the learning algorithm, identifying a significant negative correlation (r โ‰ˆ โˆ’0.38) between trajectory incentive persistence and marginal valueโ€”driven primarily by gradient variance, termed the โ€œvariance-mediated mechanism.โ€ Method: We propose trajectory-level Shapley values and Leave-One-Out analysis to quantify trajectory contributions, complemented by LQR-theoretic modeling, analytical gradient variance characterization, and saddle-point escape probability estimation. To counteract the adverse correlation, we introduce state whitening and Fisher preconditioning, achieving correlation reversal (r โ†’ +0.29), and design a decision-alignment score for efficient trajectory pruning. Contribution/Results: Experiments validate the variance-mediated mechanism and identify a class of harmful trajectories amenable to pruning. Our framework establishes a novel paradigm for robust policy optimization, enabling principled sample selection and improved training stability.

Technology Category

Application Category

๐Ÿ“ Abstract
We study how trajectory value depends on the learning algorithm in policy-gradient control. Using Trajectory Shapley in an uncertain LQR, we find a negative correlation between Persistence of Excitation (PE) and marginal value under vanilla REINFORCE ($rapprox-0.38$). We prove a variance-mediated mechanism: (i) for fixed energy, higher PE yields lower gradient variance; (ii) near saddles, higher variance increases escape probability, raising marginal contribution. When stabilized (state whitening or Fisher preconditioning), this variance channel is neutralized and information content dominates, flipping the correlation positive ($rapprox+0.29$). Hence, trajectory value is algorithm-relative. Experiments validate the mechanism and show decision-aligned scores (Leave-One-Out) complement Shapley for pruning, while Shapley identifies toxic subsets.
Problem

Research questions and friction points this paper is trying to address.

Investigates how trajectory value depends on learning algorithm in policy-gradient control
Analyzes correlation between Persistence of Excitation and marginal value in REINFORCE
Demonstrates algorithm-relative trajectory valuation through variance-mediated mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Trajectory Shapley to analyze value correlation
Proves variance mechanism affects gradient escape probability
Shows algorithm stabilization flips trajectory value correlation
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Shihao Li
The University of Texas at Austin
J
Jiachen Li
The University of Texas at Austin
J
Jiamin Xu
The University of Texas at Austin
C
Christopher Martin
The University of Texas at Austin
W
Wei Li
The University of Texas at Austin
Dongmei Chen
Dongmei Chen
Department of Geography and Planning, Queen's University
remote sensingGISspatial analysis