๐ค AI Summary
This study reveals that trajectory value in policy gradient control exhibits a strong dependence on the learning algorithm, identifying a significant negative correlation (r โ โ0.38) between trajectory incentive persistence and marginal valueโdriven primarily by gradient variance, termed the โvariance-mediated mechanism.โ
Method: We propose trajectory-level Shapley values and Leave-One-Out analysis to quantify trajectory contributions, complemented by LQR-theoretic modeling, analytical gradient variance characterization, and saddle-point escape probability estimation. To counteract the adverse correlation, we introduce state whitening and Fisher preconditioning, achieving correlation reversal (r โ +0.29), and design a decision-alignment score for efficient trajectory pruning.
Contribution/Results: Experiments validate the variance-mediated mechanism and identify a class of harmful trajectories amenable to pruning. Our framework establishes a novel paradigm for robust policy optimization, enabling principled sample selection and improved training stability.
๐ Abstract
We study how trajectory value depends on the learning algorithm in policy-gradient control. Using Trajectory Shapley in an uncertain LQR, we find a negative correlation between Persistence of Excitation (PE) and marginal value under vanilla REINFORCE ($rapprox-0.38$). We prove a variance-mediated mechanism: (i) for fixed energy, higher PE yields lower gradient variance; (ii) near saddles, higher variance increases escape probability, raising marginal contribution. When stabilized (state whitening or Fisher preconditioning), this variance channel is neutralized and information content dominates, flipping the correlation positive ($rapprox+0.29$). Hence, trajectory value is algorithm-relative. Experiments validate the mechanism and show decision-aligned scores (Leave-One-Out) complement Shapley for pruning, while Shapley identifies toxic subsets.