🤖 AI Summary
This work addresses the challenge of quantifying historical observation dependence in reinforcement learning policies. We propose **Temporal Range**, a model-agnostic, axiomatically grounded measure of temporal memory extent. Temporal Range is derived from the mean Jacobian blocks between inputs and outputs—computed via reverse-mode automatic differentiation—to construct a temporal sensitivity spectrum; its weighted average delay characterizes the effective memory span of a policy. It is the first such metric supporting vector-valued outputs while rigorously satisfying causality, monotonicity, and other foundational axioms. Evaluated on benchmarks including POPGym and Copy-k, Temporal Range precisely aligns with task-inherent delays and accurately predicts the minimal context length required to achieve near-optimal performance. It is broadly applicable across diverse architectures—including MLPs, RNNs, and state space models (SSMs)—enabling principled analysis and design of temporally aware policies.
📝 Abstract
How much does a trained RL policy actually use its past observations? We propose emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $partial y_s/partial x_tinmathbb{R}^{c imes d}$ averaged over final timesteps $sin{t+1,dots,T}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.