Quantifying Memory Use in Reinforcement Learning with Temporal Range

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of quantifying historical observation dependence in reinforcement learning policies. We propose **Temporal Range**, a model-agnostic, axiomatically grounded measure of temporal memory extent. Temporal Range is derived from the mean Jacobian blocks between inputs and outputs—computed via reverse-mode automatic differentiation—to construct a temporal sensitivity spectrum; its weighted average delay characterizes the effective memory span of a policy. It is the first such metric supporting vector-valued outputs while rigorously satisfying causality, monotonicity, and other foundational axioms. Evaluated on benchmarks including POPGym and Copy-k, Temporal Range precisely aligns with task-inherent delays and accurately predicts the minimal context length required to achieve near-optimal performance. It is broadly applicable across diverse architectures—including MLPs, RNNs, and state space models (SSMs)—enabling principled analysis and design of temporally aware policies.

Technology Category

Application Category

📝 Abstract
How much does a trained RL policy actually use its past observations? We propose emph{Temporal Range}, a model-agnostic metric that treats first-order sensitivities of multiple vector outputs across a temporal window to the input sequence as a temporal influence profile and summarizes it by the magnitude-weighted average lag. Temporal Range is computed via reverse-mode automatic differentiation from the Jacobian blocks $partial y_s/partial x_tinmathbb{R}^{c imes d}$ averaged over final timesteps $sin{t+1,dots,T}$ and is well-characterized in the linear setting by a small set of natural axioms. Across diagnostic and control tasks (POPGym; flicker/occlusion; Copy-$k$) and architectures (MLPs, RNNs, SSMs), Temporal Range (i) remains small in fully observed control, (ii) scales with the task's ground-truth lag in Copy-$k$, and (iii) aligns with the minimum history window required for near-optimal return as confirmed by window ablations. We also report Temporal Range for a compact Long Expressive Memory (LEM) policy trained on the task, using it as a proxy readout of task-level memory. Our axiomatic treatment draws on recent work on range measures, specialized here to temporal lag and extended to vector-valued outputs in the RL setting. Temporal Range thus offers a practical per-sequence readout of memory dependence for comparing agents and environments and for selecting the shortest sufficient context.
Problem

Research questions and friction points this paper is trying to address.

Quantifies memory use in RL policies via temporal influence profiles.
Measures sensitivity of outputs to past observations across tasks.
Determines minimal history window needed for near-optimal performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Range metric measures memory use via sensitivity analysis
Uses automatic differentiation to compute influence profiles across time
Axiomatic framework extends to vector outputs in reinforcement learning
🔎 Similar Papers
No similar papers found.