Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the limited memory window and accumulated drift in long-sequence streaming 3D reconstruction caused by constrained recurrent state updates. The authors propose a training-free, parameter-free, closed-form frame-level scalar gating mechanism, denoted αₜ, which dynamically modulates each frame’s contribution to the recurrent state. Moving beyond existing token-level modulation, this approach enables content-aware long-term memory retention and continuously relaxes traditional SLAM keyframe selection into a frame-level adaptive update at inference time. Derived from inter-frame feature variations, the gating mechanism operates within a constant-memory architecture and achieves substantial performance gains: it reduces the absolute trajectory error (ATE) by 51% on long TUM-RGBD sequences and decreases depth AbsRel by 12.8% on Bonn datasets, outperforming current methods across six benchmarks including KITTI.

📝 Abstract

Streaming 3D reconstruction under a strict constant-memory budget hinges on how the recurrent state is updated as the stream evolves. We profile TTT3R-style per-token gates across five benchmarks and discover a structural bottleneck: the gate is intrinsically bounded in magnitude (median $0.31$; never exceeding $0.6$) and nearly frame-invariant, yielding an effective memory horizon of only $\sim$3 frames per state token, which serves as the structural origin of long-sequence drift. We trace this to a missing axis: existing inference-time methods modulate updates only at the per-token, intra-frame level, while the orthogonal frame-level question of \emph{how strongly each frame should contribute to the state} has been treated as content-independent. We close this gap with a scalar frame-level gate $α_t \in (0, 1]$ derived in closed form from frame-to-frame changes of internal features -- a continuous relaxation of classical Simultaneous Localization and Mapping (SLAM) keyframe selection that requires no parameters, no training, and no extra forward pass. Across six benchmarks spanning camera pose, video depth, and 3D reconstruction at sequence lengths up to $4,541$ frames, our gate cuts ATE by $51\%$ on long TUM-RGBD pose sequences, reduces AbsRel by $12.8\%$ on Bonn video depth, and on KITTI long-sequence pose estimation surpasses both LongStream and Keyframe-VO, while retaining strictly constant memory at zero training cost.

Problem

Research questions and friction points this paper is trying to address.

long-sequence drift

recurrent state update

constant-memory streaming

3D reconstruction

memory horizon

Innovation

Methods, ideas, or system contributions that make the work stand out.

frame-level gating

constant-memory streaming

long-sequence 3D reconstruction