Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

πŸ“… 2026-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

234K/year
πŸ€– AI Summary
This work addresses the limited memory window and accumulated drift in long-sequence streaming 3D reconstruction caused by constrained recurrent state updates. The authors propose a training-free, parameter-free, closed-form frame-level scalar gating mechanism, denoted Ξ±β‚œ, which dynamically modulates each frame’s contribution to the recurrent state. Moving beyond existing token-level modulation, this approach enables content-aware long-term memory retention and continuously relaxes traditional SLAM keyframe selection into a frame-level adaptive update at inference time. Derived from inter-frame feature variations, the gating mechanism operates within a constant-memory architecture and achieves substantial performance gains: it reduces the absolute trajectory error (ATE) by 51% on long TUM-RGBD sequences and decreases depth AbsRel by 12.8% on Bonn datasets, outperforming current methods across six benchmarks including KITTI.
πŸ“ Abstract
Streaming 3D reconstruction under a strict constant-memory budget hinges on how the recurrent state is updated as the stream evolves. We profile TTT3R-style per-token gates across five benchmarks and discover a structural bottleneck: the gate is intrinsically bounded in magnitude (median $0.31$; never exceeding $0.6$) and nearly frame-invariant, yielding an effective memory horizon of only $\sim$3 frames per state token, which serves as the structural origin of long-sequence drift. We trace this to a missing axis: existing inference-time methods modulate updates only at the per-token, intra-frame level, while the orthogonal frame-level question of \emph{how strongly each frame should contribute to the state} has been treated as content-independent. We close this gap with a scalar frame-level gate $Ξ±_t \in (0, 1]$ derived in closed form from frame-to-frame changes of internal features -- a continuous relaxation of classical Simultaneous Localization and Mapping (SLAM) keyframe selection that requires no parameters, no training, and no extra forward pass. Across six benchmarks spanning camera pose, video depth, and 3D reconstruction at sequence lengths up to $4,541$ frames, our gate cuts ATE by $51\%$ on long TUM-RGBD pose sequences, reduces AbsRel by $12.8\%$ on Bonn video depth, and on KITTI long-sequence pose estimation surpasses both LongStream and Keyframe-VO, while retaining strictly constant memory at zero training cost.
Problem

Research questions and friction points this paper is trying to address.

long-sequence drift
recurrent state update
constant-memory streaming
3D reconstruction
memory horizon
Innovation

Methods, ideas, or system contributions that make the work stand out.

frame-level gating
constant-memory streaming
long-sequence 3D reconstruction
recurrent state update
keyframe selection
πŸ”Ž Similar Papers
No similar papers found.
K
Kejun Ren
Beijing University of Posts and Telecommunications, Beijing, China
L
Lei Jin
Beijing University of Posts and Telecommunications, Beijing, China
Tianxin Huang
Tianxin Huang
The University of Hong Kong
Computer VisionComputer Graphics
L
Lianming Xu
Beijing University of Posts and Telecommunications, Beijing, China
Li Wang
Li Wang
School of Computer Science, Beijing University of Posts and Telecommunications
D2DEdge Computing and AICooperative CachingMatching TheoryEmergency Communications