SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

πŸ“… 2026-03-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Real-time reconstruction, semantic understanding, and low-latency streaming inference in dynamic scenes are challenging to achieve simultaneously, as existing methods suffer from limitations in motion modeling, semantic alignment, and memory efficiency. This work proposes SLARMβ€”the first unified feedforward framework that integrates language-aligned semantics, high-order unsupervised motion modeling, and streaming causal inference. SLARM leverages LSeg-based semantic distillation to enable natural language queries, employs a windowed causal attention mechanism for low-latency streaming inference without accumulating memory overhead, and jointly optimizes geometry and semantics through differentiable rendering. Experiments demonstrate that SLARM achieves state-of-the-art performance in dynamic estimation, rendering quality, and scene parsing, with a 21% improvement in motion accuracy, a 1.6 dB gain in reconstruction PSNR, and a 20% increase in segmentation mIoU.

Technology Category

Application Category

πŸ“ Abstract
We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.
Problem

Research questions and friction points this paper is trying to address.

dynamic scene reconstruction
semantic understanding
streaming inference
non-uniform motion
language-aligned representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic scene reconstruction
language-aligned semantics
streaming inference
high-order motion modeling
causal attention
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhicheng Qiu
Huawei Technologies Ltd.
Jiarui Meng
Jiarui Meng
Peking University
3D Reconstruction3D Vision
T
Tong-an Luo
Huawei Technologies Ltd.
Y
Yican Huang
Huawei Technologies Ltd.
X
Xuan Feng
Huawei Technologies Ltd.
X
Xuanfu Li
Huawei Technologies Ltd.
Zhan Xu
Zhan Xu
Unknown affiliation
computer graphicscomputer vision