SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Real-time reconstruction, semantic understanding, and low-latency streaming inference in dynamic scenes are challenging to achieve simultaneously, as existing methods suffer from limitations in motion modeling, semantic alignment, and memory efficiency. This work proposes SLARM—the first unified feedforward framework that integrates language-aligned semantics, high-order unsupervised motion modeling, and streaming causal inference. SLARM leverages LSeg-based semantic distillation to enable natural language queries, employs a windowed causal attention mechanism for low-latency streaming inference without accumulating memory overhead, and jointly optimizes geometry and semantics through differentiable rendering. Experiments demonstrate that SLARM achieves state-of-the-art performance in dynamic estimation, rendering quality, and scene parsing, with a 21% improvement in motion accuracy, a 1.6 dB gain in reconstruction PSNR, and a 20% increase in segmentation mIoU.

Technology Category

Application Category

📝 Abstract

We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.

Problem

Research questions and friction points this paper is trying to address.

dynamic scene reconstruction

semantic understanding

streaming inference

non-uniform motion

language-aligned representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic scene reconstruction

language-aligned semantics

streaming inference