π€ AI Summary
Real-time reconstruction, semantic understanding, and low-latency streaming inference in dynamic scenes are challenging to achieve simultaneously, as existing methods suffer from limitations in motion modeling, semantic alignment, and memory efficiency. This work proposes SLARMβthe first unified feedforward framework that integrates language-aligned semantics, high-order unsupervised motion modeling, and streaming causal inference. SLARM leverages LSeg-based semantic distillation to enable natural language queries, employs a windowed causal attention mechanism for low-latency streaming inference without accumulating memory overhead, and jointly optimizes geometry and semantics through differentiable rendering. Experiments demonstrate that SLARM achieves state-of-the-art performance in dynamic estimation, rendering quality, and scene parsing, with a 21% improvement in motion accuracy, a 1.6 dB gain in reconstruction PSNR, and a 20% increase in segmentation mIoU.
π Abstract
We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.