🤖 AI Summary
This work addresses the challenges of low-accuracy interactive motion reconstruction and difficult multi-object coordinated control in complex physical scenes. We propose a language-driven Interactive Radiance Field framework. Methodologically, we introduce a novel scene-level language embedding mechanism, integrating local deformable field decomposition with interaction-aware language-object alignment for precise spatial localization—enabling motion-decoupled modeling and fine-grained, natural-language-instruction-driven control. Built upon NeRF optimization, our approach is evaluated on the OmniSim and InterReal datasets. Results show a 2.1 dB PSNR improvement in novel-view synthesis, an 18.7% gain in cross-modal grounding accuracy, a 39% reduction in GPU memory consumption, and a 42.5% decrease in dynamic reconstruction error—substantially outperforming existing interactive reconstruction and language-controlled methods.
📝 Abstract
This paper scales object-level reconstruction to complex scenes, advancing interactive scene reconstruction. We introduce two datasets, OmniSim and InterReal, featuring 28 scenes with multiple interactive objects. To tackle the challenge of inaccurate interactive motion recovery in complex scenes, we propose LiveScene, a scene-level language-embedded interactive radiance field that efficiently reconstructs and controls multiple objects. By decomposing the interactive scene into local deformable fields, LiveScene enables separate reconstruction of individual object motions, reducing memory consumption. Additionally, our interaction-aware language embedding localizes individual interactive objects, allowing for arbitrary control using natural language. Our approach demonstrates significant superiority in novel view synthesis, interactive scene control, and language grounding performance through extensive experiments. Project page: https://livescenes.github.io.