🤖 AI Summary
This work addresses the limitations of existing methods in mapping natural language instructions to fine-grained interactive elements within 3D scenes, which are often hindered by passive, single-scale frame selection and visual occlusions. The authors propose UniFunc3D, a training-free unified framework that leverages multimodal large language models as active observers. In a single forward pass, UniFunc3D jointly performs semantic understanding, spatiotemporal reasoning, and task decomposition. It adaptively selects keyframes and focuses on high-detail regions through an active spatiotemporal localization mechanism and a coarse-to-fine strategy, while preserving global context to resolve ambiguities. Evaluated on SceneFun3D, the method achieves a 59.9% relative improvement in mIoU, significantly outperforming both trained and untrained state-of-the-art approaches and establishing a new performance benchmark.
📝 Abstract
Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.