PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

๐Ÿ“… 2026-04-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of natural languageโ€“driven semantic understanding in dynamic 4D scenes, where weak contextual reasoning, view-dependent noise, and cross-spatiotemporal semantic inconsistency hinder performance. The authors propose a query-time reasoning framework that, for the first time, integrates a multi-view semantic consensus mechanism with 4D Gaussian splatting reconstruction and neural field optimization to achieve structured 4D semantic grounding while preserving geometric consistency. By fusing multi-view, multi-frame 2D semantic predictions, the method effectively supports complex linguistic queries involving object attributes, actions, spatial relations, and multi-object interactions. Evaluated on the newly introduced Panoptic-L4D benchmark, the approach achieves state-of-the-art performance, significantly advancing language grounding capabilities in dynamic 4D environments.
๐Ÿ“ Abstract
Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.
Problem

Research questions and friction points this paper is trying to address.

4D scene understanding
natural language querying
semantic grounding
dynamic scenes
panoptic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D Gaussian Splatting
multi-view semantic consensus
neural field optimization
language-based querying
panoptic scene understanding
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Ruilin Tang
School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China; School of Computing and Information Systems, Singapore Management University, Singapore 188065
Yang Zhou
Yang Zhou
South China University of Technology
Computer Vision
Z
Zhong Ye
School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
Wenxi Liu
Wenxi Liu
Fuzhou University
Computer vision
Yan Huang
Yan Huang
South China University of Technology
Computer VisionImage ProcessingDeep Learning
Shengfeng He
Shengfeng He
Singapore Management University
Visual ComputingGenerative ModelsComputer VisionComputational PhotographyComputer Graphics