PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenges of natural language–driven semantic understanding in dynamic 4D scenes, where weak contextual reasoning, view-dependent noise, and cross-spatiotemporal semantic inconsistency hinder performance. The authors propose a query-time reasoning framework that, for the first time, integrates a multi-view semantic consensus mechanism with 4D Gaussian splatting reconstruction and neural field optimization to achieve structured 4D semantic grounding while preserving geometric consistency. By fusing multi-view, multi-frame 2D semantic predictions, the method effectively supports complex linguistic queries involving object attributes, actions, spatial relations, and multi-object interactions. Evaluated on the newly introduced Panoptic-L4D benchmark, the approach achieves state-of-the-art performance, significantly advancing language grounding capabilities in dynamic 4D environments.

Technology Category

Application Category

📝 Abstract

Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.

Problem

Research questions and friction points this paper is trying to address.

4D scene understanding

natural language querying

semantic grounding

dynamic scenes

panoptic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D Gaussian Splatting

multi-view semantic consensus

neural field optimization