PARSE: Part-Aware Relational Spatial Modeling

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing representations of spatial relationships—such as prepositions or object-level scene graphs—are too coarse to accurately capture precise contact, support, or containment regions between objects, often resulting in ambiguous and physically inconsistent 3D scene layouts. To address this limitation, this work proposes PARSE, a novel framework that introduces part-level interaction modeling for the first time. PARSE leverages a Part-centered Assembly Graph (PAG) and a part-aware spatial configuration solver to translate geometric relationships into structured constraints, enabling the generation of collision-free and physically plausible 3D scenes. The authors also introduce PARSE-10K, a new dataset featuring densely annotated contact structures, and fine-tune Qwen3-VL on it to significantly enhance the model’s understanding of part-level spatial relations. Experiments demonstrate that incorporating PAG as a structural prior substantially improves geometric consistency and physical realism in 3D scene generation.

Technology Category

Application Category

📝 Abstract
Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
object parts
scene layout
physical consistency
geometric relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Part-aware modeling
Spatial reasoning
Part-centric Assembly Graph
Geometric constraints
Physically consistent 3D generation
🔎 Similar Papers
No similar papers found.
Y
Yinuo Bai
ShanghaiTech University
P
Peijun Xu
ShanghaiTech University
K
Kuixiang Shao
ShanghaiTech University
Y
Yuyang Jiao
ShanghaiTech University
J
Jingxuan Zhang
ShanghaiTech University
K
Kaixin Yao
ShanghaiTech University
Jiayuan Gu
Jiayuan Gu
Assistant Professor, ShanghaiTech University
Embodied AI3D Vision
Jingyi Yu
Jingyi Yu
Professor, ShanghaiTech University
Computer VisionComputer Graphics