PARSE: Part-Aware Relational Spatial Modeling

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing representations of spatial relationships—such as prepositions or object-level scene graphs—are too coarse to accurately capture precise contact, support, or containment regions between objects, often resulting in ambiguous and physically inconsistent 3D scene layouts. To address this limitation, this work proposes PARSE, a novel framework that introduces part-level interaction modeling for the first time. PARSE leverages a Part-centered Assembly Graph (PAG) and a part-aware spatial configuration solver to translate geometric relationships into structured constraints, enabling the generation of collision-free and physically plausible 3D scenes. The authors also introduce PARSE-10K, a new dataset featuring densely annotated contact structures, and fine-tune Qwen3-VL on it to significantly enhance the model’s understanding of part-level spatial relations. Experiments demonstrate that incorporating PAG as a structural prior substantially improves geometric consistency and physical realism in 3D scene generation.

Technology Category

Application Category

📝 Abstract

Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.

Problem

Research questions and friction points this paper is trying to address.

spatial reasoning

object parts

scene layout

physical consistency

geometric relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Part-aware modeling

Spatial reasoning

Part-centric Assembly Graph