🤖 AI Summary
This paper introduces the novel task of language-guided 3D object placement: given a real-world scene point cloud, a 3D asset to be placed, and a natural language prompt, the goal is to predict a semantically plausible and geometrically feasible 6-DoF pose—ensuring support, collision avoidance, and efficient free-space utilization. It is the first work to systematically address the challenges of solution multiplicity under high ambiguity, cross-modal geometric–linguistic alignment, and free-space reasoning. To this end, we establish the first dedicated benchmark—including a curated dataset, standardized evaluation protocol, and baseline methods—and propose an end-to-end trainable 3D large model framework. Our approach integrates multimodal feature alignment, explicit 3D spatial relation modeling, and language-guided pose optimization. Experiments demonstrate substantial improvements over heuristic baselines, establishing a foundational benchmark for evaluating localization capabilities in general-purpose 3D foundation models.
📝 Abstract
We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.