Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year

Technology Category

Application Category

📝 Abstract
We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a "which-target-region" mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.
Problem

Research questions and friction points this paper is trying to address.

long-horizon planning
3D rearrangement
vision-language grounding
reactive planning
3D segmentation masks
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Vision-Language Grounding
Long-Horizon Planning
Reactive Policy
3D Instance Segmentation Mask
Pick-and-Place Manipulation