Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
📝 Abstract
We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a "which-target-region" mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.
Problem

Research questions and friction points this paper is trying to address.

long-horizon planning
3D rearrangement
vision-language grounding
reactive planning
3D segmentation masks
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Vision-Language Grounding
Long-Horizon Planning
Reactive Policy
3D Instance Segmentation Mask
Pick-and-Place Manipulation
🔎 Similar Papers
No similar papers found.
A
Ashish Malik
Electrical Engineering and Computer Science, College of Engineering, Oregon State University
C
Caleb Lowe
Electrical Engineering and Computer Science, College of Engineering, Oregon State University
A
Aayam Shrestha
Electrical Engineering and Computer Science, College of Engineering, Oregon State University
Stefan Lee
Stefan Lee
Associate Professor, Oregon State University
Computer VisionNatural Language Processing
Fuxin Li
Fuxin Li
Oregon State University
Deep LearningComputer VisionPoint CloudsExplainable AIExplainable Deep Learning
Alan Fern
Alan Fern
Oregon State University
Reinforcement LearningRoboticsAgricultural AI