ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing robotic manipulation methods suffer from coarse semantic modeling granularity, limited real-time closed-loop planning capability, and insufficient cross-scenario robustness. This paper proposes a vision-language fused, semantics-driven manipulation framework that introduces, for the first time, a hierarchical recursive reasoning mechanism to generate two-stage 3D spatial constraints—part-level extraction followed by region-level refinement—enabling fine-grained semantic grounding and dynamic optimization. The framework unifies multimodal large language models (MLLMs) and vision foundation models with RGB-D perception, natural language understanding, and differentiable joint-space optimization, yielding an end-to-end trainable manipulation system. Evaluated in real-world settings—including domestic environments and chemical laboratories—the framework achieves efficient zero-shot execution of diverse manipulation tasks. Both simulation and physical experiments demonstrate strong generalization and robustness against environmental perturbations. Code and demonstration videos are publicly available.

Technology Category

Application Category

📝 Abstract
Semantics-driven 3D spatial constraints align highlevel semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization. Code and videos at https://resem3d.github.io.
Problem

Research questions and friction points this paper is trying to address.

Coarse semantic granularity in robotic constraint modeling
Lack of real-time closed-loop planning in manipulation
Robustness issues in semantically diverse environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained semantic grounding for 3D constraints
Hierarchical recursive reasoning with MLLMs and VFMs
Real-time optimization for dynamic manipulation