Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the challenge of enabling vision-language models (VLMs) to accurately infer consistent 6D poses of target objects in 3D scenes from textual instructions. The authors propose a test-time, closed-loop agent framework that requires neither fine-tuning nor additional modules. By iteratively performing multi-view observation, pose evaluation, single-axis rotation prediction, and novel-view rendering—augmented with a visual grounding mechanism in the object-centric coordinate frame—the method substantially enhances the VLM’s 3D spatial reasoning capabilities. Experiments demonstrate consistent improvements over existing approaches across both open- and closed-source VLMs, and when integrated with a simple motion planner, the framework significantly boosts robotic manipulation success rates.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate whether it is faithful to the instruction; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-centered coordinate system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.
Problem

Research questions and friction points this paper is trying to address.

6D object pose
vision-language models
3D understanding
text-guided rearrangement
pose estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

closed-loop VLM agents
6D object pose rearrangement
text-guided manipulation
multi-view reasoning
object-centered coordinate system
🔎 Similar Papers