🤖 AI Summary
This work addresses the bottleneck of domain-specific fine-tuning in long-horizon robotic manipulation. We propose a fine-tuning-free, end-to-end language-to-action control framework. Methodologically, we introduce a novel dynamic scene graph–based environmental representation that unifies multimodal perception, general-purpose reasoning models, and symbolic-spatial joint reasoning to enable zero-shot cross-task transfer. Our core contributions are: (1) explicit modeling of spatial and semantic inter-object relationships via scene graphs, enabling long-horizon instruction grounding and action planning; and (2) modular decoupling of perception, reasoning, and execution components—ensuring compatibility with off-the-shelf multimodal foundation models (e.g., CLIP, LLaVA) and large language models (LLMs). Evaluated on multi-stage desktop manipulation tasks, the framework achieves a +28.6% improvement in task success rate and significantly enhances robustness under environmental perturbations, providing the first empirical validation of training-free paradigms for complex robotic manipulation.
📝 Abstract
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.