Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the bottleneck of domain-specific fine-tuning in long-horizon robotic manipulation. We propose a fine-tuning-free, end-to-end language-to-action control framework. Methodologically, we introduce a novel dynamic scene graph–based environmental representation that unifies multimodal perception, general-purpose reasoning models, and symbolic-spatial joint reasoning to enable zero-shot cross-task transfer. Our core contributions are: (1) explicit modeling of spatial and semantic inter-object relationships via scene graphs, enabling long-horizon instruction grounding and action planning; and (2) modular decoupling of perception, reasoning, and execution components—ensuring compatibility with off-the-shelf multimodal foundation models (e.g., CLIP, LLaVA) and large language models (LLMs). Evaluated on multi-stage desktop manipulation tasks, the framework achieves a +28.6% improvement in task success rate and significantly enhances robustness under environmental perturbations, providing the first empirical validation of training-free paradigms for complex robotic manipulation.

Technology Category

Application Category

📝 Abstract
This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
Problem

Research questions and friction points this paper is trying to address.

Enabling robotic manipulation without domain-specific training
Integrating multimodal perception with robust task sequencing
Maintaining spatial awareness through dynamic scene graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained foundation models without domain-specific training
Integrates multimodal perception with robust task sequencing
Uses dynamically maintained scene graphs for spatial awareness
🔎 Similar Papers
No similar papers found.
S
Sushil Samuel Dinesh
Department of Electrical and Computer Engineering, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
Shinkyu Park
Shinkyu Park
KAUST
Multi-Robot CoordinationRoboticsAI for RoboticsGame TheoryFeedback Control