Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the bottleneck of domain-specific fine-tuning in long-horizon robotic manipulation. We propose a fine-tuning-free, end-to-end language-to-action control framework. Methodologically, we introduce a novel dynamic scene graph–based environmental representation that unifies multimodal perception, general-purpose reasoning models, and symbolic-spatial joint reasoning to enable zero-shot cross-task transfer. Our core contributions are: (1) explicit modeling of spatial and semantic inter-object relationships via scene graphs, enabling long-horizon instruction grounding and action planning; and (2) modular decoupling of perception, reasoning, and execution components—ensuring compatibility with off-the-shelf multimodal foundation models (e.g., CLIP, LLaVA) and large language models (LLMs). Evaluated on multi-stage desktop manipulation tasks, the framework achieves a +28.6% improvement in task success rate and significantly enhances robustness under environmental perturbations, providing the first empirical validation of training-free paradigms for complex robotic manipulation.

Technology Category

Application Category

📝 Abstract

This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.

Problem

Research questions and friction points this paper is trying to address.

Enabling robotic manipulation without domain-specific training

Integrating multimodal perception with robust task sequencing

Maintaining spatial awareness through dynamic scene graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained foundation models without domain-specific training

Integrates multimodal perception with robust task sequencing

Uses dynamically maintained scene graphs for spatial awareness

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey