Improving Generalization of Language-Conditioned Robot Manipulation

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak generalization to unseen environments and heavy reliance on large-scale vision-language model (VLM) fine-tuning data hinder language-conditioned robotic manipulation. To address this, we propose a two-stage few-shot learning framework: first decoupling grasp and place subtasks, then introducing an instance-level semantic fusion module that enables fine-grained alignment between textual instructions and image instance features. Integrated with target localization and region classification mechanisms, the framework achieves lightweight adaptation from only a few demonstrations. Evaluated on both simulation and real-world robotic arm platforms, our method significantly improves cross-environment generalization and zero-shot transfer capability, enabling high-precision language-driven manipulation in unseen scenes without large-scale retraining. This work establishes an efficient, scalable paradigm for empowering embodied intelligence with VLMs.

Technology Category

Application Category

📝 Abstract
The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of environments. However, existing methods require a large amount of data to fine-tune VLMs for operating in unseen environments. In this paper, we present a framework that learns object-arrangement tasks from just a few demonstrations. We propose a two-stage framework that divides object-arrangement tasks into a target localization stage, for picking the object, and a region determination stage for placing the object. We present an instance-level semantic fusion module that aligns the instance-level image crops with the text embedding, enabling the model to identify the target objects defined by the natural language instructions. We validate our method on both simulation and real-world robotic environments. Our method, fine-tuned with a few demonstrations, improves generalization capability and demonstrates zero-shot ability in real-robot manipulation scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing robot generalization with few demonstrations
Aligning visual-text data for language-conditioned manipulation
Enabling zero-shot ability in unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework for object-arrangement tasks
Instance-level semantic fusion module
Few-shot fine-tuning for generalization
🔎 Similar Papers
No similar papers found.
C
Chenglin Cui
Centre for Intelligent Sensing, Queen Mary University of London, UK
C
Chaoran Zhu
Centre for Intelligent Sensing, Queen Mary University of London, UK
Changjae Oh
Changjae Oh
Queen Mary University of London
computer visionimage processingrobotic perception
Andrea Cavallaro
Andrea Cavallaro
Director, Idiap Research Institute; Professor, EPFL
Machine LearningComputer VisionAudio ProcessingRobot PerceptionPrivacy