EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation

📅 2024-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak compositional generalization, exponential state-action space growth, and limitations in training data scale and model capacity for multi-object visual dexterous manipulation, this paper proposes an offline visual behavior cloning framework that synergistically integrates an Entity-Centric Transformer with a diffusion model. Methodologically, we introduce object-centric representation learning and an entity-centric attention mechanism to achieve object-level disentangled modeling; combined with a diffusion model to capture multimodal action distributions, enabling high-fidelity action generation. Our key contribution is the first demonstration of zero-shot generalization—without additional training—to unseen numbers of objects, novel spatial configurations, and object counts exceeding those seen during training. Empirically, our approach significantly improves generalization performance and manipulation robustness on multi-object tasks, establishing a scalable new paradigm for embodied intelligence under few-shot settings and high-dimensional pixel inputs.

Technology Category

Application Category

📝 Abstract
Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into an object-centric representation, which is then processed by our entity-centric Transformer that computes attention at the object level, simultaneously predicting object dynamics and the agent's actions. Combined with the ability of diffusion models to capture multi-modal behavior distributions, this results in substantial performance improvements in multi-object tasks and, more importantly, enables compositional generalization. We present BC agents capable of zero-shot generalization to tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training. We provide video rollouts on our webpage: https://sites.google.com/view/ec-diffuser.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning
Complex Image Recognition
Limited Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral Cloning (BC)
Transformer Models
Diffusion Techniques
🔎 Similar Papers
No similar papers found.