Spotlighting Task-Relevant Features: Object-Centric Representations for Better Generalization in Robotic Manipulation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Current robotic manipulation policies often suffer from limited generalization under distribution shifts—such as changes in lighting, texture variations, or the presence of distractors—due to visual representations that entangle task-relevant and task-irrelevant information. To address this, this work proposes Slot-Based Object-Centric Representation (SBOCR), which leverages a slot attention mechanism to aggregate dense features extracted by a pretrained encoder into a fixed set of object-like entities, thereby constructing a structured intermediate representation. Notably, SBOCR achieves effective separation of task-critical information from noise without requiring task-specific pretraining. Evaluated across diverse simulated and real-world manipulation tasks, SBOCR substantially outperforms baseline approaches relying on global or dense feature representations, demonstrating markedly enhanced policy generalization under complex environmental perturbations.

Technology Category

Application Category

📝 Abstract

The generalization capabilities of robotic manipulation policies are heavily influenced by the choice of visual representations. Existing approaches typically rely on representations extracted from pre-trained encoders, using two dominant types of features: global features, which summarize an entire image via a single pooled vector, and dense features, which preserve a patch-wise embedding from the final encoder layer. While widely used, both feature types mix task-relevant and irrelevant information, leading to poor generalization under distribution shifts, such as changes in lighting, textures, or the presence of distractors. In this work, we explore an intermediate structured alternative: Slot-Based Object-Centric Representations (SBOCR), which group dense features into a finite set of object-like entities. This representation permits to naturally reduce the noise provided to the robotic manipulation policy while keeping enough information to efficiently perform the task. We benchmark a range of global and dense representations against intermediate slot-based representations, across a suite of simulated and real-world manipulation tasks ranging from simple to complex. We evaluate their generalization under diverse visual conditions, including changes in lighting, texture, and the presence of distractors. Our findings reveal that SBOCR-based policies outperform dense and global representation-based policies in generalization settings, even without task-specific pretraining. These insights suggest that SBOCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.

Problem

Research questions and friction points this paper is trying to address.

generalization

visual representations

robotic manipulation

distribution shifts

object-centric representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-centric representation

slot-based representation

robotic manipulation