SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing robotic vision-language-action (VLA) models rely on dense visual embeddings, resulting in computational inefficiency and poor interpretability. To address this, we propose SlotVLA—a framework grounded in object-centric representation and explicit relational modeling. It introduces slot attention as a lightweight visual tokenizer, enabling structured and interpretable vision-to-action mapping. Methodologically, we construct LIBERO+, a fine-grained, instance-level robotic manipulation dataset; design a relation-centric decoder; and integrate a large language model (LLM)-driven action generation module. Experiments demonstrate that SlotVLA reduces the number of visual tokens by up to 72% while achieving superior cross-task generalization over state-of-the-art baselines. By explicitly encoding object identities and their spatial-relational structure, SlotVLA establishes a new paradigm for efficient, interpretable, and relation-aware robotic manipulation.

Technology Category

Application Category

📝 Abstract

Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most existing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object-relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention-based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experiments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the number of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object-relation-centric robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

Developing object-centric representations for robotic manipulation tasks

Addressing efficiency and interpretability in visuomotor control systems

Creating structured object-relation models for action decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Slot-based visual tokenizer maintains object representations

Relation-centric decoder produces task-relevant embeddings

LLM-driven module translates embeddings into executable actions

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey