SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

πŸ“… 2025-11-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing robotic vision-language-action (VLA) models rely on dense visual embeddings, resulting in computational inefficiency and poor interpretability. To address this, we propose SlotVLAβ€”a framework grounded in object-centric representation and explicit relational modeling. It introduces slot attention as a lightweight visual tokenizer, enabling structured and interpretable vision-to-action mapping. Methodologically, we construct LIBERO+, a fine-grained, instance-level robotic manipulation dataset; design a relation-centric decoder; and integrate a large language model (LLM)-driven action generation module. Experiments demonstrate that SlotVLA reduces the number of visual tokens by up to 72% while achieving superior cross-task generalization over state-of-the-art baselines. By explicitly encoding object identities and their spatial-relational structure, SlotVLA establishes a new paradigm for efficient, interpretable, and relation-aware robotic manipulation.

Technology Category

Application Category

πŸ“ Abstract
Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most existing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object-relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention-based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experiments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the number of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object-relation-centric robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

Developing object-centric representations for robotic manipulation tasks
Addressing efficiency and interpretability in visuomotor control systems
Creating structured object-relation models for action decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slot-based visual tokenizer maintains object representations
Relation-centric decoder produces task-relevant embeddings
LLM-driven module translates embeddings into executable actions
T
Taisei Hanyu
University of Arkansas, USA
N
Nhat Chung
FPT Software AI Center, Vietnam
H
Huy Le
FPT Software AI Center, Vietnam
T
Toan Nguyen
FPT Software AI Center, Vietnam
Y
Yuki Ikebe
University of Arkansas, USA
A
Anthony Gunderman
University of Arkansas, USA
D
Duy Nguyen Ho Minh
University of Stuttgart, Germany
K
Khoa T. Vo
University of Arkansas, USA
Tung Kieu
Tung Kieu
Aalborg University, Department of Computer Science
Data MiningData ManagementSpatio-Temporal DataTime Series Analysis
Kashu Yamazaki
Kashu Yamazaki
Carnegie Mellon University, Genesis AI
Robot LearningPhysical AIMultimodal AI
Chase Rainwater
Chase Rainwater
University of Arkansas
logisticsoptimizationsecurity
A
Anh Nguyen
University of Liverpool, UK
Ngan Le
Ngan Le
University of Arkansas
Artificial IntelligenceMachine LearningComputer Vision